raFTI: Matching "Chaotic" Wine Names | |
Hi, I'm Vit Glinca, a backend programmer at Deeplace, a company that actively works in winetech, among other areas. I'd like to present our latest feature in this area: raFTI.v5.3, a full-text search system. | |
When working with wine catalogs, one quickly encounters an unexpected problem. The same wine may appear under several completely different names. For example: | |
Barolo Riserva 2016 Barolo 2016 Riserva Barolo DOCG Riserva 2016 | |
Sometimes the differences are even more significant: | |
| |
As a result, what seems to be a simple search task turns into a text interpretation problem. Several years ago, I started experimenting with a system for matching wine names. Over time, it evolved into a rather unusual solution built around a large collection of heuristics and domain-specific rules. The system was eventually named - raFTI.v5.3 (Relate Assemblies for Full Text Indexation). | |
The Chaos of Wine Names | |
Wine names often resemble loosely structured text rather than formal identifiers. They may contain: grape varieties / regions / wineries / wine styles / vintages / colors / … | |
Moreover, the order of these elements may vary. Sometimes part of the information is missing altogether. In other cases, marketing terms are added. As a result, multiple textual representations may refer to exactly the same wine. | |
Anchors and Modifiers | |
Some terms in a title act as anchors, while others act as modifiers. Anchors define the main semantic points of a line, while modifiers refine them. | |
From Groups to Templates | |
The first versions of the system relied on simple term groups. For example: | |
grape varieties / regions / wineries / … | |
Over time, however, it became clear that wine names are not assembled randomly. Most of them follow recurring structural patterns. The next step was therefore to identify term subgroups and build name templates. For example: | |
[Winery] [Wine name] [Color] [Type] or [Region] [Grape] [Typology] | |
When the internal structure of a wine name is not explicitly visible, such templates begin to serve as matching anchors. | |
Synonym Mania | |
One of the core ideas behind raFTI is the aggressive use of synonyms. Over time, this mechanism evolved into several distinct layers. | |
Primary Synonymy | |
At the token level, different forms of the same concept are treated as equivalent: | |
| |
All such variants receive the same base index component. | |
Secondary Synonymy | |
During index construction, additional synonyms are generated automatically: | |
| |
Tertiary Synonymy | |
This layer is derived from dictionaries and object catalogs: | |
| |
Quaternary Synonymy | |
Some synonym relationships emerge naturally during data processing. For example, winery names frequently appear together with city names. Such associations may later be used as additional matching hints. | |
Semantic Triggers | |
Over time, it became clear that excessive synonymy creates side effects of its own. To control this process, a mechanism called semantic triggers was introduced. | |
A trigger is a contextual condition that allows or blocks the use of specific synonyms and/or activates a group of attributes. For example: | |
| |
Over time, this mechanism evolved into a powerful semantic description tool in its own right. | |
Decision Signature | |
To understand why the system chooses one match over another, every candidate is represented by a compact 11-digit diagnostic code. Each position reflects the contribution of a specific component: |
FF NGV RMC SAU | |
Component | Destination |
FF | Final score |
N | Name - criterion for the completeness of the name template |
G | Grape (high impact on score) |
V | Vineyard (high impact on score) |
R | Region (mean impact on score) |
M | Modifiers for class (low impact on score) |
C | Color (high impact on score) |
S | Sparkling - CO₂ (high impact on score) |
A | Anchor - class type (high impact on score) |
U | Unical words (high impact on score) |
A value of 3 is considered neutral; greater than 3 is a bonus; less than 3 is a penalty. Additionally, components marked as (high impact on score) receive additional penalties if the value is less than 3. The impact of the important Region characteristic is weakened, as this characteristic is heavily contaminated with errors and inaccuracies in real data sources. Example: |
(search sample) VIN CHATEAU VARTELY TARABOSTE ROSU SEC 0.75L |
(comparison results with wines from the database) |
score | vineyard | wine name |
19 339 339 934 | Chateau Vartely | Taraboste Pinot Noir |
9 339 349 932 | Chateau Vartely | Taraboste Pur Aristocratic Rosu |
9 339 339 924 | Chateau Vartely | Taraboste Reserva Cabernet Sauvignon & Merlot |
(explanations) The first result received the maximum score for matches: winery + color + CO₂ (characteristics not reflected in the names are taken as values from the database). Grape varieties and region are not specified in the search sample and do not affect the result. A match for the unique word Taraboste is valued if it contains no extra words. The second result from the database was penalized for extra words. The third result was penalized for containing the important (Anchor - class type) word Reserva, which was not in the search sample. |
When a person makes a mistake |
(search sample) DIVUS Rara Neagra rosu sec 0.75l, anul 2022 |
(comparison results with wines from the database) |
score | vineyard | wine name |
24 588 339 933 | Divus Winery | Rara Neagra |
12 582 369 933 | Gitana Winery | Rara Neagra Rosu Sec |
(explanations) At first glance, the second option seems convincing. The words (Rosu Sec) in the name serve as both characteristics and modifiers when explicitly stated in the name. As characteristics, they earn prizes for both samples; as modifiers, only the second sample receives a prize, but it is penalized more severely for not matching the winery. (The operator chose the second option based on the number of word matches) | |
Comparing Matchers | |
During development, raFTI was continuously compared against traditional full-text search engines. In particular, several experiments were performed using the search mechanisms available in MySQL.FTI. In practice, a number of characteristic differences emerged. Full-text search performs well when: | |
| |
Wine names, however, often violate all three assumptions. Typical issues include: | |
| |
In such situations, conventional search tends to produce a large number of irrelevant candidates. raFTI addresses this problem through: | |
| |
As a result, the candidate space becomes significantly smaller and the matching results more stable. Numerous statistically significant comparisons of the matching results yield the following distribution of correct answers: |
Matcher | Result |
MySQL.FTI | 40 % correct answers |
raFTI.v5.3 | 96 % correct answers |
The Anti-Combinatorial Effect | |
At first glance, a system built around numerous heuristics should suffer from combinatorial explosion. In practice, the opposite effect is observed. Real-world wine names are generated within a relatively small set of structural patterns. As a result, additional rules do not expand the search space. Instead, they help eliminate impossible combinations at an earlier stage. Rather than increasing complexity, many heuristics act as filters that progressively narrow the set of plausible candidates. | |
The Core Idea Behind raFTI | |
is based on the Relate Assemblies (RA) methodology. Among other concepts, the methodology includes: | |
| |
In particular, two complementary forms of referential integrity are used: | |
| |
The broader RA methodology is discussed in more detail: here | |
Final Thoughts | |
In the age of neural networks, raFTI may appear somewhat old-fashioned. That is probably true. Nevertheless, it remains an effective approach for domains where transparency, controllable knowledge representation, and predictable behavior are more important than raw statistical inference. Several ideas turned out to be considerably more useful than originally expected: | |
| |
This article describes only a subset of the mechanisms used by the system. Many implementation details were intentionally omitted to avoid turning the discussion into technical documentation. Practical use of the system continues to reveal new directions for development. In particular, the accumulated synonym data has proven valuable not only for improving recognition quality, but also for discovering missing relationships in the domain model and refining heuristics for rare or erroneous data forms. | |
Comments, criticism, alternative approaches, or simply a fresh perspective on the problem are always welcome. | |
TECHNOLOGIES ARE CHANGING, CRYSTALLIZATION OF DATA - NOT | |
(Help, translations, graphics by ChatGPT) |