Question:
To facilitate the search for commonly misspelled information, we use phonetic algorithms . A resource that is extremely useful but that tends to be neglected, especially outside the English language where there are some algorithms known as those mentioned in Wikipedia (en) , particularly the Soundex available in several DBs and languages.
We have particularities in our language that make the use of algorithms from other languages unfeasible. In fact, it's even more regionalized. An algorithm that works for Brazil does not work for Portugal and perhaps for other Portuguese-speaking countries. I even have doubts if I would need specializations for Ceará , Rio Grande do Sul or even for Piracicabano and surroundings (hundreds of kilometers 🙂 ), just to name a few.
- Is there any official source, a conclusive study of what our phonetic algorithm should look like?
- Based on these studies or from experience, what would this algorithm look like? Can we take advantage of any Soundex , Metaphone or other well-known algorithms in the English language, just by changing the phonemes?
- Can or should we (optionally) use common foreign phonemes as well, since we use foreign names?
- I'm interested in the algorithm itself, so how to develop it in detail (it's not enough general lines that already have enough information), a pseudo-code or real code in some language would be useful but not fundamental.
Since I can port code from almost any mainstream language, and in fact I will (and others will) use it in a few different languages, I don't care if it's in C, C++, C#, Java, Javascript, PHP, Perl, Python, Ruby , Lua, Delphi, Go, D, Scala, F#, or variants of BASIC, xBase (Clipper, Harbor, FoxPro), SQL, etc. or even COBOL :). Use what you already have or feel more comfortable.
Note that the best algorithm is what counts, not the implementation, so I don't care what language. The same algorithm in a different language will be considered duplicated.
Some references I know (some are pretty bad) :
- Phonetic Search in Brazilian Portuguese
- Information Retrieval by Phoneme Similarity Adapted to the Portuguese Language
- example in Delphi
- example in C#
- another example in C#
- VB.NET example
- PHP example
- Java example
- another example in C#
- one more in C# .
Although I have accepted an answer, I think something better could still come, and I am willing to change the acceptance if that happens. I still want to see your answer.
Answer:
The idea behind the "Pt-BR Metaphone" is precisely the success story and the use of algorithms such as soundex during the first American censuses and the improvement of the idea, in this case, with the appearance of the metaphone.
The niche of this algorithm is very specific. This is not an accurate phonetic representation, following the IPA , but simplifications of words based on "sounds like…". The usage, like the soundex, was mainly in cross-referencing textual data and misspelled duplicate name identifications. Its advantage is that it can reduce the computational effort needed to find a word similar to another using algorithms such as similar_text() (in PHP), levenshtein, among others.
How could Metaphone be useful in this case? Algorithms like similar_text, levenshtein and others return an index, usually between 0 and 1, of the degree of proximity between two strings, with 0 being no similarity and 1 being total similarity.
So imagine to cross a database of thousands of street names, each with an average of three to four words, with a street name of the same average length, from another database, which may have been misspelled, or abbreviated, finally, not exactly the same as what you have, it would be necessary to verify the similarity between each word and take an average for the street. Considering that these similarity algorithms have O(m*n) and O(n^3) complexities, then we are faced with huge computational efforts to find a single street.
The metaphone enters this part. By simplifying the string, reducing it to a length of up to 4 characters, it becomes possible to create an "index" for similar words. For example, REBECA, REBBECA, RABEKA and so many other variations would share the same metaphone string: RBK. With that, I can apply the levenshtein algorithm to a reduced set (usually between 1 and 2 dozen) of words, reducing the computational effort needed.
Of course, you can have other approaches to solve the problem mentioned, but Metaphone is one of them, it would only require having one more column in the database to serve as an index.
The Portuguese version was made because the original Metaphone uses English grammar rules, so many Portuguese words end up falling into different groups because they would sound different to an American.
The article that @Luiz-Vieira cites, in the comments of the author of the question, is my authorship, motivated by the work I participated in REDECA , where the challenge was to set up a register of people who could have several documents, therefore none of them mandatory. Those who develop systems know that normally the CPF, for being unique and numerical, is commonly used as a primary key and to avoid duplicate registrations. Phonetic output is one of the approaches that appeared in the group. With that, I deepened the algorithm using a database assembled with people's names, 1 million names and 220,000 unique words, resulting in the study mentioned above, where I understand that 4 characters are enough and what the rules for Metaphone Portuguese would be. – Brazilian. The implementation of these rules are in the code in C available at SourceForge and in the JavaScript that @João-Paraná makes available.
I believe, due to the date of the REDECA project, this would be one of the first open variants for Metaphone in Brazilian Portuguese, yielding a scientific article.
And to add an answer to the question, follow the README link of the code in C that contains the Metaphone Pt-BR conversion rules, published in an article: http://sourceforge.net/p/metaphoneptbr/code/ci/master/ tree/README