I have the following data in a table
------------------- **usuarios** ------------------- nome ------------------ Wallace de Souza Vizerra ------------------------ Gustavo Carmo da Costa
I need to return the records from the
usuarios table that contain the given value in the
nome field, but I need to ignore some existing words in the
nome field values
SELECT * FROM usuarios WHERE nome LIKE 'Wallace Souza Vizerra' #Sem o "de" no nome
I would like the
nome query to ignore the words
das existing in the database.
How can I make this query, with LIKE or REGEXP, excluding words?
I had a similar problem here at my company. Basically, here we use two ways to identify homonyms with typos.
The first is the levenshtein distance and the second is the soundex() function.
To paraphrase Wikipedia "the Levenshtein distance or edit distance between two strings is given by the minimum number of operations required to transform one string into the other."
For example, the levenshtein distance between "Guilherme Silva" and "Guilherme da Silva" is 3. Between "Maria Dores" and "Maria das Dores" it is 4.
Here is the code to create the levenshtein() function in Mysql.
DELIMITER $$ CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) ) RETURNS INT DETERMINISTIC BEGIN DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; DECLARE s1_char CHAR; -- max strlen=255 DECLARE cv0, cv1 VARBINARY(256); SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; IF s1 = s2 THEN RETURN 0; ELSEIF s1_len = 0 THEN RETURN s2_len; ELSEIF s2_len = 0 THEN RETURN s1_len; ELSE WHILE j <= s2_len DO SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; END WHILE; WHILE i <= s1_len DO SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; WHILE j <= s2_len DO SET c = c + 1; IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF; SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; IF c > c_temp THEN SET c = c_temp; END IF; SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; IF c > c_temp THEN SET c = c_temp; END IF; SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; END WHILE; SET cv1 = cv0, i = i + 1; END WHILE; END IF; RETURN c; END$$ DELIMITER ;
To use, in your case, you can make the following query:
SELECT * FROM usuario as us WHERE levenshtein(us.nome, 'Wallace Silva') < 5 //Ou outro indice. Coloquei 5 para que ele pegue somente nomes que tenham no maximo indice 4 na distancia levenshtein.
If you have the following records in the bank
-------------------- 1 Wallace Silva 2 Wallace João da Silva 3 Wallace das Silva 4 Guilherme da Silva --------------------
The query will return, in this case, only records 1 and 2.
The soundex() function is a function that can be used in database searches where you know the pronunciation but not exactly how to spell it.
The soundex() is already native to Mysql and we don't need to create it manually.
In your case, the query would look something like this:
SELECT * FROM pessoa WHERE soundex(nome) = soundex('Wallace Silva')
The problem with using soundex() is that it works better with English words and doesn't work 100% in Portuguese.
We can also create a query in which we combine the two forms of search so that we can have a higher probability of success.
We can use it like this:
SELECT * FROM pessoa WHERE levenshtein(soundex(nome), soundex('Wallace Silva')) < 3
SELECT * FROM pessoa WHERE (levenshtein(nome, 'Wallace Silva') < 5) OR soundex(nome) = soundex('Wallace Silva')
- The more functions we use, the more processing we need to have on the machine, that is, if we create a 'little monster' query, it can take a while to run
- Both functions have some negative points, I recommend studying them well and testing A LOT before implementing them in some code
- So far, these have been the most effective ways to look up similar names directly in the SQL query that I've found that have worked for me. Doesn't mean there aren't any better ones, it doesn't hurt to Google them 😉
Anyway, I hope I helped! Enjoy it!