mysql – How can I make a query with LIKE or REGEXP ignoring table words?

Question:

I have the following data in a table

-------------------
**usuarios**
-------------------
nome
------------------
Wallace de Souza Vizerra
------------------------
Gustavo Carmo da Costa

I need to return the records from the usuarios table that contain the given value in the nome field, but I need to ignore some existing words in the nome field values

Example:

SELECT * FROM usuarios WHERE nome LIKE 'Wallace Souza Vizerra'
#Sem o "de" no nome

I would like the nome query to ignore the words de , da , dos , das existing in the database.

How can I make this query, with LIKE or REGEXP, excluding words?

Answer:

I had a similar problem here at my company. Basically, here we use two ways to identify homonyms with typos.

The first is the levenshtein distance and the second is the soundex() function.

LEVENSHTEIN

To paraphrase Wikipedia "the Levenshtein distance or edit distance between two strings is given by the minimum number of operations required to transform one string into the other."

For example, the levenshtein distance between "Guilherme Silva" and "Guilherme da Silva" is 3. Between "Maria Dores" and "Maria das Dores" it is 4.

Here is the code to create the levenshtein() function in Mysql.

DELIMITER $$
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END$$
DELIMITER ;

To use, in your case, you can make the following query:

SELECT * FROM usuario as us WHERE levenshtein(us.nome, 'Wallace Silva') < 5 //Ou outro indice. Coloquei 5 para que ele pegue somente nomes que tenham no maximo indice 4 na distancia levenshtein.

If you have the following records in the bank

--------------------
1 Wallace Silva
2 Wallace João da Silva
3 Wallace das Silva
4 Guilherme da Silva
--------------------

The query will return, in this case, only records 1 and 2.

SOUNDEX

The soundex() function is a function that can be used in database searches where you know the pronunciation but not exactly how to spell it.

The soundex() is already native to Mysql and we don't need to create it manually.

In your case, the query would look something like this:

SELECT * FROM pessoa WHERE soundex(nome) = soundex('Wallace Silva')

The problem with using soundex() is that it works better with English words and doesn't work 100% in Portuguese.

COMBINED MODE

We can also create a query in which we combine the two forms of search so that we can have a higher probability of success.

We can use it like this:

SELECT * FROM pessoa WHERE levenshtein(soundex(nome), soundex('Wallace Silva')) < 3

Or still:

SELECT * FROM pessoa WHERE (levenshtein(nome, 'Wallace Silva') < 5) OR soundex(nome) = soundex('Wallace Silva')

SOME CONSIDERATIONS

  • The more functions we use, the more processing we need to have on the machine, that is, if we create a 'little monster' query, it can take a while to run
  • Both functions have some negative points, I recommend studying them well and testing A LOT before implementing them in some code
  • So far, these have been the most effective ways to look up similar names directly in the SQL query that I've found that have worked for me. Doesn't mean there aren't any better ones, it doesn't hurt to Google them 😉

Anyway, I hope I helped! Enjoy it!

Scroll to Top