Question:
It's very common to compare strings
, which is usually done by comparing for equality. However, today I have a need to compare the similarity between two strings
so that I can compare how similar they are.
For example:
"City" is different from "city", but is similar (consider c
!= C
).
"City" is different from "city" but is similar (assuming the user typed it wrong).
"city" is different from "city" but is similar (same thing).
My question is: how do I check the similarity between strings
? Is there a ready-made algorithm to do this?
I'm looking for a way to do this:
if(checkSimilarity("cidade","cdade") > 0.8)
{
// as duas strings são muito parecidas
// sendo que checkSimilarity() retornando 1 para strings iguais
}
It doesn't matter what programming language is used in the answer. My question is more related to the algorithm.
Answer:
I believe that Levenshtein's algorithm is a solution that works well for what you are intending to do, since with it it is possible to know the number of modifications that a word needs to go through before it equals the other.
A possible implementation of it (in C++) would be:
#include <iostream.h>
#include <string.h>
#include <math.h>
#include <algorithm>
using namespace std;
int VerificarSimilaridade(string palavra1, string palavra2)
{
int tam1 = palavra1.size();
int tam2 = palavra2.size();
int verif[tam1 + 1][tam2 + 1];
// Se uma das palavras tiver coprimento igual a zero, é necessária uma modificação do tamanho da outra:
if (tam1 == 0)
return tam2;
if (tam2 == 0)
return tam1;
// Atribuir ordem numérica resultante das palavras ao vetor de modo, respectivamente, "vertical" e "horizontal":
int i = 0;
while(i <= tam1)
verif[i][0] = i++;
int j = 0;
while(j <= tam2)
verif[0][j] = j++;
// Verificação:
for (int i = 1; i <= tam1; i++)
{
for (int j = 1; j <= tam2; j++)
{
// Definindo custos de modificação (deleção, inserção e substituição):
int custo = (palavra2[j - 1] == palavra1[i - 1]) ? 0 : 1;
verif[i][j] = min(
min(verif[i - 1][j] + 1, verif[i][j - 1] + 1),
verif[i - 1][j - 1] + custo);
}
}
return verif[tam1][tam2];
}
int main()
{
string pala1, pala2;
cout << "Informe a primeira palavra: " << endl;
cin >> pala1;
cout << "Informe a segunda palavra: " << endl;
cin >> pala2;
cout << "O numero de modificacoes necessarias para que as duas palavras se igualem e: "
<< VerificarSimilaridade(pala1, pala2) << endl;
system("pause");
return 0;
}
Furthermore, from the number of modifications you can create your own criteria to quantify the level of similarity between two strings; something interesting would be to calculate the percentage of change according to the length of the words, so through a degree of reference it would be possible to see that 5 modifications can have a high value in a string with 7 letters and, at the same time, low in a string with 200 for example.