How to check similarity between strings?

Question:

It's very common to compare strings , which is usually done by comparing for equality. However, today I have a need to compare the similarity between two strings so that I can compare how similar they are.

For example:

"City" is different from "city", but is similar (consider c != C ).

"City" is different from "city" but is similar (assuming the user typed it wrong).

"city" is different from "city" but is similar (same thing).

My question is: how do I check the similarity between strings ? Is there a ready-made algorithm to do this?

I'm looking for a way to do this:

if(checkSimilarity("cidade","cdade") > 0.8)
{
    // as duas strings são muito parecidas
    // sendo que checkSimilarity() retornando 1 para strings iguais
}

It doesn't matter what programming language is used in the answer. My question is more related to the algorithm.

Answer:

I believe that Levenshtein's algorithm is a solution that works well for what you are intending to do, since with it it is possible to know the number of modifications that a word needs to go through before it equals the other.

A possible implementation of it (in C++) would be:

#include <iostream.h>
#include <string.h>
#include <math.h>
#include <algorithm>

using namespace std;

int VerificarSimilaridade(string palavra1, string palavra2)
{
    int tam1 = palavra1.size();
    int tam2 = palavra2.size();
    int verif[tam1 + 1][tam2 + 1];

    // Se uma das palavras tiver coprimento igual a zero, é necessária uma modificação do tamanho da outra:
    if (tam1 == 0)
        return tam2;

    if (tam2 == 0)
        return tam1;

    // Atribuir ordem numérica resultante das palavras ao vetor de modo, respectivamente, "vertical" e "horizontal":
    int i = 0;
    while(i <= tam1)
        verif[i][0] = i++;

    int j = 0; 
    while(j <= tam2)
        verif[0][j] = j++;

    // Verificação:
    for (int i = 1; i <= tam1; i++)
    {
        for (int j = 1; j <= tam2; j++)
        {
            // Definindo custos de modificação (deleção, inserção e substituição):
            int custo = (palavra2[j - 1] == palavra1[i - 1]) ? 0 : 1;

            verif[i][j] = min(
                min(verif[i - 1][j] + 1, verif[i][j - 1] + 1), 
                verif[i - 1][j - 1] + custo);
        }
    }

    return verif[tam1][tam2];
 }

 int main()
 {
     string pala1, pala2;

     cout << "Informe a primeira palavra: " << endl;
     cin >> pala1;
     cout << "Informe a segunda palavra: " << endl;
     cin >> pala2;

     cout << "O numero de modificacoes necessarias para que as duas palavras se igualem e: " 
          << VerificarSimilaridade(pala1, pala2) << endl;

     system("pause");
     return 0;   
 }

Furthermore, from the number of modifications you can create your own criteria to quantify the level of similarity between two strings; something interesting would be to calculate the percentage of change according to the length of the words, so through a degree of reference it would be possible to see that 5 modifications can have a high value in a string with 7 letters and, at the same time, low in a string with 200 for example.

Scroll to Top