python – How to compare if the contents of two string columns of a data frame are similar

Question:

I have a data frame where I need to compare how much the contents of two columns are similar.

For example: coluna a = “José Luiz da Silva” and coluna b = “José L. Silva” . How can I indicate that column a and column b are similar?

Answer:

Here's a possible solution, in Python:

#-*- coding: utf-8 -*-
from unidecode import unidecode

ignore_list = ['de', 'do', 'da', 'dos', 'das']

def parse_name(full_name):
    name_list = full_name.split() # Separa cada nome
    new_name_list = []
    for name in name_list: # Percorre cada nome
        name = name.strip('.') # Remove pontos
        name = name.lower() # Converte todas as letras em minúsculas
        if name in ignore_list: # Remove preposições
            continue
        name = unidecode(name.decode('utf8')) # Remove acentos (necessita da biblioteca 'unidecode')
        new_name_list.append(name)
    return new_name_list

def is_similar(a, b):
    a = parse_name(a)
    b = parse_name(b)
    if len(a) != len(b): # Se o número de palavras for diferente, retorna falso
        return False
    for x, y in zip(a, b):
        if (len(x) == 1) or (len(y) == 1): # Se uma das palavras possuir apenas uma letra...
            if x[0] != y[0]: #...compara apenas a primeira letra
                return False
        else: # Caso contrário...
            if x != y: #...compara a palavra toda
                return False
    return True # Se todas as palavras forem iguais, retorna verdadeiro

Example of use:

a = 'José Luiz da Silva'
b = 'José L. Silva'
print is_similar(a, b) # Retorna True

In this solution, the is_similar() function only returns true or false . Depending on your need, you might want to think of a more flexible metric that returns a measure of distance. For example:

  • Names like 'José L. Silva' and 'José Luiz da Silva' would have a distance of 0 (they would be considered equal);
  • Names like 'José Silva' and 'José Luiz da Silva' would have a small distance value (they would be considered similar);
  • Names like 'José Silva' and 'Maria Souza' would have a large distance value (they would be considered quite different).
Scroll to Top