java – Recognize word repetitions in the String

Question:

I have text inside a StringBuffer and I need to check and mark the words that appear more than once. At first I used a circular row of 10 positions, as I am interested in only words repeated in a "radius" of 10 words.
It is worth noting that repeated word tagging can only occur if the repeated words are within 10 words of each other. If the repeated words are at a "distance" of more than 10 words, they must not be marked.
The Contem method returns null if there is no repetition or returns the word that has repetition. String is just the variable that contains the full text.

StringBuffer stringProximas = new StringBuffer();
String has = "";
Pattern pR = Pattern.compile("[a-zA-Zà-úÀ-Ú]+");
Matcher mR = pR.matcher(string);
while(mR.find()){
  word = mR.group();
  nextWord.Inserir(word);//inserir na lista
  has = nextWord.Contem();//verifica se há palavras iguais na lista
  //um if pra verificar se has é null ou nao
  //e aqui marca a palavra repetida, se has for diferente de null
  mR.appendReplacement(stringProximas, "");
  stringProximas.append(has);
}
public void Inserir(String palavra){
    if(this.list[9].equals("null")){
        if(this.list[0].equals("null")){
            this.list[this.fim]=palavra;
        }else{
            this.fim++;
            this.list[this.fim] = palavra;
        }
    }else{
        //inverte o apontador fim para a posição 0
        if(this.inicio == 0 && this.fim == 9){
            this.inicio++;
            this.fim = 0;
            this.list[this.fim] = palavra;
        }else if(this.inicio == 9 && this.fim == 8){//inverte o apontador inicio para posição 0
            this.inicio = 0;
            this.fim++;
            this.list[this.fim] = palavra;
        }else{
            this.inicio++;
            this.fim++;
            this.list[this.fim] = palavra;                    
        }
    }
}
public String Contem() throws Exception{
    for(int i=0;i<this.list.length;i++){
        for(int j=i+1;j<this.list.length;j++){
            if(this.list[i].equals(this.list[j]) && (!this.list[i].equals("null") || !this.list[j].equals("null"))){
                //nao pegar a mesma repetição mais de uma vez
                if(!this.list[i].equals("?")){
                    this.list[i] = "?";//provavelmente será retirado isso
                    return this.list[j];
                }
            }
        }
    }
    return "null";
}

My big problem: if I find repeated words, I can only mark the second occurrence because even the first one being in the queue, the word variable will be the second one and because while I can't mark the second one.

I'm using this text as an example:
Nowadays, it is necessary to be smart. Our daily life is complicated.
The method should return for example (I put it as bold here, but it's not necessarily the way I mark it):
Today, you must be smart. Our daily life is complicated.

Answer:

Solution:

Using regular expressions you can solve it with a very expressive code, small and with few if s – actually only 1 if and only 1 loop :

public String assinalaRepetidas(String texto, String marcadorInicio, 
                                            String marcadorFim, int qtdPalavrasAnalisar) {

    String palavraInteiraPattern = "\\p{L}+"; 
    Pattern p = Pattern.compile(palavraInteiraPattern);
    Matcher matcher = p.matcher(texto);

    ArrayList<String> palavras = new ArrayList<String>();
    ArrayList<String> palavrasRepetidas = new ArrayList<String>();
    
    while (matcher.find() && palavras.size() < qtdPalavrasAnalisar) {
        
        String palavra = matcher.group();

        if (palavras.contains(palavra) && !palavrasRepetidas.contains(palavra)) {
            texto = texto.replaceAll(
                    String.format("\\b%s\\b", palavra), 
                    String.format("%s%s%s", marcadorInicio, palavra, marcadorFim));

            palavrasRepetidas.add(palavra);
        }
        palavras.add(palavra);
    }
    return texto;
}

And that's all! End.

Below, some explanation and also the consumer code.

Explaining the solution:

I used regular expression to get every word in the text, ignoring spaces, parentheses, symbols, commas and other punctuation that aren't real words. The regular expression to do this in Java in accented text (using unicode UTF-8 ) is \p{L}+ .

In the same loop that I get the words found by the regular expression, I already replace the repeated word by itself, enclosing it by the markers.

The consumer code (unit test) looked like this:

@Test
public void assinalaPrimeirasPalavrasRepetidas() {
  String texto = "Hoje em dia, é necessário ser esperto. O nosso dia a dia é complicado.";
  String esperado = "Hoje em [dia], é necessário ser esperto. O nosso [dia] a [dia] é complicado.";
    
  assertEquals(esperado, new AnalisaTexto().assinalaRepetidas(texto, "[", "]", 10));
}

Although the question describes that it wants only the first 10 words, the expected result example seems to consider all of them. So I added a signature that doesn't need the "ray" of words to analyze:

public String assinalaPalavrasRepetidas(String texto, String marcadorInicio, String marcadorFim) {
    return assinalaRepetidas(texto, marcadorInicio, marcadorFim, Integer.MAX_VALUE);
}

Using this other method, as more than 10 words are analyzed, the "is" is also identified as repeated:

@Test
public void assinalaTodasPalavrasRepetidas() {
  String texto = "Hoje em dia, é necessário ser esperto. O nosso dia a dia é complicado.";
  String esperado = "Hoje em [dia], [é] necessário ser esperto. O nosso [dia] a [dia] [é] complicado.";
    
  assertEquals(esperado, new AnalisaTexto().assinalaPalavrasRepetidas(texto, "[", "]"));
}

Finally, note that I used regular expressions also when replacing words with their marked equivalents. Notice the regex in the texto.replaceAll method. Otherwise, a part of another word that matches would also be flagged. For example, in "to be a server" it would be marked "[to be] [to be]server" .

The test that proves the effectiveness of this little care is:

@Test
public void assinalaApenasPalavraInteira() {
    
    String texto = "Hoje em dia, pode ser necessário servir ao ser esperto.";
    String esperado = "Hoje em dia, pode [ser] necessário servir ao [ser] esperto.";
    
    assertEquals(esperado, new AnalisaTexto().assinalaPalavrasRepetidas(texto, "[", "]"));
}
Scroll to Top