How to return most common words from a text with PHP?

Question:

I would like to know how best to return the most frequent occurrences of substrings in a string containing text. Example:

$texto = "Hoje nós vamos falar de PHP. PHP é uma linguagem criada no ano de ...";

E o output:

array(
    "PHP" => 2
    "de" => 2
    //...
);

The idea is to return an array with the most used words in a given string .

I'm currently using the substr_count() function, but the problem is that it only works if you already pass a word to be checked, meaning I would need to know the words of the text to check one by one.

Is there any other way to do this?

Answer:

Try it like this:

print_r(array_count_values(str_word_count($texto, 1, "óé")));

Result:

Array ( 
   [Hoje] => 1 
   [nós] => 1 
   [vamos] => 1 
   [falar] => 1 
   [de] => 2 
   [PHP] => 2 
   [uma] => 1 
   [linguagem] => 1 
   [criada] => 1 
   [no] => 1 
   [é] => 1
   [ano] => 1 
)

To understand how array_count_values works see the php manual .

Edition

A smarter solution (independent of language)

With the previous solution it is necessary to specify the entire utf-8 special character set (just as was done with the ó and é ).

A more complicated solution follows, however, that eliminates the special character set problem.

$text = str_replace(".","", "Hoje nós vamos falar de PHP. PHP é uma linguagem criada no ano de ...");
$namePattern = '/[\s,:?!]+/u';
$wordsArray = preg_split($namePattern, $text, -1, PREG_SPLIT_NO_EMPTY);
$wordsArray2 = array_count_values($wordsArray);
print_r($wordsArray2);

In this solution I use regular expressions to break the words and then I use the array_count_values to count the words. The result is:

Array 
( 
  [Hoje] => 1 
  [nós] => 1 
  [vamos] => 1 
  [falar] => 1 
  [de] => 2 
  [PHP] => 2 
  [é] => 1 
  [uma] => 1 
  [linguagem] => 1 
  [criada] => 1 
  [no] => 1 
  [ano] => 1 
)

This solution also meets the need, however, the dots must be eliminated before splitting the words, otherwise words will appear in the result with . and words without the . .For example:

  ...
  [PHP.] => 1 
  [PHP] => 1 
  ...

Counting words is never that simple a task. It is necessary to know well the string that you want to count the words before applying a definitive solution.

Scroll to Top