I would like to know how best to return the most frequent occurrences of substrings in a string containing text. Example:
$texto = "Hoje nós vamos falar de PHP. PHP é uma linguagem criada no ano de ...";
E o output:
array( "PHP" => 2 "de" => 2 //... );
The idea is to return an
array with the most used words in a given string .
I'm currently using the
substr_count() function, but the problem is that it only works if you already pass a word to be checked, meaning I would need to know the words of the text to check one by one.
Is there any other way to do this?
Try it like this:
print_r(array_count_values(str_word_count($texto, 1, "óé")));
Array ( [Hoje] => 1 [nós] => 1 [vamos] => 1 [falar] => 1 [de] => 2 [PHP] => 2 [uma] => 1 [linguagem] => 1 [criada] => 1 [no] => 1 [é] => 1 [ano] => 1 )
To understand how
array_count_values works see the php manual .
A smarter solution (independent of language)
With the previous solution it is necessary to specify the entire utf-8 special character set (just as was done with the
A more complicated solution follows, however, that eliminates the special character set problem.
$text = str_replace(".","", "Hoje nós vamos falar de PHP. PHP é uma linguagem criada no ano de ..."); $namePattern = '/[\s,:?!]+/u'; $wordsArray = preg_split($namePattern, $text, -1, PREG_SPLIT_NO_EMPTY); $wordsArray2 = array_count_values($wordsArray); print_r($wordsArray2);
In this solution I use regular expressions to break the words and then I use the
array_count_values to count the words. The result is:
Array ( [Hoje] => 1 [nós] => 1 [vamos] => 1 [falar] => 1 [de] => 2 [PHP] => 2 [é] => 1 [uma] => 1 [linguagem] => 1 [criada] => 1 [no] => 1 [ano] => 1 )
This solution also meets the need, however, the dots must be eliminated before splitting the words, otherwise words will appear in the result with
. and words without the
. .For example:
... [PHP.] => 1 [PHP] => 1 ...
Counting words is never that simple a task. It is necessary to know well the
string that you want to count the words before applying a definitive solution.