Question:
I would like to know how best to return the most frequent occurrences of substrings in a string containing text. Example:
$texto = "Hoje nós vamos falar de PHP. PHP é uma linguagem criada no ano de ...";
E o output:
array(
"PHP" => 2
"de" => 2
//...
);
The idea is to return an array
with the most used words in a given string .
I'm currently using the substr_count()
function, but the problem is that it only works if you already pass a word to be checked, meaning I would need to know the words of the text to check one by one.
Is there any other way to do this?
Answer:
Try it like this:
print_r(array_count_values(str_word_count($texto, 1, "óé")));
Result:
Array (
[Hoje] => 1
[nós] => 1
[vamos] => 1
[falar] => 1
[de] => 2
[PHP] => 2
[uma] => 1
[linguagem] => 1
[criada] => 1
[no] => 1
[é] => 1
[ano] => 1
)
To understand how array_count_values
works see the php manual .
Edition
A smarter solution (independent of language)
With the previous solution it is necessary to specify the entire utf-8 special character set (just as was done with the ó
and é
).
A more complicated solution follows, however, that eliminates the special character set problem.
$text = str_replace(".","", "Hoje nós vamos falar de PHP. PHP é uma linguagem criada no ano de ...");
$namePattern = '/[\s,:?!]+/u';
$wordsArray = preg_split($namePattern, $text, -1, PREG_SPLIT_NO_EMPTY);
$wordsArray2 = array_count_values($wordsArray);
print_r($wordsArray2);
In this solution I use regular expressions to break the words and then I use the array_count_values
to count the words. The result is:
Array
(
[Hoje] => 1
[nós] => 1
[vamos] => 1
[falar] => 1
[de] => 2
[PHP] => 2
[é] => 1
[uma] => 1
[linguagem] => 1
[criada] => 1
[no] => 1
[ano] => 1
)
This solution also meets the need, however, the dots must be eliminated before splitting the words, otherwise words will appear in the result with .
and words without the .
.For example:
...
[PHP.] => 1
[PHP] => 1
...
Counting words is never that simple a task. It is necessary to know well the string
that you want to count the words before applying a definitive solution.