python – What is the easiest way to mark up the keywords found by the stemmer in the text? Through the list?


There is a text – a set of sentences. His words are reduced to the basic form by a stemmer.

Input: String. Usually it is about 20 kB of text. An article from the sentences of the Russian text. Encoding (utf8) and other similar parameters can be done by any, it is hardly important.

Example "Oleg has a friend Oleg"

After the stemmer works, you will get an object that is not yet clear in what to store. For example, it can be a list or a tuple: ['U', 'Oleg', 'appear', 'Oleg'] processing this list, we find the most frequent word

At the output: you need to get a line (of the source text) in which the keywords are wrapped in html tags, for example, with bold text tags: «У <b>Олега</b> появился друг <b>Олег</b>»

What is the best way to do this? The only thing that comes to mind is to insert the source words into the list, then after the stemmer, knowing the number of the word in the list, wrap it with tags. And then go back from the list to the regular text.

Maybe there is a simpler option which one?


Since the stemmer is aimed at finding the stem of a word, and the stem of a word is an immutable part of a word, in my opinion, highlighting text using a regular expression would be a good approach. The regularity for the search will look something like this [^\s]*Олег[^\s]* . To wrap matching expressions in tags, you will need to use groups.

Scroll to Top