python – Identify common themes among word sets

Question:

Through long searches, deletions, parsing, stemming and other analyzes, I came to the TOP 10 words for each lecture from ted.com. They are quite unique, i.e. among 2346 sets of words, no more than 50 words are repeated at most.

The task is to identify, on the basis of the obtained 2346 sets of words of 10 similar. A simple intersection of the sets of each with each does not lead to anything (length of intersection is 1-3). You need to use synonyms. Tried to figure it out with gensim and nltk but got nowhere. LSA did not configure.

Direct you on the true path in this difficult matter.

Answer:

If without using any word2vec and the like, you can try to develop an algorithm of this type:

  1. For each source word from your 2346 sets, compile a table of the frequency of occurrence of other words taken from the same sentences in which your word occurs, you can not even take all words, but the nearest words that are before and after the original word.
  2. For each word found, find the adjacent words (i.e. context) in the same way, and group these words into a single list sorted by frequency of occurrence and filtered so that only those words that are present in your 2346 word sets remain in the list.
  3. The top10 obtained from this list will consist of words that can be used to replace the original word.
  4. PROFIT

Previously, of course, you need to process all this with all kinds of stemming and other analysis.

And regarding the simple intersection of sets, the following idea arose: try to intersect not by words, but by selected word roots, i.e. so that the same root words intersect.

And finally, a useful link: https://ru.wikipedia.org/wiki/%D0%94%D0%B8%D1%81%D1%82%D1%80%D0%B8%D0%B1%D1%83 % D1% 82% D0% B8% D0% B2% D0% BD% D0% B0% D1% 8F_% D1% 81% D0% B5% D0% BC% D0% B0% D0% BD% D1% 82% D0 % B8% D0% BA% D0% B0

Scroll to Top