python – How do you isolate key concepts from multiple natural language descriptions?


Good afternoon.

Tell me what approaches and libraries (python) can be used to solve such a problem:

There are a lot of matching job descriptions. For each vacancy, I can select a block with the requirements for applicants for this vacancy. It turns out 5-15 lines of something like this:

  • confident knowledge of Python;
  • familiarity with Django, Tornado, Twisted;
  • understanding and ability to apply design patterns;
  • experience in setting up, optimizing and working with MySQL, PostgreSQL, MongoDB;
  • ability to work not only with ORM, but also with “naked” SQL;
  • experience in the development of high-load projects;
  • love of testing code;
  • a frantic desire to constantly learn new things and learn;
  • Experience from 2 years;
  • reading documentation and literature in English;
  • sociability, responsibility, ability to work in a team.

I would like to analyze this entire array of vacancies and get:

  1. An idea of ​​what knowledge, skills, technologies are most in demand
  2. Which ones are most often found together. For example, Tornado and Twisted often appear on the same line. Some words can often occur within the same vacancy, but in different lines.

So far, the only approach that comes to my mind is to break everything down into separate words and somehow analyze how often individual words and their combinations occur.

How exactly to estimate the overall frequency of occurrence, in order to take into account how close or far words are located in the text?

What other approaches are there?

How can I solve the following problems:

  1. Filter out conjunctions, prepositions and common words. Almost every vacancy contains the words "understanding" of something, "experience" of something, "ability" to do something. As a result, such words have a high frequency of occurrence, but they are of no value for analysis.

  2. Take into account that the same concept can be expressed in different words. For example "testing", "writing tests", "unit tests", "unit testing"


When I was just learning to code for Python, I wrote a script like this to find the most common words in a text file:

import re

normal_dict = {} # normal dictionary

def readfilelines(filename, dictionary):
    with open (filename, "r") as file:
        lines = file.readlines()
        for line in lines:
            line = line.strip()
            result = re.findall(r'(\b[a-z]+\b)', line, re.IGNORECASE)

            for word in result:
                if word in dictionary.keys():
                    cur_value = dictionary[word] + 1
                    dictionary[word] = cur_value
                    dictionary[word] = 1

def writefile(filename, dictionary):
    with open (filename, "w") as file:
        for k, v in dictionary:
            s = str(k) + " - " + str(v) + "\n"

def sort_dictionary_by_value(dictionary):
    sorted_dict = [(k, dictionary[k]) for k in sorted(dictionary, key=dictionary.get, reverse=True)]
    return sorted_dict

readfilelines("input.txt", normal_dict)
sorted_dict = sort_dictionary_by_value(normal_dict)
writefile("output.txt", sorted_dict)

At the input you submit a text file, at the output you get a file with a gradation of the occurrence of words from largest to smallest in this format:

Word – N (where N is how many times this word occurred)

PS Each word will be located on a new line

I think you can try to change the regex from \w+ to [az]+, re.IGNORECASE in order to look specifically for frameworks and technologies, because they are usually named in Latin letters (except for 1C, although it may be written in Latin as well).

In any case, as an option, write all the requirements into one file, and then apply this script to it with a changed regular pattern, and the output file will contain frameworks and technologies from the most popular to the most unclaimed ones.

For the test, I copied the description of several vacancies into a file and ran them using the script and regular rules r'\b[az]+\b', re.IGNORECASE

Here is a link to the input and output

Scroll to Top