алгоритм – Finding a pattern among strings

Question:

I have a huge number of strings, and I know they are built from a limited set of patterns, but I don't know which ones. I need to compute these patterns fairly quickly on a very large dataset .

For example, this set of templates (unknown to me):

"Hello, my {0} is: {1}"  
"{0} has left our team."  
"{0} has joined the team {1}!"  

Matches the set of lines below, the only ones I have:

Hello, my name is: Slim Shady  
Mike has left our team.  
Mike has joined the team Liquid!  
Hello, my name is Nikole  
Nick has joined the team Spirit!  
Elizabeth has left our team.  

Task: find patterns in similar strings. That is, to find patterns in the text that allow you to come up with such patterns automatically. There can be a lot of lines.

So far, I've come up with something like editorial distance using N-grams.

But this thing has a wider scope than I need, and the run time is no less important than the accuracy of finding patterns.

Since I have words that will match completely and without errors, and it seems to me that here you can somehow increase performance due to this.

I thought about making an alphabet of words, not letters, so that the search for sentences would go by words.

UPD : Just in case, I'll clarify that at the time of compilation there are neither templates nor the strings themselves. Strings are given to the program as input, and no one will give us templates.

UPD2 : We need to find patterns in real time in the lines already passed so that in the next we can find their match.

(Comment) Suppose we have n sentences Мама наша мыла раму and m sentences Мама мыла мою раму . It should be one template Мама {0} мыла {1} раму or two templates: Мама {0} мыла раму , Мама мыла {0} раму ? There are a lot of such examples. In order to avoid such ambiguities, you need to either understand more clearly what kind of task you have, or you need to indicate a specific meaning so that you can find a unique solution.

The pattern should be as general as possible, i.e. Мама {0} мыла {1} раму . But it is for similar lines. Not just one template {0}

(Comment) The question is, why not use the pattern {0} ? It is the most general. The problem seems to be that some sort of order needs to be introduced over the set of templates. In general, this order is non-linear. It needs to be linearized somehow. This seems like a daunting task. If you say why you need this task, you can think about it.

This is the whole task: D I need to find patterns of these lines, so that strongly similar lines are combined into patterns, you can, for example, take something like a percentage of similarity in the form of a criterion.

UPD3 : The template should concatenate similar strings. Similar strings will be considered strings that match by N% .

Answer:

If the number of patterns is limited and they are fairly evenly distributed in the data, then there is no point in having a very large dataset. You can get templates from a smaller dataset as well.

The head-on solution looks like this:

You can come up with some function that receives words from strings (for example, just split by spaces and cut out punctuation marks).

Then, make a frequency dictionary. For example, the word 'Hello' occurs twice in the example set. In addition, you can also count the number of occurrences of a word in a certain position, so that later you can look not just at the frequency of the word, but at the frequency of the word somewhere in the region of 3-5 positions, for example.

Now, for any word, you can make an assumption about whether it is part of the template or not. If the word occurs often enough somewhere around the current position, then most likely it is part of the template.

For each row from the data set, we build a template and leave only unique values.

In this case, the data will have to be processed twice (the first time building a frequency dictionary, the second – patterns), because we first need information about what the whole set is.

Neuronka can help you set thresholds and get better results.

Scroll to Top