алгоритм – How to determine the degree of similarity of two texts?


Let's say we took the original text, three paragraphs. In his copy, the last sentence was completely removed, the link address was changed somewhere in the text, a couple of prepositions were replaced, and a couple of words were replaced with synonyms.

What is the algorithm to determine "these texts are 65% similar. Most likely, a common source"? Is there something like wavelet analysis for texts?


In bioinformatics, such questions – determining the similarity of two different sequences of nucleic acids or proteins (read – texts) – constitute the main problem. It is solved using different alignment algorithms . In your case, you can apply the global alignment method – the simplest of them. Read more about it at the link provided. If it is not clear, I will recommend literature.

Scroll to Top