Let's say we took the original text, three paragraphs. In his copy, the last sentence was completely removed, the link address was changed somewhere in the text, a couple of prepositions were replaced, and a couple of words were replaced with synonyms.
What is the algorithm to determine "these texts are 65% similar. Most likely, a common source"? Is there something like wavelet analysis for texts?
In bioinformatics, such questions – determining the similarity of two different sequences of nucleic acids or proteins (read – texts) – constitute the main problem. It is solved using different alignment algorithms . In your case, you can apply the global alignment method – the simplest of them. Read more about it at the link provided. If it is not clear, I will recommend literature.