arquivo – An algorithm to find duplicate files?

Question:

I need to make a program that finds duplicate files on my computer, so that the user can decide what action to take with these files (eg, delete copies). For now, I only care about a binary comparison between files (i.e. the file is only duplicated if it is 100% the same as another)

I know that searching by file name alone is insufficient, as the same file may have been saved under another name.

Is there any algorithm to compare the files?

I imagine that generating the checksum of all the files and comparing them all against all is unproductive, as it is not normal to have so many duplicate files. I also imagine that you can't just use the file size. And there may be cases where the file is duplicated more than once.

Answer:

Step by step:

  1. list everything with basic information: location (on disk/directory), name, date and size;
  2. separate files that have the same name (exactly the same, including upper and lower case);
  3. similarly files that have the same size (in Bytes);
  4. eliminate the "non-repeated" (without the same name or size);
  5. select the "repeated level 1" (same name, size and date), and apply a checksum to each block separately, mark the REALLY equal ones;
  6. select the "repeated level 2" (same name or size and date), and apply a checksum to each block separately, mark the REALLY equal ones;
  7. select the "repeated level 3" (same name and size with different date), and apply a checksum to each block separately, mark the REALLY equal ones;
  8. select the "repeated level 4" (same name or size with different date), and apply a checksum to each block separately, mark the REALLY equal ones;
  9. with the REALLY the same, present each block to the user so that he can define which ones will be eliminated;

I suggest you add some options: that the user can access the location of each file; can open in the default editor to view the content; can move the "chosen/repeat" file to a specific folder.

An option that I believe is very useful, when selecting only one file (in the windows environment, for example), it can be used through the context menu (right mouse click), so that a REPEATED file of the selected one is found.

Just think about ignoring the contents of compressed folders, that is, if the repeated file is inside a ZIP/RAR file, it will never be evaluated and therefore will never be considered repeated (put this in the instructions for use of your future application). And then send me a copy to test 😉

Scroll to Top