Question:
I need to make a program that finds duplicate files on my computer, so that the user can decide what action to take with these files (eg, delete copies). For now, I only care about a binary comparison between files (i.e. the file is only duplicated if it is 100% the same as another)
I know that searching by file name alone is insufficient, as the same file may have been saved under another name.
Is there any algorithm to compare the files?
I imagine that generating the checksum of all the files and comparing them all against all is unproductive, as it is not normal to have so many duplicate files. I also imagine that you can't just use the file size. And there may be cases where the file is duplicated more than once.
Answer:
Step by step:
- list everything with basic information: location (on disk/directory), name, date and size;
- separate files that have the same name (exactly the same, including upper and lower case);
- similarly files that have the same size (in Bytes);
- eliminate the "non-repeated" (without the same name or size);
- select the "repeated level 1" (same name, size and date), and apply a checksum to each block separately, mark the REALLY equal ones;
- select the "repeated level 2" (same name or size and date), and apply a checksum to each block separately, mark the REALLY equal ones;
- select the "repeated level 3" (same name and size with different date), and apply a checksum to each block separately, mark the REALLY equal ones;
- select the "repeated level 4" (same name or size with different date), and apply a checksum to each block separately, mark the REALLY equal ones;
- with the REALLY the same, present each block to the user so that he can define which ones will be eliminated;
I suggest you add some options: that the user can access the location of each file; can open in the default editor to view the content; can move the "chosen/repeat" file to a specific folder.
An option that I believe is very useful, when selecting only one file (in the windows environment, for example), it can be used through the context menu (right mouse click), so that a REPEATED file of the selected one is found.
Just think about ignoring the contents of compressed folders, that is, if the repeated file is inside a ZIP/RAR file, it will never be evaluated and therefore will never be considered repeated (put this in the instructions for use of your future application). And then send me a copy to test 😉