GNU/Linux Desktop Survival Guide
by Graham Williams |
|||||
Duplicate Files |
20191229 A common challenge is to find duplicate files, such as photos or music or documents. When available disk space becomes tight then it's also a good time for a clean up.
A simple trick to find duplicates is to calculate a MD5 signature for a file, and to the use that signature to find duplicates of the file, knowing that in general a mapping of the contents of a file to a signature is a unique mapping - the signature is unique for different files.
The fdupes package provides the fdupes command that incorporates the use of the MD5 signature within a more thorough pipeline to guarantee the files are duplicates. The pipeline for checking for duplicate files begins with a file size comparison, a partial MD5 signature comparison, a full MD5 signature comparison, and then a byte-to-byte comparison.
A summary as obtained using the --summarize
or
-m
option is often useful to begin with:
$ fdupes --summarize . 13567 duplicate files (in 6407 sets), occupying 16996.0 megabytes |
fdupes requires at least one command line argument (a path to a directory). In the above a period (.) is used to indicate the current directory.
With no options fdupes lists groups of duplicated files in the specified directory:
$ fdupes . ./20180323_thesis_02.pdf ./20180323_thesis_01.pdf ./20180323_thesis.pdf ./20030102_pakdd01_03.pdf ./20031012_pakdd01.pdf ./20200531_siunits_01.pdf ./20200531_siunits.pdf |
Use the --recurse
or -r
option to recurse into
subdirectories.
fdupes can delete duplicates, retaining the first listed file. A general heuristic is to keep the original rather than files with versioned file names, noting they contain exactly the same content. Ordering the list by name and then reversing the order can be useful:
$ fdupes --order='name' --reverse . ./20180323_thesis.pdf ./20180323_thesis_01.pdf ./20180323_thesis_02.pdf ./20031012_pakdd01.pdf ./20030102_pakdd01_03.pdf ./20200531_siunits.pdf ./20200531_siunits_01.pdf |
The following command will delete duplicates, keeping the first file in the list, the list being ordered in reverse by the filename:
$ fdupes --delete --noprompt --order='name' --reverse . |
The --ommitfirst
or -f
option will generate a
list of duplicate files excluding the first of the duplicates. This is
then a list that can be saved to file to generate a script to manually
delete the duplicate files if desired.