add bsd license
switch to xxhash
clean up various lints
This is a program that was quickly hacked together to identify duplicate files in an I/O-efficient manner. Most dedupe utilities I've found tend to hash the entire file's contents -- this is suboptimal on archives of large media files that are stored on slow storage. (i.e: commodity hard drives.)
Instead this program identifies large files above a given threshold and reads a block of a fixed size from the beginning & ending of the file. These blocks are then fed to a digest algorithm to create a "fuzzy hash." The advantage to this approach is that: (a) because the chunks are small there is less of a chance that rotational drives will need to read many fragmented extents, and (b) there is much less read I/O to be done in general.
This tends to work well enough on sufficiently random files (i.e: photos, video) but will fail on files that share large extents. (For e.g this approach may fail if you had two copies of a database file: with the only delta between them being updates in the middle of the file.)
As such it is crucial to note that this tool is not meant to have a 100% success rate. The output of this program should not be used to do any sort of automated deduplication. It is meant to be used to quickly generate a list of duplicates to be checked by a human.
git clone https://github.com/drbawb/deduper
cd deduper
cargo build
dedeuper <path>
: will scan path
recursively for duplicate files