~hime/deduper

quick dedupe utility
54b5cb29 — drbawb 4 years ago
add bsd license
d7605274 — drbawb 4 years ago
switch to xxhash
a4382746 — drbawb 4 years ago
clean up various lints

refs

master
browse  log 

clone

read-only
https://git.sr.ht/~hime/deduper
read/write
git@git.sr.ht:~hime/deduper

You can also use your local clone with git send-email.

#deduper - a fast file dedupe utility

This is a program that was quickly hacked together to identify duplicate files in an I/O-efficient manner. Most dedupe utilities I've found tend to hash the entire file's contents -- this is suboptimal on archives of large media files that are stored on slow storage. (i.e: commodity hard drives.)

Instead this program identifies large files above a given threshold and reads a block of a fixed size from the beginning & ending of the file. These blocks are then fed to a digest algorithm to create a "fuzzy hash." The advantage to this approach is that: (a) because the chunks are small there is less of a chance that rotational drives will need to read many fragmented extents, and (b) there is much less read I/O to be done in general.

This tends to work well enough on sufficiently random files (i.e: photos, video) but will fail on files that share large extents. (For e.g this approach may fail if you had two copies of a database file: with the only delta between them being updates in the middle of the file.)

As such it is crucial to note that this tool is not meant to have a 100% success rate. The output of this program should not be used to do any sort of automated deduplication. It is meant to be used to quickly generate a list of duplicates to be checked by a human.

#build instructions

  1. You will need a copy of the Rust for your operating system.
  2. git clone https://github.com/drbawb/deduper
  3. cd deduper
  4. cargo build

#usage instructions

  1. dedeuper <path>: will scan path recursively for duplicate files

#TODO

  • [ ] add flag to digest full file if desired
  • [ ] add flag to not recurse through directories
  • [ ] alternative (machine-readable) output formats?
  • [ ] configurable chunk size
  • [ ] perhaps consider chunk in middle of the file for add'l accuracy?
  • [ ] optimization work