~shiny/image-dedup

Image deduplication helper
Add script to display duplicate groups with feh image viewer
Add script to remove deleted photos from database
Assume fingerprint script is in the same folder as show-dupes

refs

master
browse  log 

clone

read-only
https://git.sr.ht/~shiny/image-dedup
read/write
git@git.sr.ht:~shiny/image-dedup

You can also use your local clone with git send-email.

#Photo Deduplication

This is a very small, very simple tool to help you deduplicate images.

Dependencies: python 3, OpenCV, numpy, shell tools: sort, uniq, cut, find, grep

I'm using the GNU versions of the shell utils with bash. YMMV with others, I've not checked whether all the options etc. I use are POSIX.

show-dupes maintains a "database" (a file with lines in the format output by fingerprint-files). Its first argument is the database file, and then zero or more folders which will be recursively processed. Only file paths not already in the database will be added. Note that you need to be consistent about how the paths are specified - the filenames in the db are just what find outputs, nothing clever is done to resolve the paths - they must match exactly for them not to be re-processed. All files considered to be duplicates are printed; the output is empty-line separated groups of newline-separated filenames.

Example usage:

./show-dupes photos/db photos/
# Examine the output, delete any you want to etc.
# ... later you add some more photos
./show-dupes photos/db photos/
# any new files now processed and dupes output again
./show-dupes photos/db
# Any duplicates in the db are output without adding any new files

That should be all you need to find duplicates. But read on if you want more info.

clean-db will delete entries from the database which no longer exist on disk.

./clean-db photos/db photos/

display-dupes uses feh to display detected duplicates. Each group of duplicates will be displayed in a multiwindow feh invocation. You can use the built in functions of feh to delete some of the files if desired. Closing all the feh windows (or exit from the feh menu) will move onto the next group.

./show-dupes photos/db | ./display-dupes

#Details

fingerprint-files.py reads a list of (newline-separated) filenames from stdin and outputs the fingerprint and the filename to stdout: the fingerprint is 64 hex characters, then a space, and then the filename (verbatim as read). Filenames containing newlines won't work (don't do that!). If OpenCV can't read a file then an error is printed to stderr (but otherwise processing continues).

#Fingerprint

Method to generate the fingerprint is the same as findimagedupes

  • Resize image to 160x160
  • Convert to grayscale
  • Blur
  • Normalise/equalise brightness
  • Resize 16x16
  • Threshold to 1 bit per pixel
  • Fingerprint is those 256 bits hex encoded

Note that whilst the method is the same this implementation uses OpenCV rather than ImageMagick, so fingerprints will not be comparable.