JUniq - Duplicate file remover


Backing up your files is important. But, if you're anything like me, a bunch of your files are backed-up many times on your backup server. JUniq will help you remove all but a single set of files; i.e., easily remove duplicate files.


About JUniq

JUniq recurses through directories that you select, finding files that have the same content. It uses a cryptographic hash to compare files, and only bothers computing a hash of the whole file if a quick "smoke" test fails; consequently, it's quite fast.

After building a database of filesets (files that have the same content), JUniq generates a shell script which will actually do the removal for you. In other words, JUniq is completely safe: it gives you an opportunity to look over its decisions, and to make exceptions on a one-by-one basis. You can actually do more than just delete the files; the shell script generator can generate arbitrary shell script code.

How to use

  1. Run JUniq (java -jar juniq.jar).
  2. Add paths (under the Operations menu) to index.
  3. Wait patiently! JUniq may have to read gigabytes of data, depending on what paths you selected. Note that you can save the database (the Database menu) that is generated.
  4. Click Generate Script from the Operations menu. Check the configuration, then click Generate.
  5. Manually verify the contents of the script (in emacs, for example) and make any desired changes.
  6. Execute the script!

How does JUniq choose which files should be preserved?

JUniq is designed to support multiple "survivor" strategies: i.e., how to pick which file from a set of identical files will be retained. At present, only two strategy are implemented, though it's easy to add your own (see Generate.java):

  1. Delete all but the file with the longest path name. This works pretty well: if you have multiple directories containing the same files, this strategy tends to select a single directory (rather than selecting files from different directories, which would be annoying.) It also tends to ignore source control metadata this way.

  2. Delete all but the file with the shortest path name. For photo and mp3 albums, this tends to preserve those files that have been most carefully organized and sorted into subdirectories, deleting the files that are in a "miscellaneous to-be-sorted" directory.

Download

JUniq is free (GPL). Get it here.

Version History:

FAQ/Troubleshooting

How do I install it?
Please read the README. If someone wants to contribute an INSTALL file, that'd be great.

Software Index Donate About Ed

Everything you see here is (C) 2003 by Edwin Olson, eolson@mit.edu. Please feel free to link, but don't copy my content. Last modified: May 07 2007 16:14:29.