Development Log: Duplicate File Finder

I have thousands of files stored on an external USB attached 1TB drive.  My drive is currently 95% full.  I know I have duplicate files throughout the drive because over time I have been lazy and made backups of backups (or copies of copies) of images or other documents.
Time to clean house.
I’ve searched online for a tool to do the following things, relatively easily and in a decent designed user interface:

  • Find duplicates based on hash (SHA-256)
  • List duplicates at end of scan
  • Give me an option to delete duplicates, or move them somewhere
  • Be somewhat fast

Every tool I’ve used fell short somewhere.  So I decided to write my own application to do what I want.
What will my application do?
Hash each file recursively given a starting path and store the following information into an SQLite database for reporting and/or cleanup purposes.

  • SHA-256 Hash
  • File full path
  • File name
  • File extension
  • File mimetype
  • File size
  • File last modified time

With this information, I could run a report such as the following pseudo report:
Show me a list of all duplicate files with an extension of JPG over a file size of 1MB modified in the past 180 days.

That’s just a simple query, something like:

SELECT fileHash, fileName, filePath, fileSize COUNT(fileHash) FROM indexed_files WHERE fileExtension=‘JPG’ and fileSize > 1024 GROUP BY fileHash HAVING COUNT(fileHash)>1

My application can show me a list of these and make some decisions to allow me to move or delete the duplicates after the query runs.

One problem comes to mind in automating removal or moving duplicates… What if there are more than 1 duplicate file; how do I handle this?

So on to the bits and pieces…

The hashing function is pretty straight-forward in VB.NET (did I mention I was writing this in .NET?).

Imports System.IO Imports System.Security Imports System.Security.Cryptography Function hashFile(ByVal fileName As String) Dim hash hash = SHA256.Create() Dim hashValue() As Byte Dim fileStream As FileStream = File.OpenRead(fileName) fileStream.Position = 0 hashValue = hash.ComputeHash(fileStream) Dim hashHex = PrintByteArray(hashValue) fileStream.Close() Return hashHex End Function Public Function PrintByteArray(ByVal array() As Byte) Dim hexValue As String = "" Dim i As Integer For i = 0 To array.Length - 1 hexValue += array(i).ToString(“X2”) Next i Return hexValue.ToLower End Function Dim path As String = “Z:” ’ Insert recursion function here and inside, use the following: Dim fHash = hashFile(path) ’ The SHA-256 hash of the file Dim fPath = Nothing ’ The full path to the file Dim fName = Nothing ’ The filename Dim fExt = Nothing ’ The file’s extension Dim fSize = Nothing ’ The file’s size in bytes Dim fLastMod = Nothing ’ The timestamp the file was last modified Dim fMimeType = Nothing ’ The mimetype of the file

Ok cool, so I have a somewhat workable code idea here. I’m not sure how long this is going to take to process, so I want to sample a few hundred files and maybe even think about some options I can pass to my application such as only hashing specific exensions or specific file names like *IMG_* or even be able to exclude something.
But first… a proof of concept.

Update: 11/28/2016

Spent some time working on the application.  Here’s a GUI rendition;  not much since it is being used as a testing application.

I have also implemented some code for SQLite use to store this to a database.  Here’s a screenshot of the database.

Continuing on with some brainstorming, I’ve been thinking about how to handle the multiple duplicates.
I think what I want to do is

  • Add new table “duplicates”
  • Link “duplicates” to “files” table by “id” based on duplicate hashes
  • Store all duplicates found in this table for later management (deleting, archiving, etc.)

After testing some SQL queries and using some test data, I came up with this query:

SELECT * FROM file a WHERE ( hash ) IN ( SELECT hash FROM file GROUP BY hash HAVING COUNT(*) > 1 )

This gives me the correct results as illustrated in the screenshot below.

So with being able to pick out the duplicate files and display them via a query, I can then use the lowest “id” as the base or even the last modified date as the original and move the duplicates to a table to be removed or archived.
Running my first test on a local NAS with thousands of file.  It’s been running about 3 hours and the database file is at 1.44MB.

Update 12/1/2016

I’ve worked on the application off and on over the past few days trying to optimize the file recursion method.  I ended up implementing a faster method than I created above, and I wrote about it here.

Here’s a piece of the code within the recursion function.  I’m running the first test on my user directory, C:Users
kreider.  The recursive count took about 1.5 seconds to count all the files (27k).  I will need to add logic because the file count doesn’t actually attempt to open and create a hash like my hash function does;  so 27k files may actually end up only being 22k or whatever.

Just a file count of C:\users\rkreider (SSD) took about 1.5 seconds for 26k files.

File count of my user directory (SSD disk), no file hashing or other processing done.

Hashing Test Run 1
On this pass, I decided to run the hash on the files.  It took considerably longer, just under 5 minutes.

File hashing recursively of my user directory (SSD).

Something important to note.  Not all 26,683 of the original files scanned were actually hashed for various reasons such as Access Permissions, file already opened by something, etc.
For comparison, the database (SQLite) created 26,505 records and is 5.4MB in size.
Hashing Test Run 2
I moved the file counter further into the hash loop and only increment the counter when a file is successfully hashed.  Here are my results now.

Recursive hash of my user directory (SSD) with a found/processed indicator now.

As you can see, it found 26,684 file and could only process (hash) 26,510.

Comparing the result in GUI to the database with SELECT COUNT(*) FROM file, it matches properly.  The database size remains about the same, 5.39MB.

One thing that I’m trying to decide is whether or not to put some type of progress identifier on the interface.
The thing is, this adds overhead because I have to first get a count of files and that will take x seconds.  In the case of the NAS scan, it took 500+ seconds, over 5 minutes.  So I’d be waiting 5 minutes JUST for a count and then I’d start the file hashing which will take time.  I just don’t know if it is worth it, but it sure would be nice I believe.

Database Schema

CREATE TABLE [file] ( [id] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, [hash] text NULL, [fullname] text NULL, [shortname] text NULL, [extension] text NULL, [mimetype] text NULL, [size] intEGER NULL, [modified] TIMESTAMP NULL );

Published At