Get File Count Recursively

I’ve been working on a small tool to aid in removing duplicate files and as I’m going back over my roughed in code, I’m trying to optimize it for some performance gains.

This snippet of code works really well for recursively counting files given a specific path.  I originally found it at StackOverflow and slightly modified to suit my needs.

Sub ProcessFile(ByVal path As String)
        fileCounter += 1
    End Sub

    Sub ApplyAllFiles(ByVal folder As String, ByVal extension As String, ByVal fileAction As ProcessFileDelegate)
        For Each file In Directory.GetFiles(folder, extension)
            fileAction.Invoke(file)
        Next
        For Each subDir In Directory.GetDirectories(folder)
            Try
                ApplyAllFiles(subDir, extension, fileAction)
            Catch ex As Exception
            End Try
        Next

    End Sub

It processes about 27k files in 1.5 seconds on my SSD disk.  I have it running against a NAS with considerably larger amount of files, so I’ll see how well it performs.

In my sub, I use the following to kick it off.

        Dim fileCounter as Long = 0L
        Dim path = "z:"
        Dim ext = "*.*"

        ToolStripStatusLabel1.Text = "Calculating files..."

        stpw.Start()

        Dim runProcess As ProcessFileDelegate = AddressOf ProcessFile
        ApplyAllFiles(path, ext, runProcess)

        stpw.Stop()

        Dim rslts3 As String = String.Format("Total files = {0:n0}. Took {1:n0} ms.", fileCounter, stpw.ElapsedMilliseconds)
        ToolStripStatusLabel1.Text = rslts3.ToString

Graphically speaking, this isn’t much to look at – but the important part is in the ToolStripStatus. I have a timer on my form that updates the latest file count every 15 seconds so that a user would know it’s still working. Interestingly enough, if I update the ToolStripStatus with every single file that is found, it exponentially increases the time it takes to go through the files, so I decided to just update every 15 seconds.

Development Log: Duplicate File Finder

I have thousands of files stored on an external USB attached 1TB drive.  My drive is currently 95% full.  I know I have duplicate files throughout the drive because over time I have been lazy and made backups of backups (or copies of copies) of images or other documents.

Time to clean house.

I’ve searched online for a tool to do the following things, relatively easily and in a decent designed user interface:

  • Find duplicates based on hash (SHA-256)
  • List duplicates at end of scan
  • Give me an option to delete duplicates, or move them somewhere
  • Be somewhat fast

Every tool I’ve used fell short somewhere.  So I decided to write my own application to do what I want.

What will my application do?

Hash each file recursively given a starting path and store the following information into an SQLite database for reporting and/or cleanup purposes.

  • SHA-256 Hash
  • File full path
  • File name
  • File extension
  • File mimetype
  • File size
  • File last modified time

With this information, I could run a report such as the following pseudo report:

Show me a list of all duplicate files with an extension of JPG over a file size of 1MB modified in the past 180 days.

That’s just a simple query, something like:

SELECT fileHash, fileName, filePath, fileSize COUNT(fileHash) FROM indexed_files WHERE fileExtension='JPG' and fileSize > 1024 GROUP BY fileHash HAVING COUNT(fileHash)>1

My application can show me a list of these and make some decisions to allow me to move or delete the duplicates after the query runs.

One problem comes to mind in automating removal or moving duplicates… What if there are more than 1 duplicate file; how do I handle this?

So on to the bits and pieces…

The hashing function is pretty straight-forward in VB.NET (did I mention I was writing this in .NET?).

Imports System.IO
Imports System.Security
Imports System.Security.Cryptography

Function hashFile(ByVal fileName As String)

  Dim hash
  hash = SHA256.Create()

  Dim hashValue() As Byte

  Dim fileStream As FileStream = File.OpenRead(fileName)
  fileStream.Position = 0
  hashValue = hash.ComputeHash(fileStream)
  Dim hashHex = PrintByteArray(hashValue)

  fileStream.Close()

  Return hashHex

End Function

Public Function PrintByteArray(ByVal array() As Byte)

  Dim hexValue As String = ""

  Dim i As Integer

  For i = 0 To array.Length - 1
    hexValue += array(i).ToString("X2")
  Next i

  Return hexValue.ToLower

End Function

Dim path As String = "Z:"
' Insert recursion function here and inside, use the following:
Dim fHash = hashFile(path) ' The SHA-256 hash of the file
Dim fPath = Nothing ' The full path to the file
Dim fName = Nothing ' The filename
Dim fExt = Nothing ' The file's extension
Dim fSize = Nothing ' The file's size in bytes
Dim fLastMod = Nothing ' The timestamp the file was last modified
Dim fMimeType = Nothing ' The mimetype of the file

Ok cool, so I have a somewhat workable code idea here. I’m not sure how long this is going to take to process, so I want to sample a few hundred files and maybe even think about some options I can pass to my application such as only hashing specific exensions or specific file names like *IMG_* or even be able to exclude something.

But first… a proof of concept.

Update: 11/28/2016

Spent some time working on the application.  Here’s a GUI rendition;  not much since it is being used as a testing application.

I have also implemented some code for SQLite use to store this to a database.  Here’s a screenshot of the database.

Continuing on with some brainstorming, I’ve been thinking about how to handle the multiple duplicates.

I think what I want to do is

  • Add new table “duplicates”
  • Link “duplicates” to “files” table by “id” based on duplicate hashes
  • Store all duplicates found in this table for later management (deleting, archiving, etc.)

After testing some SQL queries and using some test data, I came up with this query:

SELECT * FROM file a
WHERE ( hash ) IN ( SELECT hash FROM file GROUP BY hash HAVING COUNT(*) > 1 )

This gives me the correct results as illustrated in the screenshot below.

So with being able to pick out the duplicate files and display them via a query, I can then use the lowest “id” as the base or even the last modified date as the original and move the duplicates to a table to be removed or archived.

Running my first test on a local NAS with thousands of file.  It’s been running about 3 hours and the database file is at 1.44MB.

Update 12/1/2016

I’ve worked on the application off and on over the past few days trying to optimize the file recursion method.  I ended up implementing a faster method than I created above, and I wrote about it here.

Here’s a piece of the code within the recursion function.  I’m running the first test on my user directory, C:Users
kreider.  The recursive count took about 1.5 seconds to count all the files (27k).  I will need to add logic because the file count doesn’t actually attempt to open and create a hash like my hash function does;  so 27k files may actually end up only being 22k or whatever.

Just a file count of C:users
kreider (SSD) took about 1.5 seconds for 26k files.

File count of my user directory (SSD disk), no file hashing or other processing done.

Hashing Test Run 1

On this pass, I decided to run the hash on the files.  It took considerably longer, just under 5 minutes.

File hashing recursively of my user directory (SSD).

Something important to note.  Not all 26,683 of the original files scanned were actually hashed for various reasons such as Access Permissions, file already opened by something, etc.

For comparison, the database (SQLite) created 26,505 records and is 5.4MB in size.

Hashing Test Run 2

I moved the file counter further into the hash loop and only increment the counter when a file is successfully hashed.  Here are my results now.

Recursive hash of my user directory (SSD) with a found/processed indicator now.

As you can see, it found 26,684 file and could only process (hash) 26,510.

Comparing the result in GUI to the database with SELECT COUNT(*) FROM file, it matches properly.  The database size remains about the same, 5.39MB.

One thing that I’m trying to decide is whether or not to put some type of progress identifier on the interface.

The thing is, this adds overhead because I have to first get a count of files and that will take x seconds.  In the case of the NAS scan, it took 500+ seconds, over 5 minutes.  So I’d be waiting 5 minutes JUST for a count and then I’d start the file hashing which will take time.  I just don’t know if it is worth it, but it sure would be nice I believe.

Database Schema

CREATE TABLE [file] (
[id] INTEGER  PRIMARY KEY AUTOINCREMENT NOT NULL,
[hash] text  NULL,
[fullname] text  NULL,
[shortname] text  NULL,
[extension] text  NULL,
[mimetype] text  NULL,
[size] intEGER  NULL,
[modified] TIMESTAMP  NULL
);