Death by 30,000 Files

One of my colleagues humorously referred to this post as “how to delete a thousand files.”  We all had a good laugh.  And it only served to highlight how something that seems so simple is complex enough to deserve a detailed post about it!  Fear of the unknown is a powerful motivator.  Let’s break through that together …

Who Cares?

Our largest build artifact was 256 MB and 30,000 files.  Nothing in today’s world.  No big deal, right?  Sure, without context … but let’s put some color on this black & white picture:  20,000 – 66% – of those files were cruft. That eats up small amounts of time in lots of ways that really add up:

  • Cloning the Git repo and checking out the branch (CI builds)
  • Zipping the files to create the artifact
  • Unzipping the files to deploy
  • Uploading / downloading the artifact to/from Nexus
  • At 30+ builds a day, that’s 218 GB per month to store in Nexus

And even if you ignore all of that, reducing the cognitive load of your project will increase the velocity and quality of your teams.

Cheaply Determining if You Have a Problem

So, how did we know we had 20,000 crufty files without authorizing someone to do the work to investigate?  In our case, it was a combination of any of the following tactics you can use:

  • Someone who’s been around a long time and has a good gut feel
  • Whitespace programs
  • Hackathons
  • “0” point research spikes (it took nearly no effort to do the Splunk query below to find how many files are in use and compare against how many files are on disk)
  • No one knows what half of the files are

In our case, the www directory contained the majority of the number of files in the artifact. That meant we could look at our IIS access logs to see what was actually in use and trim out the rest.

Safely Analyzing Mountains of Data

When I solve these types of problems, I find I work best with grep, sed, awk, and xargs instead of writing, testing, and debugging a larger script that covers all cases.  It was also key to have Splunk at my side — I needed to reduce 60 days worth of access logs (billions of lines and hundreds of gigs of data).

I let Splunk do the first-step heavy lifting. We have a Splunk index for our IIS logs which automatically extracts the fields so I can query them. In the search below, I select that index, use the W3SVC1 logs (the main website), filter for GET (other verbs like OPTIONS were causing false positives), filter for HTTP status codes (I especially don’t care about 404’s), and then remove any irrelevant paths.  I grouped by cs_uri_stem (after forcing everything to lower-case to prevent dupes based on case differences) which gave me a list of active files and how many hits each file had.

 index=qb-prod-iis (source="D:\\IISLogs\\W3SVC1\\*" AND cs_method="GET" AND sc_status IN (20*, 304) AND NOT cs_uri_stem IN ("/foo/*", "/bar/*")) | eval cs_uri_stem=lower(cs_uri_stem) | stats count by cs_uri_stem | sort count 

I downloaded those results to a CSV file that was all of 320 KB covering 60 days’ worth of logs.  That file had lines that looked like:


I wrote this little utility script to help me use that file to:

  1. Determine which files were actively being accessed
  2. List the files in Git that were not accessed

# In this script we convert everything to lower-case to simplify things.
# We'll later need to conver back to the original case in order to
# remove dead files from Git

# Get latest Splunk export file
CSV=`ls -lt ~/Downloads/15*.csv|head -1 | awk '{print $10}'`

# Our CDN assets are placed in /res by the build system as duplicates
# of files from elsewhere in Git (they are not checked in).  This
# requires us to fold away the CDN path to get at the real file.
# For example:
#   /res/xxx/css/foo.css is the same as /css/foo.css
# Here, get a unique list of all non-CDN assets that were accessed
grep -v /res/ $CSV | awk -F, '{print $1}' | sed 's/"//g'|tr '[:upper:]' '[:lower:]'|sort|uniq > /tmp/files.1

# Now, Get a unique list of all CDN assets that were accessed and
# trim out the /res/xxx path
egrep -E "/res/[a-z,0-9]+-" $CSV | cut -c 19- | awk -F, '{print $1}' | sed 's/"//' | tr '[:upper:]' '[:lower:]' | sort | uniq > /tmp/files.2

# Get the union of the two file sets above.  This is our list of
# active files
cat /tmp/files.1 /tmp/files.2 | sort | uniq > /tmp/

# Get a list of all files in Git
pushd ~/git/QuickBase/www
find . -type f | sed 's/^\.//' | tr '[:upper:]' '[:lower:]' | sort > /tmp/files.all

# Finally, diff the active file list with what's in Git
# Files that are being accessed but not in Git will start with ""
diff /tmp/ /tmp/files.all

This was a large list; out of the ~15,000 files in www checked into Git, only ~1,500 were in use.  The list itself was a cognitive load problem!  I also didn’t want my GitHub PR’s to be so large that they didn’t load or were impossible to review.  I categorized the list into about 10 parts and created a JIRA story for each.  Using grep, I could execute on various pieces of the list.  For example, this would give me the files in /i/ that started with the letters a through g.

./ | grep '> /i/' | sed 's#> /i/##' | egrep -e '^[a-g].*' > /tmp/foo

I would use the following to double-check that I’m not removing something important.  It uses the contents of /tmp/foo (one file per line) by transforming it into a single line regular expression.  So if the file contained


The result of the expression inside the backticks (in the code below) would be

egrep -r `cat /tmp/foo | xargs | sed 's/ /|/g'` .

Removing the Files

When I was ready to start removing files, I needed to convert the file case back to what’s on disk, so I used the power of xargs to take the original list and run grep (once per file) to find the original entry in /tmp/files.all:

cat /tmp/foo | xargs -xn1 -I % egrep -Ei "^%$" /tmp/files.all > /tmp/foo2

And now I can use xargs again to automate the git command

git rm `cat /tmp/foo2 | xargs`


The artifact has gone from 256 MB to 174 MB but more importantly it’s gone from 29,000 files to 12,600.  This means:

  • Downloads are 32% faster (previous baselines vary by location but were anywhere from 20 to 120 seconds)
  • Unzips are 50% faster (baselines also vary but were 25 seconds in prod and are now 12 seconds)
  • The CI build is 1 minute faster (zip/unzip, Git checkout, and Nexus upload speedups)
  • Our local workspaces have gone from ~32,500 files to ~21,000 files
  • The source is back to being browsable in GitHub because it can now do complete directory listings


There are intangible benefits such as cognitive load, attack surface, reduced complexity, and simplifying future projects where we break apart this old monolith.  But there are hard numbers, too!

For the purposes of this exercise, assume $200k annually for a fully-loaded SWE, which is about $96/hour (200000 / 52 / 40).

Using $96/hour, the average team saves 60 person-minutes/day (6 people * 5 builds per day per team * (1 minute per build + 1 minute for download & install), or $2,000/mo or $24k annually.

How Does This Help You?

Before I answer that, I’ll backfill a little more history.  Cleaning up the www directory has been something we’ve talked about for many years.  And everyone was literally afraid to do it.  I always kept saying I’ll just do it, or help someone do it.  But it was never a “funded” project.

One of my colleagues (in concert with others, including our chief architect) and fellow blog authors, Ashish, decided to track all the things rattling around in our head as new JIRA stories in a “tech debt” backlog project so we could really start to see what’s out there.  This was one item on that list.  Having it as a real item in JIRA was the first step towards making it possible.

After I’d completed one of my stories, I was able to take that now-tangible tech debt story and break it down into something manageable.  After spending maybe an hour, I was able to quantify the problem (15,000 files, only 1,500 of which were in use) which lent credence to the effort.  As I mentioned, I broke it down into 10 stories, estimated them, and was able to show that this was an achievable goal.

No one, and I mean no one, is unhappy that I spent time doing this.  We often spend a lot of time not even trying to do what’s right because we believe we’re not authorized.  It’s crucial that we, as engineers, work closely with the rest of the business to collaborate, build relationships, and increase trust so that we can have the open conversations that lead to working on stories that are difficult to tie back to customer value.  But remember, there’s a lot of customer value in being efficient!

My advice:

  • Just spend a little time quantifying a problem and its solution so you can have an informed discussion with your team about working on it
  • If you’re passionate about something, use that to your advantage!  Instead of complaining about a problem, use your passion to solve the heck out of it!  People will thank you for it.
  • Break the problem down into manageable units.
    • Get those units into the backlog as stories and talk about them during your grooming sessions.  Make them real.
  • Don’t boil the ocean; keep the problem clear and defined
  • Use the 80/20 rule.  I didn’t clean everything up.  I cleaned up the easy stuff.  And that still has incredible impact.
  • Find allies / supporters on your team and other teams.  They increase your strength.
  • Use the right tools for the job.  It would have been easy for me to build a long, complex script but in the end, it’s not necessary.  It would have taken longer and would have been more error prone.  Tackling the problem in groups helped me stay focused and make progress in chunks.