2 Minute Builds

Background

The core of Quick Base is a large Microsoft Visual C++ project (technically it’s a solution with multiple projects).  Our build/deploy/test/upload-artifact cycle was 90 minutes.  It was automated.  Not bad for a 20-year-old code base, right?  Nah, we can do better.  We can do it in 2 minutes!

At least, that’s how we pitched it.  You can imagine the response.  Besides the obvious “that’s impossible!” sentiments, we were asked “Why?  It takes us longer than that to fully QA the resulting artifact, so what’s the point?”  And so, our journey began.

If there’s one thing I’ve learned over the years, it’s that the hard part isn’t the technology, it’s the human equation.  Here, we’d lived with a very slow build for years (because the belief was that we’d deprecate the old stack in favor of the re-architecture that was in progress).  Once we’d re-focused our efforts to iterating on the existing stack, we knew things had to change.  We were operating using Agile methodologies (both scrum and Kanban are in use) but the tools weren’t properly supporting us.  A few engineers close to the build knew there was low-hanging fruit; what better way to demonstrate “yes, we can!” and gather excitement than to make significant progress with relatively little effort.

Organizationally, we were now better-suited to support these kinds of improvements.  We have a Site Reliability Engineering team that consists of both Ops and Dev.  Together, we started to break down the problem.  We deconstructed the long Jenkins job into this diagram:

Picture1

Now we knew where to focus for the biggest gains.

Our First Big Win

The “Tools Nexus Deploy” was literally just Maven uploading a 250-MB zip file to Nexus from servers in the office in Cambridge, MA to our Nexus server in AWS (Oregon).  It definitely shouldn’t take that long to upload; we have a very fat Internet pipe in the office.  We did packet traces using WireShark and other network tests to try and determine the cause.  We didn’t uncover anything.

So, let’s break down the problem and isolate the issue.  Is the network in the office OK?  AWS?  Is the Nexus server slow?  Here’s some of what we did:

  • Download data directly from Nexus using wget (remove Maven from the equation)
  • Upload directly to Nexus using wget (ditto)
  • Do the above from the office servers (is it the server network?)
  • Do the above from office workstations (is it the entire Cambridge network?)
  • Do the above from EC2 instances in AWS (Oregon) (remove Cambridge from the equation)
  • Try a (much) newer version of Windows that hasn’t been hardened (maybe issues with TCP windowing and other high-latency improvements)
  • Do the above from linux instead of Windows (remove Windows from the equation)

When we switched from Windows to linux, we stood back in disbelief.  The upload was now taking 90 seconds instead of 22 minutes.  We found that Maven on Windows has extremely poor network performance.  We temporarily switched to Maven on linux by splitting up the build job to have a separate upload job that was tied to the Jenkins master node (running linux).

Our Second Big Win

The next thing we tackled was the “PD CI-Test” group.  These are just TestNG Java tests that hit the Quick Base API to do some automated testing.  We found one simple area to improve: add test data using bulk import instead of per-record inserts.  Since this was in setup code, the several-second difference added up to … drum roll … 18 minutes!

Number Three

There was still lots of room for improvement in the “PD CI-Test” group, so we found one other quick win.  After we’d encountered the Maven slowness, we started to question the speed of Ant on Windows.  The server was only running at 20% CPU when the tests were running, so we suspected something wasn’t going as fast as it could.  Switching our tests to be called via Gradle instead of Ant saved us another 12 minutes!

Assessing Where We Are Now

In 2 months, our diagram looked like this:

Picture2

You can bet that was exciting!  Now we have the momentum and people believe it can be done.

We’ve continued to make further improvements such as moving from the aging hardware in the Cambridge server room to AWS using the Jenkins EC2 plugin (and then taking advantage of the C5 instance types (which boot our Windows AMI in 4 minutes instead of 10) by building our own version of the plugin and submitting a PR for it here).  Build times are currently averaging 26 minutes and we’ve got items on the roadmap (including moving to Jenkins pipeline so we can easily take advantage of parallelism) that should get us closer to 15.  After that, we run into limitations of the MSVC++ linker that does a few things single-threaded; one of our projects is quite large and produces a single binary.  The next steps there include breaking that project up (e.g. use libraries).  That will take more effort, so we’ve left that for last.

Will we ever get to 2 minutes?  Who’s to say?  The purpose of setting the goal that low was to fire up people’s imaginations.  And it has.

Once Upon a Time …

Quick Base is the platform that businesses use to quickly turn ideas about better ways to work into apps that make them more efficient, informed, and productive.  It has been around for nearly 20 years.  It’s a successful SaaS offering serving billions of requests per month.  It’s primarily written in MSVC++ running on Windows.  If you’ve been in the software industry long enough, you can imagine some of the tech debt acquired over its lifetime.  It makes very efficient use of server hardware but it’s grown past the point where it needs to let go of the old ways of doing things (which were appropriate “back then”).  Namely, there are monoliths to break down, automated test harnesses to build, code to rewrite to be testable, and build systems to re-think.

This story begins about 5 years ago when we started having the re-architecture discussions that most software companies do as they start having “success-based problems.”  At that point, Quick Base was essentially 100% C++ on Windows with an ever-increasing success rate with companies that wanted to store more data in their apps, have more users accessing their apps, and create more complicated apps than we could ever imagine.  Internally we constantly refer to the performance characteristics of an application ecosystem as the combination of those 3 things: size, concurrency/activity, and complexity.  That means there’s no single lever we can pull to increase how apps scale and succeed on our platform.

As a way to better meet these challenges, we became hyper-focused on solving for developer productivity, which included taking a look at what languages the talent pool was most familiar with, languages that had strong testability characteristics and support, as well as what languages would support an evolution – we wanted both code bases (C++ and the candidate) to, e.g., share a connection with the SQL server without having to manage what flags mean what in two places and risk getting that wrong.  C#/.NET was an obvious choice, and became the winner … at least for a short while.  We did build some stuff in .NET (and continue to do so today; you’ll read more of that in later posts), but this approach didn’t last long.

The belief that consolidating technologies to support better economies of scale (software contracts, support, staffing, you name it) was overwhelming and ultimately sent us down the wrong path.  We started building on technologies that had integration challenges with the existing platform, and we couldn’t take advantage of our existing SDLC (think: build/test/deploy as well as the IDE).

And then, we fell into the trap that many software companies do as our approach evolved into a complete re-architecture.  We believed the only viable way to go from old to new was to start over and migrate our customers.  We believed that incrementally breaking down the monolith was not possible.  So, we spawned a small scrum team to do a PoC, which turned into 2 teams, then 3, and then a business decision to put most of our engineering effort into the re-architecture in order to focus and just get it done and behind us.

All along the way there was that little voice inside that kept telling us this was wrong.  It occasionally came out during moments of frustration, or over lunch, or over a drink down the street.  But we succumbed to the inertia of the high-speed train.  We further exacerbated the issue by materially changing execution strategies at least 3-4 times because we discovered how difficult it was to recreate even an MVP of Quick Base.  After 4 years of producing something that ultimately didn’t deliver value to our customers, we drew up the courage to have a heart-to-heart with ourselves and canceled the project.  Why?  We obviously weren’t delivering value yet, and (once we were honest with ourselves) we knew we wouldn’t for a while – too long.  It’s excruciatingly hard to abandon something you’ve poured years of your heart into and feels “so close to shipping” (but in reality, it’s not).  It feels like you’re abandoning a child.  We found that belligerently asking “will this deliver customer value (in a timely fashion)?” gave us the strength and clarity to make the hard decision.

Did we all come to work one day and just stop working on the re-architecture and begin working on the existing platform?  For many reasons, no.  We needed to go back to the drawing board with our roadmap, and we had to somehow shift our development organization from 10% C++ / 90% Java/NodeJs to 75% C++ / 25% Java/NodeJs.  That’s right, we are continuing to work with the newer technologies.  We didn’t throw everything away.  In fact, we actually kept a lot of it.  As it turns out, we discovered through our journey that the fastest and most sustainable way to deliver more value to our customers was to iterate on the technology we have and tactically augment it with the new services and paradigms we’d originally built to serve the re-architecture.

Just like mistakes are our biggest teachers, so was the re-architecture.  We didn’t completely waste 4 years of our lives and money.  In fact, we know more about our customers, ourselves, the market, the technology, our own “secret sauce,” how to build & test & deploy software, and much more.  We have a new strategy which allows us to deliver value to our customers on an ongoing basis (starting right yesterday) while making meaningful progress on our software architecture as well as our build/test/deploy systems.  For me personally, I learned at the highest rate I’ve ever learned during the last several years.  And that learning (and some of the systems that were built during the re-architecture) are serving our new approach well.  Many of the upcoming posts will discuss things we built and learned over the last 4 years.