Building a robust Application Audit capability using AWS services (and a few others)

Have you ever built a house using Lego blocks of different sizes and colors? Have you started with the foundation only to rip it out and start again because it was too wobbly as the house went up? That’s what happened when Quick Base’s Platform Governance team started building an Application Audit capability past few months.

Read on to uncover what the challenges were and what we learned along the way…

The Foundation

Our work started with putting the main blocks together. Our technical decisions needed to meet the following parameters and architectural guidelines:

  • We wanted to process and store audit data in AWS. We preferred AWS-managed service over AWS-hosted service or homegrown service.
  • Services where audit data containing user data goes through need to be HIPAA compliant
  • Cost of the solution is known and acceptable
  • The infrastructure needs to handle the following loads:
    • Per day: ~190 million events -> 96 GB
    • Per year: ~69 billion events -> 35 TB

First, we needed to capture the events related to user requests. The majority of our user requests get processed by a C++ application hosted on Windows machines in a data center.

We piggybacked on an already existing logging infrastructure. The existing mechanism logged events on disk using an asynchronous  logger process. We enhanced the logger service to also capture auditable events and pipe them to an event log file.

image2018-10-4_13-41-11

Lesson #1: Estimate the load as early as possible as it needs to be factored in and tested when evaluating options and determining cost

Lesson #2: Re-use existing mechanisms as much as you can

We investigated different solutions for log forwarding and aggregation. We looked at the following options:

  • Apache Flume: Eliminated as it requires significant effort to host
  • Kinesis Agents: Eliminated as it doesn’t work on Windows
  • nsqio: Eliminated as it requires significant developer overhead
  • Native Kafka: Eliminated as it requires significant effort to host
  • Logstash: Eliminated as its strength lies in collecting a variety of data sources which is different from our usage
  • FluentD

We decided to go with FluentD because it is:

  • Easy to install and configure
  • Able to handle the loads we are expecting. We tested FluentD with the estimated loads.
  • Lightweight. It consumes less than 1% of CPU and 40MB of RAM when processing 1000 audit events per second.

The next decision we needed to make was where to persist the audit events and its corollary of how to query the data. Per the architectural guidelines above, we knew we wanted to persist these on AWS. The storage options we explored were:

  • Glacier: Eliminated because it didn’t fit our model of frequent access of data.
  • DynamoDB: Eliminated due to size limitation of 400K per row.
  • Redshift: Eliminated due to column size limitation of 64K.
  • Aurora RDS
  • S3 / Athena

Two contenders remained:

  • Aurora RDS: Store audit data in a relational database and use SQL to query it.
  • S3 / Athena: Store audit data in S3 buckets and use Athena to query it.

Both solutions were comparable based on our criteria. At the end we went with S3 / Athena because it aligns with modern best practices of splitting storage from the querying engine. This allows us to swap any one of these later if necessary.

We went with S3 / Athena knowing that Athena is not HIPAA compliant. We relied on the assurances from our AWS rep that Athena will be HIPAA certified before our launch. The certification happened in the next quarter than the one we originally were told. Luckily, our timeline was not impacted, but we cut it really close.

We also added Kinesis streams to stream audit from FluentD to S3 and Firehose as our buffering mechanism to batch the audit events before writing them to S3. Our audit events data lake was taking shape.

image2018-10-4_16-15-11

The first floor gets stood up

The first type of audit events we delivered were user actions with no user data. Examples of user actions include:

  • User login / logout
  • User creates a Quick Base application
  • User creates a table in an application

We started with these events to avoid worrying about HIPAA compliant services and complex audit event payloads at the outset, while still delivering significant customer value.

We decided to partition the audit events in S3 hourly. This meant that for each hour during which events occurred there would be an equivalent folder in S3.

Strengthen the foundation while building more floors

After we delivered the first slice of customer value, we focused on improving the performance of querying audit events.

We knew Athena queries were slow and we researched options to improve their performance. The one challenge we faced was that Athena query execution is a black box. Unlike traditional RDBMS systems, there is no peaking into the execution plan of a query in order to optimize it.

We looked at using different file formats and compression schemes, based on AWS recommendations for how to improve Athena performance:

  • JSON vs Parquet vs Orc: Since we query almost all the columns of the event data, we didn’t see significant performance improvements by changing the file format.
  • bzip2 vs gzip: We decided to go with bzip2 as it slightly faster than gzip.

We bucketed audit events per customer and per hour as another attempt to improve Athena performance. This necessitated that we get rid of Firehose as Firehose doesn’t support custom partitioning. On top of that, Firehose was still not HIPAA compliant despite early indication from our AWS rep that it would be. We replaced Firehose with a homegrown Lambda function. This resulted in a multitude of small files.

We then decided to further aggregate the audit events. We aggregated the small files from above into one file per customer per hour. This was accomplished by an EMR job triggered every hour by a lambda. The reason we went with an EMR like solution versus building our own aggregation script is because we needed a highly performant, distributed ETL mechanism. We could have gone with Glue, which would have been the slightly more straightforward option. We decided against it for two reasons:

  • When testing with Glue, we ran into scaling issues.
  • At the time of this writing, Glue is still not HIPAA compliant.

After doing this work, we found that creating sub-folders per customer per hour caused additional performance problems in Athena. We further aggregated the audit events per day at the end of the day again using EMR ETL.

image2018-10-4_16-32-14

Lesson #3: A lot of small files cause a series of performance problems in Athena and EMR.

Lesson #4: It is risky to use new-ish AWS services because they come with limitations and performance profiles that are not well understood.

Enable others to maintain and extend

The team is now focusing on enabling other teams to add audit events using the Audit infrastructure.  Everything is in place to make this effort as simple as identifying what to capture, including the event payload, and ensure the load is well understood.

Overall, we learned that it makes business sense to work on the infrastructure while also delivering customer value on top of the infrastructure – as long as it is deliberate. This eliminates long research projects with no customer delivery.

Et voilà….

image2018-10-4_13-40-19

Serverless Application Model – First Impressions

If you’ve been paying attention, you’ll know that deploying software applications is evolving from bare metal servers -> virtual machines -> containers -> serverless. Here at Quick Base, we have a handful of services now running in containers in the cloud via AWS ECS. We were able to put together an automated software delivery life-cycle (SDLC) for these services in order to build, test, and deploy them all in the same way. While we were already using AWS Lambda for some behind-the-scenes orchestration, we were intentionally choosing to not deploy customer-facing serverless services because we had yet to find a good solution for the serverless SDLC.

We wanted a serverless framework to be able to support the following:

  • Deployment via CloudFormation. All of our other infrastructure and services already use CloudFormation, so we don’t want to introduce another deployment tool into the mix
  • Local development and debugging.
  • Deployment to multiple environments across multiple AWS accounts.

I was made aware of a whitespace project underway on my team to deploy an internal facing dashboard which would show the status of our many environments. The back-end of this dashboard was just some API calls which would poll the information and return it to a node front-end. It seemed to me that a simple lambda was the right solution for the API, and I was aware of the AWS Serverless Application Model (SAM) so I decided to attempt to use it to deploy the back-end solution.

Deployment

The first benefit I noticed about SAM was the template format.

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Resources:
    HelloWorldFunction:
        Type: AWS::Serverless::Function
        Properties:
            CodeUri: hello_world/
            Handler: app.lambda_handler
            Runtime: nodejs8.10
            Events:
                HelloWorld:
                    Type: Api
                    Properties:
                        Path: /hello
                        Method: get

With this template, you can deploy a lambda function and an API Gateway. Creating these same resources with a standard CloudFormation template would be a much larger effort as you would have to either embed the source code (where you’re limited to 4096 characters) in the template, or use a task runner to package up your function and libraries, push them to S3, and then use a pointer in your CloudFormation template to that S3 object.

The AWS CLI makes deployment of a SAM application easy. With 2 commands, you can package up your function (either standalone, or a zip with any libraries), and deploy it.

aws cloudformation package --template-file template.yaml --output-template-file packaged.yaml --s3-bucket mybucket
aws cloudformation deploy --template-file packaged.yaml --stack-name mystack

Local Development and Debugging

The next thing I found was the sam-cli.  This CLI allows you to run your function locally in a docker container which simulates a lambda environment. It can also simulate an API gateway, and allows you to generate and send simulated events to trigger your function from sources such as S3, Dynamo, Kinesis, etc. You can also attach a debugger to your function to leverage debugging features of your IDE.

For example, to test the function behind API Gateway:

$ sam local start-api
2018-08-21 12:31:34 Mounting HelloWorldFunction at http://127.0.0.1:3000/hello [GET]
2018-08-21 12:31:34 You can now browse to the above endpoints to invoke your functions. You do not need to restart/reload SAM CLI while working on your functions changes will be reflected instantly/automatically. You only need to restart SAM CLI if you update your AWS SAM template
2018-08-21 12:31:34  * Running on http://127.0.0.1:3000/ (Press CTRL+C to quit)
2018-08-21 12:31:36 Invoking app.lambda_handler (nodejs8.10)
2018-08-21 12:31:36 Starting new HTTP connection (1): 169.254.169.254

Fetching lambci/lambda:nodejs8.10 Docker container image......
2018-08-21 12:31:38 Mounting ~/dev/scratch/sam-app/hello_world as /var/task:ro inside runtime container

With this running, I’m able to browse to http://127.0.0.1:3000/, which executes the function and shows me the output.

START RequestId: c023d97e-b667-1a95-38c0-9e2c71ffc9dc Version: $LATEST
END RequestId: c023d97e-b667-1a95-38c0-9e2c71ffc9dc
REPORT RequestId: c023d97e-b667-1a95-38c0-9e2c71ffc9dc  Duration: 355.89 ms     Billed Duration: 400 ms Memory Size: 128 MB     Max Memory Used: 35 MB
2018-08-21 12:31:39 No Content-Type given. Defaulting to 'application/json'.
2018-08-21 12:31:39 127.0.0.1 - - [21/Aug/2018 12:31:39] "GET /hello HTTP/1.1" 200 -

I’m also able to make a change to the function in my IDE, and just refresh my browser to reflect those changes (hot reloads). Of course, I also run automated tests against this solution locally.

Deployment to multiple environments across multiple AWS accounts

Now that we had our internal dashboard API running via SAM, I wanted to take it a step further. We needed a solution to send our AWS Load Balancer logs to our Splunk Cloud instance. Splunk provides some great solutions to accomplish this, and they are essentially one-click deployments. We had to make some minor modifications to the provided solution due to security concerns, which was simple enough. I also wanted to see how we could automate the deployment of this SAM application to all of our AWS accounts. We are currently using Jenkins pipelines to deploy our AWS infrastructure and services, so I was easily able to create a new Jenkinsfile which grabbed the SAM application from GitHub, ran tests, packaged it up, and deployed it to each of our accounts.

The cross-account deployment of the application is a stage in a Jenkins pipeline:


stage ("Deploy") {
    steps {
        unstash name: "package"
        getAWSFunctions()
        script {
            def accounts = ["${AWS_ACCOUNT1}": "splunk-index1",
                            "${AWS_ACCOUNT2}": "splunk-index2",
                            "${AWS_ACCOUNT3}": "splunk-index3",
                            "${AWS_ACCOUNT4}": "splunk-index4"]          

            accounts.each{ accountId, splunkindex ->
                sh ". bin/functions;with_role arn:aws:iam::${accountId}:role/CrossAccountRole-us-west-2 aws cloudformation deploy --template-file serverless/splunk-hec-forwarder/template.output.yaml --stack-name splunk-hec-forwarder --parameter-overrides SplunkIndex=${splunkindex} --capabilities CAPABILITY_IAM --no-fail-on-empty-changeset"
            }
        }
    }
}

Final Thoughts

I believe that SAM could be a solution for our business to successfully develop, test, and deploy serverless services to our customers. At this point, only the Site Reliability Team is using it, but I’m eager to have a service team adopt this model to see how it fits (or doesn’t) in our customer facing services.

Death by 30,000 Files

One of my colleagues humorously referred to this post as “how to delete a thousand files.”  We all had a good laugh.  And it only served to highlight how something that seems so simple is complex enough to deserve a detailed post about it!  Fear of the unknown is a powerful motivator.  Let’s break through that together …

Who Cares?

Our largest build artifact was 256 MB and 30,000 files.  Nothing in today’s world.  No big deal, right?  Sure, without context … but let’s put some color on this black & white picture:  20,000 – 66% – of those files were cruft. That eats up small amounts of time in lots of ways that really add up:

  • Cloning the Git repo and checking out the branch (CI builds)
  • Zipping the files to create the artifact
  • Unzipping the files to deploy
  • Uploading / downloading the artifact to/from Nexus
  • At 30+ builds a day, that’s 218 GB per month to store in Nexus

And even if you ignore all of that, reducing the cognitive load of your project will increase the velocity and quality of your teams.

Cheaply Determining if You Have a Problem

So, how did we know we had 20,000 crufty files without authorizing someone to do the work to investigate?  In our case, it was a combination of any of the following tactics you can use:

  • Someone who’s been around a long time and has a good gut feel
  • Whitespace programs
  • Hackathons
  • “0” point research spikes (it took nearly no effort to do the Splunk query below to find how many files are in use and compare against how many files are on disk)
  • No one knows what half of the files are

In our case, the www directory contained the majority of the number of files in the artifact. That meant we could look at our IIS access logs to see what was actually in use and trim out the rest.

Safely Analyzing Mountains of Data

When I solve these types of problems, I find I work best with grep, sed, awk, and xargs instead of writing, testing, and debugging a larger script that covers all cases.  It was also key to have Splunk at my side — I needed to reduce 60 days worth of access logs (billions of lines and hundreds of gigs of data).

I let Splunk do the first-step heavy lifting. We have a Splunk index for our IIS logs which automatically extracts the fields so I can query them. In the search below, I select that index, use the W3SVC1 logs (the main website), filter for GET (other verbs like OPTIONS were causing false positives), filter for HTTP status codes (I especially don’t care about 404’s), and then remove any irrelevant paths.  I grouped by cs_uri_stem (after forcing everything to lower-case to prevent dupes based on case differences) which gave me a list of active files and how many hits each file had.

 index=qb-prod-iis (source="D:\\IISLogs\\W3SVC1\\*" AND cs_method="GET" AND sc_status IN (20*, 304) AND NOT cs_uri_stem IN ("/foo/*", "/bar/*")) | eval cs_uri_stem=lower(cs_uri_stem) | stats count by cs_uri_stem | sort count 

I downloaded those results to a CSV file that was all of 320 KB covering 60 days’ worth of logs.  That file had lines that looked like:

"/css/themes/classic/images/icons/icon_eye_gray.png",8918397

I wrote this little utility script to help me use that file to:

  1. Determine which files were actively being accessed
  2. List the files in Git that were not accessed
#!/bin/bash

# In this script we convert everything to lower-case to simplify things.
# We'll later need to conver back to the original case in order to
# remove dead files from Git

# Get latest Splunk export file
CSV=`ls -lt ~/Downloads/15*.csv|head -1 | awk '{print $10}'`

# Our CDN assets are placed in /res by the build system as duplicates
# of files from elsewhere in Git (they are not checked in).  This
# requires us to fold away the CDN path to get at the real file.
# For example:
#   /res/xxx/css/foo.css is the same as /css/foo.css
# Here, get a unique list of all non-CDN assets that were accessed
grep -v /res/ $CSV | awk -F, '{print $1}' | sed 's/"//g'|tr '[:upper:]' '[:lower:]'|sort|uniq > /tmp/files.1

# Now, Get a unique list of all CDN assets that were accessed and
# trim out the /res/xxx path
egrep -E "/res/[a-z,0-9]+-" $CSV | cut -c 19- | awk -F, '{print $1}' | sed 's/"//' | tr '[:upper:]' '[:lower:]' | sort | uniq > /tmp/files.2

# Get the union of the two file sets above.  This is our list of
# active files
cat /tmp/files.1 /tmp/files.2 | sort | uniq > /tmp/files.active

# Get a list of all files in Git
pushd ~/git/QuickBase/www
find . -type f | sed 's/^\.//' | tr '[:upper:]' '[:lower:]' | sort > /tmp/files.all
popd

# Finally, diff the active file list with what's in Git
# Files that are being accessed but not in Git will start with ""
diff /tmp/files.active /tmp/files.all

This was a large list; out of the ~15,000 files in www checked into Git, only ~1,500 were in use.  The list itself was a cognitive load problem!  I also didn’t want my GitHub PR’s to be so large that they didn’t load or were impossible to review.  I categorized the list into about 10 parts and created a JIRA story for each.  Using grep, I could execute on various pieces of the list.  For example, this would give me the files in /i/ that started with the letters a through g.

./dodiff.sh | grep '> /i/' | sed 's#> /i/##' | egrep -e '^[a-g].*' > /tmp/foo

I would use the following to double-check that I’m not removing something important.  It uses the contents of /tmp/foo (one file per line) by transforming it into a single line regular expression.  So if the file contained

one
two
three

The result of the expression inside the backticks (in the code below) would be

one|two|three
egrep -r `cat /tmp/foo | xargs | sed 's/ /|/g'` .

Removing the Files

When I was ready to start removing files, I needed to convert the file case back to what’s on disk, so I used the power of xargs to take the original list and run grep (once per file) to find the original entry in /tmp/files.all:

cat /tmp/foo | xargs -xn1 -I % egrep -Ei "^%$" /tmp/files.all > /tmp/foo2

And now I can use xargs again to automate the git command

git rm `cat /tmp/foo2 | xargs`

Results

The artifact has gone from 256 MB to 174 MB but more importantly it’s gone from 29,000 files to 12,600.  This means:

  • Downloads are 32% faster (previous baselines vary by location but were anywhere from 20 to 120 seconds)
  • Unzips are 50% faster (baselines also vary but were 25 seconds in prod and are now 12 seconds)
  • The CI build is 1 minute faster (zip/unzip, Git checkout, and Nexus upload speedups)
  • Our local workspaces have gone from ~32,500 files to ~21,000 files
  • The source is back to being browsable in GitHub because it can now do complete directory listings

Impact

There are intangible benefits such as cognitive load, attack surface, reduced complexity, and simplifying future projects where we break apart this old monolith.  But there are hard numbers, too!

For the purposes of this exercise, assume $200k annually for a fully-loaded SWE, which is about $96/hour (200000 / 52 / 40).

Using $96/hour, the average team saves 60 person-minutes/day (6 people * 5 builds per day per team * (1 minute per build + 1 minute for download & install), or $2,000/mo or $24k annually.

How Does This Help You?

Before I answer that, I’ll backfill a little more history.  Cleaning up the www directory has been something we’ve talked about for many years.  And everyone was literally afraid to do it.  I always kept saying I’ll just do it, or help someone do it.  But it was never a “funded” project.

One of my colleagues (in concert with others, including our chief architect) and fellow blog authors, Ashish, decided to track all the things rattling around in our head as new JIRA stories in a “tech debt” backlog project so we could really start to see what’s out there.  This was one item on that list.  Having it as a real item in JIRA was the first step towards making it possible.

After I’d completed one of my stories, I was able to take that now-tangible tech debt story and break it down into something manageable.  After spending maybe an hour, I was able to quantify the problem (15,000 files, only 1,500 of which were in use) which lent credence to the effort.  As I mentioned, I broke it down into 10 stories, estimated them, and was able to show that this was an achievable goal.

No one, and I mean no one, is unhappy that I spent time doing this.  We often spend a lot of time not even trying to do what’s right because we believe we’re not authorized.  It’s crucial that we, as engineers, work closely with the rest of the business to collaborate, build relationships, and increase trust so that we can have the open conversations that lead to working on stories that are difficult to tie back to customer value.  But remember, there’s a lot of customer value in being efficient!

My advice:

  • Just spend a little time quantifying a problem and its solution so you can have an informed discussion with your team about working on it
  • If you’re passionate about something, use that to your advantage!  Instead of complaining about a problem, use your passion to solve the heck out of it!  People will thank you for it.
  • Break the problem down into manageable units.
    • Get those units into the backlog as stories and talk about them during your grooming sessions.  Make them real.
  • Don’t boil the ocean; keep the problem clear and defined
  • Use the 80/20 rule.  I didn’t clean everything up.  I cleaned up the easy stuff.  And that still has incredible impact.
  • Find allies / supporters on your team and other teams.  They increase your strength.
  • Use the right tools for the job.  It would have been easy for me to build a long, complex script but in the end, it’s not necessary.  It would have taken longer and would have been more error prone.  Tackling the problem in groups helped me stay focused and make progress in chunks.

Deprecating Old Features: Legacy Home Pages

If the Quick Base product were a person, we would already have the ability to vote and buy lottery tickets. Nearing two decades as a trusted solution of business app builders, we’ve learned and improved a lot. But like a young adult heading off to college or the proverbial workforce, we realize there are some things — like that beloved, old blankey or teddy bear — that need to be left behind.

My team, the End User team, tries to put ourselves in the shoes of end users of Quick Base (as opposed to app builders or account admins) in order to solve their problems. An end user is someone who uses a Quick Base app that someone else in their company builds. Our goals are to empower end users to:

  1. Be more efficient and productive while using Quick Base
  2. Learn how to use and navigate Quick Base more easily
  3. Enjoy using Quick Base

In service to these goals, we aim to simplify Quick Base and make it more intuitive to use.

What does this mean in practice? It means making sure that there is only one, easily discoverable way to do something.

We have a lot of functionality and UI that, in their day, may have been cutting edge, but are now woefully outdated. Our customers are endlessly creative in finding workarounds to add functionality that we previously didn’t offer. One of the reasons we keep a lot of those features around is because customers are still using them.  Even once we realize, “Hey, we can do that natively in the product for everyone,” there isn’t always a strong impetus for those with existing solutions to change. So we end up keeping old stuff around to make sure customer functionality remains intact.

Home pages are a perfect example of this. Home pages are the entry point to nearly all of our apps. An app builder can curate the content of these home pages for their end users by presenting customized reports and charts, providing links to other important places in the app, displaying a README with instructions on using the app, and much more. End users then have all the context they need to start being productive right away.

new_homepage

Quick Base’s new, widget based, drag-and-drop home page builder/preview

A few years ago, we created a modern, widget-based home page, with a drag-and-drop building process. Builders loved it because they could directly manipulate the elements of the home page while previewing what their users would see. At the time, we decided to keep the older homepage, built with checkboxes and drop-down select menus, so that customers with older home pages could still view and modify them. We’ve seen that our builders, especially builders new to Quick Base, love the new home pages and their drag-and-drop construction. Beyond that, there are costs of supporting old functionality as we try to innovate. We decided to deprecate our old home pages for three reasons:

  1. As we started updating some of our report types (most recently Calendar Reports) and adding new ones (Kanban), the tax of ensuring these new reports play well with the rest of the UI on the old home pages was becoming too much.
  2. Removing old home pages allows us to simplify the product by providing fewer and more intuitive ways to do the same thing.
  3. Several features used in old home pages expose our users to code that may no longer be supported, which opens both our customers and Quick Base to potential security risks.

When we saw an opportunity to deprecate old home pages, while maintaining customer functionality, we took the opening.

Quick Base makes prolific use of feature switches (aka feature flags or feature toggles) — customer specific server-side and client-side checks that enable or disable a given feature. They allow us to release functionality in early access, receive feedback on usability, and make corrections before releasing to all customers. They also allow us to easily deprecate old functionality, by simply turning a given switch on for everyone before we clean up and remove the associated code.

old_homepage4

Quick Base’s old checkbox, drop-down menu home page builder without built in preview

Removing old home pages is more than a change to the presentation layer. It potentially is a change in user data. This precludes the use of a feature switch. We, therefore, created a process that allowed app builders to convert their old home pages into new ones and switch over when ready. Old home pages follow a common pattern in our legacy C++ server code; they are HTML templates generated and filled on the server side. The new widget-based home pages use Backbone and serialize to and from JSON to communicate with the server and persist the home page object/data. It was simple enough to write a JSON serialization function for the old home page objects and use it to populate a new home page.

The tricky part was how to replicate a certain piece of functionality of the old home pages. Previously, we allowed users to enter arbitrary HTML into a rich text element, and we would dutifully render it, executing scripts and all. This is incredibly important in allowing customers to customize their home pages, but also presents Cross Site Scripting security vulnerabilities, and therefore is not supported on new home pages.

To solve this problem, as part of the old home page conversion, we take every rich text section on the old home page, sanitize it to remove script tags, inline event handlers and the like, and place it in a rich text widget on the new home page. If there was only HTML, with no added functionality in the old section, the new one will work great. In case there was important business logic occurring on the old home page, we also create a raw HTML page with all of the contents of the section. The customer can then take this page, and link to it within an iframe widget of the new home page, or use it to start a Code Page, a blank slate where advanced builders can build their own HTML, CSS, and JS based homepages. The last step customers will have to take is to swap over the role-based settings of the old home pages. For example, if the default homepage for sales reps was the home page, now they just need to make it the new one.

Software companies are continually weighing the costs and benefits of various types of projects. Tech debt work, such as deprecation of old features, is often easy to deprioritize in lieu of new feature implementation. In our case, it was easy to justify the work required to make this change. We wanted to avoid future roadblocks and to make the product safer, simpler, and more intuitive while maintaining customer functionality. Working with a legacy platform certainly has its challenges, but like a teenager comin­­g of age, it’s our approach to these challenges that allows us to adapt, grow, and become successful. As we continue to learn, simplify and improve our product and process, we look forward to problems and to solving them in efficient, creative ways.  ­

My First Steps as SRE at Quick Base

When I first joined our Platform Infrastructure team back in August 2017, I was worried of getting lost in the mix of server management and working on requests coming in from the Engineering team. This was the time when our re-architecture project was cancelled and we were going to back to doing things the old way (at least that’s what I had at the back of my mind).  But then we became Site Reliability Engineering (SRE) team.

To give you my background, I am a Staff Software Engineer on the team and have been working with QuickBase for 7+ years.  I have been part of different projects across different frameworks and stacks that we use at Quick Base.

Our first task was to migrate our product code base from Subversion to Git.  Here is my experience with my first Site Reliability Engineering (SRE) project and learnings.

Migration to Git and GitHub

The Quick Base repository is 18 years old and consists of about 40,000+ files with 76,000+ commit history.  We had been using Subversion on a server we operated in our Waltham, MA office since 2005.

Why GitHub?

Consistency

During our re-architecture days, we started using git. We had a few projects deployed in production that are source controlled using git. So, we wanted to be consistent across the board and have all of our projects under one source control.

Integrated Code reviews

With Subversion either hosted or with any other provider, it was hard to do code reviews. You could create a patch file and upload it to a tool that would create a review for you but you have to switch context to see how functions are being accessed or look for code around the piece of code that is changed. With GitHub pull requests you can do all of that and compare your changes with any other branch (bonus).

Whole repository local

With git, you have the whole repository available locally which means that you don’t need to be connected to your Subversion server to look at the logs.

Prepare for Open Source projects

As we are growing as a tech company, we wanted to contribute to the community and made plans to open source some of our tools that we use at Quick Base. GitHub is a great place to host such open source projects.

Add Ons / Extensibility

With GitHub APIs, it is easy to add extensions and webhooks to GitHub. We had a Jenkins plugin to post the status of the builds to the pull requests to make sure there aren’t any breaking changes happening.

Usage

The concentration on the re-architecture project was high over the past 3+ years and we had most of our development team working on git and GitHub.

Challenges/ Concerns

When we decided to move off of Subversion, we had to convince people about the benefits and also address their concerns before we start migration.

Losing history

This was a no brainer, everyone wanted to make sure that we maintain history of our 18 year old product with more than 76k commits.

Handling OEM modules (external modules)

We had a few OEM modules in our project and integrated with our Visual Studio solution. These were not updated for a very long time and some of it was not even in use anymore, but wanted to make sure that all the external modules do work when we migrate.

Handling Quick Base release branches

We do release our code once a month and we mostly work on the trunk branch and in the last week we create a release branch which is a long lived branch (forever). We work on the release branch until its ready to be deployed to production. Any change that we make to the release branch needs to be added to TRUNK as well (a manual process which is highly error prone). Also, for every patch that we do in production, we create a new branch off of the release branch which is also a long lived branch. Any change that goes to this branch needs to go to TRUNK as well but not in the previously released code.

The build

Since most of our deployments are done via running scripts, we discovered that we had a lot of places where we were depending on the SVN revision number which was incremented by 1 on every successful commit. Our artifact version was also the SVN revision number, so we relied on being able to sort by revision number. We started using short git hash (“GIT_HASH=%GIT_COMMIT:~0,7%” in Windows Batch) to be used for artifact versions. But since the git hash is alphanumeric string, sorting doesn’t provide chronological order. So we decided to use the build number on our Jenkins job that generates the artifact as the artifact version. That solved a lot of our problems since it was incremented by 1 for each build.

Fear of change

This was the biggest hurdle we had in mind when we started thinking of migrating our repository to Git. People were worried about the change causing script failures that are hard coded to use the code in a certain way.

Today (6 months later)

After working with Git for almost 6 months and about 6 releases, here is our view:

Every change is a branch

Every change being made to the source code is done via a separate branch and merged only if reviewed and approved. This ensures that any code that goes to develop branch is actually reviewed to keep develop clean.

Try build results

Since all of our changes are separate branches, we run try builds against the branch to make sure that the change doesn’t break the develop branch. You can read more about Try Builds here 

Pull requests improve code reviews

Pull requests have been one of the best tools to share knowledge and provide constructive feedback to new developers and get valuable feedback from peers.

Invisible Outages

The Cloud can sound so fancy and wonderful especially if you haven’t worked with it much.  It may even sound like a Unicorn.  You may question whether auto-scaling and auto-healing really works.  I’ve been there.  And now I reside in the promised land.  Here, I share a story with you to make it more tangible and real for those of you who haven’t done this yet.

About 18 months ago, Quick Base launched a Webhooks feature.  At that time, we had no cloud infrastructure to speak of.  Everything in production was running as a monolithic platform in a set of dedicated hosting facilities.  This feature gave us a clear opportunity to build it in a way that utilized “The Cloud.”  At that time, we’d learned enough about doing things in AWS to know we wanted to use a “bakery” model (where everything needed to run the application is embedded in the AMI) instead of a “frying” model (where the code and its dependencies are pulled down into a generic AMI at boot-time).  We’d seen that the frying model relied too heavily on external services during the boot phase and thus was unreliable and slow.

Combining the power of Jenkins, Chef, Packer, Nexus, and various AWS services, we put together our first fully-automated build-test-deploy pipeline.

Untitled Diagram (1)

The diagram above is a simplified version of our bakery, as orchestrated by Jenkins.  Gradle is responsible for building, testing, and publishing the artifact (to Nexus) that comprises the service.  Packer, Chef, and AWS are combined to place that artifact into an OS image (AMI) that will boot that version of the service when launched.  That enabled us to deploy immutable infrastructure that was built entirely from code — 100% automated.  Servers are treated as cattle, not pets.  This buys us:

  • Traceability: since all changes must be done as code, we know who made them, when they were made, who reviewed them, that they were tested, and when the change was deployed (huge benefits to root cause analysis)
  • Predictability: the server always works as expected no matter what environment it’s in.  We no longer worry about cruft or manual, undocumented changes
  • Reliability: recovering from failures isn’t just easy, it’s automatic
  • Scalability: simply increase the server count or size.  No one needs to go build a server

Several months after the launch, the Webhooks servers in AWS began experiencing very high load and we weren’t sure why – there was nothing obvious like a spike in traffic causing it.  This high load caused servers to get slower over time (the kind of behavior usually attributed to a memory leak, fragmentation, or garbage collection issues).  Under normal circumstances, the servers would become too slow to process Webhooks for customers.

This is where the real win happened: when the server got too slow, the health checks began failing which caused the servers to be replaced it with a new one.  This happened hundreds of times over the course of several days – with zero customer-facing impact.  If this had been deployed in the traditional manner, we would have had numerous outages, late nights, and tired Operations and Development staff.  This was our first personal proof that “The Cloud” allows us to architect our services in a way that make them more resilient and self-healing.

When the code was repaired, it was automatically built, tested, and deployed with zero human intervention.  This is known as Continuous Delivery (CD).  It was just a few minutes between code-changed and in-production.  We were able to solve the problem without being under searing pressure (which causes mistakes) and without any pomp and circumstance.

The nerdy part of me was thrilled to see this in action and the not-so-nerdy part of me was thrilled that we literally suffered an event that was invisible to our customers.

2 Minute Builds

Background

The core of Quick Base is a large Microsoft Visual C++ project (technically it’s a solution with multiple projects).  Our build/deploy/test/upload-artifact cycle was 90 minutes.  It was automated.  Not bad for a 20-year-old code base, right?  Nah, we can do better.  We can do it in 2 minutes!

At least, that’s how we pitched it.  You can imagine the response.  Besides the obvious “that’s impossible!” sentiments, we were asked “Why?  It takes us longer than that to fully QA the resulting artifact, so what’s the point?”  And so, our journey began.

If there’s one thing I’ve learned over the years, it’s that the hard part isn’t the technology, it’s the human equation.  Here, we’d lived with a very slow build for years (because the belief was that we’d deprecate the old stack in favor of the re-architecture that was in progress).  Once we’d re-focused our efforts to iterating on the existing stack, we knew things had to change.  We were operating using Agile methodologies (both scrum and Kanban are in use) but the tools weren’t properly supporting us.  A few engineers close to the build knew there was low-hanging fruit; what better way to demonstrate “yes, we can!” and gather excitement than to make significant progress with relatively little effort.

Organizationally, we were now better-suited to support these kinds of improvements.  We have a Site Reliability Engineering team that consists of both Ops and Dev.  Together, we started to break down the problem.  We deconstructed the long Jenkins job into this diagram:

Picture1

Now we knew where to focus for the biggest gains.

Our First Big Win

The “Tools Nexus Deploy” was literally just Maven uploading a 250-MB zip file to Nexus from servers in the office in Cambridge, MA to our Nexus server in AWS (Oregon).  It definitely shouldn’t take that long to upload; we have a very fat Internet pipe in the office.  We did packet traces using WireShark and other network tests to try and determine the cause.  We didn’t uncover anything.

So, let’s break down the problem and isolate the issue.  Is the network in the office OK?  AWS?  Is the Nexus server slow?  Here’s some of what we did:

  • Download data directly from Nexus using wget (remove Maven from the equation)
  • Upload directly to Nexus using wget (ditto)
  • Do the above from the office servers (is it the server network?)
  • Do the above from office workstations (is it the entire Cambridge network?)
  • Do the above from EC2 instances in AWS (Oregon) (remove Cambridge from the equation)
  • Try a (much) newer version of Windows that hasn’t been hardened (maybe issues with TCP windowing and other high-latency improvements)
  • Do the above from linux instead of Windows (remove Windows from the equation)

When we switched from Windows to linux, we stood back in disbelief.  The upload was now taking 90 seconds instead of 22 minutes.  We found that Maven on Windows has extremely poor network performance.  We temporarily switched to Maven on linux by splitting up the build job to have a separate upload job that was tied to the Jenkins master node (running linux).

Our Second Big Win

The next thing we tackled was the “PD CI-Test” group.  These are just TestNG Java tests that hit the Quick Base API to do some automated testing.  We found one simple area to improve: add test data using bulk import instead of per-record inserts.  Since this was in setup code, the several-second difference added up to … drum roll … 18 minutes!

Number Three

There was still lots of room for improvement in the “PD CI-Test” group, so we found one other quick win.  After we’d encountered the Maven slowness, we started to question the speed of Ant on Windows.  The server was only running at 20% CPU when the tests were running, so we suspected something wasn’t going as fast as it could.  Switching our tests to be called via Gradle instead of Ant saved us another 12 minutes!

Assessing Where We Are Now

In 2 months, our diagram looked like this:

Picture2

You can bet that was exciting!  Now we have the momentum and people believe it can be done.

We’ve continued to make further improvements such as moving from the aging hardware in the Cambridge server room to AWS using the Jenkins EC2 plugin (and then taking advantage of the C5 instance types (which boot our Windows AMI in 4 minutes instead of 10) by building our own version of the plugin and submitting a PR for it here).  Build times are currently averaging 26 minutes and we’ve got items on the roadmap (including moving to Jenkins pipeline so we can easily take advantage of parallelism) that should get us closer to 15.  After that, we run into limitations of the MSVC++ linker that does a few things single-threaded; one of our projects is quite large and produces a single binary.  The next steps there include breaking that project up (e.g. use libraries).  That will take more effort, so we’ve left that for last.

Will we ever get to 2 minutes?  Who’s to say?  The purpose of setting the goal that low was to fire up people’s imaginations.  And it has.

Once Upon a Time …

Quick Base is the platform that businesses use to quickly turn ideas about better ways to work into apps that make them more efficient, informed, and productive.  It has been around for nearly 20 years.  It’s a successful SaaS offering serving billions of requests per month.  It’s primarily written in MSVC++ running on Windows.  If you’ve been in the software industry long enough, you can imagine some of the tech debt acquired over its lifetime.  It makes very efficient use of server hardware but it’s grown past the point where it needs to let go of the old ways of doing things (which were appropriate “back then”).  Namely, there are monoliths to break down, automated test harnesses to build, code to rewrite to be testable, and build systems to re-think.

This story begins about 5 years ago when we started having the re-architecture discussions that most software companies do as they start having “success-based problems.”  At that point, Quick Base was essentially 100% C++ on Windows with an ever-increasing success rate with companies that wanted to store more data in their apps, have more users accessing their apps, and create more complicated apps than we could ever imagine.  Internally we constantly refer to the performance characteristics of an application ecosystem as the combination of those 3 things: size, concurrency/activity, and complexity.  That means there’s no single lever we can pull to increase how apps scale and succeed on our platform.

As a way to better meet these challenges, we became hyper-focused on solving for developer productivity, which included taking a look at what languages the talent pool was most familiar with, languages that had strong testability characteristics and support, as well as what languages would support an evolution – we wanted both code bases (C++ and the candidate) to, e.g., share a connection with the SQL server without having to manage what flags mean what in two places and risk getting that wrong.  C#/.NET was an obvious choice, and became the winner … at least for a short while.  We did build some stuff in .NET (and continue to do so today; you’ll read more of that in later posts), but this approach didn’t last long.

The belief that consolidating technologies to support better economies of scale (software contracts, support, staffing, you name it) was overwhelming and ultimately sent us down the wrong path.  We started building on technologies that had integration challenges with the existing platform, and we couldn’t take advantage of our existing SDLC (think: build/test/deploy as well as the IDE).

And then, we fell into the trap that many software companies do as our approach evolved into a complete re-architecture.  We believed the only viable way to go from old to new was to start over and migrate our customers.  We believed that incrementally breaking down the monolith was not possible.  So, we spawned a small scrum team to do a PoC, which turned into 2 teams, then 3, and then a business decision to put most of our engineering effort into the re-architecture in order to focus and just get it done and behind us.

All along the way there was that little voice inside that kept telling us this was wrong.  It occasionally came out during moments of frustration, or over lunch, or over a drink down the street.  But we succumbed to the inertia of the high-speed train.  We further exacerbated the issue by materially changing execution strategies at least 3-4 times because we discovered how difficult it was to recreate even an MVP of Quick Base.  After 4 years of producing something that ultimately didn’t deliver value to our customers, we drew up the courage to have a heart-to-heart with ourselves and canceled the project.  Why?  We obviously weren’t delivering value yet, and (once we were honest with ourselves) we knew we wouldn’t for a while – too long.  It’s excruciatingly hard to abandon something you’ve poured years of your heart into and feels “so close to shipping” (but in reality, it’s not).  It feels like you’re abandoning a child.  We found that belligerently asking “will this deliver customer value (in a timely fashion)?” gave us the strength and clarity to make the hard decision.

Did we all come to work one day and just stop working on the re-architecture and begin working on the existing platform?  For many reasons, no.  We needed to go back to the drawing board with our roadmap, and we had to somehow shift our development organization from 10% C++ / 90% Java/NodeJs to 75% C++ / 25% Java/NodeJs.  That’s right, we are continuing to work with the newer technologies.  We didn’t throw everything away.  In fact, we actually kept a lot of it.  As it turns out, we discovered through our journey that the fastest and most sustainable way to deliver more value to our customers was to iterate on the technology we have and tactically augment it with the new services and paradigms we’d originally built to serve the re-architecture.

Just like mistakes are our biggest teachers, so was the re-architecture.  We didn’t completely waste 4 years of our lives and money.  In fact, we know more about our customers, ourselves, the market, the technology, our own “secret sauce,” how to build & test & deploy software, and much more.  We have a new strategy which allows us to deliver value to our customers on an ongoing basis (starting right yesterday) while making meaningful progress on our software architecture as well as our build/test/deploy systems.  For me personally, I learned at the highest rate I’ve ever learned during the last several years.  And that learning (and some of the systems that were built during the re-architecture) are serving our new approach well.  Many of the upcoming posts will discuss things we built and learned over the last 4 years.