Invisible Outages

The Cloud can sound so fancy and wonderful especially if you haven’t worked with it much.  It may even sound like a Unicorn.  You may question whether auto-scaling and auto-healing really works.  I’ve been there.  And now I reside in the promised land.  Here, I share a story with you to make it more tangible and real for those of you who haven’t done this yet.

About 18 months ago, Quick Base launched a Webhooks feature.  At that time, we had no cloud infrastructure to speak of.  Everything in production was running as a monolithic platform in a set of dedicated hosting facilities.  This feature gave us a clear opportunity to build it in a way that utilized “The Cloud.”  At that time, we’d learned enough about doing things in AWS to know we wanted to use a “bakery” model (where everything needed to run the application is embedded in the AMI) instead of a “frying” model (where the code and its dependencies are pulled down into a generic AMI at boot-time).  We’d seen that the frying model relied too heavily on external services during the boot phase and thus was unreliable and slow.

Combining the power of Jenkins, Chef, Packer, Nexus, and various AWS services, we put together our first fully-automated build-test-deploy pipeline.

Untitled Diagram (1)

The diagram above is a simplified version of our bakery, as orchestrated by Jenkins.  Gradle is responsible for building, testing, and publishing the artifact (to Nexus) that comprises the service.  Packer, Chef, and AWS are combined to place that artifact into an OS image (AMI) that will boot that version of the service when launched.  That enabled us to deploy immutable infrastructure that was built entirely from code — 100% automated.  Servers are treated as cattle, not pets.  This buys us:

  • Traceability: since all changes must be done as code, we know who made them, when they were made, who reviewed them, that they were tested, and when the change was deployed (huge benefits to root cause analysis)
  • Predictability: the server always works as expected no matter what environment it’s in.  We no longer worry about cruft or manual, undocumented changes
  • Reliability: recovering from failures isn’t just easy, it’s automatic
  • Scalability: simply increase the server count or size.  No one needs to go build a server

Several months after the launch, the Webhooks servers in AWS began experiencing very high load and we weren’t sure why – there was nothing obvious like a spike in traffic causing it.  This high load caused servers to get slower over time (the kind of behavior usually attributed to a memory leak, fragmentation, or garbage collection issues).  Under normal circumstances, the servers would become too slow to process Webhooks for customers.

This is where the real win happened: when the server got too slow, the health checks began failing which caused the servers to be replaced it with a new one.  This happened hundreds of times over the course of several days – with zero customer-facing impact.  If this had been deployed in the traditional manner, we would have had numerous outages, late nights, and tired Operations and Development staff.  This was our first personal proof that “The Cloud” allows us to architect our services in a way that make them more resilient and self-healing.

When the code was repaired, it was automatically built, tested, and deployed with zero human intervention.  This is known as Continuous Delivery (CD).  It was just a few minutes between code-changed and in-production.  We were able to solve the problem without being under searing pressure (which causes mistakes) and without any pomp and circumstance.

The nerdy part of me was thrilled to see this in action and the not-so-nerdy part of me was thrilled that we literally suffered an event that was invisible to our customers.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s