Have you ever built a house using Lego blocks of different sizes and colors? Have you started with the foundation only to rip it out and start again because it was too wobbly as the house went up? That’s what happened when Quick Base’s Platform Governance team started building an Application Audit capability past few months.
Read on to uncover what the challenges were and what we learned along the way…
Our work started with putting the main blocks together. Our technical decisions needed to meet the following parameters and architectural guidelines:
- We wanted to process and store audit data in AWS. We preferred AWS-managed service over AWS-hosted service or homegrown service.
- Services where audit data containing user data goes through need to be HIPAA compliant
- Cost of the solution is known and acceptable
- The infrastructure needs to handle the following loads:
- Per day: ~190 million events -> 96 GB
- Per year: ~69 billion events -> 35 TB
First, we needed to capture the events related to user requests. The majority of our user requests get processed by a C++ application hosted on Windows machines in a data center.
We piggybacked on an already existing logging infrastructure. The existing mechanism logged events on disk using an asynchronous logger process. We enhanced the logger service to also capture auditable events and pipe them to an event log file.
Lesson #1: Estimate the load as early as possible as it needs to be factored in and tested when evaluating options and determining cost
Lesson #2: Re-use existing mechanisms as much as you can
We investigated different solutions for log forwarding and aggregation. We looked at the following options:
- Apache Flume: Eliminated as it requires significant effort to host
- Kinesis Agents: Eliminated as it doesn’t work on Windows
- nsqio: Eliminated as it requires significant developer overhead
- Native Kafka: Eliminated as it requires significant effort to host
- Logstash: Eliminated as its strength lies in collecting a variety of data sources which is different from our usage
We decided to go with FluentD because it is:
- Easy to install and configure
- Able to handle the loads we are expecting. We tested FluentD with the estimated loads.
- Lightweight. It consumes less than 1% of CPU and 40MB of RAM when processing 1000 audit events per second.
The next decision we needed to make was where to persist the audit events and its corollary of how to query the data. Per the architectural guidelines above, we knew we wanted to persist these on AWS. The storage options we explored were:
- Glacier: Eliminated because it didn’t fit our model of frequent access of data.
- DynamoDB: Eliminated due to size limitation of 400K per row.
- Redshift: Eliminated due to column size limitation of 64K.
- Aurora RDS
- S3 / Athena
Two contenders remained:
- Aurora RDS: Store audit data in a relational database and use SQL to query it.
- S3 / Athena: Store audit data in S3 buckets and use Athena to query it.
Both solutions were comparable based on our criteria. At the end we went with S3 / Athena because it aligns with modern best practices of splitting storage from the querying engine. This allows us to swap any one of these later if necessary.
We went with S3 / Athena knowing that Athena is not HIPAA compliant. We relied on the assurances from our AWS rep that Athena will be HIPAA certified before our launch. The certification happened in the next quarter than the one we originally were told. Luckily, our timeline was not impacted, but we cut it really close.
We also added Kinesis streams to stream audit from FluentD to S3 and Firehose as our buffering mechanism to batch the audit events before writing them to S3. Our audit events data lake was taking shape.
The first floor gets stood up
The first type of audit events we delivered were user actions with no user data. Examples of user actions include:
- User login / logout
- User creates a Quick Base application
- User creates a table in an application
We started with these events to avoid worrying about HIPAA compliant services and complex audit event payloads at the outset, while still delivering significant customer value.
We decided to partition the audit events in S3 hourly. This meant that for each hour during which events occurred there would be an equivalent folder in S3.
Strengthen the foundation while building more floors
After we delivered the first slice of customer value, we focused on improving the performance of querying audit events.
We knew Athena queries were slow and we researched options to improve their performance. The one challenge we faced was that Athena query execution is a black box. Unlike traditional RDBMS systems, there is no peaking into the execution plan of a query in order to optimize it.
We looked at using different file formats and compression schemes, based on AWS recommendations for how to improve Athena performance:
- JSON vs Parquet vs Orc: Since we query almost all the columns of the event data, we didn’t see significant performance improvements by changing the file format.
- bzip2 vs gzip: We decided to go with bzip2 as it slightly faster than gzip.
We bucketed audit events per customer and per hour as another attempt to improve Athena performance. This necessitated that we get rid of Firehose as Firehose doesn’t support custom partitioning. On top of that, Firehose was still not HIPAA compliant despite early indication from our AWS rep that it would be. We replaced Firehose with a homegrown Lambda function. This resulted in a multitude of small files.
We then decided to further aggregate the audit events. We aggregated the small files from above into one file per customer per hour. This was accomplished by an EMR job triggered every hour by a lambda. The reason we went with an EMR like solution versus building our own aggregation script is because we needed a highly performant, distributed ETL mechanism. We could have gone with Glue, which would have been the slightly more straightforward option. We decided against it for two reasons:
- When testing with Glue, we ran into scaling issues.
- At the time of this writing, Glue is still not HIPAA compliant.
After doing this work, we found that creating sub-folders per customer per hour caused additional performance problems in Athena. We further aggregated the audit events per day at the end of the day again using EMR ETL.
Lesson #3: A lot of small files cause a series of performance problems in Athena and EMR.
Lesson #4: It is risky to use new-ish AWS services because they come with limitations and performance profiles that are not well understood.
Enable others to maintain and extend
The team is now focusing on enabling other teams to add audit events using the Audit infrastructure. Everything is in place to make this effort as simple as identifying what to capture, including the event payload, and ensure the load is well understood.
Overall, we learned that it makes business sense to work on the infrastructure while also delivering customer value on top of the infrastructure – as long as it is deliberate. This eliminates long research projects with no customer delivery.