Uptime is an obsession of ours. Thanks to our use of modern distributed computing techniques such as microservices and immutable containerization, our Inbound Filtering service customers have never experienced any significant downtime in email delivery. This week, we had our first downtime, thanks to a programming error that led to a critical system component’s catastrophic slowdown.
On September 15th, 2020, at 6:02 AM, MailChannels staff were alerted about a sudden increase in the load on a critical database service within our Inbound Filtering service, which provides spam filtering for a large number of domains. The increased load on the database service led to a substantial slowdown in email processing that lasted for several hours. In this blog post, we discuss why this situation occurred and what we did to fix it. This post is quite technical.
Figure 1: Just after 6:00 AM Pacific Time on September 15th, the load on a critical database service increased to 100%, indicating a near total failure of the database to respond to queries.
We continually monitor hundreds of performance parameters to assess the health of our Inbound Email Filtering service. Email processing latency - the amount of time it takes for us to process a message through the entire infrastructure - tells us whether the system is handling email quickly enough. A slowdown in this parameter indicates a severe problem.
On Tuesday morning, when latency rose to more than 60 seconds, we knew there was a serious problem, since many email clients will disconnect rather than wait longer than this length of time. On receiving an alert from PagerDuty at 6:02 AM, our team began working on the issue within one minute. Our #ops channel in Slack became a hive of communication. We also started a Zoom call, allowing team members to chat in real-time and share screens - something that we have found can be very helpful during an emergency. By 6:15 AM, three-quarters of our global operations and R&D teams relating to the inbound filtering service were in the channel.
Initial diagnosis of the cause: a slow lookup
Within a few minutes, the team discovered that a critical component was malfunctioning, leading to message processing delays. The so-called “Relay Domains Service” (shown in red in the system architecture diagram in Figure 2) failed to provide timely responses to queries from our inbound message queue. The slow responses from the Relay Domains Service caused the Inbound Queue to start refusing incoming connections with the error “451 4.3.0 <email address>: Temporary lookup failure”.
The Relay Domains Service verifies whether the service is configured to process email for a given email domain. A separate service known as the Transport Map tells us where to deliver the message to once we have finished filtering tasks such as identifying and blocking spam. Suppose the Relay Domains service is unable to answer questions quickly. In that case, email delivery will slow down because the service has to wait a long time before it knows whether it can accept a message for delivery. If we didn't first check whether we can deliver a message, spammers would exploit our service to generate back-scatter spam.
Figure 2: The logical architecture of the MailChannels Inbound Filtering service. The component that was running too slowly is highlighted in red. Some components have been obscured for security reasons.
Determining why lookups were so slow
Because email never sleeps, we use a highly reliable database system provided and managed by Amazon Web Services (AWS) to store information about the domains for which we process email. The Relational Database Service (RDS) provides us with a clustered, replicated, and automatically backed up database against which we can make queries without having to worry about the reliability of underlying hardware and software. RDS also lets us scale up rapidly, taking care of the complicated process of replicating data to new machines as necessary.
While RDS is impressive and reliable, one of our database queries was not, as we’ll learn more about below.
But first, a note about indexes. In databases, an “index” allows us to find things more quickly by pre-computing a lookup table that tells the database where to find information. Without an index, you have to look at potentially every entry in a table of data. With an index, you can find what you are looking for by consulting the index, which is enormously more efficient.
In our database, domains are stored in a 4,096-byte-wide column to allow for the storage of multibyte character domains that can be extremely long (4,096 bytes is the longest possible length of such a domain). Unfortunately, RDS only allows us to create an index key for the first 767 bytes of any column - a limitation of its indexing system. This limitation means we can’t fully index the domain name column. Thus, every time we processed a message, the database would have to look at a substantial fraction of the entries in the domain table, rather than consulting an index to speed up the process. On Tuesday, the load caused by these inefficient table scans pushed the RDS database systems to their breaking point: complete CPU saturation.
The immediate fix
We knew that it would take a few hours for us to fully fix the indexing problem that was leading to slow database queries. In the meantime, we set one team member to work on a temporary workaround that would have had us freeze our domain data into a text-based lookup table. We also scaled up our database capacity by about a factor of 100 by deploying a massive number of 12x large RDS instances. Thanks to the enormous capacity of AWS, we were able to deploy this additional capacity within 30 minutes, which allowed us to process some of the email backlog that was starting to build within the service.
The permanent fix
To work around the 767-byte limitation on index columns, we created a new database column that stores a hash of the domain name column. The hash value takes up not more than 256 bytes and is therefore indexable by RDS. Migrating to an indexed lookup of the hashed domain name allowed us to increase query throughput by a factor of 10 while using only 1/10th of the CPU resources - a 100x improvement in query performance.
Why didn’t we catch this previously?
Whenever an outage occurs in our service, we take pains to identify not only the fix to the problem, but also the process failure that allowed the problem to occur in the first place. By working constantly to improve our internal processes, we reduce the likelihood of future failures. Looking back, the slow query would have been identified if we had stress-tested our domain information database as part of the software verification process. While we do perform various stress tests, the kind of load test that would have revealed this query error was not part of our test suite.
What are we doing to make sure this doesn’t happen again?
Going forward, we’re going to make sure that all of the database columns that we may filter on have some kind of index, even if the index only covers a prefix of the actual column value. We’re going to audit all of our code that queries a database to make sure there are no other queries that may trigger full table scans. We’ve already added additional logging and monitoring to the database servers, so that in future we can identify and correct these issues more quickly.
Our priority: Deliver your email reliably
Delivering email is literally our “only job” and it’s painful and frustrating when we aren’t able to do this for our customers as reliably as we promise. We apologize to our valuable customers for the inconvenience they experienced this week and want to assure customers that we will continue to strive for perfect uptime by making improvements in the weeks to come. To sign up for service availability notices by email, RSS, or text message, please use the button on the top of our status page at https://status.mailchannels.net.
Photo source: https://www.proxyclick.com/