Amazon explains database glitch that impacted big customers

Amazon Web Services has posted an explanation of what happened at its big U.S. East data center that caused customers like Netflix to experience issues on Sunday.

According to the post-mortem of the “service event” (you have to love that term) a brief network disruption at 2:19 a.m. PDT affected a subset of the servers running Amazon’s (AMZN) DynamoDB database service which stores and maintains data tables for customers. Each table is divvied up into partitions, containing a portion of the table data and those partitions, in turn, are parceled out to many servers to provide fast access and to allow data replication.

Per the post, which apparently went up Tuesday night:

The specific assignment of a group of partitions to a given server is called a “membership.” The membership of a set of table/partitions within a server is managed by DynamoDB’s internal metadata service. The metadata service is internally replicated and runs across multiple data centers. Storage servers hold the actual table data within a partition and need to periodically confirm that they have the correct membership. They do this by checking in with the metadata service and asking for their current membership assignment. In response, the metadata service retrieves the list of partitions and all related information from its own store, bundles this up into a message, and transmits back to the requesting storage server.

Emphasis is mine. Read the post for the full blow-by-blow, but in essence, Amazon said the issue Sunday was that because so many customers are using a new DynamoDB feature called Global Secondary Indexes, the affected DynamoDB servers could not query the metadata service within the allotted time and took themselves offline.

As of 5:06 a.m. PDT Amazon thus decided to pause requests to the metadata service to relieve the load. By cutting down all those server retries, it was able to bring up additional capacity and restart the service.

Basically this all boils down to the fact that this Amazon service was operating at near full-capacity but AWS internal monitoring apparently did not pick that up in time to avert an outage, said David Mytton, CEO of Server Density, a London-based company which keeps an eye on web server performance across providers for customers.

“A normal network issue caused enough extra load to push the system over capacity which caused the issue,” he said via email.

Amazon will likely fix the monitoring situation and adjust its processes to improve analysis so this does not happen again, he noted. Amazon is nothing if not reactive.

Events like this one reignite the debate about whether businesses should entrust so much of their critical workload to shared public cloud infrastructure that they themselves do not own or control.

Some even brought up the notion that AWS, the self-proclaimed master of distributed and redundant resources, has become the de facto single point of failure IT people dread.

This morning’s #AWS outage reminds us that we all have a single point of failure now. In some ways, we used to be more resilient than that.

— Michael Jackson (@mjackson) September 20, 2015

Coins2Day reached out to Amazon for comment, but typically the cloud giant and big customers— including Netflix (NFLX)—note that if applications are designed correctly to take advantage of public cloud resources the benefits outweigh the risks.

The Amazon post ended in an apology and a promise to do better, explaining “… we will do everything we can to learn from the event and to avoid a recurrence in the future.”

Expect to hear more about this issue at AWS Re:invent in Las Vegas next month.

For more on Amazon Web Services, see the video.

Subscribe to Data Sheet, Coins2Day’s daily newsletter on the business of technology.