• Home
  • News
  • Coins2Day 500
  • Tech
  • Finance
  • Leadership
  • Lifestyle
  • Rankings
  • Multimedia
TechPointCloud

Amazon explains database glitch that impacted big customers

Barb Darrow
By
Barb Darrow
Barb Darrow
Down Arrow Button Icon
Barb Darrow
By
Barb Darrow
Barb Darrow
Down Arrow Button Icon
September 23, 2015, 8:24 AM ET
Photograph by David Ryder — Getty Images

Amazon Web Services has posted an explanation of what happened at its big U.S. East data center that caused customers like Netflix to experience issues on Sunday.

According to the post-mortem of the “service event” (you have to love that term) a brief network disruption at 2:19 a.m. PDT affected a subset of the servers running Amazon’s (AMZN) DynamoDB database service which stores and maintains data tables for customers. Each table is divvied up into partitions, containing a portion of the table data and those partitions, in turn, are parceled out to many servers to provide fast access and to allow data replication.

Per the post, which apparently went up Tuesday night:

The specific assignment of a group of partitions to a given server is called a “membership.” The membership of a set of table/partitions within a server is managed by DynamoDB’s internal metadata service. The metadata service is internally replicated and runs across multiple data centers. Storage servers hold the actual table data within a partition and need to periodically confirm that they have the correct membership. They do this by checking in with the metadata service and asking for their current membership assignment. In response, the metadata service retrieves the list of partitions and all related information from its own store, bundles this up into a message, and transmits back to the requesting storage server.

Emphasis is mine. Read the post for the full blow-by-blow, but in essence, Amazon said the issue Sunday was that because so many customers are using a new DynamoDB feature called Global Secondary Indexes, the affected DynamoDB servers could not query the metadata service within the allotted time and took themselves offline.

As of 5:06 a.m. PDT Amazon thus decided to pause requests to the metadata service to relieve the load. By cutting down all those server retries, it was able to bring up additional capacity and restart the service.

Basically this all boils down to the fact that this Amazon service was operating at near full-capacity but AWS internal monitoring apparently did not pick that up in time to avert an outage, said David Mytton, CEO of Server Density, a London-based company which keeps an eye on web server performance across providers for customers.

“A normal network issue caused enough extra load to push the system over capacity which caused the issue,” he said via email.

Amazon will likely fix the monitoring situation and adjust its processes to improve analysis so this does not happen again, he noted. Amazon is nothing if not reactive.

Events like this one reignite the debate about whether businesses should entrust so much of their critical workload to shared public cloud infrastructure that they themselves do not own or control.

Some even brought up the notion that AWS, the self-proclaimed master of distributed and redundant resources, has become the de facto single point of failure IT people dread.

This morning’s #AWS outage reminds us that we all have a single point of failure now. In some ways, we used to be more resilient than that.

— Michael Jackson (@mjackson) September 20, 2015

Coins2Day reached out to Amazon for comment, but typically the cloud giant and big customers— including Netflix (NFLX)—note that if applications are designed correctly to take advantage of public cloud resources the benefits outweigh the risks.

The Amazon post ended in an apology and a promise to do better, explaining “… we will do everything we can to learn from the event and to avoid a recurrence in the future.”

Expect to hear more about this issue at AWS Re:invent in Las Vegas next month.

For more on Amazon Web Services, see the video.

Subscribe to Data Sheet, Coins2Day’s daily newsletter on the business of technology.

About the Author
Barb Darrow
By Barb Darrow
See full bioRight Arrow Button Icon
Rankings
  • 100 Best Companies
  • Coins2Day 500
  • Global 500
  • Coins2Day 500 Europe
  • Most Powerful Women
  • Future 50
  • World’s Most Admired Companies
  • See All Rankings
Sections
  • Finance
  • Leadership
  • Success
  • Tech
  • Asia
  • Europe
  • Environment
  • Coins2Day Crypto
  • Health
  • Retail
  • Lifestyle
  • Politics
  • Newsletters
  • Magazine
  • Features
  • Commentary
  • Mpw
  • CEO Initiative
  • Conferences
  • Personal Finance
  • Education
Customer Support
  • Frequently Asked Questions
  • Customer Service Portal
  • Privacy Policy
  • Terms Of Use
  • Single Issues For Purchase
  • International Print
Commercial Services
  • Advertising
  • Coins2Day Brand Studio
  • Coins2Day Analytics
  • Coins2Day Conferences
  • Business Development
About Us
  • About Us
  • Editorial Calendar
  • Press Center
  • Work At Coins2Day
  • Diversity And Inclusion
  • Terms And Conditions
  • Site Map

© 2025 Coins2Day Media IP Limited. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | CA Notice at Collection and Privacy Notice | Do Not Sell/Share My Personal Information
FORTUNE is a trademark of Coins2Day Media IP Limited, registered in the U.S. and other countries. FORTUNE may receive compensation for some links to products and services on this website. Offers may be subject to change without notice.