AWS已经采取的其他值得注意的措施有:它开始致力于将索引子系统的部分划分到更小的单元。该公司还改变了AWS服务运行状况仪表板(AWSService Health Dashboard)的管理控制台,以便仪表板可以跨多个AWS区域运行――颇具讽刺意味的是,那个拼写错误在周二导致仪表板失效,于是AWS不得不依靠Twitter,向客户通报问题的进展。
我们想为大家透露另外一些信息,解释2月28日上午出现在北弗吉尼亚(US-EAST-1)区域的服务中断事件。亚马逊简单存储服务(S3)团队当时在调试一个问题,该问题导致S3计费系统的处理速度比预期来得慢。太平洋标准时(PST)上午9:37,一名获得授权的S3团队成员使用事先编写的playbook,执行一条命令,该命令旨在为S3计费流程使用的其中一个S3子系统删除少量服务器。遗憾的是,输入命令时输错了一个字母,结果删除了一大批本不该删除的服务器。不小心删除的服务器支持另外两个S3子系统。其中一个系统是索引子系统,负责管理该区域所有S3对象的元数据和位置信息。这个子系统是服务所有的GET、LIST、PUT和DELETE请求所必可不少的。第二个子系统是布置子系统,负责管理新存储的分配,它的正常运行离不开索引子系统的正常运行。在PUT请求为新对象分配存储资源过程中用到布置子系统。删除相当大一部分的容量导致这每个系统都需要完全重启。这些子系统在重启过程中,S3无法处理服务请求。S3 API处于不可用的状态时,该区域依赖S3用于存储的其他AWS服务也受到了影响,包括S3控制台、亚马逊弹性计算云(EC2)新实例的启动、亚马逊弹性块存储(EBS)卷(需要从S3快照获取数据时)以及AWSLambda。
S3子系统是为支持相当大一部分容量的删除或故障而设计的,确保对客户基本上没有什么影响。我们在设计系统时就想到了难免偶尔会出现故障,于是我们依赖删除和更换容量的功能,这是我们的核心操作流程之一。虽然自推出S3以来我们就依赖这种操作来维护自己的系统,但是多年来,我们之前还没有在更广泛的区域完全重启过索引子系统或布置子系统。过去这几年,S3迎来了迅猛发展,重启这些服务、运行必要的安全检查以验证元数据完整性的过程所花费的时间超出了预期。索引子系统是两个受影响的子系统中需要重启的第一个。到PST 12:26,索引子系统已激活了足够的容量,开始处理S3 GET、LIST和DELETE请求。到下午1:18,索引子系统已完全恢复过来,GET、LIST和DELETE API已恢复正常。S3 PUT API还需要布置子系统。索引子系统正常运行后,布置子系统开始恢复,等到下午1:54已完成恢复。至此,S3已正常运行。受此事件影响的其他AWS服务开始恢复过来。其中一些服务在S3中断期间积压下了大量的工作,需要更多的时间才能完全恢复如初。
从这起事件开始一直到上午11:37,我们无法在AWS服务运行状况仪表板(SHD)上更新各项服务的状态,那是由于SHD管理控制器依赖亚马逊S3。相反,我们使用AWS Twitter帐户(@AWSCloud)和SHD横幅文本向大家告知状态,直到我们能够在SHD上更新各项服务的状态。我们明白,SHD为我们的客户在操作事件过程中提供了重要的可见性,我们已更改了SHD管理控制台,以便跨多个AWS区域运行。
Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region
We’d like to give you some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on the morning of February 28th. The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.
S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact. We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes. While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected. The index subsystem was the first of the two affected subsystems that needed to be restarted. By 12:26PM PST, the index subsystem had activated enough capacity to begin servicing S3 GET, LIST, and DELETE requests. By 1:18PM PST, the index subsystem was fully recovered and GET, LIST, and DELETE APIs were functioning normally. The S3 PUT API also required the placement subsystem. The placement subsystem began recovery when the index subsystem was functional and finished recovery at 1:54PM PST. At this point, S3 was operating normally. Other AWS services that were impacted by this event began recovering. Some of these services had accumulated a backlog of work during the S3 disruption and required additional time to fully recover.
We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks. We will also make changes to improve the recovery time of key S3 subsystems. We employ multiple techniques to allow our services to recover from any failure quickly. One of the most important involves breaking services into small partitions which we call cells. By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem. As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery. During this event, the recovery time of the index subsystem still took longer than we expected. The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately.
From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3. Instead, we used the AWS Twitter feed (@AWSCloud) and SHD banner text to communicate status until we were able to update the individual services’ status on the SHD. We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions.
Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.
