Home > News > Internet

1,957 Users Reported Issues! Tencent Cloud Releases Incident Analysis for April 8th: Cloud API Abnormality Lasted Nearly 87 Minutes

She Qi Sun, Apr 21 2024 07:45 PM EST

April 14th, Tencent Cloud announced that on April 8th, a large number of users reported service disruptions, including issues with logging into the Tencent Cloud console.

Today, Tencent Cloud has published a retrospective analysis of the incident and provided an overview of the situation. s_ddb4c638d734450ab173a5ae73085f7a.png After troubleshooting, it was found that the inability of customers to log in to the console was caused by abnormalities in the cloud API. This affected certain public cloud services that rely on the cloud API, such as cloud functions, text recognition, microservices platform, audio content security, and captcha.

The official statement mentioned that the outage lasted approximately 87 minutes, during which 1957 customers reported issues.

The root cause of the malfunction was attributed to insufficient backward compatibility of the new version of the cloud API service and inadequate configuration data gray release mechanism.

During the API upgrade process, changes in the interface protocol of the new version resulted in abnormal processing logic for data transmitted from the old version frontend, generating erroneous configuration data. Due to insufficient gray release mechanism, the abnormal data quickly spread throughout the network, causing overall API usage abnormalities.

Following the outage, Tencent Cloud initiated a standard rollback plan to revert both the service backend and configuration data to the old version. However, due to a circular dependency where the container platform hosting the API service relied on the API service for scheduling capabilities, automatic service recovery was hindered.

Manual intervention was required to restart the API service and complete the fault recovery process.

The incident was reviewed as follows:

  1. At 15:23, the fault was detected, and immediate service recovery and root cause analysis were initiated.
  2. At 15:47, it was discovered that rolling back the version did not fully restore the service, prompting further investigation.
  3. At 15:57, the root cause of the fault was identified as erroneous configuration data, leading to the urgent design of a data repair plan.
  4. At 16:02, data repair work was initiated across all regions, with API services gradually recovering.
  5. At 16:05, it was observed that API services in all regions except Shanghai had been restored, prompting further investigation into the recovery issue in Shanghai.
  6. At 16:25, it was identified that technical components in Shanghai had a circular dependency issue, leading to the decision to redirect traffic to other regions for recovery.
  7. At 16:45, Shanghai region recovery was confirmed, with API and dependent PaaS services fully restored. However, there was a significant increase in console traffic, requiring a nine-fold capacity expansion.
  8. By 16:50, request volumes gradually returned to normal levels, and business operations stabilized with all console services restored.
  9. At 17:45, after one hour of continuous monitoring without issues, the fault handling process was deemed complete according to the plan.

To prevent similar incidents in the future, improvement measures will be taken, focusing on enhancing system resilience, strengthening change management and protection measures, and improving fault response and communication capabilities.

These measures include:

  • Regularly conducting simulated drills of change management strategies.
  • Optimizing service deployment architecture to mitigate potential circular dependency issues.
  • Providing an API service escape route for quick switching during faults.
  • Enhancing automated testing and implementing gray release strategies.
  • Introducing automatic anomaly circuit-breaking mechanisms.
  • Upgrading fault handling processes to ensure real-time updates and transparency in communication.
  • Optimizing information display logic in Tencent Cloud's StatusPage to ensure accurate and timely transmission of fault information, even during cloud service failures.

The incident review and explanation by Tencent Cloud have been commended by some netizens for its transparency and commitment to improvement. s_763e5666a3504c34be7307ce5fee6321.jpg