More
    StartKryptowährung NewsAzure status history | Microsoft Azure

    Azure status history | Microsoft Azure

    Video Azure status history | Microsoft Azure

    Preliminary Post Incident Review (PIR) – Azure Cosmos DB – North Europe (Tracking ID 3TPC-DT8)

    This is our „Preliminary“ PIR that we endeavor to publish within 3 days of incident mitigation, to share what we know so far.

    After our internal retrospective is completed (generally within 14 days) we will publish a „Final“ PIR with additional details/learnings.

    What happened?

    Between 09:50 UTC and 17:21 UTC on 07 Sep 2022, a subset of customers using Azure Cosmos DB in North Europe may have experienced issues accessing services. Connections to Cosmos DB accounts in this region may have resulted in an error or timeout.

    Downstream Azure services that rely on Cosmos DB also experienced impact during this window – including Azure Communication Services, Azure Data Factory, Azure Digital Twins, Azure Event Grid, Azure IoT Hub, Azure Red Hat OpenShift, Azure Remote Rendering, Azure Resource Mover, Azure Rights Management, Azure Spatial Anchors, Azure Synapse, and Microsoft Purview.

    What went wrong and why?

    Cosmos DB load balances workloads across its infrastructure, within frontend and backend clusters. Our frontend load balancing procedure had a regression that did not factor in the effect of a reduction in available cluster capacity, due to ongoing maintenance. This surfaced during an ongoing platform maintenance event in one of the frontend clusters in the North Europe region, causing the availability issues described above.

    How did we respond?

    Our monitors alerted us of the impact on this cluster. We ran two workstreams in parallel – one focused on identifying the reason for the issues themselves, while one focused on mitigating the customer impact. To mitigate, we load balanced off the impacted cluster by moving customer accounts to healthy clusters within the region.

    Given the volume of accounts we had to migrate, it took us time to safely load balance accounts – we had to analyze the state of each account individually, then systematically move each to an alternative healthy cluster in North Europe. This load balancing operation allowed the cluster to recover to a healthy operating state.

    Although we have the ability to mark a Cosmos DB region as offline (which would trigger automatic failover activities, for customers using multiple regions) we decided not to do that during this incident – as the majority of the clusters (and therefore customers) in the region were unimpacted.

    How are we making incidents like this less likely or less impactful?

    Already completed:

    • Fixed the regression in our load balancer procedure, to safely factor in capacity fluctuations during maintenance.

    In progress:

    • Improving our monitoring and alerting to detect these issues earlier and apply pre-emptive actions. (Estimated completion: October 2022)
    • Improving our processes to reduce the impact time with a more structured manual load balancing sequence during incidents. (Estimated completion: November 2022)

    How can customers make incidents like this less impactful?

    Consider configuring your accounts to be globally distributed – enabling multi-region for your critical accounts would allow for a customer-initiated failover during regional service incidents like this one. For more details, refer to: https://docs.microsoft.com/azure/cosmos-db/distribute-data-globally

    More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency

    Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts

    How can we make our incident communications more useful?

    We are piloting this „PIR“ format as a potential replacement for our „RCA“ (Root Cause Analysis) format.

    You can rate this PIR and provide any feedback using our quick 3-question survey: https://www.aka.ms/AzPIR/3TPC-DT8

    Read more: Microsoft blockchain service september – Krypto-NFTs

    Source: 🔗