Resolved
eCommerce / Dashboard | Service Disruption (6/29/20)

Started
June 29, 2020 at 11:19 PM
Status
Resolved after 1 day

Impact

Major outage
Affected
Retailer Dashboard
eCommerce Applications
  • July 01, 2020 at 12:49 AM

    Yesterday (Jun 29) at 4:14pm PT, our service experienced a long-running critical outage, during which users were unable to access any eCommerce sites powered by Olla, and the majority of customers were unable to establish a reliable connection to our dashboard. We call it “critical” as orders were unable to be placed or processed using Olla, the core service that we offer.

    Transparency is in our DNA: real-time updates pertaining to this outage are logged on our status page (status.olla.co), but we also wanted to share with you all the details of the outage and more importantly the details of our response.

    The Alert

    At approximately 4:14 pm PT, our director of engineering received an alert signaling an 2x increase in the average number of requests called via one of our customer-serving API gateways - the spike was identified as an anomaly and triage protocols were immediately put into action.

    It was quickly identified that all eCommerce sites were displaying a 502 Error, signaling a breakdown in data properly flowing between core services. Our customer support team was quickly bolstered, and our critical outage triage protocol was put into play. Our customer-facing teams were quickly overwhelmed with incoming phone calls, yet our chat-support efforts remained under our 5-minute target response time. Our operations team confirmed the severity of the outage with our engineering team, and an incident was logged on our Status Page.

    Unfortunately, it was quickly identified that this outage wasn’t related to any recent code deployments, ruling out the option to rollback to a previous, stable version. Instead, our development team dove into debugging the problem areas, combing through logs and ruling out any potential security threats.

    Identifying the Problem

    At approximately 7:45 pm PT, our engineering team pinpointed the primary area impacting the outage, and began focusing solely on rectifying the issue. In summary, this incident was directly related to our API gateway, our automatic SSL certificate renewal service, and the flow of data to our customer facing sites.

    The good news was all core services continued to function properly on the backend, product data remained up-to-date, integrations continued to sync data, etc. The bad news was the eCommerce and Retailer Dashboard front-end applications were unable to establish a reliable connection with the appropriate API gateways in order to access and visualize any of the data necessary to build these applications.

    Implementing A Fix

    At approximately 9:36 pm PT, our engineering team deployed an update to our core infrastructure services, directly addressing the issue at hand. The update was successful, and data immediately began flowing to the frontend applications once again. We continued to monitor the application health until 10:30pm, before updating the status of this incident to ‘resolved’.

    Thanks to the tireless effort of our development team, we were able to resolve this outage before EOD for many of the retailers powered by Olla, and prevented carrying this issue over until the following day. We are incredibly appreciative and thankful for the hard work our development team put in to investigate, identify, and develop a fix to resume service.

    Preventing This Situation In The Future

    One of our core values is “Embrace our Responsibility”. For us, this means reveling in the responsibility we collectively bear on our shoulders - building and maintaining the best damn eCommerce software, used to process tens of millions of dollars worth of commerce activity each month. Especially in this COVID era, where eCommerce is more important than ever, we take pride in this responsibility.

    In the event of a critical outage, the real work begins as soon as the issue has been resolved. Triage mode doesn’t end as soon as service has been resumed and an apology letter has been written - that’s when the hard work begins. We have identified a series of flaws within our processes that ultimately led to this outage & are actively taking steps, both organizational and technological to address them. We’ll be investing in enhanced tools and technologies to support our efforts, and enhancing our SOP’s to better account for routine maintenance, systems checks and oversight.

    In Conclusion

    In a perfect world, we’d make an ambitious statement written by our marketing team, full of promises of grandeur, ensuring you that we’ll never have an outage again. We’d rather be honest with you.

    We’re humans writing and maintaining software - mistakes are inevitable. What we can promise you, wholeheartedly, is that we can do much, much better than we did yesterday - a 5 hours outage is simply unacceptable, and we won’t stand to encounter that again.

    We commit to improving our triage processes, enhancing our policies and technologies, and empowering our team to not only ensure a high rate of stability and uptime, but to triage problems much, much swifter & with far less collateral damage than we did yesterday.

    We’re extraordinarily grateful for the patience and grace you all showed us last night & give you our word that we can, and will do better.

    ♥️ Team Olla

  • Resolved
    June 30, 2020 at 5:31 AM

    After a period of close monitoring, this incident has been resolved.

    We will post a more detailed post-mortem tomorrow - in the meantime, this incident was directly related to our API gateway, our automatic SSL certificate renewal service, and the flow of data to our customer facing sites.

    On behalf of the entire team, we're so incredibly sorry for this incident and the inconvenience it has inevitably caused. Rest assured, we understand the critical service Olla provides, especially during these challenging times, and will continue investing heavily in minimizing any downtime incidents in the future.

    We thank you all for your patience and understanding while confronted with this outage - we have the best customers & are grateful each and every day for you all.

    ❤️ Team Olla

  • Monitoring
    June 30, 2020 at 4:55 AM

    A fix has been implemented and we are monitoring the results. Full functionality has been restored across the platform.

  • Identified
    June 30, 2020 at 3:02 AM

    Our engineering team has identified the underlying issue and we are currently working on the fix to resume service ASAP.

  • Update
    June 30, 2020 at 1:54 AM

    We are continuing to investigate this issue and will to post updates here in real-time.

    We understand how critical online ordering is currently, and are working to identify the root cause and implement a solution as quickly as we can.

  • Update
    June 30, 2020 at 12:14 AM

    We are continuing to investigate this issue - we'll continue to post updates in real-time via this status page.

    We sincerely apologize for this inconvenience & assure you that all efforts are focused on getting your online stores running ASAP!

  • Investigating
    June 29, 2020 at 11:19 PM

    We are receiving reports of customers unable to load online store sites and the dashboard. Our engineers are investigating and we will update this page as the situation develops.