Resolved
Online Store Service Degradation

Started
October 08, 2020 at 12:13 AM
Status
Resolved after 2 days

Impact

Major outage
Affected
eCommerce Applications
  • October 09, 2020 at 10:23 PM

    On October 7 starting at 5:06pm PT, the front-end online stores experienced a critical outage. During this time, users were unable to access the eCommerce sites powered by Olla or place online orders. Access and functionality of the Olla Dashboard was unaffected during this incident. This incident ran until 9:37pm PT, totaling 4hrs and 31mins of degraded service.

    The Alert

    At 5:06pm PT our Lead Engineer received a monitoring alert that our Platform API was returning 500 errors and subsequently all online stores were inaccessible.

    Triage protocols were put into place immediately and our StatusPage was updated to reflect the degraded state.

    Our customer support team began fielding inbound requests from customers looking to place an order with their preferred retailer, as well as retailers looking to report the outage. Both were directed to our StatusPage as well as informed of our current investigative status.

    Upon initial investigation, our API gateway was reporting degraded health and inbound traffic requests had spiked significantly outside of the nominal range.

    Our development team dove into debugging the problem areas, combing through logs, and ruling out any potential security threats.

    Identifying the Problem

    Upon digging into our infrastructure further, our team confirmed that our AWS cloud management tool was reporting severe health issues.

    Given the traffic spike, we attacked the issue as load related and scaled-up the server pool, while at the same time began the process of rebooting the existing servers in the pool at 6:50pm PT. All came back online with positive health statuses, but at this time our load balancer was still reporting failures.

    By 7:05pm PT we began a full reboot of all infrastructure servers and gained some additional log output for logging mechanisms that had previously failed.

    Based on this data, our team began auditing the SSL certificates for all areas of our core infrastructure. We have recently investigated the reverse proxy and automatic SSL management as causing previous degradation of service and have begun the process of identifying and procuring a new solution. The certificate for our Platform API was valid and not expired but the API Gateway, which is the reverse proxy, was returning a 500 error for this resource. Our ongoing analysis of this part of our infrastructure pointed to the reverse proxy as being the cause of this outage.

    Implementing A Fix

    By 8:45pm PT our development team had positively identified the primary area causing the outage.

    The initial point of failure that cascaded throughout the retailer online store network was caused by a daily automatic certificate renewal process on an un-expired certificate. During this process, a failure occurred which set an invalid configuration in the API gateway and caused all traffic for our Platform API to fail.

    Our development team began issuing commands to order, but not install, new certificates which updated the gateway configuration and resolved the issue.

    By 9:37pm PT all retailer online stores were back up, full functionality had been restored, and new orders immediately began flowing to retailers to be fulfilled.

    During this incident, our support team fielded hundreds of end-user support requests who were attempting to place an order. Upon resolution of this incident, each one was directly followed-up with and prompted to place a new order with the original retailer.

    Preventing This Situation In The Future

    We are very sensitive to the reliance and trust placed on our shoulders during this unique period of time; where online ordering is a critical, core part of a cannabis retailer’s infrastructure.

    As we have pivoted during the start of the COVID crisis to greater infrastructural enhancements to service higher traffic volumes, we have also taken this moment to reassess and rebuild many of our foundational footings.

    As of last week, we are proud to announce we have secured new engineering leadership who will be overseeing our overall growth, and ensuring the highest possible level of service to both retailers and their customers.

    Part of this process we are undertaking includes the audit, assessment, and replacement of aspects of our codebase that require greater resources to support our growing network of users. We are actively investing in more robust, agile, and scalable tools to support our growth, and significantly enhancing our monitoring, maintenance, and reporting functionality.

    We’re extraordinarily grateful for your patience, and our entire team sincerely appreciates your support as we move forward into our next chapter.

  • Resolved
    October 08, 2020 at 4:56 AM

    This incident has been resolved and full functionality has been restored. We will post a post-mortem tomorrow with more technical details and the steps we'll be taking to address this going forward.

    On behalf of the entire team, we sincerely apologize for this incident and the impact it’s had on your business. We thank you all for your patience and understanding while confronted with this outage.

  • Monitoring
    October 08, 2020 at 4:42 AM

    A fix has been implemented for this issue and full functionality for accessing online stores has been resumed.

  • Update
    October 08, 2020 at 3:53 AM

    We are continuing to investigate this issue - we'll continue to post updates in real-time via this status page.

    We understand the impact any downtime has and appreciate your patience while we work through this.

    We sincerely apologize for this inconvenience & assure you that all efforts are focused on getting your online stores running ASAP!

  • Update
    October 08, 2020 at 3:01 AM

    Some customers may be experiencing issues accessing their online store. Our engineers are investigating the root cause of this issue, and we will provide updates as soon as possible. We are working to resolve this ASAP.

  • Update
    October 08, 2020 at 2:07 AM

    Some customers may be experiencing issues accessing their online store. Our engineers are investigating the root cause of this issue, and we will provide updates as soon as possible. We are working to resolve this ASAP.

  • Update
    October 08, 2020 at 1:23 AM

    Some customers may be experiencing issues accessing their online store. Our engineers are investigating the root cause of this issue, and we will provide updates as soon as possible. We are working to resolve this ASAP.

  • Update
    October 08, 2020 at 12:51 AM

    Some customers may be experiencing issues accessing their online store. Our engineers are investigating the root cause of this issue, and we will provide updates as soon as possible. We are working to resolve this ASAP.

  • Investigating
    October 08, 2020 at 12:13 AM

    Some customers may be experiencing issues accessing their online store. Our engineers are investigating the root cause of this issue, and we will provide updates as soon as possible. We are working to resolve this ASAP.