Namecheap downtime issues – post mortem
We would like to apologize for the partial service outage that took place in the past 24 hours, where many of our services, including our main Namecheap.com site and our Private Email service, were intermittently unavailable with periods up to a couple of hours.
We have been partially online and offline for the last 19 hours. At the epicenter of these issues was our core hardware firewall cluster. We were intermittently losing and then restoring its operation. Unfortunately, despite it being a fully redundant system, the entire cluster failed in operation due to the extreme load hitting it.
Why this was unexpected
The incident happened due to a human error during expansion work in another area of the network, which at first glance seemed unrelated. We were introducing new equipment to improve the customer experience. Unfortunately, the misconfiguration resulted in a network storm that eventually overloaded our firewall cluster.
Our initial remediation actions were incorrectly based on the assumption that the cause of the issue was either a hardware failure in the cluster or a custom-crafted attack. This was wrong. As a result, we performed actions that had limited effect on full resolution and only temporarily restored service. Nevertheless, in the process of critical incident response, we had all hands on deck for the past 19 hours to address the issues and eventually identified the actual cause and fixed it.
Systems are now fully stable and operational, and we will take the necessary steps to critically analyze and learn from our mistakes to ensure we do not let you down like this again in the future.
Thank you for your trust and continued patience.