Statement on the Namecheap Outage, November 13th
No company likes an outage. At Namecheap, we do everything in our power to maximise availability and uptime, while minimizing downtime because we understand the disruption it causes to you and your business.
What happened on November 13
Unfortunately on November 13, at approximately 5 AM UTC, we experienced a major outage caused by a power failure in our primary datacenter in Phoenix, Arizona.
For our primary datacenter, we operate out of PhoenixNAP, a top-tier datacenter renowned for reliability, security, and infrastructure resilience. We are one of the major tenants, consuming around half a Megawatt of power and hosting thousands of servers within the datacenter. We consume floor space and power from PhoenixNAP, while running our own network, hardware and onsite operations.
During the night, electrical engineers were engaged in maintenance on one of the UPS battery backup systems associated with one of the power feeds available in the datacenter itself. Maintenance on power infrastructure is routine, and happens on a weekly basis. Up until now, we’ve never had an issue with any power maintenance window.
On the November 13, however, routine power maintenance turned into the first power outage in PhoenixNAP’s 10 year history and caused one of the power feeds in the datacenter to go down several times.
The result on Namecheap’s services
For the vast majority of Namecheap’s services, power outage should never be an issue. This is because our core network, platforms, website, hosting, and cloud are all connected to redundant power circuits, taking power from an A-feed as well as a fully separated B-feed.
Unfortunately, we did experience an issue. Some of our infrastructure (non-critical areas) are not redundant by design. Elements of this infrastructure include network equipment. So when the primary power feed failed, this network equipment went down, and started rebooting and going down throughout the unstable period. This caused route flapping on our core distribution network equipment, that in turn caused us to lose connectivity to the datacenter.
One of our 4 core routers even went into kernel panic due to the unexpected load pattern. On top of the flapping problem, several of our legacy network switches and load balancers were not properly configured to sustain full A-feed outage, and therefore went down as well.
The resulting situation was that many servers and services remained up, but did not have outside network connectivity to serve customers.
Resolving the issue, and guarding against future problems
Once the A-side power was returned after the outage, the flapping stopped. This allowed us to bring all of our services back online, and restore the full Namecheap experience.
To be transparent and open, this was an error on our behalf, and one that should not have happened.
For that, we again apologize. For us, mistakes are both an opportunity to learn, as well as improve. And to further our commitment to full and open transparency, we are taking the following immediate steps to ensure this does not happen again:
- Introduction of network changes to ensure that, in future, the non redundant portion of network equipment does not affect the redundant portion in case of an outage—especially for cases when power feeds are unstable.
- An audit and fix of the legacy network equipment to ensure it is properly configured and tested for full feed failures.
- Improvement of our internal policies and procedures around how frequently power audits and live tests are done.
On behalf of our Executive Team, we would again like to take this opportunity to apologize for the disruption that we caused to you, our customers.
The Namecheap Executive Team