We have an amazing network in our Phoenix datacenter at Namecheap. I’ve written about it before and I remain immensely proud of just what we’ve built and how reliable it is.
Unfortunately on Friday, we experienced an issue causing a loss of connectivity for some customers on certain segments of our converged Phoenix network at various times in the early hours of Friday morning.
I’m writing this post to explain both the root cause of the issue and also the fixes implemented.
Firstly, I’ll paint the picture. If my phone goes off before 6am in the morning, it normally isn’t good news. The incident on Friday lived up to that expectation and a colleague on shift and part of the team diagnosing and repairing the issue alerted me what was going on.
We have a 24×7 IOC (Infrastructure Ops Center) at Namecheap with network specialists, server specialists and storage specialists always on hand to mitigate and manage any issues that crop up. However, we also have alerting thresh-holds when other staff members – including senior staff all the way up to our CEO – get alerted and informed when something isn’t quite right. Why? Because everyone at Namecheap is dedicated to delivering the best service to our customers and everyone will contribute in an emergency.
So after being thoroughly woken up, onto the issue at hand.
Our core routers for our Phoenix Converged Network consist of 2 x Juniper MX104 routers. Each is fully redundant in its own right with redundant routing engines, line cards, PSUs and more. Each has connectivity to several different upstream providers. We opted for this level of redundancy to eliminate as many single point of failures as possible and allow for common network issues (upstream connectivity packet loss, congestion or complete outage) to less common network issues (line card failure) to very rare network issues (routing engine failure, psu failure, fabric failure).
Unfortunately, we were seeing an anomaly we’d never seen before on Friday whereby both routing engines in both routers would reboot, taking physical uplinks offline. One routing engine would reboot so we’d fail over to the second routing engine as planned. This causes a little downtime as some routing tables reconverge but the second router is there to pick up the slack. Unfortunately, the second router was displaying exactly the same symptoms and sometimes at the same time.
The above resulted in some short outages as routing engines and routers failed over and routes reconverged. And this happened on several occasions.
This behaviour is highly unusual and isn’t something we’d seen before on any of Juniper’s products — and we’ve used Juniper switches and routers for a long time.
Suspecting a software fault with the routers themselves, we engaged Juniper to show them this behaviour.
They conducted some of their own analysis and confirmed a software fault with the JunOS operating system. Senior Juniper staff were engaged by their support team who created a patch to address a security flaw in the router operating system. We showed them some spurious packets traversing the routers coinciding with the time that the routing engines rebooted.
We quickly applied the patch to address the JunOS issue and our network was restored to full health. Juniper were quick to acknowledge fault and very quick to release a patch – something very good to see. They also indicated that this patch will be built into the next JunOS release, fixing other potentially at-risk routers.
Even with the best laid plans and a really solid network that in every other month has delivered 100% uptime, hiccups can occur. We are thoroughly prepared for almost all eventualities and approach everything with reliability, security and performance in mind. But sometimes, things outside of our immediate control can cause an issue.
At Namecheap, we’re committed to being transparent to customers — hence this blog post. And a thank you to customers who were impacted but demonstrated amazing understanding in the early hours of Friday.