If you didn’t sleep in this morning, you may have noticed that the site was very slow or inaccessible for about 20 minutes this morning. This is bad news, less bad news, and good news.
The bad news is that your blogs were down, and that sucks. Even though we provide a free service, we strive to maintain uptime that most enterprise services would envy. You guys have great blogs and you deserve nothing less.
The less bad news is this wasn’t due to a systems error, it was a human error. We’re working on some new features, and some of the groundwork code that worked perfectly on our test servers caused problems in production. Once the problem was identified, we were able to quickly roll it back. We now know exactly what caused the problem, and are taking steps to make sure it doesn’t happen again.
The good news is that this is our first major downtime since May of 2006, about 9 months ago. Before when we had problems it was due to not having the hardware and expertise to scale the system to the levels of traffic it was getting at the time. Now 9 months later, we’re more than 10x the size and the servers are humming along fine. Human error will always be an issue, but we try not to make the same mistakes twice.
For those following along at home, we’re now powered by 152 physical processors, 511 gigabytes of memory (RAM), 174 hard disks with several terabytes of storage, and we’re adding new servers constantly.