“There he is!” both my fellow engineers said as I walked into the office this morning. I knew right away that this was not going to be the Monday I expected it to be.
“The database is down,” my boss said before I reached my chair. “Totally dead.”
“Ah, shit,” I replied, dropping my backpack and logging in. I typed a few commands so my tools were pointing at the right place, then started to check the database cluster.
It looked fine, humming away quietly to itself.
I checked the services that allow our software running elsewhere to connect to the database. All the stuff we controlled looked perfectly normal. That wasn’t a surprise; it would take active intervention to break them.
Yet, from the outside, those services were not responding. Something was definitely wrong, but it was not something our group had any control over.
As I was poking and prodding at the system, my boss was speaking behind me. “Maybe you could document some troubleshooting tips?” he asked. “I got as far as the init command, then I had no idea what to do.”
“I can write down some basics,” I said, too wrapped up in my own troubleshooting world to be polite, “but it’s going to assume you have a basic working knowledge of the tools.” I’d been hinting for a while that maybe he should get up to speed on the stuff. Fortunately my boss finds politeness to be inferior to directness. He’s an engineer.
After several minutes confirming that there was nothing I could do, I sent an urgent email to the list where the keepers of the infrastructure communicate. “All our systems are broken!” I said.
Someone else jumped in with more info, including the fact that he had detected the problem long before, while I was commuting. Apparently it had not occurred to him to notify anyone.
The problem affected a lot of people. There was much hurt. I can’t believe that while I was traveling to work and the system went to hell, no one else had bothered to mention it in the regular communication channels, from either the consumer or provider side.
After a while, things worked again, then stopped working, and finally started working for realsies. Eight hours after the problem started, and seven hours after it was formally recognized, the “it’s fixed” message came out, but by then we had been operating normally for several hours.
By moving our stuff to systems run by others, we made an assumption that those others are experts at running systems, and they could run things well enough that we could turn our own efforts into making new services. It’s an economically parsimonious idea.
But those systems have to work. When they don’t, I’m the one that gets the stink-eye in my department. Or the all-too-happy greeting.