We were about halfway through our all-hands meeting when James, our VidMob CTO, mentioned he wasn’t able to log into our product. It took us about five minutes to figure out that any of our users who had connected their VidMob accounts to Facebook were unable to log in, and about two minutes more to discover that the reason was that Facebook itself seemed no longer to exist.
This was, of course, October 4, the day that Facebook accidentally locked themselves out of the internet for much of the day.
VidMob’s software is, as you can imagine, pretty tightly-tied to the platforms we support. Perhaps too tightly, as it turns out. It took us almost exactly as long to disentangle our code from Facebook as it took Facebook to get itself back online. Their API became available again mere minutes after we were deployed our last code changes to work around its absence.
The internet is a lot more fragile than we give it credit for. In principle it’s still possible to build a website with no external dependencies beyond the webserver it sits on. But in practice, nobody has the resources to build everything from scratch: we import external libraries, we reuse existing frameworks, we depend on third party services for everything from design review to file storage to load balancing and DDOS protection — and any one of them might break or just disappear.
It was big news when Facebook did the technological equivalent of locking their keys in the car, but on a smaller scale things like this happen all the time. Just this week, a popular shared module used by thousands of websites — including both VidMob and (coincidentally) Facebook — was hijacked and replaced with malware. If we hadn’t been paying attention that could easily have done us more harm than the facebook outage.
We were unaffected by that issue and have locked our dependency versions to prevent us accidentally “updating” to one of the compromised versions.
The Facebook outage was a good lesson for us in VidMob engineering: it took us longer than it should have to separate our code from theirs, because our developers had implicitly assumed they were a permanent part of the landscape, and let things get too tightly coupled.
We’ll be reexamining all our third-party connections and dependencies for this sort of thing, so that when things go wrong we’ll be as able to carry on as possible. If nothing else, this outage demonstrates that no-one, however big, bright, or ubiquitous, is immune to error!