Few words are more dreaded by product managers than being told by engineering: “No more new features! We need to stop and rewrite! Our code base is a mess, it can’t keep up with the number of users, it’s a house of cards, we can’t maintain it, the site is a dog!”
This situation has happened to too many companies, and continues to happen. It happened to eBay in 1999, and the company came far closer to collapsing than most people ever realized. It happened to Friendster a few years ago and opened the door for MySpace. It happened to Netscape during the browser wars with Microsoft, and everyone knows who won. The truth is that most companies never recover from this. It is a really bad situation to find yourself in.
When a company does get into this situation, everyone typically blames engineering. But in my experience, the harsh truth is that it’s usually the fault of product management. The reason is that for the past years the product managers have been pounding the engineering organization to deliver as many features as the engineering team possibly can produce. The result is that at some point, if you neglect the infrastructure, all software will reach the point where it can no longer support the functionality it needs to.
During this rewrite, you’re forced to stop forward progress as far as what the customers see. You might think that the rewrite will only take a few months (more about that below) but invariably it takes far longer, and you are forced to sit by and watch your customers leave you for your competitors.
If you haven’t yet reached this situation, here’s what you need to do to make sure you never do. You need to allocate a percentage of your engineering capacity to what at eBay we called “headroom”. The idea is to avoid slamming your head into ceilings. You do this by creating room – headroom – room for growth in the user base, growth in transactions, growth in functionality.
The deal with engineering goes like this. Product management takes 20% of the capacity right off the top and gives this to engineering to spend as they see fit – they might use it to rewrite, rearchitect or refactor problematic parts of the code base, or to swap out data base systems, improve system performance – whatever they believe is necessary to avoid ever having to come to the team and say “we need to stop and rewrite.” If you’re in really bad shape today, you might need to make this 30% or even more of the resources. I get nervous when I find teams that think they can get away with much less than 20%.
If you are currently in this situation, the truth is that your company may not survive this. But if you are to have a chance of pulling through, you’ll need to first do a realistic schedule and timeline for making the necessary changes that engineering identifies.
Most of the time, an experienced engineering team will come up with estimates that are on the slightly conservative side. The exception to this rule is this case of rewrites. Here the estimates are often wildly optimistic. You must make informed decisions in this situation, so you have to go through every line item on the schedule to make sure that the dates are realistic.
Second, if there’s any way humanly possible to break up the rewrite into chunks to be done incrementally, you should absolutely do so. Even though the rewrite might now stretch over two years instead of 9 months, if you can find a way to continue to make forward progress on user visible functionality, even if it’s only with 25% to 50% of the capacity, this is incredibly important.
Third, since you’ll only have very limited ability to deliver user visible functionality, you will need to pick the right features, and make sure you do define them right.
After eBay’s near-death experience, they made sure they wouldn’t put the company at risk again. They immediately began another rewrite, this time well in advance of issues. In fact, due to their very rapid growth, they ended up rewriting a third time, this time translating the entire site into a different programming language and architecture, and they did this massive multi-million line rewrite over several years, and most importantly, without impacting the user base and at the same time managing to deliver record amounts of new functionality. It’s the most impressive example of “rebuilding the engine mid-flight” that I know of.
But definitely the best strategy for dealing with this situation is to not get to this point. You need to pay your taxes and remember to dedicate at least 20% to headroom. If you haven’t had this discussion with your engineering counterpart, you should do so today.