How to Rescue a Codebase

Ben Sandofsky at Swift Summit San Francisco, 2016


Ben: Well, congratulations, you've joined the next billion dollar startup. It combines on-demand dog walking, with drones. That's right, you're the new lead developer or “Sky Walker”. We had to license that from George Lucas, and you're ready to dive in there and make some big changes. Just fair warning, the code's a few years old. It's been through a few developers and it was their first iPhone app, and as you dive into this pyramid of doom, going, "What is going on?" And you start going through 7000 line view controllers, trying to trace the state and whenever you make a change, there's weird side effects somewhere else and you're afraid to touch anything. You get that nice feeling in your gut that everyone gets like, "Wouldn't it just be faster if I just went up to; file, new, project?"

If you're a junior engineer, you're like, "Sounds great, what's the worst that can happen?" If you're a senior engineer, you're like, “What is the worst that can happen?" You're going to take risk assessment. Now, no doubt, you've by now read Joel Spolsky's "Things You Should Never Do, Part I" and he talks about how Netscape, after version 4.0, decided to rewrite their code from scratch and they spent three years on a rewrite in the middle of the browser war, three years of lost opportunities. You're saying, "Well, I'm not rewriting a browser, I'm just doing a silly social networking app." Then, in 2012 Facebook rewrote their HTML5 app to a fully native experience, and in an interview with ABC News, they said that it took them nine months.

You're always going to underestimate the effort. In part, you're going to have to support two codebases at once. I've never been at a company where the PMs are like, "Yeah, cool, just go rewrite that. I'm going to play some Call Of Duty for the next six months." You're always going to have pressure to continue to ship. Now, even if your dream comes true and you finally get a greenfield architecture to work with, there's no guarantees that it's going to meet the requirements you had previously. Digg with v4 made some unpopular product decisions but also they rewrote their backend which ran into trouble when they deployed it to production, and at Digg v4's launch, there was a huge user out-lash. In addition to features, the site was really unstable and ultimately their VP of Engineering stepped down.

You don't know if a lot of the gnarliness in your code is the result of crazy background URL session bugs that you had to work around a couple years ago. Even Apple isn't immune to problems with rewrites. Last year they tried to replace mDNSResponder with Discoveryd and for four months everyone was freaking out over the problems they were having with networking and ultimately they rolled back and went to mDNSResponder, and that's a 12 year old code base. You might think that's 12 years of gnarly code, but it's also 12 years of bug fixes, of seeing every possible networking device in the wild and it's really hard to synthesize that hardening in a clean room.

in lieu of that, rather than abandon code that's working, what can we do to preserve existing behavior? This is going to be the underlying problem in your code. Now, in all fairness I know that a quarter of you in the room are probably dealing with codebases that are less than a year old and tens of thousands of lines of code, but if you are dealing with a codebase that has years of experience built into it, it could be all of those bugs, it could be a signup flow that's been A/B tested to death, it could be important analytics that's built in there that your PM's expecting to show investors when you're looking for the next round of funding, and if you break this consistent behavior, or behavior that users have come to expect, then that could be tragic.

What can we do to preserve that existing behavior, reduce the risk, and continue product development without a gigantic gap? I'm going to give you two more conservative approaches. The first one is, let's say that we have a legacy platform and it's written something monolithic that you just want to abandon, it's going nowhere. You're rewriting a phone gap app, and you're saying to yourself, "I want to rewrite it and we can't get any more developers that are going to work on it." What we want to do is peel off a new piece of functionality like a new signup flow that your PMs want to implement and treat that as the greenfield project, rewrite the model layer as you dreamed it should look like, and then build an adapter layer between the new system and the old system to leverage existing behavior like data serialization or every other view control in your app.

If you're really slick, build controls in front of these two view controllers so that you can ramp up users going to the new system, starting at 1%, monitor for bugs, make sure that you're not having major regressions in behavior, and then over time continue to slice these verticals, slice this functionality, until you've rewritten the app. Martin Fowler calls this a "strangler rewrite" and that's inspired by the strangler fig, a particular type of tree that grows in tropical environments where it's really competitive for life. What it'll do is it'll find an older dying tree and it will wrap its vines around it and plant into the roots and over time, completely envelop the old tree.

Actually, if you saw it open, it's hollow inside where the old tree was. I know, it's super creepy, right? What's nice here is that you're reducing the risk but most importantly you're delivering value to stakeholders. 

The other approach looks waterfall by comparison where this is much more iterative. You could target one month releases once you've built up that infrastructure, but the cost is you do have overhead, you have to build the adapter layer and the controls in place to do this in a meticulous manner. Also, this rewrite could get canceled for reasons outside of your control. Management could change, or your lead developer could leave the company, and in a worst case scenario, now you have two different architectures that you need to juggle.

The most conservative approach, the one I prefer and the one I'm going to be talking about for the remainder of the talk, I'm going to call it the "Inline Rewrite" also known as "fixing things," and it's the very simple philosophy of going into the codebase and leaving things slightly better than you found them. Then, over time, as we do this, inch by inch, within months hopefully you have a new codebase that's closer to your ideal design. Now, if there were one North Star to look toward as you're doing this rewrite, I would say print this out and look at it. Break your app into small decoupled components with well defined responsibilities. That's all you need to think about when you're first starting out, and this is what I look at when I dive into a code base where I'm wondering, "What is going on?"

If this is your code base right now, you've got a bunch of tangled dependencies, you're reaching into other view controllers to change state, you've got goto statements masquerading as NSNotificationCenter. Step one is let the jank be as janky as it is now, but just try to shuffle it around so there are fewer dependencies and side effects as you're making changes, and you can reason through the effects of your refactoring. Then, once you've sold that, it becomes much more straightforward to piece by piece rewrite things, or if no one really accesses a particular section of code, you can just leave it there, as long as it doesn't emit warnings, and worry about it some time in the future.

Put another way, would you rather tackle one 2000 line view controller and rewrite it, or ten 200 line classes? You're already looking at that gnarly 7000 line view controller that has all sorts of business logic in it around summoning the drone, canceling it, warning you about surge pricing, and you're thinking, "What kind of tools can I use to refactor this and make it cleaner?" I have good news. You're actually in a two day conference and you're going to learn all of these techniques over these two days and you might say, "Well, this seems like a job for state machines," awesome. Note that, but what I would caution you about is a pattern I've seen which I'll call “Leap Frogging" where you go out there and you read something in a Medium post about a new architecture.

It's always the scary ideas that start with "I read it in a Medium post," and you say, "All right, this can solve all of our problems. This external thing, this magic pill." The problem is you're right now trying to preserve behavior and not break anything but at the same time you've taken on the risk of learning best practices around a new paradigm. Now, you have to train your teammates on these best practices and get them to buy-in with you. I think most of you in here have probably seen a decent iOS codebase so you have an idea of what a good looking app probably should look like internally, right?

I would caution you to try not to look toward external things when you're first starting out and treat those as productivity multipliers and better for larger architecture later. You want to take the smallest step at a time. Here is hopefully the least controversial slide you'll see in the next two days; MVC, I know you all probably disagree about what that means, but I think we all agree it's a pretty solid design pattern. Just use this as your initial guide to break up your app into three components to start out. Now, this sounds obvious, I'm going to start with a simple example, but things will get complicated a few slides later.

Now, let's take this section of code which I've taken from a popular open source iOS app with thousands of stars on GitHub so you know it's great. The names and faces have been changed to protect the innocent, but I've seen this all the time. Inside your view controller, you spit out a UICollectionViewCell, and you're reaching in there to modify the subviews in there, and if you just go back to MVC as a smoke test, well, the role of the view is presentation, the role of the controller is to be a mediator between presentation and your models, so why does it care about how we're presenting the content? Why does it care that we're using subviews versus Core Graphics versus an open GLView?

These are implementation details that should probably be encapsulated inside of the view, so it takes 30 seconds to cut and paste and throw it into that cell. This is still slightly coupled. Maybe we'd want to use a protocol instead of handing it in the model, but this is a lateral change and you know what? The compiler will probably warn you if you mistype a cell and change it to self, but it's a pretty safe, simple change that I'd feel comfortable with just to break us apart, and if I rewrite the view controller later, it's easier to drop in the old cell in there. Okay, hopefully that's not too complicated. Let's go back to that pyramid of doom.

Of course, you can just see the glance of what's going on but I'll walk you through it. This is the SignupViewController. Initially we show a progress spinner and we grab the email and password from the text fields, we try to validate them and if they're incomplete or invalid we show you an alert, otherwise we try to send them up to the API and we have networking code built in there that checks for a particular magic number as a status code, and then if it's successful we push you into the welcome on-boarding view controller, otherwise we show you an error. Okay, at first glance this is a little difficult to understand and you know that this should probably not belong in the view controller.

This belongs over in the model and one fun anti-pattern I see that comes from the right place is, "Let's just cut and paste it and throw it in a class, we're going to call it a manager class." They're always called managers. I think some of you call them view models, but it's this bucket that you just shove everything into because now we have fewer lines of code so clearly it's decoupled. Well, for starters it isn't decoupled. It still knows about the structure of the view controller, it still knows to pull out values from UITextFields. All right, that's a minor violation, I'll let that slide, but what I'm more worried about here is that we're changing state on that modal on screen. Modal, which is closer to presentation, it's not messing around with views directly but this is probably something that belongs in the view controller, keeping track of this presentation detail.

It isn't clear at the call site, when you call SignUp, how it's messing around with the modal. If you mess it up, as you try to refactor, it's very easy for the app to get into a state where you have a spinner after you tap on signup and the user then never completes the signup flow and your signups drop down and you shut down your company. This is a very dangerous section of code to change. Now, if you are going to be thinking about refactoring, and you reach that point where things are starting to go into the modal that don't seem to belong there, the one area that you should focus is abstraction and layering. Hopefully this is stuff that's like basic stuff you learned in college but just go back to the basics and think, "Well, wouldn't it be better if we modeled the signup flow as a simple return value of success or error?"

We probably actually have more lines of code in our app overall. There's more lines of code in this particular method but it's clear that inside the view controller, depending on the response, how we're going to update the view or change what view controller we're located at. You want to model your entry into your modal as a series of inputs and outputs. 

Oops, I accidentally held the left clicker. All right. This is a replay previously on presentation. All right. 

That's all easy said and done, but the problem is, and this is the conundrum, is that you're afraid of refactoring that code because a lot of that modal code probably has its tendrils deep inside of the view controller.

It's very easy for me to show you the final result, like a cooking show and I show you all the ingredients, then I pull out the turkey I made earlier in the day. How do we actually get to there? By the way, the reason you should be afraid here, I think I said there's six exit points in this block of code. Fun fact, human working memory can only juggle seven things at once, plus or minus two. It's very easy as this gets slightly more complicated for you to drop one of these states on the floor and you've got a bug, all right? How can we go about this refactoring in a safe manner? 

Well, I'm going to say something that makes a lot of people in this room feel uncomfortable. Please bear with me.

I'm going to be talking about tests? Now, if you're watching at home, please don't close your browser window yet. Please, just give me 30 seconds. I swear, I am not one of those crazy people that says that if you don't write 1:1 ratio of tests to production code, you're a lesser human being. I know that a third of you out there aren't writing any tests, and to tell you the truth, don't tell anyone, I actually fall much more into that middle bucket than the top bucket of testing everything. I think it's important to take a very pragmatic approach and I don't want to make this a big religious war, although we have Q&A at the end if we have time, but I will say that the important thing when you're trying to preserve behavior, is that tests are an invaluable tool to have in your pocket.

One; they can lock down existing behavior, as bad as it may be. Let's say that we want to refactor one class but it has weird side effects touching other parts of our app. We can pin down that behavior and then refactor that class and be sure that we're not dropping any of that on the floor. The second thing is that it will quickly shine a light on our dependencies and areas that we can continue to refactor, okay? Now, we're doing something horrible here, we’re trying to put a view controller inside of a unit test. We kind of have to because we're afraid of pulling that code outside of the view controller. We want to take small, conservative steps. This doesn't even work, one; because we have no feedback mechanism to check whether or not the HUD, the progress spinner, has completed or if it's shown an error message or what's happened, so that's the first problem.

But even before we get to that, our code's going to blow up because as soon as we try to set that text value, it's like, "View didn't load" so we're accessing an implicitly unwrapped optional and you're like, "Okay, well then to get that working we need to call loadView. Oh, and then we need to call viewDidAppear. But wait, viewDidAppear reaches up into UINavigationController, and that touches a singleton which…” And before you know it, you've instanced your entire app in order to get one unit test. This should be a big sign, okay? Now, one thing that I've seen people do is they will try to use UI Automation testing to scaffold their whole app in there, and they might write, "Tap through each form field to make sure that it handles that."

You're saying to yourself, "Whoa, whoa, whoa, whoa. Why should I test one class when I can test dozens, right? That's so much more productive." I have written these types of tests in the past, and you can use test frameworks to mock out the lowest level networking requests, and these are very valuable as smoke checks but they're very difficult to scale. Okay, very quickly, one; they take a while for each of them to execute, so let's say that we add six of these tests for each of those six cases we want to write a harness around, well we've added 30 seconds to our test suite and before you know it, if everyone is contributing tests when they're adding patches, it takes hours to run the test suite and then you lose that quick feedback before you do commits.

That's one, and two; if you have dozens of collaborators involved, it becomes very difficult to pin down when something fails, which of those collaborators broked. Unit tests are really good for getting in deep and just fine grain testing on a particular module, but full stack testing is useful as initial smoke checks to make sure a behavior exists but it isn't scalable to every variance. Cool. So, how do we go in there now and break apart this dependency so we don't have to instance the whole world? Now, if you've followed any blog posts in the last year, someone's probably brought up Dependency Injection. Simple, we create fakes for the text fields that adhere to the same signature as UITextField.

We create a fake networking model and we create a fake progress modal. I'm not a big fan of this right now, it looks kind of ugly, but maybe we can hide it behind a factory method or something, but more importantly, we're right now focusing too much on re-architecting the code because you can be afraid of changing your Objective-C initializers, makes them multiple initializers, you could forget something. This could feel too risky for you, and that's the paradox when you're in a codebase that you're terrified of. You want tests in order to change code, but you can't change code because you don't have any tests, so what do we do? You go on Amazon and purchase this book.

This is the best advice you'll hear today, and this was written in the early 2000s, "Working Effectively with Legacy Code," and it talks mostly about C++ and Java, but it will handle these difficult situations of getting dependency injection in there when a system wasn't really designed to be decoupled and it's a really, really, really awesome book and rather than show you the dozens of techniques that are in there, I'll give you a taste with one technique and hopefully send you in the right direction. We have this section of code, showProgressSpinner… all right, so we have a bunch of UIKit as dependencies, and we have all that navigation at the bottom.

We won't focus on networking for now but we'll use the same technique later. Step one is to add a level of indirection. Rather than call emailField.text, we add a private getter that returns the same type as before. We make sure that we try to match up the method signatures identical to whatever we're trying to rip out, so hopefully the compiler will catch any silly bugs, and then we do something awful. We subclass that view controller and override those injection points. In the book, this is referred to as "finding the seam" in the code. This is not good production code. It violates substitution principles, right? But this is important for bootstrapping that initial harness because next, now we can just use those to sense the inputs and outputs of that one method.

Then, we can write five more of these methods and then once we have that in place, we can extract this into a model. Then, we can write an integration test to double check the model is hooked up to our view controller, and then finally we can rip apart this code and hopefully rip apart the other code. In the book they refer to this as "scar tissue." If you need important surgery, yeah, you put up with the scar tissue. It'll eventually heal, but it can also come in the form of adding extra parameters to method signatures. It's temporary and it's okay to make code slightly worse in the name of getting it into that initial test harness so you have a feedback loop as you're trying to fix the code.

Okay. Now, to summarize the technical points; 

Don't underestimate the difficulty of preserving existing behavior. It may not apply to all of you, maybe your company recently pivoted and you don't care about supporting that old feature. That's awesome, but if you're dealing with an older system, be aware you probably have a long tail of edge cases you need to support. 

If you have to rewrite, find ways to reduce scope. When I think back to one of my few successful rewrites, behind the scenes really, we weren't really rewriting the whole thing. We were pulling off a chunk, and if you look at sometimes what people call rewrites, they've actually de-scoped and removed functionality from the original system in the process and that's the only real way that it can work.

Three, start simple with a separation of concerns. If you start with MVC you now have one third of the amount of code to worry about if you decide to re-architect it. 

Then, as you start carving out the model layer you can carve thinner and thinner slices in the form of abstractions. 

Okay, next, leave a trail of tests with every patch moving forward. If everyone on your team leaves a few tests, then pretty quickly over weeks and months you're going to have a pretty resilient test suite. 

Finally, focus on small steps. One thing at a time, okay? Always assume that you're going to make a mistake and it's possible that you can break what you're changing. It's very difficult to dive in there and test everything in the simulator, in this case six different variances. Be very slow and steady.

Just to wrap up, I want to spend a few moments to talk about the underlying problem and the reason that I warn people about rewrites. I've seen at least in one rewrite, where they locked their great engineers away in a room for a few months, they rewrote the system and it was beautiful, it was perfect. The moment they opened up that code base to regular product development and everyone else started dog-piling on it again, it just started reverting back to all the issues that got them there in the first place.

What I would urge you to do, even before you consider the first line of code that you're changing, is to take a step back and take and look at the engineering practices that got you there. Maybe there's a disparity in knowledge, like there's one junior iOS developer and they need more experience, and that could call for some pair programming for knowledge transfer. I'm not a big fan of pairing but maybe that could get you there. Maybe you need to institute code review, maybe there's a disconnect between product and engineering where product is setting too high expectations to hit deadlines and engineering is like, "Okay, I'll just cut some corners because we've got to move fast," and this and that.

I'd urge you to consider the cause and not the symptoms and instead of looking for some external thing in a rewrite or a framework, try to figure out and take agency in fixing what's there today instead of just punting it off and saying, "We can always rewrite it again later." If you can't fix that, I'd like to point out that Frontier Labs is hiring and it's a great opportunity. We're going to be building some awesome things in mobile, we've got some cool ideas, and you will never have to touch any terrible legacy code, because you're probably going to be writing a lot of it. I actually have 28 seconds before Q&A, so that's perfect. Does anyone have any questions?