How to Solve Problems in Software Systems
Sunday, Oct 15, 2017

Let’s talk about something that is near and dear to my heart: Debugging.

Debugging is so much more than the mundane, frustrating thing it seems to be on the surface. It is a pure exercise in something deeper, something extraordinarily significant: The ability to think and reason about problems.

Why am I passionate about this? Two reasons:

  1. As computer scientists, our entire purpose is to solve problems. None of the optimized code, slick platforms, or clever algorithms we create matter if they are not solving a problem for someone.
  2. The ability to think and reason about problems well is a skill, and it can be taught, practiced, and refined.

Problem-solving as a skill

So what does the problem-solving skill look like? It a mindset, or a way of thinking about any problem you encounter. It is built on two different, but equally-important approaches:

  • Analytical: Debugging software and software systems well is a scientific process: You must gather data, develop hypotheses, and test your hypotheses in a formal, reproducible way.
  • Creative: Debugging is first and foremost an investigative process. You must ask the right questions and use your intuition as you peel away the layers of the onion to reveal the root cause.

Four basic principles of problem-solving

I’ve painted the broad strokes of what problem-solving looks like at a high level. For the rest of this article, I’m going to detail a few basic principles that will help shape the proper mindset for approaching and solving a problem.

  1. Think before you do
  2. Ask the right questions, get the right answers
  3. Claim only what you can prove
  4. Solve the right problem

As the title of this article states, these principles are aimed at a software debugging context. If you’re interested in reading a general-purpose treatment on how to reason about problems, How To Solve It, by George Polya is a popular book, and I highly recommend it.

Principle 1: Think before you do

Jumping from problem to solution without understanding first

When something goes wrong, the very first thing you should do is stop and assess the situation. You will be tempted to skip this part, particularly if one or more of the following is true:

  • You think you know what the problem is.
  • Production is down.

When you assume that you know what the problem is, you take action blindly, without being sure it’s the right action. When the pressure is on, you make rushed decisions, and you don’t leave time to think through what you’re about to do. When you’re sleepy, you might do both.

How to do it: Understand the problem. Give yourself permission to take a breath, step back, and reason about the problem from a conceptual level:

  • What are the components involved?
  • How do they interact and depend on each other?
  • Would should have happened?
  • What happened instead? Which components might be misbehaving?

Don’t skip this process.

Stay calm. Remember: You will always make better decisions when you are calm and not panicking.

An inspirational figure to me in this regard is Joseph Joffre, Commander-in-Chief of French forces on the Western Front in World War 1:

Always calm and always in control, he did not panic despite the seriousness of the new Allied position; he even kept to his normal daily schedule, which included an enormous lunch and a nap.

—Michael S Neiberg, The Western Front 1914–1916

Joffre refused to give in to panic. He made it a point to ensure that he was in the best possible state to be making critical decisions.

If he can do it during the most dire circumstances, in situations where many lives hung in the balance, surely we can pause and think before we act.

Principle 2: Ask the right questions

Why is it so important to ask questions before trying to fix things? Because we need to acquire information that either supports or refutes our working theories from step #1.

We’re gathering evidence that will refine our hypothesis so that we can test it in the next step.

How to do it: Start with broad questions. A popular technique for asking the questions necessary to determine the root cause of a problem is The 5 Whys: Start with the thing that you immediately know is wrong, ask why, then answer it. Then question the resulting answer until you have asked “why” 5 times (or have arrived at the root of the problem).

For example:

Answering a question with The 5 Whys technique

Answer questions with evidence. This is where we put on our detective hat and get to work. While this is an incredibly nuanced and intuition-driven process, here are 3 tips to orient you in your approach:

  1. Know your toolbox:
    • In order to expose and uncover evidence, you need an array of tools that you are familiar with.
    • Julia Evans has an excellent introduction to Linux Debugging Tools.
  2. Increase visibility:
    • Your goal is to tap in to every available source of information. You need access to the misbehaving system.
    • Increase logging levels, turn on verbose output, etc.
    • Track down all logs and output.
  3. Divide and Conquer:
    • In larger systems, you may have many components interacting with each other, and this can obscure valuable information.
    • Learn how to insert yourself between the components to examine their interaction (e.g. using tcpdump to capture requests being reverse-proxied from Nginx to your web server)

For a look at what the investigation process and use of debugging tools looks like in the real world, check out Indeed Engineering’s blog post on debugging high system load after migrating to Java 8.

Principle 3: Claim only what you can prove

The Golden Rule of Debugging: If you can’t reproduce it, you can’t fix it.

Let that sink in. If you aren’t able to reproduce the problem at-will, then the best possible guess you can make is still a guess. I’ve seen too many examples where an engineer has checked the logs, made a conclusion, and implemented a “fix” without proving it. Don’t do that.

Remember: The only things you know at this point are:

  1. The bad behavior you’ve observed.
  2. The clues you’ve gathered.

Reproducing the problem proves that you understand it, and it will allow you to set up an environment to conclusively test and prove the fix.

How to do it: Start simple. Your environment should include all of the components necessary to reproduce the problem, and only those components. For example, you might spin up an instance of a microservice that is misbehaving, connect it to a DB running a snapshot from production, and then simulate HTTP requests to the service via curl.

If you add additional layers and components to your environment when they are not necessary for the reproduction of the bad behavior, you will increase complexity, distracting you from the real issue and slowing your progress. Your goal here is to remove as many variables in the equation as possible.

Test the thing you care about. You don’t want your environment to be so simple and basic that it is no longer representative of the real-world environment. Copying and pasting a snippet of Python into a new file and running it at the command line will likely not be useful for debugging a Django system. Your environment should be as simple as possible, and no simpler. The components should still be set up in a way which reflects the target environment that the fix will eventually go into.

Principle 4: Solve the right problem

Recently, a junior engineer was tasked with debugging Chef bootstrapping issues when bringing up our webserver instance. The engineer reached out for help, and described being stuck trying to get chef-solo to execute the cookbook locally (outside of the established Jenkins/Chef server environment).

The engineer had scaled back to an environment that was too simple, and did not accurately reflect how our target environment worked. After the engineer was advised on how to utilize the proper Jenkins/Chef pipeline and gather the resulting log data from bootstrapping, the issue was quickly tracked down. Debugging issues with running chef-solo on a laptop was a problem unrelated to the real-world environment, and a distraction from tracking down the true issue.

Don’t shave a yak. Stay focused on solving the right problem.

In practice: Figure out a tight dev/feedback loop.

When you’re developing a fix, it’s not enough to simply have good insight into how the system is behaving and the data flowing between components. You need to have an environment that allows for easy manipulation and changes, as well as rapid feedback on changes.

To cut down a tree in five minutes, spend three minutes sharpening your axe.

Definitely not Abraham Lincoln

Despite being a cliche quote that everyone seems to remember slightly differently, the takeaway is a valid one: A little investment in your debugging toolkit will make a world of difference. The key metric we’re aiming for here is a rapid feedback loop. Minimizing the time and effort between attempting a fix and observing the effects of your changes is our version of the sharpened ax.

Epilogue: Asking for help

Every single one of us, no matter how talented or experienced, will encounter problems that we need help with. The ability to humble yourself and ask for help is a good thing, and every organization with a healthy culture will encourage and enable it.

That said, there is a right way and a wrong way to ask for help. No one wants to do your thinking for you. You should never seek help without attempting to understand and solve the problem yourself.

At a minimum you should be able to answer the following basic questions:

  1. What specifically isn’t working?
  2. What did you attempt, and why?
  3. What did you expect to see?
  4. What did you see instead?

Struggle through the problem first. Struggling with a problem is an absolute must. If you run for help at the first sign of a block or struggle, you are short-circuiting your learning process. The ability to debug is like a muscle, and it gets better with repeated exercise. Don’t ping someone every few minutes with a live update of your troubles. Wrestle with the problem for a while, and accumulate a series of questions to ask all at once. This will also give you time to think about the problem, and possibly come up with a solution on your own.

Make sure you understand how to explain it. Referenced in the book, The Pragmatic Programmer, so-called Rubber Duck Debugging describes a programmer who would carry around a rubber duck and force themselves to explain the problem and the code in great detail. The idea is that, by forcing yourself to understand it well enough to thoroughly explain it to someone who doesn’t have the context, your brain might understand the situation well enough to know where you’re going wrong.