On-call management: what is it and how to improve it?

Miroslav Lazovic
Backend TL@Neon and Dev. Advocate @Holycode

Modern software engineers have many responsibilities – and one of them is dealing with incidents. However, incidents can happen anytime, which means that engineers will have to be available outside of working hours. This brings us to the concept of on-call management.

What is on-call management?

Let’s look at one definition found on the LinkedIn Engineering blog:

“On-call employees are responsible for monitoring systems, responding to alerts or notifications, and resolving issues as quickly as possible. In a typical on-call system, employees are assigned to a rotating schedule, where they take turns being available outside of their regular work hours. During their on-call shift, employees are expected to be reachable and responsive in case of any issues or emergencies.”

What we know about on-call management

From my personal experience, being a part of the on-call procedure is probably one of the aspects of the job that people like the least. Some people don’t like being available outside of working hours. Others are afraid of having to deal with stressful situations. But in many cases, the main reason for the negativity towards the on-call policy is the feeling that the person who must deal with the incident will not know what to do. And that truly is a horrible position. Especially in stressful situations, where you must deal with the issue in a timely manner and maintain communication with many stakeholders (which is always true for larger incidents that have an impact on business).

But it doesn’t have to be that way – if you get your on-call procedure in order. And it is usually the bad on-call management that is the main cause for concern and fear. Badly designed on-call policy will also have an impact on overall confidence of the engineering team, because they won’t be able to deal with incidents in a proper way. Or to describe it even more bluntly: “It is engineering’s responsibility to be on call and own their code. It is management’s responsibility to make sure that on-call does not suck. This is a handshake, it goes both ways, and if you do not hold up your end they should quit and leave you.”

But first, before we try to figure out how to improve the on-call policy and make it more manageable, let’s discuss what dealing with the incident usually involves.

OK, I got an alert! What should I do?

If there’s an incident and you are the one who needs to respond to an alert, what are the realistic expectations – what can you do? This might depend on the incident. Some of them might require much more effort than the others. But in general it all comes down to these 3 things:

Finding out what’s going on

This phase depends on your alerting, as well as the level of logging and monitoring that you have in place. If you have a sufficient level of information, it should be relatively easy to figure out which part of the system is failing and why. Of course, if this is not the case, this will be a struggle. And you will lose precious time trying to figure out what is going on.

Trying to recover the affected part of the system

This phase – the remediation – largely depends on what it is that you need to do to fully recover the system. Or if you can’t fully recover it, bring it back to some acceptable level of service. Sometimes this will involve fixing a bug and deploying a hotfix. It might also require restoring a database, or simply restarting the application. Whatever it is that you need to do, you will greatly benefit from having a good documentation on how to perform most of the procedures that are critical for this phase. For example: releasing and deploying the latest version of application, restoring database backup, performing failover, etc. There will be more information on this later in this article.

Follow-up actions

This is something that happens after you deal with the problem. It is a range of activities that should help in either resolving the problem completely or preventing the problem from happening again. This might require writing a patch or fixing a bug, performing post mortem analysis, updating documentation based on some specific thing you notice during the incident, etc.

Designing on-call rotations

So, now that we have covered what needs to be done to resolve the incident, let’s discuss what are the potential ways for improving on-call management. And the first thing we need to do is to provide an answer to a question – who needs to be on-call? The answer is – almost everyone.

Who will be on-call?

The healthy on-call process requires engineers to provide support for the application that they are building. The managers should also participate in this process because it is important that they lead by example. You don’t have to involve the most junior engineers. But they are more than welcome to observe how your and your team handle incidents (and you should always invite them to observe, if possible). Also, it’s not realistic to expect people who recently joined the team to be part of the rotation. However, it’s something that they should be doing after a few months.

In general, higher seniority ranks bring new responsibilities – and being on rotation is one of them.

When they will be on-call?

The next step is to define how often people will be on-call. If someone is on rotation too often, it can quickly lead to stress and burnout. But in the opposite case, that person might not have a chance to build the required skills and knowledge. According to various sources, the sweet spot is between 25 and 30 percent of engineer’s time. This translates to being on-call for one week every 4 weeks or so. My own experience mirrors this as well.

It is also important to keep in mind that the engineer should never be completely alone when he is on-call. There should be a secondary, the person who can take up resolving the issue in case that the primary is not available for any reason, or none of the alerts have been acknowledged or responded to after some predefined period. In many cases, you should always have the option of reaching out to your manager.

On-call rotation with a man solving an issue

Is on-call mandatory?

This brings us to the next important discussion point: should on-call be mandatory? Yes, because that is something every engineer should do. However, the ideal option is probably to make on-call voluntary, but this can happen only if your team is big enough. If there are enough people who would take on these responsibilities, for example lead developer or an entire development team. In the case of smaller teams, the on-call would probably have to be mandatory. The extreme cases may be when there is no rotation at all, or where only a single person is on rotation.

I have seen some cases where the on-call was mandatory, but the engineers were treating it like it was voluntary – because the on-call policy was not enforced properly. You need to resolve these situations, because they might lead to problems within the team, especially if it’s always the same group of people who are always on rotation.

And finally, there is the question of compensation: should on-call be compensated and how? The answer is not simple. Many companies that have mandatory on-call do not offer compensation, because it is something that is already part of the job. If the on-call is voluntary, the compensation varies from company to company. Usually it includes bonuses for being on rotation during public holidays. You can also offer compensation in the form of extra days off. Or even the combination of extra days off and monetary compensation. It is all up to you and how you will include compensation into your on-call management process.

Now what?

So, how do you make your on-call more tolerable? Reacting to incidents is (and probably will be) stressful, but that does not mean that there isn’t anything you can do to reduce the level of stress. On the contrary, there’s quite a lot of things you could do.

Alerts

First, you need to have meaningful alerts. You must be able to quickly filter out things which are not important and get enough information about the problem you are dealing with. If you do not have enough alerts (or even worse, don’t have them at all), you will lose precious time trying to figure out what is wrong, or you might not even notice that anything is wrong. The opposite is also true. If you have too many alerts, you will lose time trying to figure out which of these alerts are really the ones you must focus on.

So, the key is to find balance and focus on a smaller number of critical alerts. However, you also need to make sure that your alert descriptions include as much detail as possible about the problem itself. Because that will give the person who is on-call enough information to start an investigation. Not only that, but you should also have some documentation that describes when certain alerts might be triggered, how to resolve those issues (if that is a common occurrence), etc.

Documentation

And this brings us to the next important aspect of your on-call management, and that is the presence and quality of documentation (including design documentation). In the ideal case, you should have documentation that covers all important procedures that might be of use during the on-call. For example, this type of documentation might include instructions on how to restart application servers (or some other part of the system), redeploy an application, create database backup, perform a failover from one environment to another, etc. Things like a full list of services with brief descriptions, diagrams that show how different parts of your system communicate to each other, or some specific configuration that must be in place for certain applications to work – all of these are extremely helpful.

Non-tech documentation

However, you should not limit yourself to only technical documentation. You might also need a list of contacts that you have to call when dealing with some specific issue. These could be members of other teams from the same company, or it could be people who provide support for some 3rd party components. Such a list of contacts would probably include email addresses, phone numbers or IM usernames – the more options there are to reach someone, the better. Whatever the case, the information included in this type of documentation should be clear and precise. And of course, properly maintained so anyone in the team could easily find it and understand it, regardless of seniority level.

Runbook

One of the worst situations you can find yourself in is that you need to deal with some production issue quickly, the pressure is high – but you do not know how to perform critical operations, nor do you have any idea where to look for such information. To make things even worse – the alert might have woken you up in the middle of the night and because of that, your concentration might be severely affected. Due to all these factors, you will be more likely to make mistakes. This is where good documentation can help you. Sometimes you will just need to double-check if you are doing the right thing. But sometimes you will have to perform some procedure for the first time and having a good runbook could make all the difference.

System design

One additional factor that can also have an impact on the overall quality of your on-call management and the frequency or duration of incidents is the way the system is designed. If the system is badly designed, not only will you have trouble understanding how the system works, but such a system will also be more likely to suffer from various issues (performance problems, frequent outages, high level of technical debt, etc.). Having to provide support for such a system during an incident can easily turn into a nightmare, making your code review process a hassle as well.

Think about this when you design a new service or extend an existing system. Aim for reducing the effort required for maintenance. This might mean that you should decouple certain features or aim for design of smaller applications in general, but it can also mean that you should consider various options provided by cloud services (so you can reap benefits of autoscaling, failover or similar features).

A team celebrating successful on-call management

Conclusion

In this article, we have tried to provide some ideas on how to improve your on-call policy. Let’s make one thing clear – dealing with incidents is hard. And it’s not only the technical complexity that makes them hard – but also the level of stress, especially in situations where the ability to conduct business or provide service is affected. Good on-call management helps you by reducing stress and effort, providing you with access to relevant information and/or tools and making it easier for you to focus on resolving the underlying issue. By having a better on-call process, you might not decrease the number of incidents, but you will have more confidence when it comes to dealing with them.

Want to find out more about our IT nearshoring services and how we can help you reach your goals faster? Get in touch with us today and let’s make your business dreams come true.

Miroslav Lazovic
Backend TL@Neon and Dev. Advocate @Holycode

Miroslav Lazovic is an experienced software engineer from Belgrade. Over the past 15 years, he worked on many different projects of every size – from small applications to mission-critical services that are used by hundreds of thousands of users on a daily basis, both as a consultant or part of the development team. Since 2016, he is focused on building and managing high-performing teams and helping people do their best work. He likes discussing various important engineering and management topics – and that’s the main reason for this blog. Besides that, Miroslav has quite a few war-stories to tell and probably far too many books and hobbies. There are also rumours of him being a proficient illustrator, but the origin of these rumours remains a mystery.