The Elements of DevOps

people working at desk with notebooks and laptops on it

The goal of a software development team is to deliver a great service to the customer. But, you can’t always anticipate what will happen when your applications start running in a real environment. To extend the idea of the Agile methodology, the team needs to establish good feedback loops to tell them how things are running in production, and what they need to do to improve it.

Historically, responsibility for building and testing an application until it was ‘production ready’ belonged to the software development team. Then the code was handed off to an operations team who would deploy into production and keep it running. This separation of responsibilities often led to numerous problems. Sometimes, the Operations team wouldn’t completely understand the complex software they were responsible to keep running. Other times, they felt the Software Development team didn’t build the software to be run operationally in a production environment. For example, they didn’t include enough telemetry to inform the Operations team of how the software was running, or didn’t architect it to scale up as demand increased.

The intent of DevOps is to have the Software Development and Operations teams work more closely to solve these problems. The Software Development team needs to have access to information (data) from the production environments. Once they have the information, they should be able to quickly use it to adapt the system, leading to a higher quality service that is reliable, performant, responsive, secure, and scales to demand.

Whether you have separate Operations and Development teams, or your Development team is responsible for support themselves, you need to have all of the following elements for a good DevOps process to work:

  1. Telemetry / Alerting

Telemetry can come in many forms, but operationally, you should focus on two main kinds:

  • Application logs: these can tell you about what kinds of data are flowing through the system, and the kinds of errors you are encountering. These should be able to help you quickly find out where problems are in the system. Ideally, you are using a system that allows you to track logs across the various components of your system, so you can trace a single data event as it travels through the system (e.g. Kibana).
  • Metrics: these count and measure things. You can use these to track when various components are being called (ones owned by the Development team as well as ones external to the team), and how long they are taking to respond. Or, you can count the number of times certain kinds of errors are thrown. Many times, it’s useful to also have a “heartbeat” metric, which simply indicates that your application is actually running and not hung or crashed. You should be able to create graphs of this data in a “system dashboard” (e.g. Grafana) and make it available to the Operations or Developmentteams.

Once you have Metrics established, you can implement alerts, based on events or thresholds. This will free you up from having to check logs or reports all the time. An alert may tell you if an application has encountered too many errors, or if data seems to be out of a “normal” range. For example, traffic to your website is too low, or too high. Or your server response times are taking too long. You can use a database such as Influx to track the metrics, and an alerting component like Kapacitor to send a message to your Operations or Development team via email, or a service such as PagerDuty or Slack. The alert should contain enough contextual information to help the team quickly track down and diagnose the problem. For example, which server environment, which component, and what type of data is causing the problem. The team can configure the alerting system to notify on a broad range of possible problems, and then “tune” the alerts over time to reduce the number of false alerts.

  1. An Agile Framework

As your Development team works on new features, they should be prepared to take on unplanned tasks based on information gathered from telemetry. This might include fixing applications, adjusting configurations, or adding capacity. An Agile Framework such as Scrum or Kanban gives you the flexibility to change plans quickly. If using a framework such as Scrum, then you should allocate time in every Sprint to take on production changes.

  1. Automated Testing

In cases where a configuration or code change is necessary, having good test automation allows the team to know that their change did not inadvertently break something else. There can sometimes be hundreds of tests that can be run over a complex piece of code. If manual testing takes days, then the team cannot move quickly to fix problems. Automated testing can often determine if a fix is okay within minutes.

  1. Continuous Integration / Continuous Deployment

The Development team should be using a code repository with a good branching process to support Continuous Integration and Continuous Deployment best practices. The team should be able to make the needed changes, run an automated unit test, merge it into a production branch, run an automated system test, and release the change into a production environment within a relatively short amount of time. And, in the worst case, if the change does not solve the problem (or creates other ones), the team should be able to quickly roll back the change.

Making small, incremental changes continuously can lower the risk and impact of the change, as opposed to waiting and gathering a large number of changes and releasing them all at once. When releasing a large number of changes all at once, you often run the risk of having to roll back all of the changes if something goes wrong with one change.

DevOps is a journey and not a destination. Beginning to adopt these practices will help your team down the path of DevOps. DevOps skills and tools are continuously evolving. Similar to embracing Agile methodologies, DevOps is part of a complete culture change within an organization. The overall solution will include many other elements, such as on-call and escalation procedures, incident tracking systems, and so on. However, if you are trying to put a DevOps process in place, you will need to focus on making these elements work well.

If you would like more information about DevOps, we recommend that you look at these sources: