DevOps Documentation

In a previous entry I wrote that the key to removing the wall between developers and operations is communication. I wrote about becoming involved in your co-workers’ meetings and understanding your co-workers’ needs, and this is of course very important, but sometimes it is best to have things written down, because your co-workers aren’t always around to answer your questions (ops people who have tried calling developers at 4am know this all too well).  Good documentation is often toted as the answer to all of life’s problems, but we never make time for it. I’ve found a few ways to reduce the barriers to creating documentation, which I discuss below.

First, documentation needs to be revision controlled and live with the code, as doing so will reduce the cost for the developers to create documentation and help keep track of changes to documentation. This also means you don’t use binary formats for your documentation–no OpenOffice Doc, no MS Word, no format that you can’t store in git, svn, or whichever revision control system your company uses. You must be able diff revisions. Documentation stored primarily in plain text is ideal because plain text is a universal format; plain text can be emailed to anyone, edited by anyone, read by anyone and easily manipulated with tools readily available to both developer and operations.  My current choice for documentation is ReStructured Text, which you already know how to write even if you haven’t seen it before. ReST is a plain text format that evolved out of the docutils package and focuses on simplicity and clarity of meaning.

Furthermore, documentation that will be used by the developers and ops people should live with or near the code that the documentation is for so that the docs are easily accessible to both groups. Documentation stored in revision control also means that you can set up commit hooks that trigger alerts when the documentation changes. If you use a web-based system for browsing source code that system probably offers an RSS feed that can notify you of changes to files, so you can see when your documentation has changed without leaving your RSS client. The automation possibilities are endless–if a specific source file got updated but the docs didn’t? Send an email to the person who made the commit asking why.

Keeping the documentation within the source tree has the advantage of the documentation always being checked out with the code, and thus the most recent revision is available to all the developers, and the ops people keep up with the source tree as they work with the documentation.

The DevOps Dialogue Document

One of the first documents that an ops person looking to bridge the development divide should create is a dialogue document. This is a document that the developer and the ops person work on together that tracks all of the information needed to handle the care and feeding of the software. What should this document look like?

Administrative Details

  • A description of the software role and purpose.
  • A contact in development, a contact in ops, and a contact in QA. These are the people that can be bugged about issues with the software, and these are the technical, not managerial, contacts.

Operating Requirements

  • Disk space required for installation, disk space used in normal operations, disk space used by logs, disk space used by data, and estimates of growth rates.
  • Memory requirements.
  • Network requirements, including ports the software listens on, connections the software makes to other services, protocols used (at all layers), if SSL is required, and so on.
  • OS requirements, such as platform, distribution, release of distribution, kernel versions, patch versions, word size (32 or 64 bit), other unique OS requirements.
  • Environment requirements, such as libraries required, variables that need to be set, shells that are required, users and groups that need to be on the system, and more.
  • Details on how to start and stop the software.

Configuration Details

  • Details on how the software is configured; is it via a config file? Is the configuration in a database? Is there a way to change the running configuration on the fly?

Monitoring & Debugging

  • How does the software get monitored? Is there a way to have the software report status? What about performance monitoring?
  • What process do the developers use to debug issues they encounter in the software? For example, is there a mechanism to get a stack trace, such as Java’s handling of SIGQUIT?
  • How does the software respond to common error situations like out of memory or out of disk space, or being unable to bind to a port? What about handing of bad input?

Backup & Restore

  • How does the software store state, and can you easily backup and restore that state?
  • Can you do a backup while the software is running and get a restorable backup? How can this be tested?

Security

  • What privileges are needed to run the software?
  • How is input validated?
  • How are any authentication tokens (passwords, certificates, etc) stored?

There is a lot more that should go in each section and many more sections that could be added; the above is an example that you can use to start your document.

At the last Boston DevOps Meetup, @hercynium mentioned that he uses the FCAPS model to figure out what he needs to know about software he is going to deploy and manage. I hadn’t heard about this model but after some quick research it looks like an excellent reference for helping you build your dialogue document.  From the Wikipedia entry:

FCAPS is an acronym for Fault, Configuration, Accounting, Performance, Security, the management categories into which the ISO model defines network management tasks. In non-billing organizations Accounting is sometimes replaced with Administration.

These topics map well to the items that developers and ops people both need to understand about the software they are working on together.

No software will have the correct “answers” for the dialogue questions, and that’s not the point. You’re trying to start the dialogue that gets everyone thinking about software outside of their silo. I’ve written the questions above from the perspective of an ops person, but the developers should add their own set of questions — maybe a section on what the production environment is like and how software will be pushed to production. Since you are storing this document in plain text along side your source code under revision control you can easily check in new answers or questions and have those changes be seen by the developers on next update.

Developers and operations people lament the perceived lack of documentation and everyone agrees we need more. The purpose of the dialogue document is to create the documentation that is needed in the lowest cost way, so don’t try to make the perfect document from the start. Embrace the iterative and evolutionary nature of Agile and grow the document as your understanding of the software grows.

Recent Readings

Some interesting articles & tools:

Getting Started With DevOps

DevOps has been defined in this article by Stephen Nelson-Smith, and the executive summary is that operations and development should no longer be separate functions (and never should have been) and need to start working closely together.

Why? Without working together, failures inevitably occur. For example, at the last Boston DevOps Meetup, one of the attendees, a developer, was commenting on the disconnect between him and his sysadmin and how their relationship was unlike the devops model.  

“That all sounds nice for you guys, but my sysadmin at work doesn’t seem to care about any of this. He’s not engaged.” 

The developer went on to give examples of times when the production servers broke the code because the production servers were configured incorrectly, or the ops person didn’t assist in debugging a problem because the ops person felt the problem was the developer’s to deal with.

We all talked about this for a while, when I realized that in another bar in another town, that developer’s sysadmin was saying to his friends, probably over a beer, “That all sounds nice for you guys, but my developers don’t care about any of this. They’re not engaged.” The sysadmin probably went on to talk about how the developers don’t keep their configurations sane and how they never debug the problems they create.

This is why we need DevOps. (And probably, really, DevOpsQASales, but that’s another post).

How do you get started with DevOps at your work?

  • If you are developer, invite your ops guys to your scrum or weekly meeting. Make sure they come and always ask them if what you are talking about has an impact on their work.
  • If you are an ops guy, invite your developers to your scrum or weekly meeting. Tell the developers about upcoming changes in each environment. Ask the developers what is going in their world.

If you’re a small team, bring everyone, if you are large, bring one developer from each project, but don’t invite the development managers, or the ops managers. Invite the people who do the work. You need to have the developers who will say, “We’re implementing a feature that uses these extra libraries, can they be installed in production?” and the ops people who will say, “Oh, if you use v2.8 of that library it won’t work on the older machines because of x, can you guys use v2.9?”. You want this to happen before you go to production.

Adding meetings to your calendar always sucks, but you’ll save headache later by talking to each other, and more importantly, you’ll buy into the projects that everyone is working on. You’ll believe in the work others at your company are doing and want to help if there are issues.

The developer above should be talking to his ops guys on a daily basis. They should go for a beer and talk about technical problems at work. I guarantee one of them will say, “Oh for that I just do x y and z, and it works great.” and this will be the solution to a nagging problem.

Technology, of course, can help a lot. In the example above, the ops guy should:

  • Create virtual machine images of production on a regular schedule that the developers must run their code on as part of checkin cycle.
  • Use a configuration management tool such as puppet or chef to keep staging and QA environments matching production environments.
  • Talk to the developers about the network, hardware and software that make up the production environment, including details of the resource limitations of those components.

While the developer should:

  • Write up details of the software’s operating requirements in terms of resource usage, environment configuration, and other dependancies.
  • Package the software properly, so the ops people can review package manifests on upgrades automatically to track changes in QA and staging.
  • Codify operational issues in unit tests (lossy network unit tests, disk full unit tests, out of memory unit tests, blocked port unit tests).

Working together:

  • Use build tools such as Hudson to trigger jobs on checkin – jobs which run the full set of unit tests on a production-like environment/virtual machine.
  • Define hand off procedures for new features, which require checkoffs from ops, QA and development.
  • Push the ops deployment scripts and methods to the developers’ workstations so the developers can use them.

Also, whenever possible you should automated that which can be automated. That’s what computers are really good at.

These are just a few examples. There’s a lot to talk about in later posts, such as more detail on the heavy use of automation, Agile practices in operations, proper storage of documentation (version control! plain text formats!) and more.

I hope this is useful introduction to some DevOps concepts as I’ve understood them. Please comment below about your own experiences.