Failure Mode And Effect Analysis For Software Operations

A while back I was working at a company during and after the purchase of a competitor. The competitor also made database-backed web sites, but ran them on Oracle, while our company was an MS SQL Server shop. This was my first experience administering Oracle, and the Operations team wasn’t given much time to learn the ins and outs of running an Oracle server. Furthermore, we were told that the competitor’s software would be migrated off Oracle in a month or two, so there wasn’t any money to spend on Oracle training.

Our backup provider had an MS SQL Server plug-in that did over-the-wire replication as well as the usual full backups and transaction log backups. We could restore our MS SQL Server instances to almost any point in time if needed. This provider didn’t have an Oracle plug-in, so instead we used Oracle’s exp/imp utility to make full exports which we then used the backup provider’s OS tools to backup. We knew about RMAN (Oracle’s backup utility), but we only used it for cleaning up archive logs. We tested restores regularly, and in general had very good backup practices.

http://www.flickr.com/photos/scoobay/3163954667/

One of the competitor’s customers, who was now our customer, started calling the help desk complaining about failures in the software. The errors were all over the place, seemingly random events that could crop up at any time. Worse, the customer couldn’t get to some of its data. The developers and some of the Operations team got together to investigate the errors while other Operations people started pulling backups and restoring to staging servers; the restores appeared to import without problems but we’d still see errors and couldn’t find all of the data.

After much consternation we figured out that the event that ended up sneaking past our backup protection was block corruption on one of the table datafiles in Oracle. We resolved this with the assistance of expensive conversations with Oracle, who showed us how to recreate the lost data and how to setup RMAN to do backups as well as our exports. It turns out that RMAN allows you to restore individual blocks in the event that a corrupt block is detected, and RMAN itself can find corrupt blocks when backups are taken. Oracle also offers a command line tool, dbv, which will verify datafiles from the command line.

The incident analysis of this event was very educational but also frustrating. The frustration was because we felt that no matter the changes we made to our backups, no matter the monitoring implemented, no matter how much we learned about Oracle, we could easily miss one of the many complex failure modes in this complex system.

Another common frustration in Operations is that we only have so much time and money to spend on failure mitigation. And, even if we had unlimited resources to monitor everything, we may still miss the important events because of the poor signal to noise ratio that comes with over-monitoring.

How do we choose what to worry about?

One of my favorite ways to figure out what is important to spend operations time and money is to do a Failure Mode and Effect Analysis (FMEA). FMEA is a technique I learned about while studying six sigma while working at PowerSteering Software. The purpose of FMEA is to understand failure modes and their risks. FMEA is a great tool for Software Operations because it will allow the Operations team to understand where best to assign limited resources for topics such as monitoring, high availability, performance, backup, and disaster recovery.

FMEA assigns severity, detectability, and frequency values (from 1 to 10) to each possible failure mode of a specific system function. For severity, a value of 1 means no impact, while a 10 means total system meltdown, for detectability a value of 1 means that the event is always detectable while a 10 means the event is never detectable, and for frequency, a value of 1 means that the event never happens while a 10 means the event occurs continuously. The product of those three values is a Risk Priority Number (RPN), which is used as a sorting key. Failure modes with a higher RPN have more serious impact than modes with a lower RPN. In other words, you should be working to mitigate failure modes with high RPNs.

The initial list of failure modes is brainstormed by the all of the groups involved in the software, and then the same groups assign the scores for severity, detectability, and frequency.

Here’s an example FMEA for a hypothetical database storage subsystem. There would typically be a key that goes along with an FMEA that explains what each severity, detectability and frequency value means so the user can understand the assigned ratings, but for the sake of space I’ve not included a key in the example below.

FMEA - Database Storage

Failure Mode Effect of Failure Severity Rating (S) Potential Cause of Failure Occurrence Rating (O) Possible Means of Detection Detection Rating (D) RPN (S x O x D)Preventative Actions to be Taken
Data CorruptionUser data incorrect or lost, application errors8Damaged cables, memory errors, firmware bugs2Custom alerting scripts, datafile consistency checkers7112Regular cable examination, clean up cable runs in hosting center, subscribe to vendor notification of storage firmware issues
High read/write latencydegraded user experience, failed transactions due to timeouts4Storage block contention, too few spindles to sustain read/write rates, disk cache too small6latency monitors, OS tools, vendor storage network tools372Change RAID level to better distribute load on the storage array, spread datafiles across LUNs, use table partitioning
Disk FullWrites fail, possible database unavailable7Sudden growth in writes, admin error3Nagios alerts, OEM alerts121Switch from filesystem-based datafile storage to ASM to avoid user access to datafiles, QOS write rates
Storage UnavailableDatabase unavailable10SAN switch down, cable unplugged2OEM alerts, various Nagios alerts, user notification120Fully redundant SAN fabric
An example of an FMEA for database storage failures.

This is an example that is illustrative of what may be discovered when you create an FMEA. In the table above we can see that latency and data corruption are important areas of concerns because they occur often (the high latency case) or because they are very difficult to detect and are very severe (the corruption case). These two top-risk items warrant an investment in failure mitigation and detection.

That’s not say the items with the highest RPN are the only things you should spend time on. The FMEA doesn’t capture the economics of mitigation nor do they always reflect the true cost of loss of service. In the above example, “disk full” may appear to the customer as the same failure as “storage unavailable”. Also, there may be close to zero cost for mitigating disk space issues (monitors for disk space are available in all monitoring packages), but a significant amount of cost in mitigating latency problems (you may have to re-architect the application). FMEA should be used alongside common sense and experience when choosing risk mitigation strategies.

DevOps Documentation

In a previous entry I wrote that the key to removing the wall between developers and operations is communication. I wrote about becoming involved in your co-workers’ meetings and understanding your co-workers’ needs, and this is of course very important, but sometimes it is best to have things written down, because your co-workers aren’t always around to answer your questions (ops people who have tried calling developers at 4am know this all too well).  Good documentation is often toted as the answer to all of life’s problems, but we never make time for it. I’ve found a few ways to reduce the barriers to creating documentation, which I discuss below.

First, documentation needs to be revision controlled and live with the code, as doing so will reduce the cost for the developers to create documentation and help keep track of changes to documentation. This also means you don’t use binary formats for your documentation–no OpenOffice Doc, no MS Word, no format that you can’t store in git, svn, or whichever revision control system your company uses. You must be able diff revisions. Documentation stored primarily in plain text is ideal because plain text is a universal format; plain text can be emailed to anyone, edited by anyone, read by anyone and easily manipulated with tools readily available to both developer and operations.  My current choice for documentation is ReStructured Text, which you already know how to write even if you haven’t seen it before. ReST is a plain text format that evolved out of the docutils package and focuses on simplicity and clarity of meaning.

Furthermore, documentation that will be used by the developers and ops people should live with or near the code that the documentation is for so that the docs are easily accessible to both groups. Documentation stored in revision control also means that you can set up commit hooks that trigger alerts when the documentation changes. If you use a web-based system for browsing source code that system probably offers an RSS feed that can notify you of changes to files, so you can see when your documentation has changed without leaving your RSS client. The automation possibilities are endless–if a specific source file got updated but the docs didn’t? Send an email to the person who made the commit asking why.

Keeping the documentation within the source tree has the advantage of the documentation always being checked out with the code, and thus the most recent revision is available to all the developers, and the ops people keep up with the source tree as they work with the documentation.

The DevOps Dialogue Document

One of the first documents that an ops person looking to bridge the development divide should create is a dialogue document. This is a document that the developer and the ops person work on together that tracks all of the information needed to handle the care and feeding of the software. What should this document look like?

Administrative Details

  • A description of the software role and purpose.
  • A contact in development, a contact in ops, and a contact in QA. These are the people that can be bugged about issues with the software, and these are the technical, not managerial, contacts.

Operating Requirements

  • Disk space required for installation, disk space used in normal operations, disk space used by logs, disk space used by data, and estimates of growth rates.
  • Memory requirements.
  • Network requirements, including ports the software listens on, connections the software makes to other services, protocols used (at all layers), if SSL is required, and so on.
  • OS requirements, such as platform, distribution, release of distribution, kernel versions, patch versions, word size (32 or 64 bit), other unique OS requirements.
  • Environment requirements, such as libraries required, variables that need to be set, shells that are required, users and groups that need to be on the system, and more.
  • Details on how to start and stop the software.

Configuration Details

  • Details on how the software is configured; is it via a config file? Is the configuration in a database? Is there a way to change the running configuration on the fly?

Monitoring & Debugging

  • How does the software get monitored? Is there a way to have the software report status? What about performance monitoring?
  • What process do the developers use to debug issues they encounter in the software? For example, is there a mechanism to get a stack trace, such as Java’s handling of SIGQUIT?
  • How does the software respond to common error situations like out of memory or out of disk space, or being unable to bind to a port? What about handing of bad input?

Backup & Restore

  • How does the software store state, and can you easily backup and restore that state?
  • Can you do a backup while the software is running and get a restorable backup? How can this be tested?

Security

  • What privileges are needed to run the software?
  • How is input validated?
  • How are any authentication tokens (passwords, certificates, etc) stored?

There is a lot more that should go in each section and many more sections that could be added; the above is an example that you can use to start your document.

At the last Boston DevOps Meetup, @hercynium mentioned that he uses the FCAPS model to figure out what he needs to know about software he is going to deploy and manage. I hadn’t heard about this model but after some quick research it looks like an excellent reference for helping you build your dialogue document.  From the Wikipedia entry:

FCAPS is an acronym for Fault, Configuration, Accounting, Performance, Security, the management categories into which the ISO model defines network management tasks. In non-billing organizations Accounting is sometimes replaced with Administration.

These topics map well to the items that developers and ops people both need to understand about the software they are working on together.

No software will have the correct “answers” for the dialogue questions, and that’s not the point. You’re trying to start the dialogue that gets everyone thinking about software outside of their silo. I’ve written the questions above from the perspective of an ops person, but the developers should add their own set of questions — maybe a section on what the production environment is like and how software will be pushed to production. Since you are storing this document in plain text along side your source code under revision control you can easily check in new answers or questions and have those changes be seen by the developers on next update.

Developers and operations people lament the perceived lack of documentation and everyone agrees we need more. The purpose of the dialogue document is to create the documentation that is needed in the lowest cost way, so don’t try to make the perfect document from the start. Embrace the iterative and evolutionary nature of Agile and grow the document as your understanding of the software grows.