Failure Mode And Effect Analysis For Software Operations

A while back I was working at a company during and after the purchase of a competitor. The competitor also made database-backed web sites, but ran them on Oracle, while our company was an MS SQL Server shop. This was my first experience administering Oracle, and the Operations team wasn’t given much time to learn the ins and outs of running an Oracle server. Furthermore, we were told that the competitor’s software would be migrated off Oracle in a month or two, so there wasn’t any money to spend on Oracle training.

Our backup provider had an MS SQL Server plug-in that did over-the-wire replication as well as the usual full backups and transaction log backups. We could restore our MS SQL Server instances to almost any point in time if needed. This provider didn’t have an Oracle plug-in, so instead we used Oracle’s exp/imp utility to make full exports which we then used the backup provider’s OS tools to backup. We knew about RMAN (Oracle’s backup utility), but we only used it for cleaning up archive logs. We tested restores regularly, and in general had very good backup practices.

http://www.flickr.com/photos/scoobay/3163954667/

One of the competitor’s customers, who was now our customer, started calling the help desk complaining about failures in the software. The errors were all over the place, seemingly random events that could crop up at any time. Worse, the customer couldn’t get to some of its data. The developers and some of the Operations team got together to investigate the errors while other Operations people started pulling backups and restoring to staging servers; the restores appeared to import without problems but we’d still see errors and couldn’t find all of the data.

After much consternation we figured out that the event that ended up sneaking past our backup protection was block corruption on one of the table datafiles in Oracle. We resolved this with the assistance of expensive conversations with Oracle, who showed us how to recreate the lost data and how to setup RMAN to do backups as well as our exports. It turns out that RMAN allows you to restore individual blocks in the event that a corrupt block is detected, and RMAN itself can find corrupt blocks when backups are taken. Oracle also offers a command line tool, dbv, which will verify datafiles from the command line.

The incident analysis of this event was very educational but also frustrating. The frustration was because we felt that no matter the changes we made to our backups, no matter the monitoring implemented, no matter how much we learned about Oracle, we could easily miss one of the many complex failure modes in this complex system.

Another common frustration in Operations is that we only have so much time and money to spend on failure mitigation. And, even if we had unlimited resources to monitor everything, we may still miss the important events because of the poor signal to noise ratio that comes with over-monitoring.

How do we choose what to worry about?

One of my favorite ways to figure out what is important to spend operations time and money is to do a Failure Mode and Effect Analysis (FMEA). FMEA is a technique I learned about while studying six sigma while working at PowerSteering Software. The purpose of FMEA is to understand failure modes and their risks. FMEA is a great tool for Software Operations because it will allow the Operations team to understand where best to assign limited resources for topics such as monitoring, high availability, performance, backup, and disaster recovery.

FMEA assigns severity, detectability, and frequency values (from 1 to 10) to each possible failure mode of a specific system function. For severity, a value of 1 means no impact, while a 10 means total system meltdown, for detectability a value of 1 means that the event is always detectable while a 10 means the event is never detectable, and for frequency, a value of 1 means that the event never happens while a 10 means the event occurs continuously. The product of those three values is a Risk Priority Number (RPN), which is used as a sorting key. Failure modes with a higher RPN have more serious impact than modes with a lower RPN. In other words, you should be working to mitigate failure modes with high RPNs.

The initial list of failure modes is brainstormed by the all of the groups involved in the software, and then the same groups assign the scores for severity, detectability, and frequency.

Here’s an example FMEA for a hypothetical database storage subsystem. There would typically be a key that goes along with an FMEA that explains what each severity, detectability and frequency value means so the user can understand the assigned ratings, but for the sake of space I’ve not included a key in the example below.

FMEA - Database Storage

Failure Mode Effect of Failure Severity Rating (S) Potential Cause of Failure Occurrence Rating (O) Possible Means of Detection Detection Rating (D) RPN (S x O x D)Preventative Actions to be Taken
Data CorruptionUser data incorrect or lost, application errors8Damaged cables, memory errors, firmware bugs2Custom alerting scripts, datafile consistency checkers7112Regular cable examination, clean up cable runs in hosting center, subscribe to vendor notification of storage firmware issues
High read/write latencydegraded user experience, failed transactions due to timeouts4Storage block contention, too few spindles to sustain read/write rates, disk cache too small6latency monitors, OS tools, vendor storage network tools372Change RAID level to better distribute load on the storage array, spread datafiles across LUNs, use table partitioning
Disk FullWrites fail, possible database unavailable7Sudden growth in writes, admin error3Nagios alerts, OEM alerts121Switch from filesystem-based datafile storage to ASM to avoid user access to datafiles, QOS write rates
Storage UnavailableDatabase unavailable10SAN switch down, cable unplugged2OEM alerts, various Nagios alerts, user notification120Fully redundant SAN fabric
An example of an FMEA for database storage failures.

This is an example that is illustrative of what may be discovered when you create an FMEA. In the table above we can see that latency and data corruption are important areas of concerns because they occur often (the high latency case) or because they are very difficult to detect and are very severe (the corruption case). These two top-risk items warrant an investment in failure mitigation and detection.

That’s not say the items with the highest RPN are the only things you should spend time on. The FMEA doesn’t capture the economics of mitigation nor do they always reflect the true cost of loss of service. In the above example, “disk full” may appear to the customer as the same failure as “storage unavailable”. Also, there may be close to zero cost for mitigating disk space issues (monitors for disk space are available in all monitoring packages), but a significant amount of cost in mitigating latency problems (you may have to re-architect the application). FMEA should be used alongside common sense and experience when choosing risk mitigation strategies.