Devops Homebrew – Vladimir Vuksan is a regular at the Boston DevOps Meetups and I was happy to see this post on his previous job’s release process. The post is an excellent case study in DevOps in deployment.
An Agile Architectural Epic Kanban System – Part 1 – There’s a lot of room for Kanban and Agile in DevOps initiatives, and I think many people are already headed in that direction (I’ve started doing Kanban with the operations teams at ITA; they’ll be a post on how this is working in a few months). Having the developers and ops people use the same process management technique helps improve communication all around, and Kanban gives excellent visibility into what is happening now in an organization. The article above discusses using Kanban to give visibility into the process of architectural decision making, a process which is often invisible to developers or ops people.
The Visible Ops Handbook – Tom Zauli from rPath brought me a copy of this at the last Boston DevOps Meetup, and I’m about halfway through. I think the practical steps recommended in Visible Ops would be very effective to gain control of an operations organization that is underwater, and after control is regained you can start automating as much as possible.
The Checklist Manifesto – If you haven’t read Complications and Better you should stop reading this and pick up those two books right now. Dr. Gawande’s analytical look at process improvement in medicine (or lack thereof) is readable and it is easy to find parallels between his observations about medicine and any other industry. Both books are highly recommended for people who care about honest self reflection and evolutionary improvement.
The thesis of Dr. Gawande’s new book couldn’t simpler: checklists prevent errors. He backs this up with examples from many fields and the argument really is compelling; I can think of many cases at work where a checklist has saved the day. I think the DevOps trend of automating as much as possible, especially around deployment, is a way of encoding checklists. At ITA our deployment process went from a checklist that took a day or more to complete manually to code that performs the same checklist in under 45 minutes – that’s 45 minutes for an entire airline reservation system.
A while back I was working at a company during and after the purchase of a competitor. The competitor also made database-backed web sites, but ran them on Oracle, while our company was an MS SQL Server shop. This was my first experience administering Oracle, and the Operations team wasn’t given much time to learn the ins and outs of running an Oracle server. Furthermore, we were told that the competitor’s software would be migrated off Oracle in a month or two, so there wasn’t any money to spend on Oracle training.
Our backup provider had an MS SQL Server plug-in that did over-the-wire replication as well as the usual full backups and transaction log backups. We could restore our MS SQL Server instances to almost any point in time if needed. This provider didn’t have an Oracle plug-in, so instead we used Oracle’s exp/imp utility to make full exports which we then used the backup provider’s OS tools to backup. We knew about RMAN (Oracle’s backup utility), but we only used it for cleaning up archive logs. We tested restores regularly, and in general had very good backup practices.
One of the competitor’s customers, who was now our customer, started calling the help desk complaining about failures in the software. The errors were all over the place, seemingly random events that could crop up at any time. Worse, the customer couldn’t get to some of its data. The developers and some of the Operations team got together to investigate the errors while other Operations people started pulling backups and restoring to staging servers; the restores appeared to import without problems but we’d still see errors and couldn’t find all of the data.
After much consternation we figured out that the event that ended up sneaking past our backup protection was block corruption on one of the table datafiles in Oracle. We resolved this with the assistance of expensive conversations with Oracle, who showed us how to recreate the lost data and how to setup RMAN to do backups as well as our exports. It turns out that RMAN allows you to restore individual blocks in the event that a corrupt block is detected, and RMAN itself can find corrupt blocks when backups are taken. Oracle also offers a command line tool, dbv, which will verify datafiles from the command line.
The incident analysis of this event was very educational but also frustrating. The frustration was because we felt that no matter the changes we made to our backups, no matter the monitoring implemented, no matter how much we learned about Oracle, we could easily miss one of the many complex failure modes in this complex system.
Another common frustration in Operations is that we only have so much time and money to spend on failure mitigation. And, even if we had unlimited resources to monitor everything, we may still miss the important events because of the poor signal to noise ratio that comes with over-monitoring.
How do we choose what to worry about?
One of my favorite ways to figure out what is important to spend operations time and money is to do a Failure Mode and Effect Analysis (FMEA). FMEA is a technique I learned about while studying six sigma while working at PowerSteering Software. The purpose of FMEA is to understand failure modes and their risks. FMEA is a great tool for Software Operations because it will allow the Operations team to understand where best to assign limited resources for topics such as monitoring, high availability, performance, backup, and disaster recovery.
FMEA assigns severity, detectability, and frequency values (from 1 to 10) to each possible failure mode of a specific system function. For severity, a value of 1 means no impact, while a 10 means total system meltdown, for detectability a value of 1 means that the event is always detectable while a 10 means the event is never detectable, and for frequency, a value of 1 means that the event never happens while a 10 means the event occurs continuously. The product of those three values is a Risk Priority Number (RPN), which is used as a sorting key. Failure modes with a higher RPN have more serious impact than modes with a lower RPN. In other words, you should be working to mitigate failure modes with high RPNs.
The initial list of failure modes is brainstormed by the all of the groups involved in the software, and then the same groups assign the scores for severity, detectability, and frequency.
Here’s an example FMEA for a hypothetical database storage subsystem. There would typically be a key that goes along with an FMEA that explains what each severity, detectability and frequency value means so the user can understand the assigned ratings, but for the sake of space I’ve not included a key in the example below.
Regular cable examination, clean up cable runs in hosting center, subscribe to vendor notification of storage firmware issues
High read/write latency
degraded user experience, failed transactions due to timeouts
Storage block contention, too few spindles to sustain read/write rates, disk cache too small
latency monitors, OS tools, vendor storage network tools
Change RAID level to better distribute load on the storage array, spread datafiles across LUNs, use table partitioning
Writes fail, possible database unavailable
Sudden growth in writes, admin error
Nagios alerts, OEM alerts
Switch from filesystem-based datafile storage to ASM to avoid user access to datafiles, QOS write rates
SAN switch down, cable unplugged
OEM alerts, various Nagios alerts, user notification
Fully redundant SAN fabric
An example of an FMEA for database storage failures.
This is an example that is illustrative of what may be discovered when you create an FMEA. In the table above we can see that latency and data corruption are important areas of concerns because they occur often (the high latency case) or because they are very difficult to detect and are very severe (the corruption case). These two top-risk items warrant an investment in failure mitigation and detection.
That’s not say the items with the highest RPN are the only things you should spend time on. The FMEA doesn’t capture the economics of mitigation nor do they always reflect the true cost of loss of service. In the above example, “disk full” may appear to the customer as the same failure as “storage unavailable”. Also, there may be close to zero cost for mitigating disk space issues (monitors for disk space are available in all monitoring packages), but a significant amount of cost in mitigating latency problems (you may have to re-architect the application). FMEA should be used alongside common sense and experience when choosing risk mitigation strategies.