lean IT – The Simple Logic

Recent Readings

The cloud is great. Stop the hype. – This is an excellent article on what cloud computing is and isn’t and when the use of the cloud is the correct technical or architectural choice. I had a long post planned on the overloaded term “cloud computing” but OmniTI covers all the important points in this article. Like any new approach to infrastructure deployment that promises quick provisioning of services, people often forget that all of that infrastructure needs to be managed. There are a lot of good tools coming out to help with that management but none make it zero cost.

Dissecting Today’s Internet Traffic Spikes – With the above article on cloud computing and this article on the sudden nature of internet traffic spikes, I’m becoming an OmniTI fanboy. Part of my job is to worry about designing and provisioning correctly for sudden changes in traffic patterns, and Theo is correct that you have to design for spikes, not react to spikes.

Kanban For Sysadmins – I’ve started doing Kanban at work for one of our Operations teams and have been really pleased with the results so far — so much so that we’re rolling it out for another team this week and hopefully the rest of the department over the next few weeks. We track our work in Request Tracker, but it is hard to know 1. what is being worked on right now, and 2. how much throughput a team has. Kanban lets us know both, and it also lets us avoid the entire topic of prioritization of future work. We only prioritize when we are ready to start doing new work. I’ll post a follow-up to this once we’re further along in our Kanban experiment.

Hello From A libc-free World! – Have you ever wondered what, exactly, your “Hello, world!” program does? Jessica at Ksplice dives into what happens when you build a super-simple C program (it’s more complicated than you think!).

Data-Intensive Text Processing With MapReduce – A freely available draft in PDF of an upcoming book on using MapReduce to process large text datasets. One of the cool things we’ve done at ITA is add tracking data to each and every request that passes throughout our reservation system, and we output this tracking information in each log entry in every component we’ve written. The structure of this tracking data is such that if you aggregate the logs from all of the components you can easily construct a graph of the request’s path through the reservation system (including the asynchronous calls). The problem now is searching all of that log data, and I’ve been curious about MapReduce as it applies to this sort of data mining.

March 30, 2010 cloud, ITA, lean IT, linux, performance

Failure Mode And Effect Analysis For Software Operations

A while back I was working at a company during and after the purchase of a competitor. The competitor also made database-backed web sites, but ran them on Oracle, while our company was an MS SQL Server shop. This was my first experience administering Oracle, and the Operations team wasn’t given much time to learn the ins and outs of running an Oracle server. Furthermore, we were told that the competitor’s software would be migrated off Oracle in a month or two, so there wasn’t any money to spend on Oracle training.

Our backup provider had an MS SQL Server plug-in that did over-the-wire replication as well as the usual full backups and transaction log backups. We could restore our MS SQL Server instances to almost any point in time if needed. This provider didn’t have an Oracle plug-in, so instead we used Oracle’s exp/imp utility to make full exports which we then used the backup provider’s OS tools to backup. We knew about RMAN (Oracle’s backup utility), but we only used it for cleaning up archive logs. We tested restores regularly, and in general had very good backup practices.

Data Corruption — http://www.flickr.com/photos/scoobay/3163954667/

One of the competitor’s customers, who was now our customer, started calling the help desk complaining about failures in the software. The errors were all over the place, seemingly random events that could crop up at any time. Worse, the customer couldn’t get to some of its data. The developers and some of the Operations team got together to investigate the errors while other Operations people started pulling backups and restoring to staging servers; the restores appeared to import without problems but we’d still see errors and couldn’t find all of the data.

After much consternation we figured out that the event that ended up sneaking past our backup protection was block corruption on one of the table datafiles in Oracle. We resolved this with the assistance of expensive conversations with Oracle, who showed us how to recreate the lost data and how to setup RMAN to do backups as well as our exports. It turns out that RMAN allows you to restore individual blocks in the event that a corrupt block is detected, and RMAN itself can find corrupt blocks when backups are taken. Oracle also offers a command line tool, dbv, which will verify datafiles from the command line.

The incident analysis of this event was very educational but also frustrating. The frustration was because we felt that no matter the changes we made to our backups, no matter the monitoring implemented, no matter how much we learned about Oracle, we could easily miss one of the many complex failure modes in this complex system.

Another common frustration in Operations is that we only have so much time and money to spend on failure mitigation. And, even if we had unlimited resources to monitor everything, we may still miss the important events because of the poor signal to noise ratio that comes with over-monitoring.

How do we choose what to worry about?

One of my favorite ways to figure out what is important to spend operations time and money is to do a Failure Mode and Effect Analysis (FMEA). FMEA is a technique I learned about while studying six sigma while working at PowerSteering Software. The purpose of FMEA is to understand failure modes and their risks. FMEA is a great tool for Software Operations because it will allow the Operations team to understand where best to assign limited resources for topics such as monitoring, high availability, performance, backup, and disaster recovery.

FMEA assigns severity, detectability, and frequency values (from 1 to 10) to each possible failure mode of a specific system function. For severity, a value of 1 means no impact, while a 10 means total system meltdown, for detectability a value of 1 means that the event is always detectable while a 10 means the event is never detectable, and for frequency, a value of 1 means that the event never happens while a 10 means the event occurs continuously. The product of those three values is a Risk Priority Number (RPN), which is used as a sorting key. Failure modes with a higher RPN have more serious impact than modes with a lower RPN. In other words, you should be working to mitigate failure modes with high RPNs.

The initial list of failure modes is brainstormed by the all of the groups involved in the software, and then the same groups assign the scores for severity, detectability, and frequency.

Here’s an example FMEA for a hypothetical database storage subsystem. There would typically be a key that goes along with an FMEA that explains what each severity, detectability and frequency value means so the user can understand the assigned ratings, but for the sake of space I’ve not included a key in the example below.

FMEA - Database Storage

Failure Mode	Effect of Failure	Severity Rating (S)	Potential Cause of Failure	Occurrence Rating (O)	Possible Means of Detection	Detection Rating (D)	RPN (S x O x D)	Preventative Actions to be Taken
Data Corruption	User data incorrect or lost, application errors	8	Damaged cables, memory errors, firmware bugs	2	Custom alerting scripts, datafile consistency checkers	7	112	Regular cable examination, clean up cable runs in hosting center, subscribe to vendor notification of storage firmware issues
High read/write latency	degraded user experience, failed transactions due to timeouts	4	Storage block contention, too few spindles to sustain read/write rates, disk cache too small	6	latency monitors, OS tools, vendor storage network tools	3	72	Change RAID level to better distribute load on the storage array, spread datafiles across LUNs, use table partitioning
Disk Full	Writes fail, possible database unavailable	7	Sudden growth in writes, admin error	3	Nagios alerts, OEM alerts	1	21	Switch from filesystem-based datafile storage to ASM to avoid user access to datafiles, QOS write rates
Storage Unavailable	Database unavailable	10	SAN switch down, cable unplugged	2	OEM alerts, various Nagios alerts, user notification	1	20	Fully redundant SAN fabric

Failure Mode Effect of Failure Severity Rating (S) Potential Cause of Failure Occurrence Rating (O) Possible Means of Detection Detection Rating (D) RPN (S x O x D) Preventative Actions to be Taken

Data Corruption User data incorrect or lost, application errors 8 Damaged cables, memory errors, firmware bugs 2 Custom alerting scripts, datafile consistency checkers 7 112 Regular cable examination, clean up cable runs in hosting center, subscribe to vendor notification of storage firmware issues

High read/write latency degraded user experience, failed transactions due to timeouts 4 Storage block contention, too few spindles to sustain read/write rates, disk cache too small 6 latency monitors, OS tools, vendor storage network tools 3 72 Change RAID level to better distribute load on the storage array, spread datafiles across LUNs, use table partitioning

Disk Full Writes fail, possible database unavailable 7 Sudden growth in writes, admin error 3 Nagios alerts, OEM alerts 1 21 Switch from filesystem-based datafile storage to ASM to avoid user access to datafiles, QOS write rates

Storage Unavailable Database unavailable 10 SAN switch down, cable unplugged 2 OEM alerts, various Nagios alerts, user notification 1 20 Fully redundant SAN fabric

An example of an FMEA for database storage failures.

This is an example that is illustrative of what may be discovered when you create an FMEA. In the table above we can see that latency and data corruption are important areas of concerns because they occur often (the high latency case) or because they are very difficult to detect and are very severe (the corruption case). These two top-risk items warrant an investment in failure mitigation and detection.

That’s not say the items with the highest RPN are the only things you should spend time on. The FMEA doesn’t capture the economics of mitigation nor do they always reflect the true cost of loss of service. In the above example, “disk full” may appear to the customer as the same failure as “storage unavailable”. Also, there may be close to zero cost for mitigating disk space issues (monitors for disk space are available in all monitoring packages), but a significant amount of cost in mitigating latency problems (you may have to re-architect the application). FMEA should be used alongside common sense and experience when choosing risk mitigation strategies.

March 25, 2010 documentation, lean IT, six sigma

Content

Recent Readings

Failure Mode And Effect Analysis For Software Operations

How do we choose what to worry about?

FMEA - Database Storage