What Does A Sysadmin Look Like In 10 Years?

Crystal ball image courtesy of Daniel De Jager

At Boston DevOpsDays 2011 last week I hosted an open spaces discussion during which we prognosticated on what the everyday sysadmin would look like in 10 years time.

A lively discussion followed and out of it we came up with a few key predictions that we all loosely agreed on; the future sysadmin will:

Write code.

We all agreed that there’s little place for a future sysadmin that can’t (at the minimum) write scripts, and nominally write and understand code in a non-shell programming language.

Do a lot of data analytics.

We thought that any future sysadmin will be much more of a data-driven engineer; they’ll build systems based on engineering not gut feelings or “because it worked last time”. The future sysadmin can do math because the future sysadmin does more science.

Work on a higher level of abstraction.

The future sysadmin needs to build complex systems by treating what we now think of as systems as building blocks. They’ll not think as much about network ports, IP addresses and machines but instead think about the interactions of applications and instances of those applications.

Focus on service delivery.

The future sysadmin has monitors that says if the service is providing the business function, not if the host is pingable. The future sysadmin understands what the service provides and how to make sure that service is being delivered reliably.

Be on call (with her developer friends) and own the problem.

She’s the crisis manager and may have the best understanding of the data as they relate to the system as a whole. She’s got the full stack view that gives her the credibility to the be first line of defense in a tough situation.

Lead the root cause analysis.

The future sysadmin, if anything, has a more complex job because they need to bring together not just computing resources but data and human resources. The future systems administrator is more about the system and less about the administration.

What do you predict?

With the rise of PaaS and IaaS, we see movement away from the traditional role of the sysadmin into a more app-focused, development-aware admin; someone who understands the business needs and the full technology stack, and I feel like the leading edge of administration is already at this point, as the people you see speaking at all the conferences and writing all the books do these “future sysadmin” tasks already. What do you think about these predictions? What do you think the future sysadmin will look like?

Recent Readings

Web

Fixing UNIX Filenames – An interesting discussion of the non-standard handling of the “standards” in UNIX filesystem naming, such as newlines in names.

An ‘Accordion’ of Wood and Glass – A look at where all the money you spent on your calculus textbook went.

Strace – the Sysadmin’s Microscope – An excellent article on using strace(1), the best way to find out what your process is really doing in Linux.

Print

Garner’s Modern American Usage – Okay, I’ve not read the whole thing, but I keep it on my desk at work to figure out if my word choice is correct, or if it makes me sound like an asshole (and, all of the usage examples are from recent media, including great references to Superbad when needed).

Web Operations – Allspaw is listed as the author but this book is written by many people in the DevOps community (and hey, I reviewed Patrick DeBois’ chapter on monitoring). This book offers excellent practical advice to people doing web operations.

Rework – The 37signals guys wrote this book about the lessons they’ve learned running a successful startup, and while the book is pretentious, it does have solid advice (for example: don’t hire fast; don’t worry about being “professional”, build software you want to use).

Boston DevOps @ MS NERD

Boston DevOps meeting regular Vladimir Vuksan has has gone and done a great thing – he’s setup the next meeting at Microsoft NERD, which is at One Memorial Drive in Cambridge, MA (about a mile from where we usually meet).

Vladimir has setup a registration link here.

This promises to be a great show; Vlad and Jeff Buchbinder will be giving a presentation on lessons learned while reengineering deployments at their companies. You can read more about what is on the agenda on this post on Vlad’s blog.

See you there – Tuesday, August 3rd, 2010, from 6pm until 8pm.

Learning From My Hiring Mistakes

The last 8 or 9 years of my career have been spent managing operations groups at various organizations. I think one of the most difficult tasks for any manager is hiring the right people. Many blog posts have been written about how to hire developers and operations people, and I won’t rehash what they’ve said here, but I will add to the conversations with some of my own experiences and some of the lessons I’ve learned from my mistakes.

I’ve mostly hired “DevOps” people

Almost all of my hires have been for teams that are either too small to have defined boundaries or for teams that have charters that involve problems that don’t fit into traditional operations roles (“System Engineer II”). I’m much more interested in building a team of people that will own problems and find solutions regardless of their job description and have therefore tried to avoid creating jobs that rely on a specific skill. I want people who will go outside their role boundaries and marshal the technical resources needed to solve a problem correctly.

I want to hire people to do people work and let computers do computer work. People are good at creative problem solving; they’re good at understanding subtlety and context and intention. Computers are very good at doing the same thing over and over again very quickly. In other words, if it can be automated, I want to hire the people who will automate it and move on to harder, more human problems. I want to hire smart people who can learn a particular tool if necessary, not hire people just because they’ve used that tool before.

That is not to say I like to hire purely creative people with no thought to their technical ability, but rather that I don’t hire based on knowledge of a specific technology. For example, I may run postfix on my mail servers but I’d still hire someone with sendmail experience if that candidate understood internet mail routing really well.

On the other hand, there’s a baseline level of technical ability you need to do your job well. I’m not hiring people to teach them Linux. You need to be experienced with the basics so that you can get to work building better tools and solving hard problems for the company that hired you. We’ve got work to do, after all.

I’ve made a lot of mistakes in hiring

Here are just a few of the many mistakes I’ve made, and some of the lessons I’ve learned from those mistakes:

Rushing a hire because of a current need (a specific project, a needed skill on a team, a work overload).

Your best admin is quitting work to become a piano tuner and he’s taking his storage management expertise with him. You need a replacement stat. What you should not do is rush to hire the first admin with storage experience on his or her resume, because it is inevitable that the hire will work on a diverse set of technologies in their career at your company. Spend the time to hire an admin who you’ll later think of as your best admin.

Some tips for speeding up hiring without sacrificing quality:

  • Fast track your internal references, especially from your most valued employees. Smart people are friends with other smart people.
  • Use phone screens to test candidates for the requisite technical ability; phone screens should have questions with specific answers on topics the candidate simply must know. You’ll weed out a lot of unqualified people in a 30 minute phone call.
  • If they pass the first phone, do a 45 minute phone interview; discuss an interesting topic from their resume.
  • Don’t be afraid to walk someone out early; it’ll save you time and they probably already know the interview isn’t going well.
  • ITA is well known for our pre-interview programming puzzles, and there’s legitimate debate about this process, but if you don’t require a puzzle at least ask for a code sample or equivalent. Like the technical phone screen, this process can help identify candidates for in-person interviews.

Not interviewing thoroughly enough.

This is especially important in a small company where the success or failure of a hire can have a drastic effect on the health of the company. Have the candidate talk to at least 5 people, and make sure most of them are technical people. It’s also important to discuss who will talk about what in each interview before the interview. Otherwise, you’ll all talk to the candidate about the same interesting bit on his or her resume. Don’t hesitate to bring the candidate back for another round of interviews if you think it is warranted.

Compromising on ability.

I made this mistake at a previous job after being unable to find the correct person for the job for months. I eventually hired a person simply to get a body in, and I knew the person wasn’t good enough for the job. Sure enough we had to part ways after a few months. Firing someone is expensive and time consuming, and you’re left back where you started. Part of the problem with this hire was that my budget wasn’t correct for the skill set I wanted, but in the end it was me compromising that led to the failure.

Some lessons from this mistake:

  • Fight for the correct budget for the skills you need.
  • Make sure everyone agrees the person is correct for the job.
  • If you haven’t hired for a position you’ve had open for a long time, perhaps you need to re-evaluate what you are looking for and if you really need the hire.

Thinking I’ll have the time to train someone in skills fundamental to the job.

I feel worst about this mistake because I should have known better; I was already overloaded at work, so what made me think I could train someone in my zero free time? Even if you think you have the free time it is inevitable that work will take back that free time.

What didn’t work for me may work for you…

… but it also may not, so be careful. Hopefully you’ll learn from my mistakes in any future hiring you do.

OpsCamp Boston 2010

This past Thursday I attended OpsCamp Boston 2010, an unconference themed around topics in systems operations. I was interested in meeting several of the people I know only from Twitter and also making new friends from the greater Boston Operations community. Microsoft NERD generously provided the use of their space for the conference.

Conference Structure

The structure of OpsCamp was unlike any of my other unconference experiences (almost all BarCamps) in that before the group decided on the what topics to discuss the sponsors had an opportunity to give lightning talks. I’ve got no issue with sponsors giving lightning talks, but the organizers arranged it such that all attendees of OpsCamp were in the room and essentially required to watch the sponsor talks. The unconference Rule of Two Feet (reason #2 for attending unconferences on this page) was never explained to them.

Another interesting difference between OpsCamp and other unconferences was that directly after the lighting talks, but again before the community had a chance to choose topics for sessions, the organizers had an unpanel answer seven or eight questions from the audience. The questions covered a variety of concerns/issues in systems operations today. The questions ended up leading to suggested sessions when we were finally able to decide what people would like to have the unconference on.

Unpanel

The questions for the unpanel, as my notes recall them:

  1. What happens when all the ops jobs move to India?
  2. How would cloud adoption affect the outsourcing of ops?
  3. What are the costs of ops and IT? What is the trend for that? What is the correct ratio of IT assets to people administrating them?
  4. Why won’t my ops people let me self-provision like I can in the cloud?
  5. What is the connection between the talks we just heard and the cloud?
  6. Should dedicated infrastructure and public cloud resources be centrally managed?
  7. Patch or Rebuild?

Questions 1, 2 and 3 became a single breakout session (more on that in a moment), and 4 became another breakout session. Topic 5 was addressed by Cory from Dyn, who said that he felt that the policy and process haven’t changed but the method has, in that we now deploy via APIs instead of physical hardware. For question 6, many people felt that you should mix your infrastructure between physical hardware, virtual hardware and the cloud, and manage it centrally, and it was pointed out that rPath, Opscode and others have technology to help do this. Questions 7 was the last breakout session of the camp.

Breakout: Will Ops Get Outsourced?

This breakout came out of the heated discussions on whether or not Ops people are going to be outsourced and the cost of Operations in general (questions 1,2 & 3, above), if new Ops people are less skilled, and other ideas that mirror the offshore development discussions of ten years ago. There was a pretty obvious age divide, which is a hot topic in technology in general, and a lot of discussion on the evolving nature of what an Ops person is. There was a lot of respectful arguing in this breakout session, and I think anyone who attended this session left thinking a lot about their own future in Operations.

In parallel with this session was a session on what tools can be used to build a cloud.

Breakout: Why can’t I just deploy to the cloud?

I ended up moderating this breakout panel because the gentleman who asked question 4 (above) had left OpsCamp early.

Why can’t a developer simply pull out her credit card and put her product in production? Perhaps even in a large company with an established IT department? I’m a big fan of everyone in the organization working towards delivering the service, not bickering over domain, so I’m in support of questions along these line. In that spirit I renamed this breakout “Why are developers trying to ruin the business? ~or~ Why are Ops people assholes?” in the hopes of bolstering attendance. Everybody at OpsCamp ended up going to this panel, so score one point for inflammatory panel titles.

There’s no short answer, and we lost track of the original question several times, but the overall idea is that process and repeatability increase the chance of successful service delivery, and often developers overlook these issues when creating software. That said, I think the Operations department should do everything it can to bring Ops processes to the developer (and make sure that Operations is built into the product, not bolted on later). Work together despite often having seemingly conflicting goals.

Breakout: Patch Or Rebuild?

The last session of the evening was a discussion of when it is okay to simply rebuild from an image instead of patching the running software. There were a lot of opinions on this, but I didn’t have too much to add because I think the right answer largely depends on the situation.

Networking

After the sessions, many of us retired to a bar in Kendall Square to have drinks (graciously provided by the folks from Dyn) and chat.

Final Thoughts

While I had some issues with the structure of OpsCamp, I enjoyed the people and the discussions that we had. I do wish that the organizers had encouraged people to post possible presentation topics on a wiki ahead of the camp (as was done for BarCamp 5) because I think that encourages people to prepare presentations on topics and helps avoid every session being a discussion.

I’ve also uploaded the raw notes for your viewing pleasure.

If you attended OpsCamp Boston I encourage you to come out to the Boston DevOps Meetups, the next of which is Tuesday, May 4th.

Recent Readings

Web

Devops Homebrew – Vladimir Vuksan is a regular at the Boston DevOps Meetups and I was happy to see this post on his previous job’s release process. The post is an excellent case study in DevOps in deployment.

An Agile Architectural Epic Kanban System – Part 1 – There’s a lot of room for Kanban and Agile in DevOps initiatives, and I think many people are already headed in that direction (I’ve started doing Kanban with the operations teams at ITA; they’ll be a post on how this is working in a few months). Having the developers and ops people use the same process management technique helps improve communication all around, and Kanban gives excellent visibility into what is happening now in an organization. The article above discusses using Kanban to give visibility into the process of architectural decision making, a process which is often invisible to developers or ops people.

Print

The Visible Ops Handbook – Tom Zauli from rPath brought me a copy of this at the last Boston DevOps Meetup, and I’m about halfway through. I think the practical steps recommended in Visible Ops would be very effective to gain control of an operations organization that is underwater, and after control is regained you can start automating as much as possible.

The Checklist Manifesto –  If you haven’t read Complications and Better you should stop reading this and pick up those two books right now. Dr. Gawande’s analytical look at process improvement in medicine (or lack thereof) is readable and it is easy to find parallels between his observations about medicine and any other industry. Both books are highly recommended for people who care about honest self reflection and evolutionary improvement.

The thesis of Dr. Gawande’s new book couldn’t simpler: checklists prevent errors. He backs this up with examples from many fields and the argument really is compelling; I can think of many cases at work where a checklist has saved the day. I think the DevOps trend of automating as much as possible, especially around deployment, is a way of encoding checklists. At ITA our deployment process went from a checklist that took a day or more to complete manually to code that performs the same checklist in under 45 minutes – that’s 45 minutes for an entire airline reservation system.