Changing Culture To Enable DevOps John Allspaw, Israel Gat, Lee Thompson, Llyod Taylor, anrew shafer moderating Shifting cultures to embrace devops; moving Agile into operations as the next agile frontier * Andrew asks about how you get management to buy in to devops (the management believes in the siloed structure) I: Agile structures let you measure consistently across development, but this is not happening in IT; first thing to do is align the governance Le: nobody addresses non-functional aspects of the system J: Domain expertise needs to be connected together, they need to be joined at the top (C level, VP level, etc) Lly: At the highest level you want make the most money possible, how do you optimize that? You need lots of trust between groups . If you don't see a lot of cross-team socialization, you are probably inefficient Andrew: Conway's law "software is reflection of the communication patterns of the people who made it" - [andrew ask the panel] How can we build trust and communication? LLy: Look at incentive conflict (ops: site up, dev: new features; these are at odds). Also product lines fight amongst each other; google uses machine allocation to determine this - so you need control like this to concretely value products I: Ask "is the culture in IT similar to the culture in revenue recognition?" Audience question: "QA is being removed from discussion, where does QA fit in?" J: Just because there is no QA department doesn't mean QA isn't happening. Many ways to assure quality. Some orgs push QA towards dev, product, ops, support. Not a QA department at Flickr or Etsy, you establish a culture of Quality. Lee: Facebook doesn't have a formal QA department I: Look at an org's release engineering practices, often this is in compensation for lack of quality, so it takes a long time to release a problem, look at this for a quick acid test of whether quality is a culture. Many, many companies lack unit tests. Quality starts with development, and requires measurement. Lly; 1) who suffers if the code is bad; if the answer is Ops, you need QA, if the dev suffers, the code will be code. "Pain driven development" 2) How hard is it to make a mistake? Poor APIs, poor docs, easy to make a mistake == lower quality 3) How quickly do you detect a screw up? Andrews: I'd like to talk about is global community of practice, how can we leverage each others tools? J: There's two communities, Audience: Why devops now? Lee: More applications like LAMP/Rails, and cloud computing. If you only do functionality you don't understand non-functional requirements Audience: Companies running really lean, so I don't as an ops person have time, how do you carve at the time to make that culture question. lly: personalize your environment I: Look at the problem in two ways: Get more non-functional into the standup meeting, but this can fall apart if ops has to go to 18 standups. the other approach is Kanban, using small pieces of flow [what does he mean by this?] Audience: We're facing distributed organizations, fighting timezone and distance, how do we do this? J: that's hard. Bring the people into the business conversations, make incidents/deployments/etc part of the business lly: The only way to build relationships is physical presence (right now), but the facebook generation may be different Audience: 5 years ago we asked how to implement agile in companies, now we are trying this next step. We use patterns to introduce new ideas, book by linda rice Audience: How do you teach this? Audience: The ops team is outnumbers, how do we help this? "Come to our demos/scrums/etc" - I can't go to all or this will be all. So ops asks dev to sit with them, how do we get trust? An: IF you can get a culture where there is only us, you win I: Look at culture challenges in the context of scaling (you scale up, etc) - look at where you are on scale up, scale out, ?? assess the constraints and address them lly: You have a personal choice, you can use victim "they did this" or owner "i can make a change", if you are categorizing them you playing the victim. Stop doing this. Find the person you have the most trouble working with and go out to lunch with them but don't talk business. J: Answer the trust problem. Get some perspective on what everyone has pain. Lead by example. Show them that collaborating is the norm. This is your culture. Andrew: There's a lot of patterns we can use from others, no reason to stop learning. Infrastructure as code ---------------------- Adam Jacob, Luke Kanies, Erik Troan, Theo Schlossnagle, moderator: Patrick Debois T: Manage around 10k machine, writing kernel, javascript, BGP, etc A: CTO of opscode L: Puppet founder, Puppet Labs, talking about this for 5 years or so E: All about using version control for sysadmins, wants to being development process to engineering P: What is the difference between this and scripts? E: Now you have 10k machines instead of 10 L: scale is very different, you can call amazon and get 1000 servers, now the admins are the limit A: We're better at what we do now, we all had ~scripts/doit5.pl, now we use SVN T: 5 years ago we did similar things, instead of why now, i can't do CDN cache configs, the cloud took away a lot, but i'm left with what I can control Tools for people who did this 5 years ago are awesome, but not shared L: Stealing the tools developers had, sysadmins didn't really have a culture of toolsmithing; now we are building a philosophy of toolsmithing P: Developers have cool things like unit tests, i don't see that coming in to infra as code L: That is the immaturity of the code; developers have 5 decades we have 1 decade. We need a new way to do unit tests. A: T: End to end testing matters, unit tests can shove it, in the cloud things come up an down all the time. A: half configured systems are busted E: We really mean declarative languages as code, if you aren't using puppet/chef/declartive language you're doing it wrong L: I disagree, unit/integration/end-to-end, the distinction is important. If you have extremely complicated code you need tests. do things fail in intelligent ways? It takes time to make these tests in the ops world T: DevOps is about pusing software engineering paradigms into the operations stack. Ops people need to have the stuff up all the time, this means you cannot exceed expectations. Not hearing enough about pushing ops practices into engineering orgs. "Why wasn't the software operable?" Audience: You're trying to build a programming language, or you are building reusable components, and then you can compose these. Is there a divergent way to look at this? A: I think this is the same thing. Inf as code is thinking about the whole stack, and thinking about it as an engineer. T: Orchestration is puppet/chef, the deployment mechanics are reusable, the recipes are not. I don't want to write the scaffolding. L: if that is true I quit. That would be failure. We're early phase, like when you wrote asm, and our inf as code is like asm. WE can work through this failure. In puppet we've been working to get through this for a long time. A: monocultures die, there are unique things we do. I don't know your system. T: Had a conversation with net eng, talking about peering, anycast, BGP, and all groups did it differently and liked it. They'd be happy to use a deployment tool but not using the same recipe. It is the same with application code. L: If youa re thinking about that there is a flaw. You install firefox, but you can use different URLs with it. You can pull these assumptions out of your code, if people find this they help patch/make more reusable. A: We agree, there's a layer you can share, and there's a layer above this, but no magical unicorn. E: I agree with Luke, it is a rising tide. L: Your infrastucture is a unique snowman, but what makes it up is not so unique. You need to reuse the 80% we all share, and try to up that number. T: The 97/3, the %3 is where I spend my time. Audience: sudo for everyone? T: PCI/DSS loves that... some orgs okay, some not. people get fired when you make errors. some places have SOX, HIPPA, PCI, which make this hard. A: I don't want CI to heart/lung machines. Audience: Trying to push TDD into infra deployment; deploy to make nagios green. Is this not common? L: You are sadly too leading edge. A: Rails Machines write the monitoring first, and they set goals based on turing those monitors green (so as software starts working, more gets green.) T: If it isn't monitored, it isn't in production. Audience: Next layer up, taking a linux distro, now we accept config management T: you can't move the bar to the end, with config managment you have to review, commit, alert the NOC, write rules, CRs, rollbacks, etc, takes 16 hours, puppet/chef is only saving 20 minutes. L: In general tools can't solve political problems, but many problems/auditing come from having bad tools. Puppet helps change audit/change control. A: Auduting process gets built into the no-op puppet run. Audience: sculpting versus coding L: infrastructure as code, like all great sayings, is a lie - how many to login in again? [applause] A: no no no, you need to log in. T: there's a reason sun implemented dtrace, you need to see what is going on. I want dtrace everywhere. I want distributed observability because things go wrong. You need to instrument your app. A: I'm a tool builder, i love tools, they make me better as an engineering. Chef is smart because you're smart Audience: We tend to forget why we do 16 hour change control processes. The reason we go through that is because we suck! I believe we can use puppet/chef to win back those 16 hours. T: Limit liability by making small changes all the time, fail fast, fail never. You have to use different policies at different risk level. Audience: Code as a boundary object; everyone can see the same thing and get a sense of what will happen because you have code. Audience: Places have "good driver status", the change control process is because they didn't have a good history of success. [you need to measure your success, but how do you prove that the process isn't what makes the success?] Ignite Talks ------------ Slides autoadvance every 15 seconds Adam Rosen from kaChing: CD at kaChing * connect people with investment managers * Think of code like running a lean warehouse * Deploying is the only want to know the code works * Anyone can push a good build live * Everyone sees all layers * deploy "canaries", then direct traffic slowly to new services * You can push to production from a commit message! * use zookeeper all over the place for service discovery * built splunk light in half a day; they enqueue stack traces to a rollup [queueing all the errors together - smart!] Alex Honor from DTO Solutions: Simple release and deploy tool chain in action * Devs need to specify packages, ops perform release and deploy * release manager controls the release * self service for everyone * meta package - oupled set of packages * allocate repos to various teams * Runbook jobs * lots of auditing via SCM revision mapping * Example of a devops toolchain is on the web Bitnomi - erica brescia Cloud is a trendy mainframe * costs have dropped * backups have evolved * I/O systems have also evolved * CPUs evolved * distribution has evolved * hardware has evolved Clint Byrum, Canonical API contract is dead * Things move fast today; APIs used to be stable * how do we deal with API change? test coverage, continuous integration * Reality is not everyone has test coverage, or dependency resolution, etc * Good things about fast change is new things tend to be better * Bad things: stuff breaks all the time * libmemcached as a cast study * mongodb - old versions in 10.04, what is up with that? * cassandra - moving so fast to deploy * why do we package? For predictability * what can authors do - stop breaking your APis * what can distributors do - try to support everything, nothing, everything sucks. . * how about what admins do - they build from source * rather than build, personal package archive, allows you to make a personal repo * apt-get install will work with these * also, derivatives Mark Lin Metrics Simplified * Why? if you can't measure you can't improve * previously, it was hard to get metrics * this is bottleneck - typically single server, big bottleneck * Grahite, RabbitMQ, Rocksteady [graphite looks completely badass] * grab a stream and it will be graphed * sweet graphs with releases as vertical lines, latency on the horizontal * too many graphs! * Rocksteady - a metric as an event [also looks badass] * auto threshold prediction! * correlations! * timing per request, per component * Your Mileage May Vary --------------------- Kevin Rays, Stefan Apitz, Burzin Engineer, Ernest Muller, Dan Nemec, moderator: Andrew Shafer Andrew: Let's talk about the bad, what didn't work well: D: Since i have gone to Dev the Ops people think I am the bad guy. Sysadmin 1.0 is about silos, what's up with that? K: I put puppet in place, and people didn't understand having to check in to svn, etc. E: biggest hindrance is other ops people. Just like when agile came out, we need to make the culture change. B: asked who runs agile shops, agile ops. [Only a few agile ops shops.] Success rate around 80%, success is limited by personalities in ops, just wants to knock off a ticket and move on to the next ticket. Fan of Kanban in Ops. Wants to hear about Kanban stories. [need to ping Burzin on this] S: we don't have the tools in place, working on getting automation in the org Andrew: Sprints 2/week iterations can be demoralizing, so make this transparent. Andrew: Does anyone have a story of spectacular technology failure? E: Culture is tremendously important, we have to work to maintain this culture, you need to treat the cutlure as a first class thing, people whose jobs are to make sure you are doing this. Audience: ?? B: We've broken up the teams to Audience: Scrum worked for us by considering the product standalone, we pulled point people into the scrums, and let others put items on the sprint list. Releasing was part of the sprint, this lead to a slip of only 11 days! E: it takes a while to be okay with pulling stuff out of the iteration, Audience: ITIL can help agile, don't alienate people, you need a career path to get to devops. K: Your sysadmins are great developers, but don't know the development tools, your education as a devops is teaching them development practices Audience: Were do the seeds in the org come from that brought you devops? With engineers, or as managers? Bottom up, top down? K: It is just me in my org, i'm evangelising D: bottom up, and limited in hiring, so it was a survival mechanism [this is a good metric on if a system really works] S: Started from the product side. so we could release more often. Audience: I love that devops is not just a piece of technology, is a culture issue. It is easy to change culture in a little company, but very hard to tell the culture is bad. In a big company it is the reverse, and that is a problem. I've never heard of a major company change its culture. What leaders bring to the table is culture. Be sure to make the culture work when you are small! E: Good point. One of the flaws we have is that we hate management, as a result we don't build culture/people. Audience: A lot of the issues with growth, you can automate to help preserve culture, but sometimes you don't know when to reorg. A: you can have culture debt, what do you do about this? K: The unicorn answer: it has to come from the executives; if the leadership pay isn't tied to culture it won't change. B: Start with the problem. Fix the downtime, iterate on that, understand without blame. Audience: Read "from good to great", talks about observing successful organizations. Audience: Change the behaviour, not the culture - you can't change the culture quickly, but you can change behaviour [the boot camp makes army soldiers out of anyone philosophy] E: Use the new tools in new ways. B: don't be afraid of change. S: change the behaviour Automata - puppet/chef for networks * complex network configs - alot of this is accidental/cruft * Highly interdependent * Automated networks are better, %95 right six times in a row is %75 success * more reliable * people are bad a doing the same thing over and over again * easier to maintain (test, deploy, rollback) * config generators Tim Larsen - petascale storage * data is expensive to store when you have a ton of it * open storage pod * uses laptop disks - save on power, by a lot, and space, and produce less heat [good recognition of the real cost of hosting] * 120 TB, 4U, 600W * 9.6PB per 8 racks Rolf Russell - Botworks Systems thinking and value stream mapping * thesis on GAs, about beer game * mit sloan beer game * parts cannot be viewed in isolation * software pipeline is the beer game * lean value stream mapping * model current state of the system - no the simple model of the software development lifecycle cross boundaries * grab metrics around time in state * ID root causes * then make changes * view the pipeline as a single system, get all people in a room, and figure out what to changes Paul Kerner, Cloudkick Apache Libcloud * no more racking * now i want a server right now, libcloud is about this * Not SaaS, PaaS, Cloud Storage * Why? all the providers have different APIs * 16 providers supported * simple API (list_nodes(), reboot_, destroy_, boot_) * include data, location, cost per hour - "get me 4GB machine, cheapest one, in europe" * fabric is an example tool * silverlinig - deploy django in cloud John Treadway A simple story about how a company spent $10million and saved nothing * Funny story about failing involving characters that seemed oddly familiar Eric Sowa, Lyris Latent Code: Lessons Learned Implementing Feature Bits * code that doesn't run until you flip a bit * DifitalRiver and ExactTarget * Flickr * Twitter * 80 active bits at lyrsis at one point, now 48 * lesson learned: - design pressures are good (bits require loose coupling) - bits have short lifetime and must be retired aggressively - maintain production quality despite the bits - naming convention matters; the product manager must be able to turn the right stuff (a verb and noun!) - features are for features, not for control - feature bits are independent - product wants to use this to make big changes - use this to do beta- and split- testing if you build them correctly * @eriksowa Making The Business Case ----------------------- Jay Lyman, Kurt Milne, Jody Mulkey, Rolf Russell, moderator Damon Edwards Damon: "I've got a great idea, the business folks won't listen" - what are the antipatterns in selling devops to the biz folks? Jod: focus on the squeaky wheel R: antipattern, not understanding your audience; don't start with words they don't understand, figure out what they want that you can give them Ja: talk about the improvements and how to bring those to more people K: business people think in terms of outcomes, so tie your justification to that outcome R: Cost is only a small part of what is driving the CTO/CEO Damon: What should we hang the devops around? Jo: "Make fucking graphs" ; measure expected revenue, watch this graph R: Double-Double - he needs to speed up releases, and release twice as much stuff, how can devops enable this? Ja: timing is a part of this. Monitor, measure, track. K: low cost probe idea; can marry this with devops Audience: observes that people all know what they are doing, just have different words and priorities, Business+devops, Step 1: talk to each other Ja: Big part of this is business requirements folks, they are probably most capable of knowing where a process needs to go Jo: second guessing is really expensive - need to trust those you work with K: one the best companies i've seen have product, ops, dev, etc all compensated on the same metric Damon: poll, how many compensated on business metric? [not too many hands] Audience: Leading Geeks by paul glen, recommended as a book, - you need to teach your management the things that are in this book. 90% of a job is figuring out how to do it, for geeks. That's not the way it is for most process. Damon: what should an IT organization metric that a business will consistently care about? R: It is very dependent on the Org, it is cool to be able to tie a system to $$ as metric. Start by showing how much work in what period of time. K "time to capability" - biz wants X, how long until X is in place? Jo: Scaffolding at shopzilla makes developers work on business problems first Audience: On getting apis on performance of the system, sometimes you have no control over this yet your compensation is tied to this. Jo: We had an uptime goal, but were blowing it, everyone took $0 on this. Lead to lots of blame. [but oddly he still supports this, saying it still is good] Ja: Who gets the credit? The comp based on performance might not always be the right model. Damon: All parts of the business share the risk. If IT says they lose money 'cause of some other guy, IT needs to step it up. Damon: don't you want to be part of place where you get a part of the profitability? Audience: work with a CTO who reports 18 metric, asked how many executives think these meaningful - not many. Focus on 3 metrics; 1) focus on $s brought in output 2) metric on the quality of your output 3) cost Audience: Shouldn't have to guess what KPIs are important, should just sit down and ask the business guys. K: Boil the story to a cocktail party story; devops can give you faster, cheaper AND more. Audience: You sell managers a story/vision of what the world should look like, and what it will look like. R: Don't need stats, great if you have them but it is all about the stories (depends on the audience) Ja: this is a circular thing, not a linear process DevOps Requires Visibility -------------------------- Gareth Bowles, Javier Soltero, Jyoti Bansal, Matt Ray, Eishay Smith, Moderator: Damon Edwards Damon: business immune system, every testing is automated, what do we need to do to get "situational awareness"? E: You need many lines of defense to make sure the system works, M: You need to instrument things from the bare metal all the way to the application, and need a tool to plug this in for later Jy: The goals of dev and ops should be the aligned with the business goals, J: Too often you throw tools at the problem (bought or built), but you need to take into account the consumer of this information G: There needs to be people watching the metrics and bubbling up the understandable view Damon: data, data everywhere, not a lot of knowledge. Better to ask the business folks and ask if they have the data they need.. what makes devops different? G: J: Web shops, ops people aren't a cost center, easier to understand the business cultural, need to teach the product managers about how to use the metrics, and build an engineering process Jy: The information out of the data is the problem, not getting the data M: We have a customer whose devs/eng work hand in hand, ask what needs to be instrumented where and push req's back and forth, devs can't mark done with ops sign off E: First hire best engineers, than give them the root password. Trust them! Enable the engineers to do what they want, build the tools for them. Damon: [for audience] how many feel like you are actually solving the questions the business wants to know? [for the panel] What are some best practices for good culture of visibility? E: A good culture of quality; engineers care about covering all the scenarios; every commit passes a test. Lots of alerts when a build is broken. M: Best pattern is cooperations and sharing of knowledge. Jy: Information sharing is part of cultures that have been successful. Information needs to be distributed. The first thing you see in the NetFlix lobby is the user satisfaction score. G: Share everything! let the whole company see them. People will come ask if they don't understand the info. Audience: We're going through the SLA conversations now, but nothing about the real business metric - this demoralizes me. [long story about how the sad state of his work] Monitoring is a racket if you're not doing custom scripts. [need to check the actual service, not just crap like ping times] Damon: How many people have a permanent review panel for looking at metrics, and the panel is cross functional? Audience: we had to asses loss, and track outages, what is "down?" If they knew how much money the companies were losing they would pay for the analysis. Regarding "allowing root", what about PCI and auditing, how do you resolve these issues? E: we have monitoring and reporting, and this forwards it on, J: VMWares culture is flexible on access control, but there is no getting around regulatory requirements, but this inspires some of the features/requirements. Audience: 1) introduce new metrics at the feature level 2) do postmortems panel: start top down, business metrics first, cpu stuff last Does the cloud need devops? [and vice versa] -------------------------------------------- Joe Arnold, Adrian COle, Justin Dean, Ben Black, moderator: john willis B: people should be asking "can integration of what i do with everyone else in the business make the business work better?" it seems like this is a win for secops, not sure why they wouldn't do that. Notion of automation in netops is alien, and the tools are not up to snuff Jus: Interested in including sec/net ops into our work, working on this now. Suffering the "kicked a server very fast, but then waiting on the network teams old tools". 1) need automated reclaiming of old servers; 7 days of idleness and we reclaim your servers A: using lisp to provision cloud with clojure Joe: CTOs not calling us, it is the product organizations who want cloud computing because their IT organization is too slow. Audience: Sourceforge "idle server" to help identify unused resources Audience: Have people used dev/qa clouds as production environments? B: people have NFS mounted their desktop to amazon and put amazon in production. nothing good comes of you making people work around you. The way to deal with inadvertent is to do firedrills, continuous deployment, continuous failover testing. Audience: Using the cloud to make servers is good, but developers need infrastructure, what can be done to help out? Jus: yes, we automated the deployment of the apps, and working on automating the load balancers. Audience: You talk about enabling developers, but it also making the developers take the ownership of uptime/etc, but this is an area of weakness for developers.. the question is what people are doing to try to help improve developers get better at ops? A: it is a collaboration Audience: When you have developers accessing servers, people lost trust due to "chmod 777" and so on, what do you do to stop this? B: Education by feedback about why policies and procedures are in place , you do this for ops people already A: developers will work around boundaries.