The State of DevOps Report 2020 – A Summary

Every year, for the past decade, Puppet have carried out their “State of DevOps” report, apart from 2019, when it was carried out and released by DORA through Google.

This year, Puppet took the reins again and despite 2020 being the year from Hell, they managed to survey 2,400 technology professionals and released their report on 12th November.

The State of DevOps report attempts to gather, aggregate and analyse progress across the technology industry, backed by data and statistical analysis.

Here are the key takeaways from the 2020 State of DevOps Report:state of DevOps report

DevOps continues to evolve.

One of the things I like about the Puppet approach is that they see DevOps as a continual evolution towards improved delivery, quality and security, and steer away from a more traditional “maturity model” that implies a possibly fictional end state where DevOps is “done”.

From personal experience and what we’ve seen over the past ten years of data, we need to recognise that technical practices are important, but practices that are isolated to a few teams simply aren’t enough to help organisations achieve widespread DevOps success. DevOps is not a CI/CD pipeline, it’s not technology, public cloud, or automation. DevOps is people, culture, mindset, technology, constraints, experience and expertise.*

As the 2019 report by DORA showed, a culture of psychological safety is crucial to both team & organisational performance, and productivity.

psychological safety and devops

Internal platform teams

One major evident transition is the shift to internal platform teams. Unlike product teams, which are responsible for the end-to-end delivery of a product, internal platform teams are responsible for providing a platform that provides the infrastructure, environments, deployment pipelines and other internal services that enable internal customers (such as those product teams)  to build, deploy and run their applications.

The platform model can make product teams far more efficient by allowing them to focus on their primary goals and their core competencies: building and delivering products. A platform team can improve governance, compliance and cost efficiency through providing a standardised toolset that can be easily understood and consumed by value stream-oriented teams.

The 2020 State of DevOps report shows that high performing organisations are six times more likely to report the use of internal platforms as compared to low performing organisations.

devops and shared platforms

Shared internal platforms provide a balance between standardisation and team autonomy. Finding where to place this balance and draw the line can be challenging, but the important thing is to start.

A really useful resource is Manuel Pais and Matthew Skelton’s book “Team Topologies”, which will help you understand what team structures will contribute to building high performing products and services, and how internal platform teams could work in your organisation.

Product over project

More organisations are transitioning away from a traditional project mindset, towards value-stream-aligned, product approaches. Organisations that still possess a traditional “project mindset” may suffer from the proliferation of temporary teams that form and disperse as projects begin and end, impacting team cohesion and performance.

A project mindset encourages teams to focus on the next shiny thing, and throw things over the wall for ops to support, rather than own a product or service longer term and ensure that it’s not only fit for purpose, but constantly improving.

Adopting a product-oriented approach and tying work to value streams improves the delivery of features, reduces defects, increases security, and lowers technical debt. Mik Kersten’s Project To Product is an excellent book to learn more about how to adopt a product approach.

The 2020 State of DevOps report shows that a product mindset is a key enabler of performance in the technology space, and accelerates DevOps adoption and evolution.

product oriented approach and devops

Change management

Ever since Gene Kim wrote The Phoenix Project, we’ve known that fast and lean change management is a precursor for technology performance. Nicole Forsgren describes in her book Accelerate how lead time for changes is an essential trailing metric for high performing teams.

The 2020 State of DevOps report revealed four different approaches to change management based on approval processes (orthodox “gatekeeping” approaches versus adaptive and collaborative), automated testing and deployment, and advanced risk mitigation techniques.

The four approaches described by Puppet are:

  • Operationally mature: High levels of both process and automation.
  • Engineering driven: High emphasis on automation.
  • Governance focused: High emphasis on manual approvals and low emphasis on automation.
  • Ad hoc: Low emphasis on both process and automation.

Puppet also showed that organisations that trust in their change management processes are more likely to adopt automation, which further improves performance.  Additionally, organisations that encourage high engagement with employees in the change management process are five times more likely to have effective change management processes.

devops and change management

It is interesting to note that ITIL, originally intended to improve the quality and performance of technology, has been adopted by many organisations (90% of Fortune 500 firms have adopted ITIL) and has resulted in cumbersome bureaucratic processes that actually resulted in slower change and higher risk. Fortunately, the latest version of ITIL, v4, departs from this heavyweight approach and instead encourages change enablement and collaboration.

To put it simply:

  • Orthodox approvals damage performance
  • Automation gives teams confidence in change management
  • Giving people agency over the process results in higher performance

Challenges to improving change management practices include incomplete test coverage, organisational mindsets of fear and compliance instead of trust and value, and tightly coupled and or monolithic architectures.

As with any DevOps transformation, improve change management processes but primarily focus on people and culture. Break down silos and build empathy across people and teams: enable and encourage engineers to understand and empathise with the concerns of compliance and risk teams, whilst working with governance to create a culture of shifting security and compliance left.

TL;DR:

  1. The industry still has a long way to go and there remain significant areas for improvement across all sectors.
  2. Internal platforms and platform teams are a key enabler of performance, and more organisations are adopting this approach.
  3. Adopting a product approach over project-oriented improves performance and facilitates improved adoption of DevOps cultures and practices.
  4. Lean, automated, and people-oriented change management processes improve velocity and performance.

 

Thanks to the team at Puppet and DORA for carrying out the State Of DevOps reports every year, including the team for this years report, Alanna Brown (@alannapb) , Michael Stahnke (@stahnma), and Nigel Kersten (@nigelkersten).

 

If you’re here looking for a summary of the 2021 State of DevOps Report by Puppet, it’s located here.

*Thanks to Tom Hoyland for the articulate description of DevOps.

Remote Working – What Have We Learned From 2020?

Remote working improves productivity.

Even way back in 2014, evidence showed that remote working enables employees to be more productive and take fewer sick days, and saves money for the organisation.  The rabbit is out of the hat: remote working works, and it has obvious benefits.

Source: Forbes Global Workplace Analytics 2020

More and more organisations are adopting remote-first or fully remote practices, such as Zapier:

“It’s a better way to work. It allows us to hire smart people no matter where in the world, and it gives those people hours back in their day to spend with friends and family. We save money on office space and all the hassles that comes with that. A lot of people are more productive in remote setting, though it does require some more discipline too.”

We know, through empirical studies and longitudinal evidence such as Google’s Project Aristotle that colocation of teams is not a factor in driving performance. Remote teams perform as well as, if not better than colocated teams, if provided with appropriate tools and leadership.

Teams that are already used to more flexible, lightweight or agile approaches adapt adapt to a high performing and fully remote model even more easily than traditional teams.

The opportunity to work remotely, more flexibly, and save on time spent commuting helps to improve the lives of people with caring, parenting or other commitments too. Whilst some parents are undoubtedly keen to get into the office and away from the distractions of home schooling, the ability to choose remote and more flexible work patterns is a game changer for some, and many are actually considering refusing to go back to the old ways.

What works for some, doesn’t work for others, and it will change for all of us over time, as our circumstances change. But having that choice is critical.

However, remote working is still (even now in 2020 with the effects of Covid and lockdowns) something that is “allowed” by an organisation and provided to the people that work there as a benefit.

Remote working is now an expectation.

What we are seeing now is that, for employees at least, particularly in technology, design, and other knowledge-economy roles, remote working is no longer a treat, or benefit – just like holiday pay and lunch breaks,  it’s an expectation.

Organisations that adopt and encourage remote working are able to recruit across a wider catchment area, unimpeded by geography, though still somewhat limited by timezones – because we also know that synchronous communication is important.

Remote work is also good for the economy, and for equality across geographies. Remote work is closing the wage gap between areas of the US and will likely have the same effect on the North-South divide in the UK. This means London firms can recruit top talent outside the South-East, and people in typically less affluent areas can find well paying work without moving away.

But that view isn’t shared by many organisations.

However, whilst employees are increasingly seeing remote working as an expectation rather than a benefit, many organisations, via pressure from command-control managers, difficulties in onboarding, process-oriented HR teams, or simply the most dangerous phrase in the English language: because “we’ve always done it this way“, possess a desire to bring employees back into the office, where they can see them.

Indeed, often by the managers of that organisation, remote working may be seen as an exclusive benefit and an opportunity to slack off. The Taylorist approach to management is still going strong, it appears.

People are adopting remote faster than organisations.

In 1962, Everett Rogers came up with the principle he called “Diffusion of innovation“.

It describes the adoption of new ideas and products over time as a bell curve, and categorises groups of people along its length as innovators, early adopters, early majority, late majority, and laggards. Spawned in the days of rapidly advancing agricultural technology, it was easy (and interesting) to study the adoption of new technologies such as hybrid seeds, equipment and methods.

Some organisations are even suggesting that remote workers could be paid less, since they no longer pay for their commute (in terms of costs and in time), but I believe the converse may become true – that firms who request regular attendance at the office will need to pay more to make up for it. As an employee, how much do you value your free time?

It seems that many people are further along Rogers’ adoption curve than the organisations they work for.

There are benefits of being in the office.

Of course, it’s important to recognise that there are benefits of being colocated in an office environment. Some types of work simply don’t suit it. Some people don’t have a suitable home environment to work from. Sometimes people need to work on a physical product or collaborate and use tools and equipment in person. Much of the time, people just want to be in the same room as their colleagues – what Tom Cheesewright calls “The unbeatable bandwidth of being there.”

But is that benefit worth the cost? An average commute is 59 minutes, which totals nearly 40 hours per month, per employee. For a team of twenty people, is 800 hours per month worth the benefit of being colocated? What would you pay to obtain an extra 800 hours of time for your team in a single month?

The question is one of motivation: are we empowering our team members to choose where they want to work and how they best provide value, or are we to revert to the Taylorist principles where “the manager knows best”? In Taylors words: “All we want of them is to obey the orders we give them, do what we say, and do it quick.

We must use this as a learning opportunity.

Whilst 2020 has been a massive challenge for all of us, it’s also taught us a great deal, about change, about people and about the future of work. The worst thing that companies can do is ignore what they have learned about their workforce and how they like to operate. We must not mindlessly drift back to the old ways.

We know that remote working is more productive, but there are many shades of remoteness, and it takes strong leadership, management effort, good tools, and effective, high-cadence communication to really do it well.

There is no need for a binary choice: there is no one-size-fits-all for office-based or remote work. There are infinite operating models available to us, and the best we can do to prepare for the future of work is simply to be endlessly adaptable.

“Root” Cause Analysis using Rothmans Causal Pies

rothmans causal pies

Context: It sometimes seems to me that in the tech industry, maybe because we’re often playing with new technologies and innovating in our organisation, or even field, (when we’re not trying to pay down tech debt and keep legacy systems running), we’re sometimes guilty of not looking outside our sphere for better practices and new (or even old) ideas.

Rothman’s Causal Pies

Whilst studying for my Master’s degree in Global Health, I discovered the concept of “Rothman’s Causal Pies”.

The Epidemiological Triad

Epidemiology is the study of why and how diseases (including non-communicable diseases) occur. As a field, it encompasses the entire realm of human existence, from environmental and biological aspect to heuristics and even economics. It’s a real exercise into Systems Thinking, which is kinda why I love it.

In epidemiology, there is a concept known as the “Epidemiological Triad“, which describes the necessary relationship between vector, host, and environment. When all three are present, the disease can occur. Without one or more of those three factors, the disease cannot occur. It’s a very simplistic but useful model. As we know, all models are wrong, but some are useful.

This concept is useful because through understanding this triad, it’s possible to identify an intervention to reduce the incidence of, or even eradicate, a disease, such as by changing something in the environment (say, by providing clean drinking water) or a vaccination programme (changing something about the host).

What the triad doesn’t provide, however, is a description of the various factors necessary for the disease to occur, and this is especially relevant to non-communicable diseases (NCDs), such as back pain, coronary heart disease, or a mental health problem. In these cases, there may be many different components, or causal factors. Some of these may be “necessary”, whilst some may contribute. There may be many difference combinations of causes that result in the disease.

To use heart disease as an example, the component causes, or “risk factors” could include poor diet, little or no exercise, genetic predisposition, smoking, alcohol, and many more. No single component is sufficient to cause the disease, and one (genetic predisposition, for example) may be necessary in all cases.

Rothman, in 1976, came up with a model that demonstrates the multifactorial nature of causation.

Rothman’s Causal Pies

An individual factor that contributes to cause disease is shown as a piece of a pie, like the triangles in the game Trivial Pursuit. After all the pieces of a pie fall into place, the pie is complete, and disease occurs.

The individual factors are called component causes. The complete pie, which is termed a causal pathway, is called a sufficient cause. A disease may have more than one sufficient cause, with each sufficient cause being composed of several component causes that may or may not overlap. A component that appears in every single pie or pathway is called a necessary cause, because without it, disease does not occur. An example of this is the role that genetic factors play in haemophilia in humans – haemophilia will not occur without a specific gene defect, but the gene defect is not believed to be sufficient in isolation to cause the disease.

An example: Note in the image below that component cause A is a necessary cause because it appears in every pie. But this should not mean that it is the “root cause”, because it is not sufficient on its own.

Root Cause Analysis

I’m a huge proponent of holding regular retrospectives (for incidents, failures, successes, and simply at regular intervals), but it seems that in technology, particularly when we’re carrying out a Root Cause Analysis due to an incident, there’s a tendency to assume one single “root cause” – the smoking gun that caused the problem.

We may tend towards assuming that once we’ve found this necessary cause, we’re finished. And whilst that’s certainly a useful exercise, it’s important to recognise that there are other component causes and there may be more than one sufficient cause.

The Five Why’s model is a great example of this – it fails to probe into other component factors, and only looks for a single root cause. As any resilience engineer will tell you: There is no Single Root Cause.

The 5 whys takes the team down a single linear path, and will certainly find a root cause, but leaves the team blind to other potential component or sufficient causes – and even worse: it leads the team to believe that they’ve identified the problem. In the worst case scenario, a team may identify “human error” as a root cause, which could re-affirm a faulty, overly-simplistic world view and result in not only the wrong cause identified, but harm the team’s ability to carry out RCAs in the future.

Read more about the flaws in the “five whys” model in John Allspaw’s “Infinite Hows”. Allspaw has recently published another great piece about “root causes” in this blog article.

In reality, we’re dealing with complex, maybe even chaotic states, alongside human interactions. There exist multiple causal factors, some necessary for the “incident” to have occurred, and some simply component causes that together become sufficient – the completed pie!

Take Away: There is usually more than one causal pie.

An improved approach could be to use Ishikawa diagrams, but in my experience, particularly when dealing with complex systems, these diagrams very quickly become visibly cluttered and complex, which makes them hard to use. Additionally, because each “fish bone” is treated as a separate pathway, interrelationships between causes may not be identified.

Instead of a complex fishbone diagram, try identifying all the component causes, and visually complete (on a whiteboard for example) all the pies that could (or did) result in the outcome. You almost certainly won’t identify all of them, but that doesn’t matter very much.

If we adopt the Rothman’s causal pie model instead of approaches such as the 5 whys or Ishikawa, it provides us with an easy to use and easy to visualise tool that can model not only “what caused this incident”, but “what factors, if present, could cause this incident to occur again?“. 

In order to prevent the incident (the disease, in epidemiological terms), the key factor we’re looking for is the “necessary cause” – component A in the pies diagram. But we’re also looking for the other component causes.

Application: The prevention of future incidents.

Suppose we can’t easily solve component A – maybe it’s a third party system that’s outside our control – but we can control causal components B and C which occur in every causal pie. If we control for those instead, it’s clear that we don’t need to worry about component A anyway!

Next time you’re carrying out a Root Cause Analysis or retrospective, try using Rothman’s Causal Pies.

Addendum: “Post-Mortem” exercises.

Even though the term “post-mortem” is ubiquitously used in the technology industry as a descriptor for analysis into root causes, I don’t like it.

Firstly, in the vast majority of tech incidents, nobody died – post-mortem literally means “after death”. It implies that a Very Bad Thing happened, but if we’re trying to hold constructive, open exercises where everyone present possesses enough psychological safety in order to contribute honestly and without fear, we should phrase the exercise in less morbid terms. The incident has already happened – we should treat it as a learning opportunity, not a punitive sounding exercise.

Secondly, we should run these root cause analysis exercises for successes, not just for failures. You don’t learn the secrets of a great marriage by studying divorce. The term “post-mortem” isn’t particularly appropriate for studying the root causes of successes.

 

I should probably highlight something about Safety I vs Safety II approaches here. I’ll add that when I have time!

 

Simpsons Paradox and the Ecological Fallacy

Simpsons Paradox

I’m currently studying for a Master’s Degree in Global Health at The University of Manchester, and I’m absolutely loving it. Right now, we’re studying epidemiology and research study design, which also involves a great deal of statistical analysis and data science work.

Some data was presented to us from an ecological study (a type of scientific study, that looks at large-scale, population level data) called The WHO MONICA Project that showed mean cholesterol vs mean height, grouped by population centre (E.g. China-Beijing or UK-Glasgow).

In this chart, you can see a positive correlation between height and cholesterol, with a coefficient of 0.36, suggesting that height may be a potential risk factor for higher cholesterol.

However, when the analysis was re-run using raw data (not averaged for each of the population centres), the correlation coefficient was -0.11.

So, when using mean measures of each population centre, it appears that height could be a risk factor for higher cholesterol, whilst the raw data actually shows the opposite is slightly more likely to be true!

This is known as an “ecological fallacy” – because it takes population level data and makes erroneous assumptions about individual effects.

This is also a great example of “Simpsons Paradox”.

Simpsons paradox is when a trend appears in several different groups of data but disappears or reverses when the groups are combined.

Table 1 in Wang (2018) is a relatively easy example. (This is fictional test score data for two schools.)

(Also, please ignore for a moment the author’s possible bias in scoring male students higher – maybe this is a test about ability to grow facial hair.)

male

male

female

female

School

n

average

n

Average

Alpha (1)

80

84

20

80

Beta (2)

20

85

80

81

It’s clear if you look at the numbers that the Beta school have higher average scores (85 and 81 for male students and female students respectively).

However, if you calculate the averaged scores for individuals in the schools, Alpha school has an average score of 83.8 and Beta has just 81.8.

So whilst Beta school *looks* like the highest performing school when broken down by gender, it is actually Alpha school that has the highest average scores.

In this case, it’s quite clear why: if you only look at the average scores by gender, it’s easy to assume that the proportion of male and female pupils for each school is roughly the same, when in fact 80 pupils at Alpha school are male (and 20 female), but only 20 are male at the Beta school, with 80 female.

Using gender to segment the data hides this disproportion of gender between the schools. This may be appropriate to show in some cases, but can lead to false assumptions being made.

The same issue can be seen in Covid-19 Case Fatality Rate (CFR) data when comparing Italy and China. Kegelgen et al (2020) found that CFRs were lower in Italy for every age group, but higher overall (see table (a)) in the paper.

The reason, when you see table (b), is clear. The CFR for the 70-79 and 80+ groups are far higher than for all other age groups, and these age groups are significantly over-represented in Italy’s confirmed cases of Covid-19. This means that Italy’s overall CFR is higher than China’s only by dint of recording a “much higher proportion of confirmed cases in older patients compared to China.” China simply didn’t report as many Covid-19 cases in older individuals, and the fatality rate is far higher in older individuals. Italy has a more elderly population (median age of 45.4 opposed to China’s 38.4), but other factors such as testing strategies and social dynamics may also be playing a part.

Another example of Simpsons Paradox is in gender bias among graduate admissions to University of California, Berkeley, where it was used in reverse. In 1973, the admission figures appeared to show that men were more likely to be admitted than women, and the difference was significant enough that it was unlikely to be due to chance alone. However, the data for the individual departments showed a “small but statistically significant bias in favour of women”. (Bickel et al, 1975). Bickel et al’s conclusions were that women were applying to more competitive departments such as English, whilst men were applying to departments such as engineering and chemistry, that typically had higher rates of admission.

(Whether this still constitutes bias is the subject of a different debate.)

The crux of Simpsons Paradox is: If you pool data without regard to the underlying causality, you could get the wrong results.

References:

Bokai WANG, C. (2018) “Simpson’s Paradox: Examples”, Shanghai Archives of Psychiatry, 30(2), p. 139. Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5936043/ (Accessed: 21 October 2020).

Julius von Kugelgen, Luigi Gresele, Bernhard Scholkopl, (2020) “Simpson’s paradox in Covid-19 case fatality rates: a mediation analysis of age-related causal effects.” Arxiv.org. Available at: https://arxiv.org/pdf/2005.07180.pdf (Accessed: 21 October 2020).

P.J. Bickel, E.A. Hammel and J.W. O’Connell (1975). “Sex Bias in Graduate Admissions: Data From Berkeley”(PDF). Science. 187 (4175): 398–404. doi:10.1126/science.187.4175.398. PMID 17835295. https://homepage.stat.uiowa.edu/~mbognar/1030/Bickel-Berkeley.pdf

WHO MONICA Project Principal Investigators (1988) “The world health organization monica project (monitoring trends and determinants in cardiovascular disease): A major international collaboration” Journal of Clinical Epidemiology 41(2) 105-114. DOI: 10.1016/0895-4356(88)90084-4