Category Archives: risk management

Managing Risk – Part 1

This is the start of a 5 part series on managing risk. Risk management is second only to communications as a core skill for project managers and this week’s 5 part series offers a quick refresher on some of the important concepts.

What is Risk?

The pyramid below lists stages of understanding ranging from low (to bottom of the pyramid) to high (the top).

The first stage is pure uncertainty. I’ll use a coin flip to illustrate these different stages, you’re going to flip a coin but you’ve absolutely no idea of what the outcomes could possibly be. You’ve no ideas that heads is more likely than the coin disappearing in a puff of smoke.

The second step is to understand at least some (but not all) of the potential outcomes, for example if I flip a coin, it could be heads.

The third step is the complete set of outcomes. If I flip a coin it could be heads or tails and no other outcomes are possible.

The forth step is to know the probabilities associated with an outcome for example 50% heads and 50% tails.

The final step is to know the outcome. This doesn’t work so well with coin example because a coin flip is designed to maintain a balance of uncertainty between two outcomes, but if you were a physics genius and knew exactly how the coin would be flipped, how clean the coin was on each side, and the impact of wind direction, then you would know in advanced how the coin would land.

But, hang on, you ask, in this final example risk management isn’t needed?

Exactly, that’s the point. Risk management is a way of dealing with our imperfect understanding of the world and can be done away we get to a total understanding of the situation so as to predict the outcome. Just as a perfect driver wouldn’t need a seat belt. In most cases, though, we’re not this good and risk management is still needed.

Project Failure – Wembley Stadium

Wembley Stadium (photo: Martin Pettitte via Flickr)

Wembley stadium is the home of English football (or English soccer if you’re American) and was rebuilt in the 2000s replacing the original structure from 1923. The project took 5 years longer than first estimated and costs were more than double initial estimates. The stadium uses an innovative steel arch that adds aesthetic appeal, but is also load bearing and minimizes the need for internal support that could have obstructed views within the stadium. As a result the arch improves the quality of the seating. The design wasn’t quite a novel as the Sydney Opera House or Guggenheim Bilbao but nonetheless included a design element in the arch that was unprecedented, making best practice techniques such as reference class forecasting impossible because there are no useful historic estimates to draw on – it had not been done before. This lack of historical precedent is often a red flag in accurate project planning.

photo: Kol Tregaskes via Flickr

There appear to be several reasons for delay in the case of Wembley stadium:

Bidding Process and Winner’s Curse

The contract was bid out and awarded to one of the lowest cost bids. This creates a winner’s curse situation, where it’s likely that the winning bid is too aggressive in estimating the actual costs of the project. The cost of the project rose 36% between the bid being accepted and the contract being signed.

Implementation Of An Unprecedented Design

The arch implementation was problematic, ultimately the sub-contractor for the arch was replaced midway through the project, and the delay caused further problems. It appears that the the fundamental issue was attempting a stadium design using a load bearing arch that was novel and untested in previous stadium designs. This is typical of projects that are too innovative, and is one of the reasons that the Denver Airport Baggage System failed. Projects with formal budgets and timelines are not the place to be prototyping unproven techniques and processes. At least not if you’re hoping for a credible initial estimate of how long the project will take.

Source: Martin Pettitt (via Flickr)

Information Flow and Incentives

Information flow around the project was never straightforward and incentives were not well aligned. The contractor was conscious of disclosure to their shareholders and their relationship with the sponsor of the project became so tense as to ultimately end in legal action. In part, this appears to be related to the fixed price nature of the contract – any delay had immediate implications for profitability. This may have lead to two interesting situations, in which it appears more junior employees were better informed about the project than senior management, perhaps because the implications of delay for so serious for profitability that information was not eagerly shared, note than around this time senior management was making statements that the project was on track:

  • Firstly, a whistleblower within the accounting department claimed to know of project delays months before they were disclosed.
  • Secondly, in the UK it was possible to place bets on potential delays on the project. These bets were stopped after the observation of “men in hard hats placing big bets in the Wembley area”.

It is also interesting that after the reviews were disclosed, management then instituted a “peer review” process to better assess the performance of in flight projects.

Trust, Drugs and Scope Changes

As with any project, there are many factors at play.

After the first delays, the sponsor and contractor became less willing to conduct work in parallel due to mistrust of completion dates, this may have added a few months to completion, but in the context of years of delay doesn’t appear to be a primary factor. It is interesting though that on delayed projects, further delays can be self-fulfilling as trust in the critical path diminishes.

There was press speculation that workers on site were using drugs. This claim is hard to substantiate and was never proven.

There were some scope changes, though again, it appears that the construction of the arch (part of the initial design) was a key factor in the delay. Unlike other projects such as the FBI’s Virtual Case File where scope change was a key contributor to delay.


Fundamentally, when attempting a unique work item, such as a novel load bearing steel arch as fundamental part of a stadium, it is very hard to estimate cost and duration with precision. Awarding the work via a bidding process with a fixed price contract exacerbates this problem, because the winning bid will be more likely to underestimate the required work due to the winner’s curse. In addition, it appears information flow could have been improved on this project – if junior employees were aware of potential delays and senior management was not, information was clearly not being shared effectively.

You can see more of these sort of case studies at or follow me on Twitter here for blog updates, or consider reading my book.

Arch Detail - Jesse Loughborough (via Flickr)

Bad Metrics

The Department of Homeland Security’s Threat Advisory System (definitions here) is a great example of a bad metric.

No information is conveyed, the method of construction is opaque and the required action isn’t clear. For example, yellow includes “Continue to be alert for suspicious activity and report it to authorities.” and orange cites ”
Exercise caution when traveling.” Aren’t these things we should be doing anyway?

Despite these failings, at least the scale is vivid and simple. This doesn’t compensate for it’s other failings, but it’s one case that some information is worse than no information at all.

The Value of Iterative Prototyping

Nice video on the value of iterative prototyping and the limitations of an MBA.

The Value of Burndown Charts

Burndown gives you a robust way to figure out when your project will actually finish. It’s a core part of scrum, but really a lot of project managers can get value from using this technique.

Burndown is valuable because we know people are generally bad at estimating completion dates, so any estimation technique that contributes an objective estimate should be useful. If you already have a Gantt-based schedule, then burndown is likely a complement rather than a substitute for it because the two methods are fundamentally different, but are both help answer the same question – when will we finish?

Burndown hinges on a very simple equation:

Rate of change in work = New work added – work completed

There are two basic outcomes of this equation. If you’re adding more work than you’re completing, then the completion of the project is still some way off. If you’re completing more work than your adding, then you are on the way to being done, and burndown charting can give you an estimate of how close the endpoint is.

As with anything apparently simple, the challenges are simplified away. Here the main challenges are:

1. Ensuring that all work is defined at a similar level (for example work in a software project might be defined as a bug)

2. Ensuring that you actually have a way to measure when new work is added and existing work is completed.

And of the 2 items above (1) is actually much harder than (2). With any decent tracking system (2) should be straightforward, but on (1) work items will never actually be truly equal, for example I mentioned software bugs as a common unit to measure in, but of course not all bugs are equal, some can be fixed in 10 minutes, others may take weeks or longer.

Then generically any project has 3 phases.

  • Phase 1 – planning: work items flat or increasing
  • Phase 2 – core execution: work items increasing
  • Phase 3 – approaching finish – work items decreasing

The diagram above gives an indication of what this looks like over the life of a project, though of course no project is ever quite this pure.

Burndown is most interesting/useful in the final phase (the downward slope) because that’s when the estimates have most forecasting value.

Example 1

50 work items outstanding

1 work item being added each day (on average)

11 work items being done each day (on average)

So, net 10 items are being done each day and the project should be done in 5 working days (i.e. 50/10).

Example 2

200 work items outstanding

10 work items being added each day (on average)

15 work items being done each day (on average)

So, net 5 items are being done each day and the project should be done in 40 working days (i.e. 200/5).

The examples above also imply that burndown only tells you when the stock of existing work will hit zero. Obviously, if work is continually added to the list each day, the project will never be done, because the stock will be at zero at the end of each day, but then tomorrow more work will come in.

So that, in essence, is burndown charting. If you’re not currently using it and have a large number of similar work items on your project (or sub-project) then it’s a useful technique to experiment with.

Building a Project Plan – Key Activities Checklist

In the appendix of a recent report regarding the Department of Energy, the Government Accountability Office use the following checklist for assessing project plans, which I’ve added two broad grouping to
1. Build an accurate plan that reflects the project

  • Capturing key activities
  • Sequencing key activities
  • Establishing the duration of key activities
  • Assigning resources to key activities
  • Integrating key activities horizontally and vertically

2. Manage project risks

  • Establishing a critical path for key activities
  • Identifying the float time between key activities
  • Performing a schedule risk analysis
  • Distributing reserves to high risk activities

Of course,  this is a very schedule-centric checklist. There is no mention of talking with your stakeholders, managing partner relationships or assessing feasibility of the work to be undertaken. However, as a checklist for building a project plan, I think it’s a good list, and the risk management section is particularly useful, because it demonstrates what can be achieved when a solid plan is in place.

How Good Is NASA At Project Management?

Source: NASA Ares Project Office

NASA too experiences schedule and cost overruns. Of the 10 NASA projects that have been in implementation phase for several years, those 10 projects have experienced cost overruns of 18.7% and launch delays of 15 months. In 2005 Congress required NASA to provide cost and schedule baselines, so no long term data is available. NASA’s projects are consistently one of a kind and pioneering, therefore uncertainty is likely to be higher than for other sorts of projects.

These cost and schedule overruns are largely due to the following factors:

External dependencies

The primary external dependencies that cause problem for NASA are weather issues causing launch delay and issues with partners on projects. NASA projects with partners experienced longer delays of 18 months relative to 11 months for those projects without partners.

Technological feasibility

As the Government Accountability Office assessment states: “Commitments were made to deliver capability without knowing whether the technologies needed could really work as intended.” This is so often a cause of project failure, see my articles on the Sydney Opera House, Denver Airport Baggage System and many
others for examples of how common this cause of failure is.

Failure to achieve stable designs at Critical Design Reviews

90% stability at Critical Design Review is cited by the Government Audit Office as a goal for successful projects, which is consistent with NASA’s System Engineering Handbook. Without this, designs are not sufficiently robust to execute against. It’s clear that NASA takes Critical Design Reviews seriously, but doesn’t always achieve 90% stability. The exact value varies across projects, but appears to be in the 70-90% range for most NASA projects. Raising the stability level at theseCritical Design Reviews would reduce project risk.

Though long, the Government Audit Office report contains lots interesting of further detail and can be found here.

Better Forecasting

Projects fail often, most studies find failure rates above 30% depending on the exact definition of failure, and since budget overrun often cause failure, it seems obvious that better cost forecasts would reduce project failure. Bent Flyvbjerg demonstrates a sound method for improving cost forecasts here in the 2006 Project Management Journal, it’s a fairly long and academic article, so I’ll provide a brief summary.

The approach Flyvbjerg suggests is called reference class forecasting. It’s a simple and elegant solution to the problem of overconfidence in forecasting. In essence, for any project, you estimate how much it would cost using normal methods, of course this requires putting in all the effort and process you would normally invest to develop a sound cost forecast. You then find a set of comparable projects (a “reference class”), with enough projects in the group to be statistically significant, but small enough that the projects are similar to the one that your undertaking. In practise, getting such data is relatively hard unless your organization conducts rigorous post-mortems consistently, but the article cites some distributions for certain project classes. For example rail projects exceed budget by 40% on average. You then gross up your cost estimate by this number and your estimate will be much more reliable.

Reference class forecasting may seem fatalistic or too simple, but it is far more reliable than existing methods. Just as most car owners believe they are better drivers than average, so project managers have excessive confidence in their own estimates, as the literature on project failure rates shows. Flyvbjerg describes this approach as taking the “outside view” looking across projects, rather than dwelling on the “inside view”, the details of the specific project. It is more useful to compare a project with a broad set of similar projects than obsess on the details of your own project.

Of course, there are some caveats to this view, if you believe overruns result from poor forecasting, then this is an effective solution, but if overruns stem from low-balling costs in order to get a project up and running, then the reference class forecasting approach won’t solve that problem.

BP’s Project Management of the Deepwater Disaster

Deepwater Horizon Oil Rig - source: SkyTruth (via Flickr)

The disaster is probably the second largest oil spill in history (after the Lakeview Gusher that occurred between 1910 and 1911 in California). The efforts to address leak can be treated as a project. BP has been widely criticized for its management of the disaster, this post analyzes BP’s sequence of media statements to determine what went wrong at the project level, rather than just their PR efforts. Several factors are apparent.

Firstly, BP initially underestimated the scale of the disaster and overestimated their ability to address it. There was no initial burst of action resembling crisis management. Two days after the explosion BP had mobilized 32 vessels and 4 aircraft, and three days later the number of aircraft had increased by one with the number of vessels unchanged. BP ultimately needed 205 times the number of vessels and 32 times the number of aircraft they initially deployed, the scale of the final response relative to BP’s initial reaction is stark. They went with 32 boats initially and ultimate needed 24x the number of ships in the entire US Navy.

This understated reaction appears to be driven by the belief that the well was leaking 5,000 barrels a day, when the reality was that the leak was 10x that. Of course, estimation is hard at the best of times, but a public underwater feed and panel of experts to analyze the flow rate wasn’t in place until day 31 based on the actions of the government rather than BP.

The apparent focus on repairing the failed the blowout preventer for the first ten days after the explosion delayed innovative ideas for a back-up solution. Secondly, alternative solutions also appear to have been explored sequentially, rather than in parallel which caused further delay, the exception to this is the digging of relief wells which did commence early in the process, but were always known to take months to complete.

Overall, one is left with the impression that BP didn’t understand (or didn’t want to understand) the scale of the project it was involved in, and that continually colored its reaction to the disaster.

Vessels deployed in oil clean up (y axis) vs. time (x axis)

Below are excepts from various press releases issued by BP during the crisis, the comments in bold are mine.

Full Timeline

The first press release came on April 20 (though the rig didn’t sink until 2 days later):

“Transocean Ltd. (NYSE: RIG) (SIX: RIGN) today reported a fire onboard its semisubmersible drilling rig Deepwater Horizon. The incident occurred April 20, 2010 at approximately 10:00 p.m. central time in the United States Gulf of Mexico. The rig was located approximately 41 miles offshore Louisiana on Mississippi Canyon block 252.”

Day 2 – BP mobilizes initial response

BP has mobilized a flotilla of vessels and resources that includes:

  • significant mechanical recovery capacity;
  • 32 spill response vessels including a large storage barge;
  • skimming capacity of more than 171,000 barrels per day, with more available if needed;
  • offshore storage capacity of 122,000 barrels and additional 175,000 barrels available and on standby;
  • supplies of more than 100,000 gallons of dispersants and four aircraft ready to spray dispersant to the spill, and the pre-approval of the US Coast Guard to use them;
  • 500,000 feet of boom increasing  to 1,000,000 feet of boom by day’s end;
  • pre-planned forecasting of 48-hour spill trajectory which indicates spilled oil will remain well offshore during that period;
  • pre-planned staging of resources for protection of environmentally sensitive areas.

Day 5 – BP had added one additional aircraft to the effort:

Equipment available for the effort includes:

  • 5 aircraft (helicopters and fixed wing including a large payload capacity C-130 (Hercules) for dispersant deployment).

Day 6 –  plans for a relief well and undersea investigation were in place:

BP, as lease operator of MC252, also continues to work below the surface on Transocean’s subsea equipment using remotely operated vehicles to monitor the Macondo/MC252 exploration well, and is working to activate the blow-out preventer.

The Transocean drilling rig Development Driller III will arrive on location today to drill the first of two relief wells to permanently secure the well. A second drilling rig, Transocean’s Discoverer Enterprise, is en route.

Day 7 – work starts on using non-traditional means to stop the leak, and the number of surface response vessels had doubled. Interestingly the amount of boom available has reduced relative to the estimate given two days after the spill, perhaps due to coordination and communication problems two days after the incident:

In parallel with these offshore efforts, advanced engineering design and fabrication of a subsea oil collection system has started onshore. This will be the first time this proven shallow water technology has been adapted for the deepwater. It is expected to be ready for deployment within the next four weeks.

  • More than 100,000 feet of boom (barrier) has been assigned to contain the spill. An additional 286,760 feet is available and 320,460 feet has been ordered.
  • 69 response vessels are being used including skimmers, tugs, barges and recovery vessels.
  • 76,104 gallons of dispersant have been deployed and an additional 89,746 gallons are available.

Day 10 – the problem with the blowout preventer is becoming apparent and it is realized oil may reach the shore:

BP has called on expertise from other companies including Exxon, Shell, Chevron and Anadarko to help it activate the blow out preventer, and to offer technical support on other aspects of the response.

BP announced today it has launched the next phase of its effort to contain and clean up the Gulf of Mexico oil spill, with a significant expansion of onshore preparations in case spilled oil should reach the coast.

And the estimate of the oil leak was very low (about 8-14% of the actual figure the technical group subsequently estimated)

Efforts to stem the flow of oil from the well, currently estimated at up to 5,000 barrels a day, are continuing with six remotely-operated vehicles (ROVs) continuing to attempt to activate the blow out preventer (BOP) on the sea bed.

Day 14 – work on the relief well starts and plans to cap the well are in place:

BP today announced that work has begun to drill a relief well to intercept and isolate the oil well that is spilling oil in the US Gulf of Mexico. The drilling began at 15:00CDT (21:00BST) on Sunday May 2.

Rapid progress is also being made in constructing a coffer dam, or containment canopy. A 14 x 24 x 40 foot steel canopy has already been fabricated and other-sized canopies are under construction and being sourced. Once lowered over the leak site and connected by pipe, the canopy is designed to channel the flow of oil from the subsea to the surface where it could be processed and stored safely on board a specialist vessel.

Day 15 – one of the three leaks is blocked:

BP today announced that it has stopped the flow of oil from one of the three existing leak points on the damaged MC252 oil well and riser in the Gulf of Mexico. While this is not expected to affect the overall rate of flow from the well, it is expected to reduce the complexity of the situation being dealt with on the seabed.

Day 20 – the large containment dome has failed, a smaller containment dome and “top kill” are cited as the next tactics

The containment dome that was deployed last week has been parked away from the spill area on the sea bed. Efforts to place it over the main leak point were suspended at the weekend as a build up of hydrates prevented a successful placement of the dome over the spill area.

In addition, further work on the blow out preventer has positioned us to attempt a “top kill” option aimed at stopping the flow of oil from the well. This option will be pursued in parallel with the smaller containment dome over the next two weeks.

Day 23 – 530 surface vessels are now working on the spill (a 16x increase over the initial response)

Work continues to collect and disperse oil that has reached the surface of the sea. Over 530 vessels are involved in the response effort, including skimmers, tugs, barges and recovery vessels.

Day 27 – 650 surface vessels are now in use

Work continues to collect and disperse oil that has reached the surface of the sea. Over 650 vessels are involved in the response effort, including skimmers, tugs, barges and recovery vessels.

Day 28 – 750 surface vessels are now in use

Work continues to collect and disperse oil that has reached the surface of the sea. Over 750 vessels are involved in the response effort, including skimmers, tugs, barges and recovery vessels.

Day 30 – 930 surface vessels are now in use

Work continues to collect and disperse oil that has reached the surface of the sea. Over 930 vessels are involved in the response effort, including skimmers, tugs, barges and recovery vessels.

Day 31 – underwater live feed added and government appoints a team to measure flow rate

BP has been providing a live feed to government entities over the last two weeks – including the US Department of the Interior, US Coast Guard, Minerals Management Service (MMS) through the Unified Area Command center in Louisiana – as well as to BP and industry scientists and engineers involved in the effort to stop the spill.

The US government has created a Flow Rate Technical Team (FRTT) to develop a more precise estimate. The FRTT includes the US Coast Guard, NOAA, MMS, Department of Energy (DOE) and the US Geological Survey. The FRTT is mandated to produce a report by close of business on Saturday, May 22.

Day 34 – 1,100 surface vessels are now in use

Work continues to collect and disperse oil that has reached the surface of the sea. Over 1,100 vessels are involved in the response effort, including skimmers, tugs, barges and recovery vessels.

Day 35 – alternative to the top kill method are considered

Being progressed in parallel with plans for the top kill is development of a lower marine riser package (or LMRP) cap containment option. This would first involve removing the damaged riser from the top of the BOP, leaving a cleanly-cut pipe at the top of the BOP’s LMRP. The LMRP cap, an engineered containment device with a sealing grommet, would be connected to a riser from the Discoverer Enterprise drillship and then placed over the LMRP with the intention of capturing most of the oil and gas flowing from the well and transporting it to the drillship on the surface. The LMRP cap is already on site and it is anticipated that this option will be available for deployment by the end of May.

Additional options also continue to be progressed, including the option of lowering a second blow-out preventer, or a valve, on top of the MC 252 BOP

Day 36 – “Top kill” starts

BP started the “top kill” operations today to stop the flow of oil from the MC252 well in the Gulf of Mexico.

Day 38 – 1,300 surface vessels are now in use

Almost 1,300 vessels are now involved in the response effort, including skimmers, tugs, barges and recovery vessels.

Day 39 – “Top kill” fails

Despite successfully pumping a total of over 30,000 barrels of heavy mud, in three attempts at rates of up to 80 barrels a minute, and deploying a wide range of different bridging materials, the operation did not overcome the flow from the well.

Day 41 – Lower Marine Riser Extended

BP announced today that, after extensive consultation with National Incident Commander Admiral Thad Allen and other members of the Federal government, it plans to further enhance the lower marine riser package (“LMRP”) containment system currently scheduled to be deployed this week with further measures that are expected to keep additional oil out of the Gulf of Mexico.

Day 42 – 1,600 vessels are now in use

Over 1,600 vessels are now involved in the response effort, including skimmers, tugs, barges and recovery vessels.

Day 42 – 2,600 vessels are now in use

More than 2,600 vessels are now involved in the response effort, including skimmers, tugs, barges and recovery vessels.

Day 50 – 3,600 vessels are now in use

Almost 3,600 vessels are now involved in the response effort, including skimmers, tugs, barges and recovery vessels.

Day 52 – additional containment cap in place

BP announced today that oil and gas is flowing through a second containment system attached to the Deepwater Horizon rig’s failed blow out preventer (BOP).

Day 59 – 4,500 vessels are now in use, 1,330 miles of containment boom

Approximately 37,000 personnel, more than 4,500 vessels and some 100 aircraft are now engaged in the response effort.

The total length of containment boom deployed as part of efforts to prevent oil from reaching the coast is now almost 2.8 million feet (530 miles), and about 4.2 million feet (800 miles) of sorbent boom also has been deployed.

Day 68 – 5,000 vessels in use

Over 39,000 personnel, almost 5,000 vessels and some 110 aircraft are now engaged in the response effort.

Day 75 – 6,563 vessels in use

Approximately 44,500 personnel, more than 6,563 vessels and some 113 aircraft are now engaged in the response effort.

Day 89 – 6,470 vessels in use

Approximately 43,100 personnel, more than 6,470 vessels and dozens of aircraft are engaged in the response effort.

Day 91 – sealing cap in place

Following approval from the National Incident Commander, BP began replacing the existing lower marine riser package (LMRP) containment cap over the Deepwater Horizon’s failed blow-out preventer with a new sealing cap assembly.

Day 97 – CEO steps down

BP today announced that, by mutual agreement with the BP board, Tony Hayward is to step down as group chief executive with effect from October 1, 2010. He will be succeeded as of that date by fellow executive director Robert Dudley.

The Deepwater Horizon Disaster and Estimation Techniques

Effective estimation of time, money and other resources is key to effective project management. For that reason it’s interesting to look at best practises, and a hotly debated estimate for the past 2 months is how much oil is flowing from the Deepwater Horizon spill.

A Flow Rate Technical Group has been established to estimate this number and the approach there are taking is detailed here.

Basically, the panel of experts is splitting into sub-groups, using different methods. These estimates will then be combined into one overall estimate.

The methods they are using are:

  • Plume Modelling – looking at video of the oil escaping in the water
  • Mass Balancing – looking at satellite data of the volume of oil on the surface adjusting it for any oil that hasn’t reached or left the surface
  • Reservoir Modelling – analyzing the composition of the oil reservoir under the seabed and determining oil pressure and hence flow rate
  • Nodal Analysis – examining the leak points on the seabed and calculating flow based on that
  • Woods Hole Analysis – using acoustic technologies to collect data close to the leak source

This calculation of numerous estimates using independent techniques is best practise in estimation. As you produce estimates, explore different techniques to create independent estimates, the overall estimate is likely to be more robust as result.