Columbia Disaster: Uncovering NASA's organisational failures

“Too often, accident investigations blame a failure only on the last step in a complex process, when a more comprehensive understanding of that process could reveal that earlier steps might be equally or even more culpable” (CAIB, 2003, p.6).

On Saturday February 1, 2003, the Space Shuttle Columbia disintegrated during re-entry into the Earth’s atmosphere after a 17-day mission in space, with the loss of its seven-member crew. Debris from the shuttle was scattered over 2000 square miles of east Texas.

This was the Space Shuttle Program’s 113th flight and Columbiaʼs 28th. Columbia was the first space-rated Orbiter and differed slightly from Challenger, Discovery, Atlantis, and Endeavour.

Immediate causes

On January 16, 2003, at 81.7 seconds after launch, when the shuttle was at about 65,600 feet and traveling at 1,650 mph, a large piece of hand-crafted insulating foam came off the external fuel tank, striking the leading edge of Columbiaʼs left wing. This event was not detected by either the crew on board or the ground support teams until the next day.

The thermal protection system on the leading edge of the wings was designed to withstand heat, not impact from debris or ice. The damage caused by this impact allowed superheated air to penetrate through the leading edge insulation during re-entry. This breach progressively melted the aluminum structure of the left wing, causing a loss of control at 10,000 mph, failure of the left wing, and finally breakup of the shuttle. The design of the shuttle meant that this was not a survivable event.

NASA response during mission

After some discussion, the impact was declared not to be significant by NASA management. No inspection of the thermal protection system was made, nor were any contingency plans made. NASA considered it implausible that foam could cause significant damage to the wing leading edge system. NASA sent a video of the take-off to the crew stating that: “thermal analyses indicate possible localised structural damage, but no burn-through and no safety of flight issue”.

Damage assessment meetings were held, but NASA could not provide the Columbia Accident Investigation Board (CAIB) with details on the flow of key information from these meetings; and had difficulty in providing information on the relationships between departments in the space program, which had become a sprawling, compartmentalized agency.

The Debris Assessment Team requested detailed photographs, but this request was not approved. Engineers continued to discuss the damage, but the discussions remained within their ‘area of expertise and level of seniority’. None of the warnings of potential danger moved up through the organisation to senior officials, or down to flight control. Engineers who voiced concerns were marginalised. Status was seen to be more important than technical expertise in NASA’s decision making; and dissenting views were often ignored. For example, when further imaging of the Columbia was requested by technical staff, managers asked “Who’s requesting the photos?” instead of assessing the merits of the request.

Nothing new to see here . . .

Damage from foam debris was seen on 79 of 113 Shuttle flights from 1981 to 2003 and yet shuttles continued to arrive home safely. It is perhaps because of this long history of earlier damage that didn’t appear to effect flight safety, that NASA ‘rationalised the danger’ and operated on the principle that ‘nothing bad has happened yet’. Foam debris was seen as maintenance or turnaround issue, rather than flight safety issue.

“With each successful landing, it appears that NASA engineers and managers increasingly regarded the foam-shedding as inevitable, and as either unlikely to jeopardize safety or simply an acceptable risk” (CAIB, 2003, p.122).

For more information on this issue, see my post on normalisation of deviance.

The crew were informed of the damage, but were told that there was no issue, as similar damage had been seen many times before. I note that the language used within NASA – these foam impacts were often known as ‘dings’ – almost serves to play down the danger. On January 23, the Flight Director sent an email to Columbia Commander Rick Husband and Pilot William McCool about the debris strike, which included the following:

“The impact appears to be totally on the lower surface and no particles are seen to traverse over the upper surface of the wing. Experts have reviewed the high speed photography and there is no concern for RCC or tile damage. We have seen this same phenomenon on several other flights and there is absolutely no concern for entry. That is all for now. It’s a pleasure working with you every day” (CAIB, 2003, p.159).

Note that foam debris impact on the shuttle was discussed prior to the Challenger accident in 1986 and in the investigation into Challenger itself, 17 years before Columbia.

Culture

“Unfortunately, NASAʼs views of its safety culture . . . did not reflect reality”
CAIB, 2003, p.177

The Columbia Accident Investigation Board (CAIB) reports that NASAʼs organisational culture had as much to do with this accident as foam did. The Board concludes that NASA’s current organisation does not provide effective checks and balances, does not have an independent safety program, and has not demonstrated the characteristics of a learning organization. Therefore, this led the Board to conclude that the hole in the wing was produced not simply by debris, but by holes in organisational decision making.

I was interested to read that management required technical experts to prove that the debris strike created a safety-of-flight issue. Engineers had to produce evidence that the system was unsafe, rather than prove that it was safe.

Staffing levels, workload and organisational change

In 2001, one experienced observer of the space program described the shuttle workforce as “The Few, the Tired” (CAIB, 2003, p.118).

In the 1990s, the overall NASA workforce was reduced by 25%, and a hiring freeze was in place. This led to a skill imbalance, increased workload and stress on those staff remaining; and it was recognised that this had a potential impact on operations and safety. During the same period, many NASA personnel were transferred from government functions to the private sector. Many key technical areas were described as being ‘one deep’ (in other words, just one person in each technical area).

Missed opportunities

Several meetings were held to discuss the possible damage to Columbia from the foam strike; and the possibility of obtaining further images. However, discussions often revolved around the fact that previous strikes had never been classified as flight safety issues. Previous consensus had been reached – and it seemed that no-one wanted to challenge this position. Several email exchanges about this issue are available in the CAIB report.

Further imaging was not requested (from the Department of Defense), despite several technical experts thinking it necessary. At one point, engineers were tasked with proving that further images were needed – but they needed those images in order to make that case. . . Managers made the decisions and technical experts did not want to challenge them, or jump the chain of command. There was a ‘cultural fence’ that prevented open communications between mission managers and working engineers. The CAIB stated that there was a lack of effective leadership – particularly in displaying no interest in understanding a problem and its implications for flight safety.

“Program managers failed to fulfill the implicit contract to do whatever is possible to ensure the safety of the crew. In fact, their management techniques unknowingly imposed barriers that kept at bay both engineering concerns and dissenting views, and ultimately helped create “blind spots” that prevented them from seeing the danger the foam strike posed” (CAIB, 2003, p.170).

Uncertainty – and its impact on budgets

In the decade or so before the Columbia incident, there was considerable uncertainty as to when (and how) to replace the Space Shuttle. This uncertainty led to delaying and differing various upgrades and investments in the shuttle program and the ground infrastructure, based on the assumption that they would be a waste of money if the shuttle were to be retired in the near future; and yet the date of a possible replacement program continued to shift to the right. Funds from the shuttle program were reallocated to pressing ground infrastructure repairs, when additional funding requests for these were refused.

There are significant similarities here to the extension of infrastructure and facilities in the oil and gas industry – as technology has enabled previously nonviable fields to become economically viable, along with the discovery of additional fields, the industry has embarked on ‘life extension’ programs. NASA was faced with having to embark on its own Shuttle Service Life Extension program, and having to deal with years of infrastructure neglect.

Schedule pressure

A key date in the shuttle program was the delivery of a section of the International Space Station called “Node 2”. This would configure the International Space Station to its “U.S. Core Complete” status. In order to regain credibility with the White House and Congress, and to continue support for subsequent Space Station growth, the delivery of Node 2 (on February 19, 2004) became a “line in the sand”. This resulted in pressure to meet an increasingly ambitious launch schedule.

This date became key in many communications; and any delay in a shuttle launch would impact on the ability to meet this deadline. NASA Headquarters distributed a computer screensaver, counting down to February 19, 2004, to all NASA employees. NASA is known for its can-do attitude, and clearly no-one wanted to be the one to stand up and say, “We can’t make that date”.

The February 19, 2004 deadline (which was increasingly being seen as one which NASA wouldn’t be able to make, as any padding in the schedules disappeared) most probably had an effect on the decision-making following knowledge of the debris strike on Columbia. If the debris strike was seen as anything more than a simple maintenance issue, it would impact on future launches – and the February 19, 2004 deadline. The investigation indicates that any discussion about the damage to Columbia was not in relation to any threat to Columbia, but whether this would threaten the overall shuttle schedule. There was no slack or schedule margin that could accommodate such unforeseen problems.

What we are seeing here is a gradual degradation in safety that is not uncommon in organisations; something that has become known as “organisational drift”:

“Little by little, NASA was accepting more and more risk in order to stay on schedule”
CAIB, 2003, p.101

History repeats itself. . .

On January 28, 1986, Shuttle Challenger was lost on the 25th mission, exploding minutes after take-off. In this event, an O-ring in a solid rocket booster failed 73 seconds after launch, allowing hot gases to escape. A high-school teacher was among the seven crew killed in this disaster, providing further evidence that the shuttle program was seen as ‘operational’ – despite many seeing the shuttle as a developmental program.

The Rogers Commission into Challenger showed that senior NASA managers were unaware of the O-ring debates before launch. The investigation revealed a NASA culture that had gradually begun to accept escalating risk, and a NASA safety program that was largely silent and ineffective.

“By the eve of the Columbia accident, institutional practices that were in effect at the time of the Challenger accident – such as inadequate concern over deviations from expected performance, a silent safety program, and schedule pressure – had returned to NASA” (CAIB, 2003, p.101).

Key conclusions of the Rogers Commission included:

an accumulation of organisational problems;
management complacency;
bureaucratic interactions;
flaws in the decision-making process;
inadequate trend analysis;
misrepresentation of criticality;
lack of adequate resources devoted to safety;
lack of safety personnel involvement in important discussions and decisions; and
inadequate independence of the safety organisation.

Although the immediate cause in Columbia was very different to Challenger, the disaster seventeen years later resulted from the same engineering, organisational and cultural deficiencies. A great deal of both the Rogers Commission Report and the CAIB reports discuss these flaws in the NASA organisational culture.

High Reliability Organisations

The CAIB compared the NASA organisation with a set of principles obtained from various theories, such as Normal Accidents and High Reliability Organisations (HROs). I consider that a key finding from a comparison with other organisations that successfully operate high-risk technologies is that they give equal priority to technical and safety functions as they do to schedule or cost functions.

Finally – beware ‘hindsight bias’

We must be careful not to conclude that NASA is a ‘bad’ organisation and negligent to obvious risks. If the cause is seen as a ‘bad organisation’, the same difficulties that led to Challenger and Columbia will go unrecognised in other organisations – in the same way that BP should not be seen as different to any other oil company (following Grangemouth, Texas City, Macondo etc).

Investigation reports

Columbia Accident Investigation Board (August 2003). This report was in my opinion the most comprehensive discussion of organisational and management issues until the publication of the Nimrod Inquiry Report. It discusses political and commercial pressures, organisational structures, safety culture, decision-making and leadership.

Rogers Commission (1986). Report of the Presidential Commission on the Space Shuttle Challenger Accident, William P. Rogers (Chair), U.S. Government Accounting Office, Washington, D.C.

Space Shuttle Columbia

Immediate causes

NASA response during mission

Nothing new to see here . . .

Culture

Staffing levels, workload and organisational change

Missed opportunities

Uncertainty – and its impact on budgets

Schedule pressure

History repeats itself. . .

High Reliability Organisations

Finally – beware ‘hindsight bias’

Investigation reports

THE SMALL PRINT

ABOUT ME

Immediate causes

NASA response during mission

Nothing new to see here . . .

Culture

Staffing levels, workload and organisational change

Missed opportunities

Uncertainty – and its impact on budgets

Schedule pressure

History repeats itself. . .

High Reliability Organisations

Finally – beware ‘hindsight bias’

Investigation reports

Share this:

THE SMALL PRINT

ABOUT ME