Tuesday, October 29, 2013

The seven deadly sins of HealthCare.gov

A data center failure is the latest in Obamacare site's trail of tears.

Wait for it.
Everyone—even the CTO of President Obama's successful second presidential campaign—seems to have something to say about why HealthCare.gov experiences so much trouble. Today's news that the Affordable Care Act website and supporting IT infrastructure suffered from a data center outage piled more pain upon a project that members of the "tech surge" team now say will take at least another month to put in order.
The data center, operated by Verizon's Terremark unit, went down on Sunday when an equipment failure made it lose its Internet connection. Service was restored Monday morning, and services were brought gradually back online.
Data center outages happen to almost everyone in the cloud business, as Amazon and Google and Microsoft can testify to. But the structure of HealthCare.gov's deployment makes it particularly vulnerable to outages since it runs out of a single Verizon data center. That's just one more piece of a larger problem however: that rather than turning to private industry to look for best practices in running a high-volume e-commerce website, the government's team embraced the opposite approach.
Of course, they were following the same approach that big businesses have followed for decades with their big IT projects. Having watched a fair number of corporate IT projects go awry—both as a journalist and as an unwilling participant during my days as a system integrator and a corporate IT project manager—there are plenty of things in the HealthCare.gov debacle that feel all too familiar.

Worst practices

Government IT, as Ars previously reported, is no stranger to albatross IT projects. The federal government, and the US Chief Information Officer and Office of Management and Budget in particular, have tried to fix the chronic ills of big, bad IT by applying metrics and dashboards and reviews. For a brief moment, the HealthCare.gov project even showed up on the radar as a risky proposition. But the metrics that put it there were only tangentially related to the actual problems with the project itself. They focused specifically on cost and scheduling, not with the actual functionality of the system.
The real problems with HealthCare.gov are related to the "worst practices" that went into the project nearly from the beginning. Each of these missteps, combined with the generally hostile atmosphere in Washington surrounding the Affordable Care Act, nearly guaranteed HealthCare.gov would be late, broken, or both:
1) Hyper-Complexity. The HealthCare.gov project was an amalgam of three major contracts, each with its own contractor and set of deliverables: a new e-commerce site, a new information middleware infrastructure, and a hosted data center integration project.
2) Dependency issues. In addition, the whole thing was dependent on data provided by Experian—a data source that neither the government nor the other contractors could do any sort of data quality work on. Without a way to handle exceptions in Experian's data—such as a mismatch between street addresses due to a misspelling in Experian's system or just old data—the site experienced many early headaches.
3) All new construction. Many government IT projects, particularly ones that are created as the result of specific legislation, require the construction of an entirely new infrastructure. HealthCare.gov had the complexity multiplier of being based in software and systems—the "data hub" middleware that tied the site to the systems of insurance providers in particular—that had never been used live before.
4) Rolling requirements.  The specifications for the project were delayed repeatedly then changed frequently up to within a month of the target release date. The tweaking resulted in design changes.
5) Anti-testing.  Since the requirements kept changing up until the last minute, there was no way to do full site testing until mere weeks before the release date. It's not clear if any real-world data was used in testing the customer validation piece of the site since that would have required hitting actual Experian credit data. There was no limited release of the system for "beta testing" among a select audience, aside from the demo done by President Obama. All this despite the fact that nearly every component of the site was brand new and unproven against any real-world load.
6) Release late and once.  Instead of doing a rolling release of features starting with information on what people could expect in terms of subsidies, the government committed to an all-or-nothing release date. There was no way to test the site's performance under full load as a result, and the feds couldn't gradually scale up infrastructure based on experience and testing either.
7) Anti-bugfixing. There was, based on statements from the government, no effective way to manage bug tracking across the multiple components of the site. This meant no way to identify root causes of issues and prioritize fixes at the time of launch. Instead, the contractors implemented all this after the launch. And while the data center provider had been certified as complying with government security requirements, it's not clear that there was ever any realistic capacity planning done because everything was so new.
This is by no means a conclusive list of the missteps that led to the failure to launch for HealthCare.gov. But each of these is recognizable in any number of other government technology programs and resurfaced here.

Failure is not just an option—it comes standard

Mismanagement of major programs is something of a modern government tradition. In a 2008 report, the Government Accountability Office reported that 48 percent of federal IT projects had to be "rebaselined" (restructured because of cost overages or changed in project goals)—and more than half had to be reset in such a way two or more times.  And as of 2008, 43 percent of the Department of Health and Human Services' major projects were on the Office of Management and Budget's "watch list" because of poor performance and other management concerns.
Mind you, these measurements are only based on GAO's criteria. They don't address whether projects that weren't rebaselined were actually successful in achieving their goals, or if they were ever fully implemented. Even the most successful federal IT projects are often obsolete by the time they come close to completion, meaning they get rolled into the next big improvement program to come along.
Big-bet IT projects have a history of causing trauma in the private sector as well.  A 2011 survey of business and IT executives found that 75 percent believed their projects were either usually or always "doomed from the start." Those expectations are firmly rooted in reality—research by Standish Group International found that in 2012, only 10 percent of projects with a value of over $10 million were successfully completed on time and within budget.
"Anyone who has written a line of code or built a system from the ground-up cannot be surprised or even mildly concerned that Healthcare.gov did not work out of the gate,” Standish Group International Chairman Jim Johnson said in a recent podcast.  “The real news would have been if it actually did work. The very fact that most of it did work at all is a success in itself."
In other words, HealthCare.gov's rollout was almost identical to every Web-based ERP or CRM rollout at a major corporation. Here, the small exception was that it happened publicly with a hostile audience waiting to crow about its failures.

No comments:

Post a Comment