Today, Jeff Zients offered an operational progress report on our work to improve HealthCare.gov over the past five weeks.
In Jeff’s own words: “The bottom line- HealthCare.gov on December 1st is night and day from where it was on October 1st.”
In addition, we also released a Progress and Performance Report which provides data on how the system is performing and can be viewed here.
We’ve provided weekly reports up until now and today want to highlight our work as we begin December –detailing the many measurable improvements we’ve made to the site as well as acknowledge that more work remains to be done.
As we’ve said, with any web project there is not a magic moment but a process of continual improvement over time and we will continue to work to make enhancements in the days, weeks and months ahead.
As we begin December with a vastly improved web experience, we are mindful of the work to do to make sure those consumers who experienced frustration over the past several weeks are able to resolve their issues and complete their enrollment and confirm that those who have enrolled know their next steps to make sure they get coverage. While our door is open for new consumers and we invite them in, we will place particular attention on those who still need questions answered in order to complete their enrollment. We are 2 months in to a sustained 6 month long outreach and education campaign that will continue through the end of March so that consumers have time to access and enroll in affordable health care options that best meet their needs and budget.
Before diving into the data, its important to provide some context for the Progress and Performance Report.
In mid-October, the President, Secretary Kathleen Sebelius, and Administrator Marilyn Tavenner asked Jeff Zients to provide short-term management assistance on HealthCare.gov.
We started by bringing in technology experts from across government agencies and from the private sector to conduct an assessment of HealthCare.gov
The assessment highlighted a number of significant problems, most notably an unacceptable user experience marked by very slow response times, inexplicable user error messages and frequent website crashes and system outages.
At the same time, the team identified the root cause problems that needed to be addressed to fix the site.
These root causes included: hundreds of software bugs, inadequate hardware and infrastructure, and a general lack of system monitoring and incident response capabilities.
The assessment also identified weaknesses in how the project was being managed, with slow decision making and diffuse or unclear accountability.
With these root causes identified, the conclusion was that HealthCare.gov was fixable, if significant changes were made to the management approach and if we executed against the lengthy punch list of software and hardware fixes with relentless focus and discipline.
In short, we needed to get the team working with the speed and urgency of a high performing private sector tech company.
The first key change made was appointing QSSI as the General Contractor and Systems Integrator.
QSSI has provided project management expertise, and coordinates the work with CMS and other contractors. They have also provided fresh eyes, talent, and dedicated teams of experts focused on system monitoring, software fixes, and hardware upgrades.
Working with QSSI, we instituted a new management structure to have clear accountability and rapid decision making. This management structure is centered on a Command Center that includes senior leaders from CMS and each contractor and vendor involved in HealthCare.gov.
The Command Center leadership monitors the site’s performance in real time, evaluating key metrics and dashboards. There are examples of those 24 hour monitoring dashboards on page 4 of the report.
The Command Center team focuses on site monitoring and incidence response around the clock. Twice a day, the Command Center hosts standup war room meetings for real-time, data-driven decision making and prioritization of key hardware and software fixes. There is an open line -- or bridge -- connecting the Command Center with all the key programmers and managers working on the system, so that 24 hours a day we have rapid, effective response to any issues or problems the instant they appear.
This clear accountability, prioritization and quick decision making is central to the progress we’ve made in improving the site’s performance.
In addition to implementing this new management structure and getting the team working with the velocity and discipline of a high-performing private sector company, we developed a prioritized punchlist of software fixes, hardware upgrades and user enhancements. Prioritization is based on what has the biggest impact on system stability, capacity and speed and user experience.
Over the last five weeks, we’ve made substantial progress working through the punchlist.
We’ve executed hundreds of software fixes and hardware upgrades, and the site is now stable and operating at its intended capacity, with greatly improved performance.
The report outlines some data on the fixes and upgrades and how the progress can be seen in the key operating metrics.
The top graphic on page 5 shows how the team has knocked more than 400 bugs and software improvements off the punchlist over the past two months. The pace has greatly accelerated once we got the new management structure and discipline in place.
After clearing through fewer than 100 bugs across the entire month of October, the speed has more than tripled, with over 400 bugs fixed. This has eliminated critical glitches and made improvements to the consumer experience throughout the site. This includes more than 50 bug fixes that were installed just last night, many of which made improvements in the back end of the system.
So a total of more than 400 software fixes have combined to improve the user’s experience as they look for information, fill out applications, shop, and enroll.
At the same time, we’ve been working through the software items on the punchlist, the team has made substantial improvements to the underlying hardware infrastructure.
On that front, the team has executed a series of upgrades to key components of the system that have increased redundancy, reliability and scale (bottom of page 5).
There are four components that needed a lot of work. First, there’s the front end of the system, the Registration Database. This is where we were experiencing a large bottleneck when the site first launched.
As consumers attempted to create accounts and log on to the site, they ran into error messages and website crashes.
In the last few weeks, the team has re-architected the design of the registration system and installed new, dedicated hardware; including a major new upgrade this past Friday night.
All together this has more than quadrupled the throughput of the registration database, so that many more users can successfully create new accounts and log on.
In effect, we’ve widened the system’s on ramp – it now has four lanes instead of one or two.
Beyond Registration, the entire site rests on a Core Database that enables consumers to shop, compare plans, and enroll.
Here, we’ve made two significant changes. We’ve deployed 12 large, dedicated servers. And we’ve significantly upgraded a storage – or memory – unit to improve response time. As a result, we’ve increased the system’s database throughput by more than 3 times.
Third, we’ve brought additional application environments online, more than doubling the website’s capacity.
And last, we’ve upgraded the firewall that protects the system. The team identified that the firewall was a constraining factor on the system’s capacity and throughput, so we upgraded and reconfigured it to allow more than 5 times the network throughput.
The cumulative effect of these hardware changes, along with others, is that the underlying infrastructure of HealthCare.gov is much stronger today than it was a few weeks ago.
The system is now able to handle its intended volume of consumer visits, and it has redundancy built in to avoid the type of instability that we saw in October.
The result of these efforts can be seen in the improvement in the sites’ operating metrics, starting with response times (page 6).
Response times is the measure of how quickly a page responds to a user request. In late October, the average response time on healthcare.gov was running around 8 seconds which was clearly unacceptable and very frustrating for consumers. Driven by the software and hardware fixes, we now have much faster response times. Over the last three weeks, the average response time has been well under 1 second. This means that consumers are having a much faster, smoother experience on the site.
Page 6 also shows system error rate, another key operating metric.
This is the measure of how often, on a per page basis, the system times out or presents an error message. The team has made progress. In late October, the error rate was approximately 6%.
We got that down to about 2% by November 9th, to 1% by November 16th, and this past Friday, the average error rate was approximately .75 or three quarters of one percent.
In addition to improving system speed and reducing the error rate, we’ve also made measurable progress increasing the system’s stability (page 7).
System stability, which is typically referred to as system uptime -- is measured by the percentage of time the site is available on a given day, excluding planned downtime for scheduled maintenance.
HealthCare.gov is now seeing uptime consistently above 90%. For the week ending November 2nd was only 42.9%. In fact, that’s what we think the system averaged through most of October as well.
The uptime improved to 71.9% by November 9th, and has been consistently above 90% since then, including 95% uptime this past week. Again, this improvement in stability is driven by the hardware and software fixes, and we expect to see further improvements given the redundancy and capacity we’ve added to the system.
And just as importantly, when we do experience system glitches or slowdowns, we can resolve issues much more quickly, due to the continuous monitoring and rapid response teams. Back in October, a typical system outage lasted several hours or more. Now, the team can generally diagnose root cause problems and make the necessary fixes within 60 minutes.
So we have a much more stable system that’s reliably open for business.
That’s important, because at the end of the day, we need high system up time so consumers are able to use the system to seek information, fill out applications, shop and enroll. It’s critical that the result of all the improvements we’ve made is that we’ve doubled the system’s capacity, and HealthCare.gov can now support its intended volumes.
The chart on the bottom of page 7 outlines the simple math.
The site now has the capacity to handle 50,000 concurrent or simultaneous users at one time. And we know that each visitor spends, on average, 20 to 30 minutes on the site per visit. So the site will support more than 800,000 consumer visits a day.
Now to be clear, there likely will be times that even with this increased capacity, it will be insufficient to handle peaks in simultaneous demand. So to prepare for those times when spikes in user volume outstrip the systems’ expanded capacity, we will deploy a new queuing system to serve consumers in an orderly fashion. It will allow consumers to request email notifications when it’s a better time to come back to the site.
So, lifting up, we’ve made significant progress in improving HealthCare.gov, and achieving a system that runs smoothly for the vast majority of consumers.
This progress is summarized on page 8 of the report.
Response times are under 1 second.
Error rates are down well under 1%.
And the system is stable, with uptimes exceeding 90%.
We now have a rapid response team and continuing monitoring in place to ensure optimal system performance and to respond quickly to glitches or other issues that crop up.
All of which means the site has the ability to serve 50,000 concurrent users and support 800,000 consumer visits a day as consumers seek information, fill out applications, shop and enroll.
As with any website, the team will continue to address additional bugs and glitches and will continuously evaluate emerging infrastructure needs.
The general contractor and rapid response team has served us well; enabling us to execute with private sector speed and focus currently and for the long term.
While we still have work to do, we’ve made significant progress with HealthCare.gov working smoothly for the vast majority of consumers.