Thursday, February 21, 2008

Deployment: February 21, 2008—Backout!

After a [mostly] successful deployment, some of the back-end systems that we depend on went crazy. The system was down for most of the morning. It was decided to roll back everything that went in during the night. (That included not just our application, but numerous other applications that had also deployed.)

12:00 (noon)–12:30 (30 minutes): We opened the bridge, and waited for people to join. Since this was an ad-hoc bridge, being opened at the last minute, it took longer than usual to round everyone up, and get them on the bridge. (Some of them may still have been sleeping, since we were exactly twelve hours after the original deployment.) Unfortunately, we weren’t able to get the DBA who had done the original deployment, but we had two other DBAs helping us.

12:30–12:45 (15 minutes): We shut down the application servers for both applications, and rolled back some of the configuration changes necessary for Application 1. (See the description of the original deployment, for what we were deploying, and a mention of “Application 1” vs. “Application 2.”)

12:45: At the last minute, the DBA who had done our original deployment showed up. This gave me a sense of relief, because it’s always easier for someone to back out her own work than for someone else to do it—even with good, detailed instructions.

12:45–1:20 (35 minutes): The DBAs rolled back the database changes. This is why the backups taken during the deployment were so crucial; they simply deleted the appropriate schemas, and re-imported the dump file taken during the backup. The DBA who had done our original deployment restored one of the databases, and, to shorten the outage time, another of the DBAs restored the other one. (All this time, the application was down, and users were unable to access it, so every second mattered.)

1:20–1:45 (25 minutes): The old version of Application 1 was re-deployed.

1:45–3:00 (1 hour and 15 minutes): Sanity Testing commenced for Application 1, and failed; the application didn’t come up correctly. It was believed to be a problem with the deployment procedure, so it was un-deployed, and re-deployed again. Unfortunately, the problem still remained. It was eventually determined that the application was deployed successfully, but we had a problem with a tool called Java Web Start—the Admin Tool for the application uses Java Web Start, and because we were using a new version of the Admin Tool, JWS didn’t re-download the old version.

As of 3:00, we were rolled back to our previous state—with no idea when we would next be attempting the deployment.

No comments: