Thursday, February 21, 2008

Deployment: February 21, 2008—Backout!

After a [mostly] successful deployment, some of the back-end systems that we depend on went crazy. The system was down for most of the morning. It was decided to roll back everything that went in during the night. (That included not just our application, but numerous other applications that had also deployed.)

12:00 (noon)–12:30 (30 minutes): We opened the bridge, and waited for people to join. Since this was an ad-hoc bridge, being opened at the last minute, it took longer than usual to round everyone up, and get them on the bridge. (Some of them may still have been sleeping, since we were exactly twelve hours after the original deployment.) Unfortunately, we weren’t able to get the DBA who had done the original deployment, but we had two other DBAs helping us.

12:30–12:45 (15 minutes): We shut down the application servers for both applications, and rolled back some of the configuration changes necessary for Application 1. (See the description of the original deployment, for what we were deploying, and a mention of “Application 1” vs. “Application 2.”)

12:45: At the last minute, the DBA who had done our original deployment showed up. This gave me a sense of relief, because it’s always easier for someone to back out her own work than for someone else to do it—even with good, detailed instructions.

12:45–1:20 (35 minutes): The DBAs rolled back the database changes. This is why the backups taken during the deployment were so crucial; they simply deleted the appropriate schemas, and re-imported the dump file taken during the backup. The DBA who had done our original deployment restored one of the databases, and, to shorten the outage time, another of the DBAs restored the other one. (All this time, the application was down, and users were unable to access it, so every second mattered.)

1:20–1:45 (25 minutes): The old version of Application 1 was re-deployed.

1:45–3:00 (1 hour and 15 minutes): Sanity Testing commenced for Application 1, and failed; the application didn’t come up correctly. It was believed to be a problem with the deployment procedure, so it was un-deployed, and re-deployed again. Unfortunately, the problem still remained. It was eventually determined that the application was deployed successfully, but we had a problem with a tool called Java Web Start—the Admin Tool for the application uses Java Web Start, and because we were using a new version of the Admin Tool, JWS didn’t re-download the old version.

As of 3:00, we were rolled back to our previous state—with no idea when we would next be attempting the deployment.

Deployment: February 21, 2008

This was a problematic deployment before we even got started. First of all, it was a full moon—and not just any full moon, a blood-red full moon. (Isn’t it a bad omen to be doing a deployment when the moon is turning to blood?)

But that wasn’t what I was referring to. It was a problematic deployment because we were dependent on three separate systems, for the deployment, and two of those systems were themselves dependent on a number of other systems being deployed.

feb 2008

There was a lot of confusion about which system had to be deployed before which other system, or which systems had to be installed on the same night, etc. Unfortunately, one of the systems—one of the ones that were two or three removed from my system—had issues, so they weren’t going to be able to deploy before us. It was decided that we could deploy anyway, though; our system would be able to gracefully handle the missing back-end system, simply displaying an error message at the appropriate spot, and when they were able to deploy the other system, our system would just magically start working. (Unfortunately, it meant that I’d still have to take part in the other deployment, to do some testing, which means that I’d have another sleepless night coming up.)

Also, we ourselves were deploying not one, but two applications. I’ll call them Application 1 and Application 2.

Once we finally got to the point where we were ready to deploy, here’s how it went down.

12:00 midnight: We all logged onto the conference bridge, and confirmed that we were ready to go.

12:05–12:10 (5 minutes): We shut down the application servers for both Application 1 and Application 2. To save time, we also made some necessary configuration changes, during this time, for Application 1. (It was a last-minute change to the procedure; we realized that if we’d waited until it was scheduled, it would have meant an extra reboot in the process. As much planning as you try to do ahead of time, you just can’t control when you’ll have a good idea.)

12:10–12:30 (20 minutes): We backed up the back-end databases that we depend on. Since we had to make some changes to the schemas, it’s always a good idea to back them up, first, in case you have to rollback the deployment.

12:25–1:50 (1 hour 25 minutes): We made the appropriate modifications to the database for Application 1. (The backup of this database finished before the other one, which is why this step starts before the end of the previous step.) Unfortunately, while the database scripts were executing, the database ran out of space in the “temp” tablespace. So the DBA had to page someone who does more low-level support of the database, to get the issue resolved, before the script could be run.

12:30–12:50 (20 minutes): We made the modifications to the database for Application 2. Again, you’ll notice that this step overlaps with the previous one; because the first change involved a long-running DB script, the DBA was able to make the modifications on the second database, while she waited for the long-running script to finish. There were a number of things happening at the same time, but I trusted the DBA to know how much she could do at once, and how much had to be done sequentially.

1:50–2:10 (20 minutes): We brought back one of the application servers for Application 1, and deployed the new version of the application.

2:10–2:20 (10 minutes): One more reboot, to be safe, after the deployment of Application 1.

2:20–2:25 (5 minutes): We did our Sanity Test of Application 1, and everything looked good.

2:25–2:35 (10 minutes): We brought back up the application servers for Application 2, and deployed the new version of the application.

2:35–3:10 (35 minutes): We did our first round of Sanity Tests for Application 2. Unfortunately, one of the back-end systems—one of the ones that was supposed to be up and running—was down, because of some emergency maintenance. (Replacement of a network card, or something.) Nobody bothered to inform us that this was happening, so we were taken by surprise, when we ran our test and it didn’t work. It was decided to Sanity Test as much as we could, ignoring that part of the application, and then do the Landing Tests (still ignoring that part of the application.) When it came back up, we’d revisit Sanity Testing, to verify the functionality.

3:10–5:30 (2 hours and 20 minutes): We did our Landing Testing, for the business to verify that the functionality is up and running. Because of the back-end system’s outage, they ended up Landing Testing some functionality before the Sanity Test was done. Issues discovered:
  • Character encoding issues with one back-end system that we connect to; French characters were not showing up properly in the UI.
  • A web app that we link to was down, during the deployment. (Not a big deal.)
  • We deployed the wrong version of Application 1, and had to re-deploy it. When it was redeployed, it was fine.
At the end of the day, we were only left with the issue regarding French characters. It was decided that the application could be left in production as-is, and we’d have to look into a fix later on, when we could look more closely at it. (It seems to be related to character encoding issues with some XML being sent between applications. If you’re interested in character encoding issues in XML, I can recommend a great book, with some very handsome faces on the cover…)