Thursday, February 21, 2008

Deployment: February 21, 2008

This was a problematic deployment before we even got started. First of all, it was a full moon—and not just any full moon, a blood-red full moon. (Isn’t it a bad omen to be doing a deployment when the moon is turning to blood?)

But that wasn’t what I was referring to. It was a problematic deployment because we were dependent on three separate systems, for the deployment, and two of those systems were themselves dependent on a number of other systems being deployed.

feb 2008

There was a lot of confusion about which system had to be deployed before which other system, or which systems had to be installed on the same night, etc. Unfortunately, one of the systems—one of the ones that were two or three removed from my system—had issues, so they weren’t going to be able to deploy before us. It was decided that we could deploy anyway, though; our system would be able to gracefully handle the missing back-end system, simply displaying an error message at the appropriate spot, and when they were able to deploy the other system, our system would just magically start working. (Unfortunately, it meant that I’d still have to take part in the other deployment, to do some testing, which means that I’d have another sleepless night coming up.)

Also, we ourselves were deploying not one, but two applications. I’ll call them Application 1 and Application 2.

Once we finally got to the point where we were ready to deploy, here’s how it went down.

12:00 midnight: We all logged onto the conference bridge, and confirmed that we were ready to go.

12:05–12:10 (5 minutes): We shut down the application servers for both Application 1 and Application 2. To save time, we also made some necessary configuration changes, during this time, for Application 1. (It was a last-minute change to the procedure; we realized that if we’d waited until it was scheduled, it would have meant an extra reboot in the process. As much planning as you try to do ahead of time, you just can’t control when you’ll have a good idea.)

12:10–12:30 (20 minutes): We backed up the back-end databases that we depend on. Since we had to make some changes to the schemas, it’s always a good idea to back them up, first, in case you have to rollback the deployment.

12:25–1:50 (1 hour 25 minutes): We made the appropriate modifications to the database for Application 1. (The backup of this database finished before the other one, which is why this step starts before the end of the previous step.) Unfortunately, while the database scripts were executing, the database ran out of space in the “temp” tablespace. So the DBA had to page someone who does more low-level support of the database, to get the issue resolved, before the script could be run.

12:30–12:50 (20 minutes): We made the modifications to the database for Application 2. Again, you’ll notice that this step overlaps with the previous one; because the first change involved a long-running DB script, the DBA was able to make the modifications on the second database, while she waited for the long-running script to finish. There were a number of things happening at the same time, but I trusted the DBA to know how much she could do at once, and how much had to be done sequentially.

1:50–2:10 (20 minutes): We brought back one of the application servers for Application 1, and deployed the new version of the application.

2:10–2:20 (10 minutes): One more reboot, to be safe, after the deployment of Application 1.

2:20–2:25 (5 minutes): We did our Sanity Test of Application 1, and everything looked good.

2:25–2:35 (10 minutes): We brought back up the application servers for Application 2, and deployed the new version of the application.

2:35–3:10 (35 minutes): We did our first round of Sanity Tests for Application 2. Unfortunately, one of the back-end systems—one of the ones that was supposed to be up and running—was down, because of some emergency maintenance. (Replacement of a network card, or something.) Nobody bothered to inform us that this was happening, so we were taken by surprise, when we ran our test and it didn’t work. It was decided to Sanity Test as much as we could, ignoring that part of the application, and then do the Landing Tests (still ignoring that part of the application.) When it came back up, we’d revisit Sanity Testing, to verify the functionality.

3:10–5:30 (2 hours and 20 minutes): We did our Landing Testing, for the business to verify that the functionality is up and running. Because of the back-end system’s outage, they ended up Landing Testing some functionality before the Sanity Test was done. Issues discovered:
  • Character encoding issues with one back-end system that we connect to; French characters were not showing up properly in the UI.
  • A web app that we link to was down, during the deployment. (Not a big deal.)
  • We deployed the wrong version of Application 1, and had to re-deploy it. When it was redeployed, it was fine.
At the end of the day, we were only left with the issue regarding French characters. It was decided that the application could be left in production as-is, and we’d have to look into a fix later on, when we could look more closely at it. (It seems to be related to character encoding issues with some XML being sent between applications. If you’re interested in character encoding issues in XML, I can recommend a great book, with some very handsome faces on the cover…)

No comments: