Sunday, March 16, 2008

Deployment: March 16, 2008

This should have been a pretty normal deployment. We were introducing a new system, along with a re-deployment of the usual application I work on, and we weren’t expecting it to take too long.

01:30–01:35 (5 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

01:35–01:40 (5 minutes): We shut down the application servers.

01:40: We decided not to bother with a database backup, because even if we had to backout the deployment, we wouldn’t be restoring the database, in this case, so we saved ourselves 30–40 minutes by not backing it up. (This is normally a dangerous decision to make; however, the changes we were implementing in this case could have gone in standalone, so even if we backed out the application, we wanted the database changes to stay in.)

01:40–02:00 (20 minutes): We executed our database scripts.

02:00–02:10 (10 minutes): We restarted the application servers.

02:10–02:15 (5 minutes): We deployed the “new” application, and did a quick sanity test. The test worked, but we realized we had a minor configuration problem. (I’d made a typo in the URL for a web service. Small mistakes can make for big problems…)

02:15–02:40 (25 minutes): We paged the person who could fix the problem, and waited for him to respond. (It took a while because his pager was on vibrate mode, and it took a few pages before it was able to wake him up. That’s a common occurrence when you’re trying to get hold of someone at 2:30 in the morning.)

02:40–2:55 (15 minutes): The person logged on, and fixed our problem for us. Still slightly ahead of schedule, at this point.

02:55–03:20 (25 minutes): We had another configuration problem, and had to troubleshoot that. (Someone else had made another typo! Apparently it was contagious.) This one required us to re-compile the application. (It was only a configuration change, but still required a re-build.)

03:20–03:45 (25 minutes): I couldn’t believe it, but we had yet another problem. Our new Oracle package seemed to be having issues, even though it had compiled correctly in the database. (Apologies if you didn’t understand that sentence; I couldn’t muster up the motivation to type out a post explaining what an “Oracle package” is, when I wrote this.) We had to get the DBA back on the bridge, to help us fix it. By the time we’d fixed this problem, we were 5 minutes behind schedule.

03:45–03:50 (5 minutes): We deployed the “main” application, which depends on the “new” application. This brought us back ahead of schedule (by 5 minutes).

03:50–04:05 (15 minutes): We performed our Sanity Testing. Still 5 minutes ahead of schedule.

04:05–05:30 (1 hour and 25 minutes): Landing tests were performed. At this point, I stopped paying attention to the schedule; Landing Tests always go long, with this team. They do a lot of testing. But since it’s the last thing we do, before sign-off, it’s not an issue if this part goes long. That being said, though, we were only about 10 minutes behind schedule, which is not too shabby.

At the end of the day, we deployed with one minor bug. (Can we never have a bug-free release?!?) However, it seemed to be a data integrity issue, and the only accounts that exhibited the behaviour are accounts that we’ve used heavily in the past, for testing. So it’s quite possible that we won’t actually see the issue “in the wild.” (Famous last words…)

No comments: