Sunday, March 16, 2008

Deployment: March 16, 2008

This should have been a pretty normal deployment. We were introducing a new system, along with a re-deployment of the usual application I work on, and we weren’t expecting it to take too long.

01:30–01:35 (5 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

01:35–01:40 (5 minutes): We shut down the application servers.

01:40: We decided not to bother with a database backup, because even if we had to backout the deployment, we wouldn’t be restoring the database, in this case, so we saved ourselves 30–40 minutes by not backing it up. (This is normally a dangerous decision to make; however, the changes we were implementing in this case could have gone in standalone, so even if we backed out the application, we wanted the database changes to stay in.)

01:40–02:00 (20 minutes): We executed our database scripts.

02:00–02:10 (10 minutes): We restarted the application servers.

02:10–02:15 (5 minutes): We deployed the “new” application, and did a quick sanity test. The test worked, but we realized we had a minor configuration problem. (I’d made a typo in the URL for a web service. Small mistakes can make for big problems…)

02:15–02:40 (25 minutes): We paged the person who could fix the problem, and waited for him to respond. (It took a while because his pager was on vibrate mode, and it took a few pages before it was able to wake him up. That’s a common occurrence when you’re trying to get hold of someone at 2:30 in the morning.)

02:40–2:55 (15 minutes): The person logged on, and fixed our problem for us. Still slightly ahead of schedule, at this point.

02:55–03:20 (25 minutes): We had another configuration problem, and had to troubleshoot that. (Someone else had made another typo! Apparently it was contagious.) This one required us to re-compile the application. (It was only a configuration change, but still required a re-build.)

03:20–03:45 (25 minutes): I couldn’t believe it, but we had yet another problem. Our new Oracle package seemed to be having issues, even though it had compiled correctly in the database. (Apologies if you didn’t understand that sentence; I couldn’t muster up the motivation to type out a post explaining what an “Oracle package” is, when I wrote this.) We had to get the DBA back on the bridge, to help us fix it. By the time we’d fixed this problem, we were 5 minutes behind schedule.

03:45–03:50 (5 minutes): We deployed the “main” application, which depends on the “new” application. This brought us back ahead of schedule (by 5 minutes).

03:50–04:05 (15 minutes): We performed our Sanity Testing. Still 5 minutes ahead of schedule.

04:05–05:30 (1 hour and 25 minutes): Landing tests were performed. At this point, I stopped paying attention to the schedule; Landing Tests always go long, with this team. They do a lot of testing. But since it’s the last thing we do, before sign-off, it’s not an issue if this part goes long. That being said, though, we were only about 10 minutes behind schedule, which is not too shabby.

At the end of the day, we deployed with one minor bug. (Can we never have a bug-free release?!?) However, it seemed to be a data integrity issue, and the only accounts that exhibited the behaviour are accounts that we’ve used heavily in the past, for testing. So it’s quite possible that we won’t actually see the issue “in the wild.” (Famous last words…)

Monday, March 3, 2008

Deployment: March 3, 2008

We had the conference call, regarding the previous deployment, and it was decided that we were confident enough in the original database scripts to proceed with the deployment, at 20:00. We would be using a different DBA.

20:00: We all logged onto the conference bridge, and confirmed that we were ready to go.

20:00–20:20 (20 minutes): We shut down the application servers for both apps.

20:20–20:30 (10 minutes): We backed up the back-end databases for both applications.

20:30–20:50 (20 minutes): We executed the database scripts for Application 1.

20:50–21:20 (30 minutes): We deployed Application 1.

20:50–21:25 (35 minutes): We executed the database scripts for Application 2, in parallel with the deployment of Application 1.

21:20–21:25 (5 minutes): We did sanity testing of Application 1, which turned out fine. (Only 5 minutes because sanity testing for this particular application is pretty quick.

21:25–23:20 (1 hour and 55 minutes): The client did their Landing Test of Application 1, and everything looked good.

21:30–21:45 (15 minutes): We deployed Application 2.

21:45–22:00 (15 minutes): We did sanity testing of Application 2, and everything looked good.

22:00–00:40 (2 hours and 40 minutes): The client did their Landing Test of Application 2. There were some minor issues discovered, but none were show-stoppers. They signed off on the deployment, and we all went to bed.

The last time we’d deployed this, one of the back-end systems we depend on went crazy, and we had to back out. Many of us were thinking negative thoughts, worrying about having to do the same thing this time, as our heads hit our pillows…

Sunday, March 2, 2008

Deployment: March 2, 2008

This was a retry of the previous deployment, which was backed out.

For clarity, I’ve decided to start using military notation for the times, so that there’s no confusion between, e.g., 12:00 midnight and 12:00 noon. (I’m still rounding times to the nearest 5 minutes, though, for clarity.)

00:00–00:10 (10 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

00:10–00:30 (20 minutes): We made some configuration changes to the Application 1 servers, and shut down the application servers for both Application 1 and Application 2.

00:30–01:45 (1 hour and 15 minutes): We backed up the databases, just in case. This took longer than usual, because the DBA had to back up one of the databases to his own machine, instead of doing it on the server. (It’s normally done directly on the server, which avoids the backup having to travel across the network, but he didn’t plan ahead to get the Unix passwords for the boxen in question.)

01:25–03:20 (2 hours and 55 minutes): We made the modifications to the database for Application 1. (This overlapped with the task above, as the database for Application 2 continued to be backed up.) Again, this took much longer than it should have—it took just under an hour and a half, last time—because the DBA had to run the scripts over the network, instead of from the server.

03:20–03:25 (5 minutes): There were issues with the database scripts, for Application 1. It turns out that the DBA modified the scripts, without telling anyone, before running them, but the modified scripts ran into issues. Because we were not confident that the database was left in a valid state, it was decided to roll back, and restore the database from the backup.

03:20–04:30 (1 hour and 10 minutes): During this time, the team had a roundtable conversation to discuss whether we could still put in Application 2, even though Application 1 was rolled back. The problem is that the new version of Application 1 had never been tested against the old version of Application 1. It was decided that the risk was low—the nature of the changes being made indicated that the different versions of the applications should work together—but risk is risk, and if they weren’t tested, there could be a chance of errors being undiscovered. It was finally decided to leave both applications as-is—that is, not try to deploy Application 2 without Application 1—and revisit the deployment in a day or two.

04:30–05:10 (40 minutes): We brought the applications back up. Luckily, based on the nature of the database changes for Application 1, it was decided that we didn’t need to restore the database; all we had to do was undo some configuration changes, on the application servers, and then bring those servers back online.

04:40–04:50 (10 minutes): The business came onto the bridge, and said that they’d prefer to try the deployment again, rather than roll back. Just to be safe, it was decided to continue with the rollback, and do a quick sanity test, so that there would be some version of the application online, while the decision was made. We also needed to chase the DBA team, and find a new DBA to perform the work.

05:10–05:20 (10 minutes): We all did a sanity/regression test, to verify that the applications were back up and running. They were fine.

05:20–05:30 (10 minutes): Dead air, as we waited for the DBA team to get back to us.

05:30–06:30 (1 hour): Another DBA joined the conference call, and was walked through the existing dilemma. He asked for some time to look over the situation—including what was changed, and why. Presumably, the results of his investigation would give the rest of us the warm and fuzzy that we can proceed to try again.

The result of this discussion was that the DBA(s) would try to recreate the errors we got, to prove that it was the modified scripts that caused the problems. We would reconvene at noon, to go over the findings, and talk about rescheduling the deployment.