Saturday, September 20, 2008

[Mini] Deployment: September 20, 2008

This was actually just a simple database script, to update one of our data tables. No structural changes, just data. But someone decided to treat it like a full deployment anyway, complete with the conference call and everything. So…

01:30–01:37 (7 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go. It took a bit longer than we’d expected, because we weren’t sure if one particular person was going to join or not.

01:37–01:37 (about 30 seconds): We backed up the database.

01:37–01:41 (4 minutes): We executed the DB scripts.

01:41–01:48 (7 minutes): We did our Sanity Test—mostly just checking the logs that the scripts worked fine, as well as a quick glance at the application, to verify that the data was showing up correctly.

01:43–01:53 (10 minutes): The client performed their Landing Test. (Further tests that the data was showing up correctly in the application…) We let them start before the Sanity was done, in this case, since all indications were that things had worked fine, and we were just waiting on an email to show up in my Inbox to verify the scripts.

Overall, we were done ahead of schedule, and everything worked well. Total deployment: 01:30–01:53 (23 minutes).

Thursday, September 11, 2008

Deployment: September 11, 2008

This was a bug fix deployment, to fix the issue discovered with the previous deployment.

01:30–01:34 (4 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

01:34–02:01 (27 minutes): We backed up the database, and executed the database scripts. This took longer than usual, because the DBA accidentally ran the scripts on the wrong database, the first time.

01:34–01:56 (22 minutes): We un-deployed the old version of the application, and shut down the application servers.

02:06–02:19 (13 minutes): We deployed the new version(s) of the application(s). (Some were actually still deployed from the last time, and just had to be re-activated, while others—the ones with bug fixes—actually had to be redeployed.)

02:19–03:04 (45 minutes): We conducted the Sanity Test. During Sanity, we discovered an issue unrelated to our deployment—one that had been part of the application for the previous four months.

03:00–05:23 (2 hours and 23 minutes): The client did their Landing Tests. (This was overlapped with the Sanity Test a bit.) Some additional issues were discovered, but they were all deemed minor enough that we didn’t need to back out. (The most serious one turned out to not be a bug with our application, but with a back-end system that we depend on. So the bug fix will be for them, not for us.)

Overall deployment: 01:30–05:23 (3 hours and 53 minutes).

Deployment: September 8, 2008

No timeline with this one.

There had been an issue with the previous deployment. Every once in a while, the CPU on the application servers was getting up close to 100%, and the Support Team had to reboot them to keep the application alive.

Unfortunately, rolling back wasn’t working; we tried un-deploying the application, and re-deploying the old version, but the new version seemed to be cached on the application servers. So we had to call the application server vendor, and get help. (It turned out we had to undeploy, remove the temp directory, and then re-deploy the old version.)

Sunday, September 7, 2008

Deployment: September 7, 2008

This was a slightly more intricate deployment than usual, because there were a few moving parts. We had not one but two sets of database changes to go through, plus three separate applications to deploy and/or re-deploy on the application servers. Plus, because of our new “global delivery model,” we had a DBA in India who would be making our database changes, which was new. So I was hoping it wouldn’t be a bad deployment, but I was a bit worried.

00:00: As I was getting ready for the deployment, I logged into my computer, and noticed an email from the DBA in India, dated twenty-four hours prior to the deployment, saying that he wasn’t able to join the bridge, but would I please call his cell phone when we were ready for him to start. Not a good sign. I looked in the database, and found that the changes had already been executed (presumably twenty-four hours earlier than they were supposed to be done). This could have been very bad, but we got lucky, and it turns out that the database changes didn’t break the existing application; the old version of the app was able to run for the last twenty-four hours with the database changes.

01:30–01:50 (20 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go. This took much longer than usual, because we had to confirm with the DBA what had actually been done, and what he hadn’t. It turned out to be even more confusing than I’d thought: out of the two DB changes we needed, one was partially done, and one was not done at all.

01:50–01:52 (2 minutes): We shut down the application servers.

01:52–02:21 (29 minutes): We had the DBA finish the database changes.

02:21–02:24 (3 minutes): We brought back up the application servers

02:24–02:42 (18 minutes): We deployed the updated and new applications.

02:42–05:00 (2 hours 18 minutes): We did our Sanity Testing. We found an issue with one piece of functionality, and determined that it might have been due to a back-end system, not ours. We spent some time troubleshooting it, and ended up not being sure which system was the culprit.

03:40–05:00 (1 hour 20 minutes): The client did their Landing Tests. This was overlapped with the troubleshooting we did, for the functionality that wasn’t working.

05:00–05:30 (30 minutes): One of the back-end systems that we depend on went down for scheduled maintenance, and we had to wait for it to come back up before we could go back to our testing.

05:30–06:20 (50 minutes): We finished testing, and then called it a wrap. There was still an oustanding issue—the one we weren’t able to troubleshoot—but it wasn’t a show-stopper.

Overall deployment: 01:30–06:20 (4 hours and 50 minutes).

Sunday, July 20, 2008

Deployment: July 20, 2008

If all went well, this would be the last deployment for a long time. We were pretty sure that we’d fixed all of our issues, and this release was simply to “turn on” all of the functionality for all users—now that it was working, we’d let them all use it.

01:30–01:35 (5 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

01:35–01:44 (9 minutes): We deployed the application.

01:44–02:19 (35 minutes): We Sanity Tested the application.

02:12–02:31 (19 minutes): The users did their Landing Tests. This was overlapped a bit with the Sanity Testing, because the person doing Sanity had to reboot her computer halfway through the process. However, based on the nature of the changes—i.e. that there were no changes, just a configuration change—we saw very little danger of Sanity failing, so we felt safe in letting the Landing Test start before Sanity was completely finished.

Everything worked fine.

Overall deployment: 01:30–02:31 (1 hour and 1 minute).

Friday, July 11, 2008

Deployment: July 11, 2008

This was sort of a continuation of the previous deployment. If all went well, we’d be able to kiss our recent troubles goodbye, and get back on to a regular schedule, without so many deployments all the time.

01:30–01:33 (3 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

01:33–01:40 (7 minutes): We made some configuration changes to some of the servers. (There was no “deployment” this time, per se, just configuration changes. But it still caused an outage, and so had to be controlled.)

01:40–01:55 (15 minutes): We performed our Sanity Testing. Everything was fine. Normally Landing Tests would follow, but in this case, since there were no changes to the application itself, it wasn’t deemed necessary.

01:55–01:58 (3 minutes): We verified with the user base that everything was back up and running.

Overall deployment: 01:30–01:58 (28 minutes).

Tuesday, July 8, 2008

Deployment: July 7, 2008

This was partially a bug-fix deployment, but it was also a configuration deployment. We had a number of configuration changes to make to our web servers, in preparation for some future changes to our network/physical architecture.

22:00–22:01 (1 minute): We all logged onto the conference bridge, and confirmed that we were ready to go.

22:01–22:28 (27 minutes): We made the appropriate configuration changes to our web servers.

22:08–22:38 (10 minutes): We down-graded the logging level on our application servers. Because of the recent troubleshooting we’d had to do, we’d increased the logging greatly, but we now felt it was time to decrease it, and stop chewing up so much disk space with our logs.

22:38–22:50 (12 minutes): We deployed the application.

22:50–23:11 (21 minutes): We performed our Sanity Testing. Everything was fine.

23:11–23:18 (7 minutes): The users did their Landing Test. Everything was fine. (It’s not usual that Sanity Test takes longer than Landing Test, but I guess the users had their testing down to a science, after so many deployments lately—many of which I missed, since I was on vacation.)

23:18–23:22 (4 minutes): We switched back on everything that had to be switched on, and verified with the user base was completely back up and running.

23:22–23:55 (33 minutes): Just as we had all signed off on the deployment, and were about to hang up, one of the users claimed that there was an issue. We spent this time re-testing the functionality, but there was no issue. (Error between keyboard and chair.)

Overall deployment: 22:00–23:55 (1 hour and 55 minutes).

Wednesday, June 4, 2008

Deployment: June 3, 2008

Apparently the bug fix deployments spilled over out of May, and into June, because here I was doing another one. All preliminary thinking had this being another quick one, though.

22:00–22:06 (6 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

22:06–22:57 (51 minutes): We deployed the application, and made some configuration changes required by the vendor of the application server software we use. (To help them debug a problem we’re having.)

22:57–23:23 (26 minutes): We Sanity Tested the application. Everything was fine.

23:23–00:05 (42 minutes): The users did their Landing Tests. Everything was still fine.

Overall deployment: 22:00–00:05 (2 hours and 5 minutes).

Tuesday, May 27, 2008

Deployment: May 27, 2008

Yet another bug fix deployment. (This is why I’d been telling people that May was going to suck—deployments left right and centre, for bug fixes. There’s a reason that you should never, ever, ever, rewrite your code base from scratch. But ours is not to question why…) Luckily, we could do this one without any kind of outage, so we got special permission to start sooner than usual: 10:00 PM, instead of 1:30 AM like usual.

22:00–22:05 (5 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

22:05–22:47 (42 minutes): We deployed the application.

22:47–23:13 (26 minutes): We Sanity Tested the application.

23:13–22:50 (37 minutes): The users did their Landing Tests.

Everything worked fine. There were some issues, but minor ones.

Overall deployment: 22:00–22:50 (1 hour and 50 minutes).

Friday, May 23, 2008

Deployment: May 23, 2008

Another bug fix deployment. We weren’t happy about it, because there were some big bugs that we knew still existed, but unfortunately we were having trouble reproducing some of the issues, so it was decided to deploy it into Production, and see if we could get some logs to help us diagnose the problem. That being said, we had fixed a whole bunch of other bugs, so it wasn’t necessarily a complete waste of time. It just irked us, because we knew that we’d be back in a couple of days, doing it again.

0130–01:34 (4 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

01:34–01:36 (2 minutes): We shut down the application servers.

01:34–01:43 (9 minutes): We executed our database changes.

01:36–01:40 (4 minutes): We made some configuration changes to our web servers.

01:43–01:55 (12 minutes): We brought the application servers back up, and deployed the application.

01:55–02:12 (17 minutes): We Sanity Tested the application.

02:12–04:06 (1 hour and 54 minutes): The users did their Landing Tests.

Everything worked fine. At least, everything that we expected to work fine worked fine—we still had the issues that we were expecting to have.

Overall deployment: 01:30–04:06 (2 hours and 36 minutes).

Friday, May 16, 2008

Deployment: May 16, 2008

This was a simple “bug fix” or “service pack” deployment, to fix some of the issues discovered during the last deployment. Because there was no DBR this time, we expected the deployment to be quicker than usual, and because there was no new functionality, just bug fixes, we also expected the Landing Tests to be quicker than usual.

Quite a number of steps in this deployment are overlapped with each other. We were doing a lot of things at once, and we had to get some log files while we did the deployment, to troubleshoot some issues that we’d been having. This was nothing to do with the deployment itself, but we wanted to do it while we were there. (There were a lot of log files, so this step is overlapped with much of the rest of the deployment; the log files were gzipped and tarred while everything else was going on.)

01:30–01:33 (3 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

01:33–01:34 (1 minute): We performed a minor configuration change, to turn off the redirection functionality. (See the last deployment for information on that.)

01:34–02:40 (1 hour and 6 minutes): We gathered the log files from the production servers.

01:42–01:52 (10 minutes): We redeployed the old version of the application.

01:52–02:05 (13 minutes): We did our Sanity Testing of the old version of the application.

01:52–02:40 (48 minutes): We adjusted the logging level on the new version of the application. (This wouldn’t normally take this long—it’s a simple configuration change—but again, the person making the change was also busy tarring and gzipping the old log files at the same time. Also, we had to have some discussions about what settings to use.)

02:10–02:15 (5 minutes): The users did their Landing Testing for the old application. (Everything was successful.)

01:52–02:45 (53 minutes): We redeployed the new version of the application. (Again, this took longer than normal because of the overlapping steps above—mostly the gathering of the log files.)

02:45–03:07 (22 minutes): We did our Sanity Testing of the new version of the application.

03:07–03:21 (14 minutes): The users did their Landing Testing for the new version of the application. (Everything was successful.)

Sunday, May 11, 2008

Deployment: May 11, 2008

This was our biggest deployment. Maybe not the most important—release 1 would have qualified as the most important—but with this release, we were taking our application and migrating it over to a completely new technology. That meant a new platform, new servers, new software… new everything. It also meant lots of testing.

00:00–00:11 (11 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

00:11–00:55 (44 minutes): We began the backup of our database, before executing the DB scripts.

00:11–00:33 (22 minutes): Concurrently with the database backup, we began configuring our web servers.

00:33–01:00 (27 minutes): We began some configuration of our new application servers.

00:55–01:30 (35 minutes): We executed the DB scripts.

01:30–01:50 (20 minutes): We deployed the application to the application servers.

01:50–02:20 (30 minutes): We did our Sanity Testing.

02:20–06:40 (4 hours and 20 minutes): The users performed Landing Testing. There were issues, but they were all deemed minor enough that we could leave the deployment in place.

06:40–07:10 (30 minutes): We turned on the first “phase” of users. For this rollout, since it was so big, we actually decided to roll the users out in phases; it’s a web-based application, and all users go through the old application, which decides what “phase” the user is supposed to be on; if the user is in a “current” phase, s/he gets redirected to the new application’s URL. So we had to verify not only that the right users were on the right phase, but also that users on “non active” phases were still getting the old version of the application (i.e. that they were not getting redirected).

All in all, the deployment went about as I’d expected it to. I knew that there would be a lot of testing to do, and I’d expected us to find some issues—you can’t rewrite your entire code base in a new technology and not expect to find issues, even late in the game—and I was relieved that none of the issues were deemed serious enough to roll back the deployment.

At this point, we had to monitor the application, and see how it handled the load we were throwing at it.

Sunday, May 4, 2008

Deployment: May 5, 2008

May is going to be a rough month—I’m doing deployments three out of four weekends this month. This one was sort of a preparatory deployment; the real functionality would be put into production the next week, but this one was getting ready for it.

01:30–01:33 (3 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

01:33–01:34 (1 minute): We shut down the application servers.

01:34: We decided that we didn’t need to back up the database, before executing the database scripts, since a back-out of the deployment wouldn’t require backing out the database changes. (The scripts were only creating new objects, that could sit in the database unused in the event of a back-out, no modifications to existing objects were being made.)

01:34–01:38 (4 minutes): We executed the database scripts.

01:35–01:40 (5 minutes): I verified the output logs from the DB scripts, to verify that everything ran successfully.

01:40–01:43 (3 minutes): We restarted the application servers, and deployed one of the two applications we needed to deploy.

01:43–01:44 (1 minute): We deployed the other application.

01:44–02:26 (42 minutes): We did our Sanity Testing. We actually could have finished our Sanity Testing much earlier, but all of the people who needed to do Landing Tests were in their cars on their way to the office, and not ready to do their tests yet, so we just kept testing, until they got in.

02:15–02:36 (21 minutes): We did our Landing Tests. (This actually overlapped with the Sanity Test; the main users who we needed didn’t get into the office until after others did. So we did some Landing Tests, but the really important tests didn’t happen until later in this block of time.) Everything tested fine, which made the deployment a success.

In the end, it actually turned out to be a good thing that the important testing was delayed. It allowed us to get some of the minor testing—which would have been the bulk of time spent—out of the way, and we didn’t have to waste extra time at the end of the deployment.

Sunday, March 16, 2008

Deployment: March 16, 2008

This should have been a pretty normal deployment. We were introducing a new system, along with a re-deployment of the usual application I work on, and we weren’t expecting it to take too long.

01:30–01:35 (5 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

01:35–01:40 (5 minutes): We shut down the application servers.

01:40: We decided not to bother with a database backup, because even if we had to backout the deployment, we wouldn’t be restoring the database, in this case, so we saved ourselves 30–40 minutes by not backing it up. (This is normally a dangerous decision to make; however, the changes we were implementing in this case could have gone in standalone, so even if we backed out the application, we wanted the database changes to stay in.)

01:40–02:00 (20 minutes): We executed our database scripts.

02:00–02:10 (10 minutes): We restarted the application servers.

02:10–02:15 (5 minutes): We deployed the “new” application, and did a quick sanity test. The test worked, but we realized we had a minor configuration problem. (I’d made a typo in the URL for a web service. Small mistakes can make for big problems…)

02:15–02:40 (25 minutes): We paged the person who could fix the problem, and waited for him to respond. (It took a while because his pager was on vibrate mode, and it took a few pages before it was able to wake him up. That’s a common occurrence when you’re trying to get hold of someone at 2:30 in the morning.)

02:40–2:55 (15 minutes): The person logged on, and fixed our problem for us. Still slightly ahead of schedule, at this point.

02:55–03:20 (25 minutes): We had another configuration problem, and had to troubleshoot that. (Someone else had made another typo! Apparently it was contagious.) This one required us to re-compile the application. (It was only a configuration change, but still required a re-build.)

03:20–03:45 (25 minutes): I couldn’t believe it, but we had yet another problem. Our new Oracle package seemed to be having issues, even though it had compiled correctly in the database. (Apologies if you didn’t understand that sentence; I couldn’t muster up the motivation to type out a post explaining what an “Oracle package” is, when I wrote this.) We had to get the DBA back on the bridge, to help us fix it. By the time we’d fixed this problem, we were 5 minutes behind schedule.

03:45–03:50 (5 minutes): We deployed the “main” application, which depends on the “new” application. This brought us back ahead of schedule (by 5 minutes).

03:50–04:05 (15 minutes): We performed our Sanity Testing. Still 5 minutes ahead of schedule.

04:05–05:30 (1 hour and 25 minutes): Landing tests were performed. At this point, I stopped paying attention to the schedule; Landing Tests always go long, with this team. They do a lot of testing. But since it’s the last thing we do, before sign-off, it’s not an issue if this part goes long. That being said, though, we were only about 10 minutes behind schedule, which is not too shabby.

At the end of the day, we deployed with one minor bug. (Can we never have a bug-free release?!?) However, it seemed to be a data integrity issue, and the only accounts that exhibited the behaviour are accounts that we’ve used heavily in the past, for testing. So it’s quite possible that we won’t actually see the issue “in the wild.” (Famous last words…)

Monday, March 3, 2008

Deployment: March 3, 2008

We had the conference call, regarding the previous deployment, and it was decided that we were confident enough in the original database scripts to proceed with the deployment, at 20:00. We would be using a different DBA.

20:00: We all logged onto the conference bridge, and confirmed that we were ready to go.

20:00–20:20 (20 minutes): We shut down the application servers for both apps.

20:20–20:30 (10 minutes): We backed up the back-end databases for both applications.

20:30–20:50 (20 minutes): We executed the database scripts for Application 1.

20:50–21:20 (30 minutes): We deployed Application 1.

20:50–21:25 (35 minutes): We executed the database scripts for Application 2, in parallel with the deployment of Application 1.

21:20–21:25 (5 minutes): We did sanity testing of Application 1, which turned out fine. (Only 5 minutes because sanity testing for this particular application is pretty quick.

21:25–23:20 (1 hour and 55 minutes): The client did their Landing Test of Application 1, and everything looked good.

21:30–21:45 (15 minutes): We deployed Application 2.

21:45–22:00 (15 minutes): We did sanity testing of Application 2, and everything looked good.

22:00–00:40 (2 hours and 40 minutes): The client did their Landing Test of Application 2. There were some minor issues discovered, but none were show-stoppers. They signed off on the deployment, and we all went to bed.

The last time we’d deployed this, one of the back-end systems we depend on went crazy, and we had to back out. Many of us were thinking negative thoughts, worrying about having to do the same thing this time, as our heads hit our pillows…

Sunday, March 2, 2008

Deployment: March 2, 2008

This was a retry of the previous deployment, which was backed out.

For clarity, I’ve decided to start using military notation for the times, so that there’s no confusion between, e.g., 12:00 midnight and 12:00 noon. (I’m still rounding times to the nearest 5 minutes, though, for clarity.)

00:00–00:10 (10 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

00:10–00:30 (20 minutes): We made some configuration changes to the Application 1 servers, and shut down the application servers for both Application 1 and Application 2.

00:30–01:45 (1 hour and 15 minutes): We backed up the databases, just in case. This took longer than usual, because the DBA had to back up one of the databases to his own machine, instead of doing it on the server. (It’s normally done directly on the server, which avoids the backup having to travel across the network, but he didn’t plan ahead to get the Unix passwords for the boxen in question.)

01:25–03:20 (2 hours and 55 minutes): We made the modifications to the database for Application 1. (This overlapped with the task above, as the database for Application 2 continued to be backed up.) Again, this took much longer than it should have—it took just under an hour and a half, last time—because the DBA had to run the scripts over the network, instead of from the server.

03:20–03:25 (5 minutes): There were issues with the database scripts, for Application 1. It turns out that the DBA modified the scripts, without telling anyone, before running them, but the modified scripts ran into issues. Because we were not confident that the database was left in a valid state, it was decided to roll back, and restore the database from the backup.

03:20–04:30 (1 hour and 10 minutes): During this time, the team had a roundtable conversation to discuss whether we could still put in Application 2, even though Application 1 was rolled back. The problem is that the new version of Application 1 had never been tested against the old version of Application 1. It was decided that the risk was low—the nature of the changes being made indicated that the different versions of the applications should work together—but risk is risk, and if they weren’t tested, there could be a chance of errors being undiscovered. It was finally decided to leave both applications as-is—that is, not try to deploy Application 2 without Application 1—and revisit the deployment in a day or two.

04:30–05:10 (40 minutes): We brought the applications back up. Luckily, based on the nature of the database changes for Application 1, it was decided that we didn’t need to restore the database; all we had to do was undo some configuration changes, on the application servers, and then bring those servers back online.

04:40–04:50 (10 minutes): The business came onto the bridge, and said that they’d prefer to try the deployment again, rather than roll back. Just to be safe, it was decided to continue with the rollback, and do a quick sanity test, so that there would be some version of the application online, while the decision was made. We also needed to chase the DBA team, and find a new DBA to perform the work.

05:10–05:20 (10 minutes): We all did a sanity/regression test, to verify that the applications were back up and running. They were fine.

05:20–05:30 (10 minutes): Dead air, as we waited for the DBA team to get back to us.

05:30–06:30 (1 hour): Another DBA joined the conference call, and was walked through the existing dilemma. He asked for some time to look over the situation—including what was changed, and why. Presumably, the results of his investigation would give the rest of us the warm and fuzzy that we can proceed to try again.

The result of this discussion was that the DBA(s) would try to recreate the errors we got, to prove that it was the modified scripts that caused the problems. We would reconvene at noon, to go over the findings, and talk about rescheduling the deployment.

Thursday, February 21, 2008

Deployment: February 21, 2008—Backout!

After a [mostly] successful deployment, some of the back-end systems that we depend on went crazy. The system was down for most of the morning. It was decided to roll back everything that went in during the night. (That included not just our application, but numerous other applications that had also deployed.)

12:00 (noon)–12:30 (30 minutes): We opened the bridge, and waited for people to join. Since this was an ad-hoc bridge, being opened at the last minute, it took longer than usual to round everyone up, and get them on the bridge. (Some of them may still have been sleeping, since we were exactly twelve hours after the original deployment.) Unfortunately, we weren’t able to get the DBA who had done the original deployment, but we had two other DBAs helping us.

12:30–12:45 (15 minutes): We shut down the application servers for both applications, and rolled back some of the configuration changes necessary for Application 1. (See the description of the original deployment, for what we were deploying, and a mention of “Application 1” vs. “Application 2.”)

12:45: At the last minute, the DBA who had done our original deployment showed up. This gave me a sense of relief, because it’s always easier for someone to back out her own work than for someone else to do it—even with good, detailed instructions.

12:45–1:20 (35 minutes): The DBAs rolled back the database changes. This is why the backups taken during the deployment were so crucial; they simply deleted the appropriate schemas, and re-imported the dump file taken during the backup. The DBA who had done our original deployment restored one of the databases, and, to shorten the outage time, another of the DBAs restored the other one. (All this time, the application was down, and users were unable to access it, so every second mattered.)

1:20–1:45 (25 minutes): The old version of Application 1 was re-deployed.

1:45–3:00 (1 hour and 15 minutes): Sanity Testing commenced for Application 1, and failed; the application didn’t come up correctly. It was believed to be a problem with the deployment procedure, so it was un-deployed, and re-deployed again. Unfortunately, the problem still remained. It was eventually determined that the application was deployed successfully, but we had a problem with a tool called Java Web Start—the Admin Tool for the application uses Java Web Start, and because we were using a new version of the Admin Tool, JWS didn’t re-download the old version.

As of 3:00, we were rolled back to our previous state—with no idea when we would next be attempting the deployment.

Deployment: February 21, 2008

This was a problematic deployment before we even got started. First of all, it was a full moon—and not just any full moon, a blood-red full moon. (Isn’t it a bad omen to be doing a deployment when the moon is turning to blood?)

But that wasn’t what I was referring to. It was a problematic deployment because we were dependent on three separate systems, for the deployment, and two of those systems were themselves dependent on a number of other systems being deployed.

feb 2008

There was a lot of confusion about which system had to be deployed before which other system, or which systems had to be installed on the same night, etc. Unfortunately, one of the systems—one of the ones that were two or three removed from my system—had issues, so they weren’t going to be able to deploy before us. It was decided that we could deploy anyway, though; our system would be able to gracefully handle the missing back-end system, simply displaying an error message at the appropriate spot, and when they were able to deploy the other system, our system would just magically start working. (Unfortunately, it meant that I’d still have to take part in the other deployment, to do some testing, which means that I’d have another sleepless night coming up.)

Also, we ourselves were deploying not one, but two applications. I’ll call them Application 1 and Application 2.

Once we finally got to the point where we were ready to deploy, here’s how it went down.

12:00 midnight: We all logged onto the conference bridge, and confirmed that we were ready to go.

12:05–12:10 (5 minutes): We shut down the application servers for both Application 1 and Application 2. To save time, we also made some necessary configuration changes, during this time, for Application 1. (It was a last-minute change to the procedure; we realized that if we’d waited until it was scheduled, it would have meant an extra reboot in the process. As much planning as you try to do ahead of time, you just can’t control when you’ll have a good idea.)

12:10–12:30 (20 minutes): We backed up the back-end databases that we depend on. Since we had to make some changes to the schemas, it’s always a good idea to back them up, first, in case you have to rollback the deployment.

12:25–1:50 (1 hour 25 minutes): We made the appropriate modifications to the database for Application 1. (The backup of this database finished before the other one, which is why this step starts before the end of the previous step.) Unfortunately, while the database scripts were executing, the database ran out of space in the “temp” tablespace. So the DBA had to page someone who does more low-level support of the database, to get the issue resolved, before the script could be run.

12:30–12:50 (20 minutes): We made the modifications to the database for Application 2. Again, you’ll notice that this step overlaps with the previous one; because the first change involved a long-running DB script, the DBA was able to make the modifications on the second database, while she waited for the long-running script to finish. There were a number of things happening at the same time, but I trusted the DBA to know how much she could do at once, and how much had to be done sequentially.

1:50–2:10 (20 minutes): We brought back one of the application servers for Application 1, and deployed the new version of the application.

2:10–2:20 (10 minutes): One more reboot, to be safe, after the deployment of Application 1.

2:20–2:25 (5 minutes): We did our Sanity Test of Application 1, and everything looked good.

2:25–2:35 (10 minutes): We brought back up the application servers for Application 2, and deployed the new version of the application.

2:35–3:10 (35 minutes): We did our first round of Sanity Tests for Application 2. Unfortunately, one of the back-end systems—one of the ones that was supposed to be up and running—was down, because of some emergency maintenance. (Replacement of a network card, or something.) Nobody bothered to inform us that this was happening, so we were taken by surprise, when we ran our test and it didn’t work. It was decided to Sanity Test as much as we could, ignoring that part of the application, and then do the Landing Tests (still ignoring that part of the application.) When it came back up, we’d revisit Sanity Testing, to verify the functionality.

3:10–5:30 (2 hours and 20 minutes): We did our Landing Testing, for the business to verify that the functionality is up and running. Because of the back-end system’s outage, they ended up Landing Testing some functionality before the Sanity Test was done. Issues discovered:
  • Character encoding issues with one back-end system that we connect to; French characters were not showing up properly in the UI.
  • A web app that we link to was down, during the deployment. (Not a big deal.)
  • We deployed the wrong version of Application 1, and had to re-deploy it. When it was redeployed, it was fine.
At the end of the day, we were only left with the issue regarding French characters. It was decided that the application could be left in production as-is, and we’d have to look into a fix later on, when we could look more closely at it. (It seems to be related to character encoding issues with some XML being sent between applications. If you’re interested in character encoding issues in XML, I can recommend a great book, with some very handsome faces on the cover…)