Tuesday, May 27, 2008

Deployment: May 27, 2008

Yet another bug fix deployment. (This is why I’d been telling people that May was going to suck—deployments left right and centre, for bug fixes. There’s a reason that you should never, ever, ever, rewrite your code base from scratch. But ours is not to question why…) Luckily, we could do this one without any kind of outage, so we got special permission to start sooner than usual: 10:00 PM, instead of 1:30 AM like usual.

22:00–22:05 (5 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

22:05–22:47 (42 minutes): We deployed the application.

22:47–23:13 (26 minutes): We Sanity Tested the application.

23:13–22:50 (37 minutes): The users did their Landing Tests.

Everything worked fine. There were some issues, but minor ones.

Overall deployment: 22:00–22:50 (1 hour and 50 minutes).

Friday, May 23, 2008

Deployment: May 23, 2008

Another bug fix deployment. We weren’t happy about it, because there were some big bugs that we knew still existed, but unfortunately we were having trouble reproducing some of the issues, so it was decided to deploy it into Production, and see if we could get some logs to help us diagnose the problem. That being said, we had fixed a whole bunch of other bugs, so it wasn’t necessarily a complete waste of time. It just irked us, because we knew that we’d be back in a couple of days, doing it again.

0130–01:34 (4 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

01:34–01:36 (2 minutes): We shut down the application servers.

01:34–01:43 (9 minutes): We executed our database changes.

01:36–01:40 (4 minutes): We made some configuration changes to our web servers.

01:43–01:55 (12 minutes): We brought the application servers back up, and deployed the application.

01:55–02:12 (17 minutes): We Sanity Tested the application.

02:12–04:06 (1 hour and 54 minutes): The users did their Landing Tests.

Everything worked fine. At least, everything that we expected to work fine worked fine—we still had the issues that we were expecting to have.

Overall deployment: 01:30–04:06 (2 hours and 36 minutes).

Friday, May 16, 2008

Deployment: May 16, 2008

This was a simple “bug fix” or “service pack” deployment, to fix some of the issues discovered during the last deployment. Because there was no DBR this time, we expected the deployment to be quicker than usual, and because there was no new functionality, just bug fixes, we also expected the Landing Tests to be quicker than usual.

Quite a number of steps in this deployment are overlapped with each other. We were doing a lot of things at once, and we had to get some log files while we did the deployment, to troubleshoot some issues that we’d been having. This was nothing to do with the deployment itself, but we wanted to do it while we were there. (There were a lot of log files, so this step is overlapped with much of the rest of the deployment; the log files were gzipped and tarred while everything else was going on.)

01:30–01:33 (3 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

01:33–01:34 (1 minute): We performed a minor configuration change, to turn off the redirection functionality. (See the last deployment for information on that.)

01:34–02:40 (1 hour and 6 minutes): We gathered the log files from the production servers.

01:42–01:52 (10 minutes): We redeployed the old version of the application.

01:52–02:05 (13 minutes): We did our Sanity Testing of the old version of the application.

01:52–02:40 (48 minutes): We adjusted the logging level on the new version of the application. (This wouldn’t normally take this long—it’s a simple configuration change—but again, the person making the change was also busy tarring and gzipping the old log files at the same time. Also, we had to have some discussions about what settings to use.)

02:10–02:15 (5 minutes): The users did their Landing Testing for the old application. (Everything was successful.)

01:52–02:45 (53 minutes): We redeployed the new version of the application. (Again, this took longer than normal because of the overlapping steps above—mostly the gathering of the log files.)

02:45–03:07 (22 minutes): We did our Sanity Testing of the new version of the application.

03:07–03:21 (14 minutes): The users did their Landing Testing for the new version of the application. (Everything was successful.)

Sunday, May 11, 2008

Deployment: May 11, 2008

This was our biggest deployment. Maybe not the most important—release 1 would have qualified as the most important—but with this release, we were taking our application and migrating it over to a completely new technology. That meant a new platform, new servers, new software… new everything. It also meant lots of testing.

00:00–00:11 (11 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

00:11–00:55 (44 minutes): We began the backup of our database, before executing the DB scripts.

00:11–00:33 (22 minutes): Concurrently with the database backup, we began configuring our web servers.

00:33–01:00 (27 minutes): We began some configuration of our new application servers.

00:55–01:30 (35 minutes): We executed the DB scripts.

01:30–01:50 (20 minutes): We deployed the application to the application servers.

01:50–02:20 (30 minutes): We did our Sanity Testing.

02:20–06:40 (4 hours and 20 minutes): The users performed Landing Testing. There were issues, but they were all deemed minor enough that we could leave the deployment in place.

06:40–07:10 (30 minutes): We turned on the first “phase” of users. For this rollout, since it was so big, we actually decided to roll the users out in phases; it’s a web-based application, and all users go through the old application, which decides what “phase” the user is supposed to be on; if the user is in a “current” phase, s/he gets redirected to the new application’s URL. So we had to verify not only that the right users were on the right phase, but also that users on “non active” phases were still getting the old version of the application (i.e. that they were not getting redirected).

All in all, the deployment went about as I’d expected it to. I knew that there would be a lot of testing to do, and I’d expected us to find some issues—you can’t rewrite your entire code base in a new technology and not expect to find issues, even late in the game—and I was relieved that none of the issues were deemed serious enough to roll back the deployment.

At this point, we had to monitor the application, and see how it handled the load we were throwing at it.

Sunday, May 4, 2008

Deployment: May 5, 2008

May is going to be a rough month—I’m doing deployments three out of four weekends this month. This one was sort of a preparatory deployment; the real functionality would be put into production the next week, but this one was getting ready for it.

01:30–01:33 (3 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

01:33–01:34 (1 minute): We shut down the application servers.

01:34: We decided that we didn’t need to back up the database, before executing the database scripts, since a back-out of the deployment wouldn’t require backing out the database changes. (The scripts were only creating new objects, that could sit in the database unused in the event of a back-out, no modifications to existing objects were being made.)

01:34–01:38 (4 minutes): We executed the database scripts.

01:35–01:40 (5 minutes): I verified the output logs from the DB scripts, to verify that everything ran successfully.

01:40–01:43 (3 minutes): We restarted the application servers, and deployed one of the two applications we needed to deploy.

01:43–01:44 (1 minute): We deployed the other application.

01:44–02:26 (42 minutes): We did our Sanity Testing. We actually could have finished our Sanity Testing much earlier, but all of the people who needed to do Landing Tests were in their cars on their way to the office, and not ready to do their tests yet, so we just kept testing, until they got in.

02:15–02:36 (21 minutes): We did our Landing Tests. (This actually overlapped with the Sanity Test; the main users who we needed didn’t get into the office until after others did. So we did some Landing Tests, but the really important tests didn’t happen until later in this block of time.) Everything tested fine, which made the deployment a success.

In the end, it actually turned out to be a good thing that the important testing was delayed. It allowed us to get some of the minor testing—which would have been the bulk of time spent—out of the way, and we didn’t have to waste extra time at the end of the deployment.