Wednesday, September 19, 2007

Deployment: September 2007

This was a more complex deployment than usual. There were three major systems involved:

  • One of the back-end databases we talk to is being split into two databases, for performance reasons.
  • The “main” application, which is web-based, was being upgraded
  • A second back-end application was also being upgraded
Something like this:

sep 2007 deployment

(where, as always, I’ve greatly over-simplified this diagram)

I was worried about the database split; in our test environments, we had a lot of problems with it, because it’s a very complex thing to do; we kept encountering various permissions that needed to be re-created, and that type of thing, so I was worried that there would be additional permissions we hadn’t thought of before, that we wouldn’t discover until we deployed in production. In addition, the work for this split was being done by a separate team, whereas my own team was responsible for the upgrades to “Application 1” and “Application 2”. Whenever more than one team is involved in a deployment, coordination becomes an issue. Or rather, a potential issue—I shouldn’t be so pessimistic.

Because of the database split, which takes a long time to complete, we had special permission from the call centre to extend our outage, and start the deployment earlier than usual (12:00AM instead of 1:30).

Here’s how it went down:

10:00PM: I logged onto the conference bridge. The “database split team” was beginning their work at this time, to do some backups (in case of failure, later on). I simply logged on the bridge, verified they were good to go, and logged back off again.

10:00–12:00: To kill time until my part of the deployment was scheduled to start, I watched a movie on my laptop. The Man With the Golden Gun, in case you’re interested—I’ve been on a James Bond kick, lately.

12:00: I logged back on the bridge. Everything was still going smoothly; the database backups went quicker than they’d been anticipating, so they were just waiting for us to shut down our application, so that they could move on to their next piece.

12:00–12:05: We shut down our application.

12:05: The “database split team” began their next phase of the deployment, which was the actual work of splitting the databases into two. (You’ll note that I’m purposely not giving any details about this…) I wasn’t involved in this piece, so I had another hour or so to kill.

12:05–1:30: I finished watching my movie, and farted around doing some other stuff. Probably playing with Ubuntu.

1:30AM: I logged back onto the bridge. Again, everything was going smoothly; they’d completed their database work, and were ready for us to move forward.

1:30–2:00: We completed the deployment for “Application 2”, including the Sanity Test, with some minor issues. Or so we thought. Testing wouldn’t be complete, however, until “Application 1” could be tested, since “Application 1” is dependent on “Application 2”.

2:00: The power went off, in the building. (Let me repeat that: The power went off, in the building.) My laptop kept going, but my external monitor went off. I wasn’t overly worried, since they have backup power generators; I figured that the lights would probably come back on soon. And the networking infrastructure must be powered by the generators, because my network connections were still fine.

2:15: Someone went down to the security desk, and found out that this was a scheduled power outage. (“We communicated it to all of the appropriate channels…”) The power was scheduled to stay off until 5:00.

2:30–3:00: I knew that my laptop batteries wouldn’t last until 5:00, so I made the decision to drive home, and continue the deployment from there. (Assuming that VPN connectivity would be up and running…) During the 30 minute drive home, I stayed on the conference bridge, on my cell phone.

2:30–3:00: While I drove home, we completed the deployment for “Application 1”. (When I say “we”, I mean the people who did the actual work; luckily, I personally am not the one doing the work of a deployment.) Sanity Testing indicated that everything was working—except for connectivity to “Application 2”.

3:00–6:00: We—the technical team—continued troubleshooting the connectivity issue between “Application 1” and “Application 2”, while we let the client begin their Landing Test. (That is, they could test everything except the pieces of “Application 1” that require connectivity to “Application 2”.) Their testing confirmed our Sanity Test results; everything worked except for the connectivity between the two systems.

6:00: At 6:00, we had to make a decision: We needed something up, by 7:00AM, so we either had to be confident that we could fix the system(s) within the next hour, or make the decision to roll back, since a backout takes about an hour. But then the client stepped in and granted us another hour—meaning that we could stay down until 8:00—to continue testing.

6:00–6:50: We continued troubleshooting, until… we eventually ran out of ideas. Around this time, there was talk of extending our window for another hour, but we decided not to bother; we just didn’t have anything else we could think of to test. So we reluctantly made the decision to roll back.

7:00–8:15AM: We rolled each of the three systems back. (That is, we “re-joined” the two databases together, and rolled “Application 1” and “Application 2” back to their pre-deployment states.) It took a little longer than we’d anticipated, so we were 15 minutes late getting back up and running, but we were so disappointed at having had to roll back in the first place that the 15 minute delay was the least of our worries.

So, as it turns out, the database split, which I’d been worried about, went very smoothly. Frankly, I hadn’t been expecting any problems with “Application 2”, which turned out to be a problem for us.

As I write this, the deployment has been re-scheduled to be re-attempted this weekend (a week after the original attempt). We’ve fixed the issue which prevented “Application 1” and “Application 2” from talking to each other, and tested the changes, and are now confident that we’ll be able to get it in, this time.

There was talk of splitting the release into two deployments; one for the “database split”, and one for “Application 1” and “Application 2”. But it was eventually decided that
  1. It would be too much work to re-jig “Application 1” to work with the database split, without incorporating all of the other changes that were supposed to be released as part of this deployment
  2. We were confident enough that we’d solved our issues that we figured it was worth the risk to do it all in one shot, again
So there will be another post here, probably next week, to outline how the second attempt goes, for this deployment.

No comments: