Monday, September 24, 2007

Deployment: September 2007 (Take Two)

This was a second attempt on the deployment written about earlier. Since we were convinced we’d fixed the problems discovered during the first deployment, we decided to tackle this attempt exactly the same way.

Of course, there was one major difference, this time: I decided to take this one from home. On the previous attempt, they’d shut down the power in the building, forcing me to come home; this time, they were going to shut down the phones and the network. But luckily, I knew ahead of time, this time, so I was able to plan ahead to come home, instead of being forced halfway through.

10:00PM: We’re scheduled to start at 10:00PM, but there’s an accident on the 401, which prevents me from getting home in time for the beginning of the deployment. Luckily, I really only have to log on for a couple of minutes, verify that everything is going smoothly, and then log back off until midnight. So I log onto the conference bridge from my car, on my cell. The “database split team” begins their work—at this point, they’re backing up the database, in case of rollback—and I log back off again. I have just walked into my house, at this point.

10:00–12:00: I start watching another movie, to kill the time. The Spy Who Loved Me, this time. (Still on my 007 kick.) I also make some tapioca pudding, since I’m home, and spend some time preparing my “work environment”—setting up a phone with a long cable (since I don’t know how long my cordless phone batteries will last), getting a couple of cordless phones handy (since I don’t like my “corded” phone), and preparing some things to drink.

12:00AM: I log back on. The backup was completed successfully. We shut down the Application Servers for “Application 1”, so that they can proceed with the database split. And then I log back off the bridge, since there won’t be any further activity until 12:45. (For a reminder of what “Application 1” and “Application 2” are, see the post from the first attempt.)

12:05–12:45: More movie.

12:45: I log back onto the bridge.

12:45–12:55: We sit around on the bridge, wondering where in the world everyone is. We then decide to proceed without them, for the time being—the clock is ticking, after all.

12:55: We shut down the App Servers for “Application 2”.

1:00–1:01: We back up the database, for “Application 2”.

1:01: After the quickest backup in history, we begin the database changes for “Application 2”. (We did double check, of course, to make sure the backup was successful; when a backup is that quick, you have to wonder if it really backed up at all…)

1:05: We’re informed that half of the members of the client team won’t be showing up. Yes, you read that right: They’re just not coming. (Since we’d already done it once, I guess they got bored with the whole thing…) We’re told, of course, that we can call them, if anything goes wrong. (How magnanimous.)

1:10: The database changes are finished. We bring the App Servers for “Application 2” back online, so that the deployment of that application can begin.

1:10–1:35: We deploy “Application 2”.

1:35–1:40: We conduct our Sanity Test. Our testing is positive, which means that we’ve fixed the first issue we had problems with, on the last deployment. (Phew!) The client also jumps on the application, to start testing, but I give him a verbal slap on the wrist for it—I’d prefer us to finish our testing, before handing it over to the client. After my testing is done—which only takes a couple of minutes anyway—I give the client the go ahead to do his testing.

1:40: At this point, we’re ready to begin the deployment for “Application 1”, but, fortunately and unfortunately, we’re about an hour ahead of schedule. It’s fortunate because it means there is a chance of getting to bed early; it’s unfortunate because the people we need for the next phase of the deployment aren’t on the bridge—in fact, they’re probably sleeping, since they aren’t expecting to be needed, yet.

1:40–1:45: We call them on their home numbers, but they sleep through the calls. This is a fairly normal event, for late-night deployments; the human body is used to being asleep, at this time. So we take it in stride, and simply keep trying, maintaining our good humour. We finally get hold of them, and they join the bridge.

1:45–2:05: We deploy “Application 1”.

2:05–2:25: We do our Sanity Testing. This is a bit longer than we usually take, for this particular application—we can usually run through it in 10–15 minutes—but I think some of us are being extra careful. The good news, though, is that everything is working fine—meaning that the second problem we’d had, last week, is also fixed.

2:25–3:40: The client does their Landing Testing, and everything goes well. This is always the nerve-wracking part for me; there is so much chatter on the bridge, and every time someone asks a question—“Is it supposed to work like this?” or “How come this page is taking so long to load?”—I get nervous that we’ll have a bug that’s a “show stopper”. (In other words, a bug that’s serious enough to force us to back out the application.) I guess this is where I earn part of my salary, though; being able to think on my feet in the wee hours of the morning, answering questions, and deciding when a problem is serious, and when it is just a transient blip. (e.g. a link in the application that appears not to work, but it turns out it’s because the link is pointing to another application that’s also deploying right now.) You don’t want to be wrong, and assume something is transient if it’s not, or else you’ll spend the next week fielding calls from the Help Desk, with your cell phone glued to your ear. On this morning, though, we don’t have any bugs, serious or otherwise.

And, at 3:45AM, we have another successful deployment under our belts.

Wednesday, September 19, 2007

Deployment: September 2007

This was a more complex deployment than usual. There were three major systems involved:

  • One of the back-end databases we talk to is being split into two databases, for performance reasons.
  • The “main” application, which is web-based, was being upgraded
  • A second back-end application was also being upgraded
Something like this:

sep 2007 deployment

(where, as always, I’ve greatly over-simplified this diagram)

I was worried about the database split; in our test environments, we had a lot of problems with it, because it’s a very complex thing to do; we kept encountering various permissions that needed to be re-created, and that type of thing, so I was worried that there would be additional permissions we hadn’t thought of before, that we wouldn’t discover until we deployed in production. In addition, the work for this split was being done by a separate team, whereas my own team was responsible for the upgrades to “Application 1” and “Application 2”. Whenever more than one team is involved in a deployment, coordination becomes an issue. Or rather, a potential issue—I shouldn’t be so pessimistic.

Because of the database split, which takes a long time to complete, we had special permission from the call centre to extend our outage, and start the deployment earlier than usual (12:00AM instead of 1:30).

Here’s how it went down:

10:00PM: I logged onto the conference bridge. The “database split team” was beginning their work at this time, to do some backups (in case of failure, later on). I simply logged on the bridge, verified they were good to go, and logged back off again.

10:00–12:00: To kill time until my part of the deployment was scheduled to start, I watched a movie on my laptop. The Man With the Golden Gun, in case you’re interested—I’ve been on a James Bond kick, lately.

12:00: I logged back on the bridge. Everything was still going smoothly; the database backups went quicker than they’d been anticipating, so they were just waiting for us to shut down our application, so that they could move on to their next piece.

12:00–12:05: We shut down our application.

12:05: The “database split team” began their next phase of the deployment, which was the actual work of splitting the databases into two. (You’ll note that I’m purposely not giving any details about this…) I wasn’t involved in this piece, so I had another hour or so to kill.

12:05–1:30: I finished watching my movie, and farted around doing some other stuff. Probably playing with Ubuntu.

1:30AM: I logged back onto the bridge. Again, everything was going smoothly; they’d completed their database work, and were ready for us to move forward.

1:30–2:00: We completed the deployment for “Application 2”, including the Sanity Test, with some minor issues. Or so we thought. Testing wouldn’t be complete, however, until “Application 1” could be tested, since “Application 1” is dependent on “Application 2”.

2:00: The power went off, in the building. (Let me repeat that: The power went off, in the building.) My laptop kept going, but my external monitor went off. I wasn’t overly worried, since they have backup power generators; I figured that the lights would probably come back on soon. And the networking infrastructure must be powered by the generators, because my network connections were still fine.

2:15: Someone went down to the security desk, and found out that this was a scheduled power outage. (“We communicated it to all of the appropriate channels…”) The power was scheduled to stay off until 5:00.

2:30–3:00: I knew that my laptop batteries wouldn’t last until 5:00, so I made the decision to drive home, and continue the deployment from there. (Assuming that VPN connectivity would be up and running…) During the 30 minute drive home, I stayed on the conference bridge, on my cell phone.

2:30–3:00: While I drove home, we completed the deployment for “Application 1”. (When I say “we”, I mean the people who did the actual work; luckily, I personally am not the one doing the work of a deployment.) Sanity Testing indicated that everything was working—except for connectivity to “Application 2”.

3:00–6:00: We—the technical team—continued troubleshooting the connectivity issue between “Application 1” and “Application 2”, while we let the client begin their Landing Test. (That is, they could test everything except the pieces of “Application 1” that require connectivity to “Application 2”.) Their testing confirmed our Sanity Test results; everything worked except for the connectivity between the two systems.

6:00: At 6:00, we had to make a decision: We needed something up, by 7:00AM, so we either had to be confident that we could fix the system(s) within the next hour, or make the decision to roll back, since a backout takes about an hour. But then the client stepped in and granted us another hour—meaning that we could stay down until 8:00—to continue testing.

6:00–6:50: We continued troubleshooting, until… we eventually ran out of ideas. Around this time, there was talk of extending our window for another hour, but we decided not to bother; we just didn’t have anything else we could think of to test. So we reluctantly made the decision to roll back.

7:00–8:15AM: We rolled each of the three systems back. (That is, we “re-joined” the two databases together, and rolled “Application 1” and “Application 2” back to their pre-deployment states.) It took a little longer than we’d anticipated, so we were 15 minutes late getting back up and running, but we were so disappointed at having had to roll back in the first place that the 15 minute delay was the least of our worries.

So, as it turns out, the database split, which I’d been worried about, went very smoothly. Frankly, I hadn’t been expecting any problems with “Application 2”, which turned out to be a problem for us.

As I write this, the deployment has been re-scheduled to be re-attempted this weekend (a week after the original attempt). We’ve fixed the issue which prevented “Application 1” and “Application 2” from talking to each other, and tested the changes, and are now confident that we’ll be able to get it in, this time.

There was talk of splitting the release into two deployments; one for the “database split”, and one for “Application 1” and “Application 2”. But it was eventually decided that
  1. It would be too much work to re-jig “Application 1” to work with the database split, without incorporating all of the other changes that were supposed to be released as part of this deployment
  2. We were confident enough that we’d solved our issues that we figured it was worth the risk to do it all in one shot, again
So there will be another post here, probably next week, to outline how the second attempt goes, for this deployment.

Tuesday, September 18, 2007

“Deployment” Defined/Explained

I guess I should define the term deployment. (Unfortunately, I couldn’t find a definition in the Jargon File, so I’ll have to tackle it myself.) I’m a software developer, and I deal mainly in web-based business systems. (In other words, web sites used by a company, internally, to do its work. Most of the stuff that I do isn’t “public” or customer-facing.)

For example, I’m currently working on a system for a call centre; it’s a website that brings up customer information, and allows you to perform transactions on behalf of that customer—perform billing transactions, do technical troubleshooting, etc. etc.

Because I use web-based technologies, for my applications, the structure is usually something like this:

simple app

The users’ browsers connect to the web server, where the application resides, and the application, in turn, accesses other back-end systems or databases. Actually, that diagram is simplified; the structure is usually a bit more like this:

less simple app

(Even this diagram is simplified, but of course, I can’t give out technical details about the actual systems I’m working on.)

Usually, servers are clustered together, to help handle the volume of calls to the application, and to provide redundancy—that is, if one server goes down, the other server handles the traffic until the broken server can be repaired.

When there is a new version of the application, to introduce new features, or fix bugs, it has to be installed on the Application Servers. (Sometimes changes have to be made to the Web Servers, too.) There is also usually work to be done on one or more of the back-end databases that the application uses for its data. Since the systems I work on are web-based, there is typically nothing to install on any users’ workstations; they use their web browsers to access the application, so as soon as the application is updated, they start seeing it immediately.

A deployment is when we un-install the current version of the application, on those servers, and re-install a new version. Typically, for these types of applications, a deployment is a tricky process, involving many steps, that have to be performed in a proper sequence. Certain parts of the deployment can’t be completed until other things have completed successfully. It needs to be planned out in advance, in detail, and, where I work, we typically create a deployment plan, where we detail these steps.

And, because a deployment is so complex, a deployment plan needs to include a backout, or rollback, strategy. If you start a deployment, and for whatever reason it doesn’t work, you need to be able to go back to the version of the application that had been working, before you started the deployment.

As an example, here’s what a typical deployment is like for me, these days:
  1. Because the application I’m currently working on is for a call centre, which is up and running 24/7, our deployments have to be scheduled for times when the call centre is the least busy; for us, this means deployments happen Saturday nights/Sunday mornings, beginning at 1:30AM. During these wee hours on a Sunday morning, the call centre isn’t too busy, so the few people who are working can get by without our application for a while.
    • But we do have to get the application back up and running by 7:00AM Sunday morning, when they start to get a bit more busy. So our deployment window is 1:30–7:00 AM.
  2. Our first step, when doing a deployment, is to shut the application down, so that the users can’t use it while we’re working.
    • There are times, during the deployment, when the application will be in an “inconsistent state”—meaning that some pieces are up and running, and others are not. It can cause problems when people are using the application in such a state.
    • In our specific case, shutting down the application means shutting down the Application Servers. Depending on the technologies used, this can be quick, or it can take a few minutes.
    • For other applications, this might be accomplished by shutting down the load balancers.

      Or, better yet, one might redirect the load balancer to another version of the application, which simply says that the application is down.
  3. Once we have shut the application down, we would deploy any appropriate changes to back-end systems; in many cases, this might be changes to a database, but in more complex deployments, we may have to sit and wait, at this point, while another back-end system performs their own, complex, deployment.
  4. Once all of the back-end systems are back up and running—and tested, if necessary—we bring the Application Servers back up, and begin deploying the “main” application code.
  5. Once the application is back up, we do a sanity test. This means that my team, who developed the application, does a quick test, to make sure that it’s properly back up and running.
  6. Once the sanity test is done, we would hand the application over to our client, who would do a landing test. This is a more thorough test, where they specifically test each new piece of functionality in the application, as well as regression testing existing parts of the application, to make sure it still works.
    • Typically, of course, I would also be performing these tests on my own. If there are issues, I want to know as soon as possible.
  7. If everything goes well, this is the end of the deployment. Once all of the testing has passed successfully, we can go home and go to bed.
  8. If things go wrong, however—if some of the testing fails—we then have to decide how to proceed. Normally, we will do some troubleshooting on the spot; if it’s an issue we can fix immediately, we do so. Sometimes we can, and the deployment still ends successfully.
    • In some cases, even when we troubleshoot the issue, we’re not able to solve the problem. There can be numerous reasons for this; perhaps it is a complex problem, that requires research, or perhaps we just run out of time, before we have to have the application back up and running for the users.
    • In cases where we’re not able to solve the problem, we eventually have to make a decision to back our changes out, and put the old version of the application back in place.
And this is how a deployment works for the application I’m currently working on. Deployments for other applications could be much simpler, or much more complex.

Deployment Blog

I’ve threatened to do it, in the past, and I’m finally making good on my threat: I’ve started a “Deployment Blog”. (It’s… er… it’s the blog you’re reading right now.) I do a lot of deployments—about one a month, these days—so it’s time I started blogging about them. As with the many other blogs I’ve started, I’m sure nobody will find this blog interesting but me. Only a fellow nerd/geek could possibly find this interesting, and even then, why would they want to read about my deployments? They’ve got their own to worry about.

Still, I’ve started it, and I’ll stick with it. For a while.

For those who don’t read my main blog, here are some links to posts from the past, when I mentioned previous deployments:

So from now on, when I have a deployment, I’ll write about it here. (Also, I will put up a post which describes what a deployment is, for those who aren’t familiar with the term.)