The last deployment was a success. Well… sort of. We added in some new functionality which was pulling additional information from a back-end system. Unfortunately, that system wasn’t able to handle the extra load. And, because of the Christmas season, all of the systems were being taxed more heavily than usual, so there was a danger that this extra load would be enough to start causing crashes.
So it was decided to do an “emergency deployment,” and remove the functionality. The back-end system would be going through a change, the month after, which would make it able to handle this load, so the next version of my app would have the functionality reinstated.
It sucks that we had to do a deployment on the weekend before Christmas—when I was supposed to be on holidays—but that’s the way it goes, sometimes.
1:30AM: We all logged onto the conference bridge, and confirmed we were ready to go.
1:35–1:40: There were no database changes, this time around, so all we had to do was redeploy the application itself. We did so.
1:40–1:50: The application was back up, and we did a quick Sanity Test, to ensure it was working. The Sanity Tests passed.
1:50–2:05: The client did their Landing Tests. Again, the tests passed.
Sunday, December 23, 2007
Deployment: December 23, 2007
Sunday, December 2, 2007
Deployment: December 2, 2007
Finally. We finally got this thing deployed. After all of the false starts and number of times the release was deferred, it seems anti-climactic to have such a short post for this release, but the fact is, when we finally got a chance to deploy this thing, it went without a hitch.
1:30AM: We all logged onto the conference bridge, and confirmed we were ready to go.
1:35: We shut down the two applications that we had to deploy, for this release—the “front-end” app and the “back-end” app.
1:40: We backed up the database for the front-end app, and began the deployment of the back-end app. I can’t stress enough the importance of having a good, solid deployment plan, so that you can execute tasks in parallel like this, and not worry about losing track of who’s doing what!
1:45: The backup finished for the database, so our Database Analyst (DBA) began executing the new database scripts.
1:45: As the DBA executed the DB scripts, the back-end app was taking a bit longer than expected to come back up.
1:50: The back-end app came back up, and the DBA finished executing the scripts. We did our Sanity Test for the back-end app.
2:00: Sanity tests for the back-end app passed. We now began the deployment of the front-end app. (Because it depends on the back-end app, we had to ensure that the back-end app was up and running properly, before bothering to deploy the new version of the front-end app.)
2:05: The front-end app finished deploying, and we began our Sanity testing. At this point, we were about an hour ahead of schedule.
2:20: Sanity testing finished. We now got the clients to begin their Landing tests. We actually had to call some people, and get them to join the bridge early, since we were still ahead of schedule.
2:20–5:00: We performed Landing tests. We turned up two defects, but they were deemed minor enough that we could leave the release in, until a fix could be found.
Saturday, December 1, 2007
Deployment: December 1, 2007
The investigations into the back-end system have completed, and they believe the problems were caused by a problem with the hardware load balancer, for the back-end system. They’re making the change on the morning of December 1st, which means that we’re being deferred yet again.
Assuming that all goes well with the changes to the load balancer, we’ll go in Saturday night/Sunday morning, meaning December 2nd. We’ll have a go/no-go call at 5:00 PM Saturday afternoon, to make the decision.
And just to make everything even more fun, the email servers were down all day Friday, so updates couldn’t be sent via email. We were all waiting around to see what would happen, but nobody was able to send updates.
I’m almost afraid to ask what else can go wrong with this release.
“Load Balancer” Defined/Explained
For high-availability systems, we usually want to cluster our servers. That is, instead of having one, very powerful server, we might want to split the processing between two or more servers. Requests can be processed by any of the servers in the cluster. This way, if any of the servers crashes, the other servers can handle the load, until the broken server is fixed.
However, most client applications can’t deal with a cluster; they need one place to go to, to get requests processed. So in order to enable clustering, there usually needs to be a load balancer put in place. The client applications only know the address/location of the load balancer, and the load balancer takes care of forwarding those requests to the servers in the cluster.
Depending on your needs, you may use a software load balancer, or a hardware load balancer. A software load balancer is simply a program running on an existing server, whereas a hardware load balancer is a dedicated networking device, which does nothing but balance traffic between different servers.
Deployment: November 30, 2007
The investigations into the back-end system’s crash were inconclusive. Still a no-go for our deployment. Again, maybe it’ll happen Friday night/Saturday morning, but otherwise, it’ll be Saturday night/Sunday morning.
Deployment: November 29, 2007
The back-end system were were depending on deployed successfully on Tuesday morning, so we were scheduled to go in Wednesday night/Thursday morning. Everything was set, and we’d had all of our go/no-go meetings.
Unfortunately, the same back-end system crashed Wednesday afternoon. We had to cancel, pending investigation into what caused the crash.
Assuming that all went well, we’d go in the next day.
Friday, November 23, 2007
“Source Code Repository” Defined/Explained
When developing software, the source code is your most valuable asset. A Source Code Repository is a special database where the source code is stored.
Source Code Repositories—also called version control software—usually have the following types of features:
- The ability to check out and check in code. When you check code out, it means that you’re locking that piece of code, so that other developers know not to work on it (or can’t work on it) until you’re done. When you’re finished with it, you check it back in. (Sometimes repositories are set to allow multiple people to check a file out at the same time; in this case, the changes have to be merged together, as explained in the bullet below.)
- The ability to compare different versions of a file, and merge versions together. Comparing allows you to see what has changed, between the code you have on your computer and the code that’s checked into the repository. Merging allows you to take changes from multiple versions of a file, and combine them all together into one final version of the file.
- The ability to label a particular version of the code. (Different repository products have different naming conventions for this; they don’t all call it a “label”.) Normally, the software you’re developing will go through multiple iterations; you’ll have version 1.0 of the software, and then 1.1, or maybe 2.0, etc. With each version of the software the code base changes. However, if you label your code, from time to time, you can always go back and get the code as it was at any particular point in time. e.g. “get me the code as it was when I labelled it ‘1.0’.”
Deployment: November 2007 (almost)
This was going to be an unusual deployment, because it was scheduled for a Wednesday night, instead of a Saturday night. This worried me, because I was worried about having to work on Thursday, and I was coming down with a cold.
It was also going to be a long deployment; there is another system we depend on, that had to deploy along with us, but their deployments are much longer than ours. So we would have had to shut down our system, and then wait for a couple of hours, until they were done, and then continued on with ours.
As luck would have it, though, their deployment was deferred. Their source code repository crashed, and they spent days trying to recover from it. So we’re rescheduling for this coming week. And the good news keeps getting better: we’re hoping to deploy the other application on Monday night, and my application on Wednesday night—so we won’t have such a long deployment.
Monday, September 24, 2007
Deployment: September 2007 (Take Two)
This was a second attempt on the deployment written about earlier. Since we were convinced we’d fixed the problems discovered during the first deployment, we decided to tackle this attempt exactly the same way.
Of course, there was one major difference, this time: I decided to take this one from home. On the previous attempt, they’d shut down the power in the building, forcing me to come home; this time, they were going to shut down the phones and the network. But luckily, I knew ahead of time, this time, so I was able to plan ahead to come home, instead of being forced halfway through.
10:00PM: We’re scheduled to start at 10:00PM, but there’s an accident on the 401, which prevents me from getting home in time for the beginning of the deployment. Luckily, I really only have to log on for a couple of minutes, verify that everything is going smoothly, and then log back off until midnight. So I log onto the conference bridge from my car, on my cell. The “database split team” begins their work—at this point, they’re backing up the database, in case of rollback—and I log back off again. I have just walked into my house, at this point.
10:00–12:00: I start watching another movie, to kill the time. The Spy Who Loved Me, this time. (Still on my 007 kick.) I also make some tapioca pudding, since I’m home, and spend some time preparing my “work environment”—setting up a phone with a long cable (since I don’t know how long my cordless phone batteries will last), getting a couple of cordless phones handy (since I don’t like my “corded” phone), and preparing some things to drink.
12:00AM: I log back on. The backup was completed successfully. We shut down the Application Servers for “Application 1”, so that they can proceed with the database split. And then I log back off the bridge, since there won’t be any further activity until 12:45. (For a reminder of what “Application 1” and “Application 2” are, see the post from the first attempt.)
12:05–12:45: More movie.
12:45: I log back onto the bridge.
12:45–12:55: We sit around on the bridge, wondering where in the world everyone is. We then decide to proceed without them, for the time being—the clock is ticking, after all.
12:55: We shut down the App Servers for “Application 2”.
1:00–1:01: We back up the database, for “Application 2”.
1:01: After the quickest backup in history, we begin the database changes for “Application 2”. (We did double check, of course, to make sure the backup was successful; when a backup is that quick, you have to wonder if it really backed up at all…)
1:05: We’re informed that half of the members of the client team won’t be showing up. Yes, you read that right: They’re just not coming. (Since we’d already done it once, I guess they got bored with the whole thing…) We’re told, of course, that we can call them, if anything goes wrong. (How magnanimous.)
1:10: The database changes are finished. We bring the App Servers for “Application 2” back online, so that the deployment of that application can begin.
1:10–1:35: We deploy “Application 2”.
1:35–1:40: We conduct our Sanity Test. Our testing is positive, which means that we’ve fixed the first issue we had problems with, on the last deployment. (Phew!) The client also jumps on the application, to start testing, but I give him a verbal slap on the wrist for it—I’d prefer us to finish our testing, before handing it over to the client. After my testing is done—which only takes a couple of minutes anyway—I give the client the go ahead to do his testing.
1:40: At this point, we’re ready to begin the deployment for “Application 1”, but, fortunately and unfortunately, we’re about an hour ahead of schedule. It’s fortunate because it means there is a chance of getting to bed early; it’s unfortunate because the people we need for the next phase of the deployment aren’t on the bridge—in fact, they’re probably sleeping, since they aren’t expecting to be needed, yet.
1:40–1:45: We call them on their home numbers, but they sleep through the calls. This is a fairly normal event, for late-night deployments; the human body is used to being asleep, at this time. So we take it in stride, and simply keep trying, maintaining our good humour. We finally get hold of them, and they join the bridge.
1:45–2:05: We deploy “Application 1”.
2:05–2:25: We do our Sanity Testing. This is a bit longer than we usually take, for this particular application—we can usually run through it in 10–15 minutes—but I think some of us are being extra careful. The good news, though, is that everything is working fine—meaning that the second problem we’d had, last week, is also fixed.
2:25–3:40: The client does their Landing Testing, and everything goes well. This is always the nerve-wracking part for me; there is so much chatter on the bridge, and every time someone asks a question—“Is it supposed to work like this?” or “How come this page is taking so long to load?”—I get nervous that we’ll have a bug that’s a “show stopper”. (In other words, a bug that’s serious enough to force us to back out the application.) I guess this is where I earn part of my salary, though; being able to think on my feet in the wee hours of the morning, answering questions, and deciding when a problem is serious, and when it is just a transient blip. (e.g. a link in the application that appears not to work, but it turns out it’s because the link is pointing to another application that’s also deploying right now.) You don’t want to be wrong, and assume something is transient if it’s not, or else you’ll spend the next week fielding calls from the Help Desk, with your cell phone glued to your ear. On this morning, though, we don’t have any bugs, serious or otherwise.
And, at 3:45AM, we have another successful deployment under our belts.
Wednesday, September 19, 2007
Deployment: September 2007
This was a more complex deployment than usual. There were three major systems involved:
- One of the back-end databases we talk to is being split into two databases, for performance reasons.
- The “main” application, which is web-based, was being upgraded
- A second back-end application was also being upgraded
(where, as always, I’ve greatly over-simplified this diagram)
I was worried about the database split; in our test environments, we had a lot of problems with it, because it’s a very complex thing to do; we kept encountering various permissions that needed to be re-created, and that type of thing, so I was worried that there would be additional permissions we hadn’t thought of before, that we wouldn’t discover until we deployed in production. In addition, the work for this split was being done by a separate team, whereas my own team was responsible for the upgrades to “Application 1” and “Application 2”. Whenever more than one team is involved in a deployment, coordination becomes an issue. Or rather, a potential issue—I shouldn’t be so pessimistic.
Because of the database split, which takes a long time to complete, we had special permission from the call centre to extend our outage, and start the deployment earlier than usual (12:00AM instead of 1:30).
Here’s how it went down:
10:00PM: I logged onto the conference bridge. The “database split team” was beginning their work at this time, to do some backups (in case of failure, later on). I simply logged on the bridge, verified they were good to go, and logged back off again.
10:00–12:00: To kill time until my part of the deployment was scheduled to start, I watched a movie on my laptop. The Man With the Golden Gun, in case you’re interested—I’ve been on a James Bond kick, lately.
12:00: I logged back on the bridge. Everything was still going smoothly; the database backups went quicker than they’d been anticipating, so they were just waiting for us to shut down our application, so that they could move on to their next piece.
12:00–12:05: We shut down our application.
12:05: The “database split team” began their next phase of the deployment, which was the actual work of splitting the databases into two. (You’ll note that I’m purposely not giving any details about this…) I wasn’t involved in this piece, so I had another hour or so to kill.
12:05–1:30: I finished watching my movie, and farted around doing some other stuff. Probably playing with Ubuntu.
1:30AM: I logged back onto the bridge. Again, everything was going smoothly; they’d completed their database work, and were ready for us to move forward.
1:30–2:00: We completed the deployment for “Application 2”, including the Sanity Test, with some minor issues. Or so we thought. Testing wouldn’t be complete, however, until “Application 1” could be tested, since “Application 1” is dependent on “Application 2”.
2:00: The power went off, in the building. (Let me repeat that: The power went off, in the building.) My laptop kept going, but my external monitor went off. I wasn’t overly worried, since they have backup power generators; I figured that the lights would probably come back on soon. And the networking infrastructure must be powered by the generators, because my network connections were still fine.
2:15: Someone went down to the security desk, and found out that this was a scheduled power outage. (“We communicated it to all of the appropriate channels…”) The power was scheduled to stay off until 5:00.
2:30–3:00: I knew that my laptop batteries wouldn’t last until 5:00, so I made the decision to drive home, and continue the deployment from there. (Assuming that VPN connectivity would be up and running…) During the 30 minute drive home, I stayed on the conference bridge, on my cell phone.
2:30–3:00: While I drove home, we completed the deployment for “Application 1”. (When I say “we”, I mean the people who did the actual work; luckily, I personally am not the one doing the work of a deployment.) Sanity Testing indicated that everything was working—except for connectivity to “Application 2”.
3:00–6:00: We—the technical team—continued troubleshooting the connectivity issue between “Application 1” and “Application 2”, while we let the client begin their Landing Test. (That is, they could test everything except the pieces of “Application 1” that require connectivity to “Application 2”.) Their testing confirmed our Sanity Test results; everything worked except for the connectivity between the two systems.
6:00: At 6:00, we had to make a decision: We needed something up, by 7:00AM, so we either had to be confident that we could fix the system(s) within the next hour, or make the decision to roll back, since a backout takes about an hour. But then the client stepped in and granted us another hour—meaning that we could stay down until 8:00—to continue testing.
6:00–6:50: We continued troubleshooting, until… we eventually ran out of ideas. Around this time, there was talk of extending our window for another hour, but we decided not to bother; we just didn’t have anything else we could think of to test. So we reluctantly made the decision to roll back.
7:00–8:15AM: We rolled each of the three systems back. (That is, we “re-joined” the two databases together, and rolled “Application 1” and “Application 2” back to their pre-deployment states.) It took a little longer than we’d anticipated, so we were 15 minutes late getting back up and running, but we were so disappointed at having had to roll back in the first place that the 15 minute delay was the least of our worries.
So, as it turns out, the database split, which I’d been worried about, went very smoothly. Frankly, I hadn’t been expecting any problems with “Application 2”, which turned out to be a problem for us.
As I write this, the deployment has been re-scheduled to be re-attempted this weekend (a week after the original attempt). We’ve fixed the issue which prevented “Application 1” and “Application 2” from talking to each other, and tested the changes, and are now confident that we’ll be able to get it in, this time.
There was talk of splitting the release into two deployments; one for the “database split”, and one for “Application 1” and “Application 2”. But it was eventually decided that
- It would be too much work to re-jig “Application 1” to work with the database split, without incorporating all of the other changes that were supposed to be released as part of this deployment
- We were confident enough that we’d solved our issues that we figured it was worth the risk to do it all in one shot, again
Tuesday, September 18, 2007
“Deployment” Defined/Explained
I guess I should define the term deployment. (Unfortunately, I couldn’t find a definition in the Jargon File, so I’ll have to tackle it myself.) I’m a software developer, and I deal mainly in web-based business systems. (In other words, web sites used by a company, internally, to do its work. Most of the stuff that I do isn’t “public” or customer-facing.)
For example, I’m currently working on a system for a call centre; it’s a website that brings up customer information, and allows you to perform transactions on behalf of that customer—perform billing transactions, do technical troubleshooting, etc. etc.
Because I use web-based technologies, for my applications, the structure is usually something like this:
The users’ browsers connect to the web server, where the application resides, and the application, in turn, accesses other back-end systems or databases. Actually, that diagram is simplified; the structure is usually a bit more like this:
(Even this diagram is simplified, but of course, I can’t give out technical details about the actual systems I’m working on.)
Usually, servers are clustered together, to help handle the volume of calls to the application, and to provide redundancy—that is, if one server goes down, the other server handles the traffic until the broken server can be repaired.
When there is a new version of the application, to introduce new features, or fix bugs, it has to be installed on the Application Servers. (Sometimes changes have to be made to the Web Servers, too.) There is also usually work to be done on one or more of the back-end databases that the application uses for its data. Since the systems I work on are web-based, there is typically nothing to install on any users’ workstations; they use their web browsers to access the application, so as soon as the application is updated, they start seeing it immediately.
A deployment is when we un-install the current version of the application, on those servers, and re-install a new version. Typically, for these types of applications, a deployment is a tricky process, involving many steps, that have to be performed in a proper sequence. Certain parts of the deployment can’t be completed until other things have completed successfully. It needs to be planned out in advance, in detail, and, where I work, we typically create a deployment plan, where we detail these steps.
And, because a deployment is so complex, a deployment plan needs to include a backout, or rollback, strategy. If you start a deployment, and for whatever reason it doesn’t work, you need to be able to go back to the version of the application that had been working, before you started the deployment.
As an example, here’s what a typical deployment is like for me, these days:
- Because the application I’m currently working on is for a call centre, which is up and running 24/7, our deployments have to be scheduled for times when the call centre is the least busy; for us, this means deployments happen Saturday nights/Sunday mornings, beginning at 1:30AM. During these wee hours on a Sunday morning, the call centre isn’t too busy, so the few people who are working can get by without our application for a while.
- But we do have to get the application back up and running by 7:00AM Sunday morning, when they start to get a bit more busy. So our deployment window is 1:30–7:00 AM.
- Our first step, when doing a deployment, is to shut the application down, so that the users can’t use it while we’re working.
- There are times, during the deployment, when the application will be in an “inconsistent state”—meaning that some pieces are up and running, and others are not. It can cause problems when people are using the application in such a state.
- In our specific case, shutting down the application means shutting down the Application Servers. Depending on the technologies used, this can be quick, or it can take a few minutes.
- For other applications, this might be accomplished by shutting down the load balancers.
Or, better yet, one might redirect the load balancer to another version of the application, which simply says that the application is down.
- Once we have shut the application down, we would deploy any appropriate changes to back-end systems; in many cases, this might be changes to a database, but in more complex deployments, we may have to sit and wait, at this point, while another back-end system performs their own, complex, deployment.
- Once all of the back-end systems are back up and running—and tested, if necessary—we bring the Application Servers back up, and begin deploying the “main” application code.
- Once the application is back up, we do a sanity test. This means that my team, who developed the application, does a quick test, to make sure that it’s properly back up and running.
- Once the sanity test is done, we would hand the application over to our client, who would do a landing test. This is a more thorough test, where they specifically test each new piece of functionality in the application, as well as regression testing existing parts of the application, to make sure it still works.
- Typically, of course, I would also be performing these tests on my own. If there are issues, I want to know as soon as possible.
- If everything goes well, this is the end of the deployment. Once all of the testing has passed successfully, we can go home and go to bed.
- If things go wrong, however—if some of the testing fails—we then have to decide how to proceed. Normally, we will do some troubleshooting on the spot; if it’s an issue we can fix immediately, we do so. Sometimes we can, and the deployment still ends successfully.
- In some cases, even when we troubleshoot the issue, we’re not able to solve the problem. There can be numerous reasons for this; perhaps it is a complex problem, that requires research, or perhaps we just run out of time, before we have to have the application back up and running for the users.
- In cases where we’re not able to solve the problem, we eventually have to make a decision to back our changes out, and put the old version of the application back in place.
Deployment Blog
I’ve threatened to do it, in the past, and I’m finally making good on my threat: I’ve started a “Deployment Blog”. (It’s… er… it’s the blog you’re reading right now.) I do a lot of deployments—about one a month, these days—so it’s time I started blogging about them. As with the many other blogs I’ve started, I’m sure nobody will find this blog interesting but me. Only a fellow nerd/geek could possibly find this interesting, and even then, why would they want to read about my deployments? They’ve got their own to worry about.
Still, I’ve started it, and I’ll stick with it. For a while.
For those who don’t read my main blog, here are some links to posts from the past, when I mentioned previous deployments:
- An intellgent, well-thought-out post covering the state of serna’s life at 11:30 on a Sunday night—one of those posts where the title sucks you in, only to be disappointed by the post itself
- The Long, Long Weekend—the first deployment I really wrote about
- My Deployment—a deployment post which is more detailed than usual
- Deployment—Let’s try that again, shall we? and Deployment was successful. You can relax now.—this was a second try at the previous deployment
- Deployments—a general post about what my Saturday night deployments are like
- Another Deployment—a post where I had the idea for starting a blog about deployments, and then came to the conclusion that I never would. Shows how much I know.