serna Deployments

Sunday, March 16, 2008

Deployment: March 16, 2008

This should have been a pretty normal deployment. We were introducing a new system, along with a re-deployment of the usual application I work on, and we weren’t expecting it to take too long.

01:30–01:35 (5 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

01:35–01:40 (5 minutes): We shut down the application servers.

01:40: We decided not to bother with a database backup, because even if we had to backout the deployment, we wouldn’t be restoring the database, in this case, so we saved ourselves 30–40 minutes by not backing it up. (This is normally a dangerous decision to make; however, the changes we were implementing in this case could have gone in standalone, so even if we backed out the application, we wanted the database changes to stay in.)

01:40–02:00 (20 minutes): We executed our database scripts.

02:00–02:10 (10 minutes): We restarted the application servers.

02:10–02:15 (5 minutes): We deployed the “new” application, and did a quick sanity test. The test worked, but we realized we had a minor configuration problem. (I’d made a typo in the URL for a web service. Small mistakes can make for big problems…)

02:15–02:40 (25 minutes): We paged the person who could fix the problem, and waited for him to respond. (It took a while because his pager was on vibrate mode, and it took a few pages before it was able to wake him up. That’s a common occurrence when you’re trying to get hold of someone at 2:30 in the morning.)

02:40–2:55 (15 minutes): The person logged on, and fixed our problem for us. Still slightly ahead of schedule, at this point.

02:55–03:20 (25 minutes): We had another configuration problem, and had to troubleshoot that. (Someone else had made another typo! Apparently it was contagious.) This one required us to re-compile the application. (It was only a configuration change, but still required a re-build.)

03:20–03:45 (25 minutes): I couldn’t believe it, but we had yet another problem. Our new Oracle package seemed to be having issues, even though it had compiled correctly in the database. (Apologies if you didn’t understand that sentence; I couldn’t muster up the motivation to type out a post explaining what an “Oracle package” is, when I wrote this.) We had to get the DBA back on the bridge, to help us fix it. By the time we’d fixed this problem, we were 5 minutes behind schedule.

03:45–03:50 (5 minutes): We deployed the “main” application, which depends on the “new” application. This brought us back ahead of schedule (by 5 minutes).

03:50–04:05 (15 minutes): We performed our Sanity Testing. Still 5 minutes ahead of schedule.

04:05–05:30 (1 hour and 25 minutes): Landing tests were performed. At this point, I stopped paying attention to the schedule; Landing Tests always go long, with this team. They do a lot of testing. But since it’s the last thing we do, before sign-off, it’s not an issue if this part goes long. That being said, though, we were only about 10 minutes behind schedule, which is not too shabby.

At the end of the day, we deployed with one minor bug. (Can we never have a bug-free release?!?) However, it seemed to be a data integrity issue, and the only accounts that exhibited the behaviour are accounts that we’ve used heavily in the past, for testing. So it’s quite possible that we won’t actually see the issue “in the wild.” (Famous last words…)

Monday, March 3, 2008

Deployment: March 3, 2008

We had the conference call, regarding the previous deployment, and it was decided that we were confident enough in the original database scripts to proceed with the deployment, at 20:00. We would be using a different DBA.

20:00: We all logged onto the conference bridge, and confirmed that we were ready to go.

20:00–20:20 (20 minutes): We shut down the application servers for both apps.

20:20–20:30 (10 minutes): We backed up the back-end databases for both applications.

20:30–20:50 (20 minutes): We executed the database scripts for Application 1.

20:50–21:20 (30 minutes): We deployed Application 1.

20:50–21:25 (35 minutes): We executed the database scripts for Application 2, in parallel with the deployment of Application 1.

21:20–21:25 (5 minutes): We did sanity testing of Application 1, which turned out fine. (Only 5 minutes because sanity testing for this particular application is pretty quick.

21:25–23:20 (1 hour and 55 minutes): The client did their Landing Test of Application 1, and everything looked good.

21:30–21:45 (15 minutes): We deployed Application 2.

21:45–22:00 (15 minutes): We did sanity testing of Application 2, and everything looked good.

22:00–00:40 (2 hours and 40 minutes): The client did their Landing Test of Application 2. There were some minor issues discovered, but none were show-stoppers. They signed off on the deployment, and we all went to bed.

The last time we’d deployed this, one of the back-end systems we depend on went crazy, and we had to back out. Many of us were thinking negative thoughts, worrying about having to do the same thing this time, as our heads hit our pillows…

Sunday, March 2, 2008

Deployment: March 2, 2008

This was a retry of the previous deployment, which was backed out.

For clarity, I’ve decided to start using military notation for the times, so that there’s no confusion between, e.g., 12:00 midnight and 12:00 noon. (I’m still rounding times to the nearest 5 minutes, though, for clarity.)

00:00–00:10 (10 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

00:10–00:30 (20 minutes): We made some configuration changes to the Application 1 servers, and shut down the application servers for both Application 1 and Application 2.

00:30–01:45 (1 hour and 15 minutes): We backed up the databases, just in case. This took longer than usual, because the DBA had to back up one of the databases to his own machine, instead of doing it on the server. (It’s normally done directly on the server, which avoids the backup having to travel across the network, but he didn’t plan ahead to get the Unix passwords for the boxen in question.)

01:25–03:20 (2 hours and 55 minutes): We made the modifications to the database for Application 1. (This overlapped with the task above, as the database for Application 2 continued to be backed up.) Again, this took much longer than it should have—it took just under an hour and a half, last time—because the DBA had to run the scripts over the network, instead of from the server.

03:20–03:25 (5 minutes): There were issues with the database scripts, for Application 1. It turns out that the DBA modified the scripts, without telling anyone, before running them, but the modified scripts ran into issues. Because we were not confident that the database was left in a valid state, it was decided to roll back, and restore the database from the backup.

03:20–04:30 (1 hour and 10 minutes): During this time, the team had a roundtable conversation to discuss whether we could still put in Application 2, even though Application 1 was rolled back. The problem is that the new version of Application 1 had never been tested against the old version of Application 1. It was decided that the risk was low—the nature of the changes being made indicated that the different versions of the applications should work together—but risk is risk, and if they weren’t tested, there could be a chance of errors being undiscovered. It was finally decided to leave both applications as-is—that is, not try to deploy Application 2 without Application 1—and revisit the deployment in a day or two.

04:30–05:10 (40 minutes): We brought the applications back up. Luckily, based on the nature of the database changes for Application 1, it was decided that we didn’t need to restore the database; all we had to do was undo some configuration changes, on the application servers, and then bring those servers back online.

04:40–04:50 (10 minutes): The business came onto the bridge, and said that they’d prefer to try the deployment again, rather than roll back. Just to be safe, it was decided to continue with the rollback, and do a quick sanity test, so that there would be some version of the application online, while the decision was made. We also needed to chase the DBA team, and find a new DBA to perform the work.

05:10–05:20 (10 minutes): We all did a sanity/regression test, to verify that the applications were back up and running. They were fine.

05:20–05:30 (10 minutes): Dead air, as we waited for the DBA team to get back to us.

05:30–06:30 (1 hour): Another DBA joined the conference call, and was walked through the existing dilemma. He asked for some time to look over the situation—including what was changed, and why. Presumably, the results of his investigation would give the rest of us the warm and fuzzy that we can proceed to try again.

The result of this discussion was that the DBA(s) would try to recreate the errors we got, to prove that it was the modified scripts that caused the problems. We would reconvene at noon, to go over the findings, and talk about rescheduling the deployment.

Thursday, February 21, 2008

Deployment: February 21, 2008—Backout!

After a [mostly] successful deployment, some of the back-end systems that we depend on went crazy. The system was down for most of the morning. It was decided to roll back everything that went in during the night. (That included not just our application, but numerous other applications that had also deployed.)

12:00 (noon)–12:30 (30 minutes): We opened the bridge, and waited for people to join. Since this was an ad-hoc bridge, being opened at the last minute, it took longer than usual to round everyone up, and get them on the bridge. (Some of them may still have been sleeping, since we were exactly twelve hours after the original deployment.) Unfortunately, we weren’t able to get the DBA who had done the original deployment, but we had two other DBAs helping us.

12:30–12:45 (15 minutes): We shut down the application servers for both applications, and rolled back some of the configuration changes necessary for Application 1. (See the description of the original deployment, for what we were deploying, and a mention of “Application 1” vs. “Application 2.”)

12:45: At the last minute, the DBA who had done our original deployment showed up. This gave me a sense of relief, because it’s always easier for someone to back out her own work than for someone else to do it—even with good, detailed instructions.

12:45–1:20 (35 minutes): The DBAs rolled back the database changes. This is why the backups taken during the deployment were so crucial; they simply deleted the appropriate schemas, and re-imported the dump file taken during the backup. The DBA who had done our original deployment restored one of the databases, and, to shorten the outage time, another of the DBAs restored the other one. (All this time, the application was down, and users were unable to access it, so every second mattered.)

1:20–1:45 (25 minutes): The old version of Application 1 was re-deployed.

1:45–3:00 (1 hour and 15 minutes): Sanity Testing commenced for Application 1, and failed; the application didn’t come up correctly. It was believed to be a problem with the deployment procedure, so it was un-deployed, and re-deployed again. Unfortunately, the problem still remained. It was eventually determined that the application was deployed successfully, but we had a problem with a tool called Java Web Start—the Admin Tool for the application uses Java Web Start, and because we were using a new version of the Admin Tool, JWS didn’t re-download the old version.

As of 3:00, we were rolled back to our previous state—with no idea when we would next be attempting the deployment.

Deployment: February 21, 2008

This was a problematic deployment before we even got started. First of all, it was a full moon—and not just any full moon, a blood-red full moon. (Isn’t it a bad omen to be doing a deployment when the moon is turning to blood?)

But that wasn’t what I was referring to. It was a problematic deployment because we were dependent on three separate systems, for the deployment, and two of those systems were themselves dependent on a number of other systems being deployed.

There was a lot of confusion about which system had to be deployed before which other system, or which systems had to be installed on the same night, etc. Unfortunately, one of the systems—one of the ones that were two or three removed from my system—had issues, so they weren’t going to be able to deploy before us. It was decided that we could deploy anyway, though; our system would be able to gracefully handle the missing back-end system, simply displaying an error message at the appropriate spot, and when they were able to deploy the other system, our system would just magically start working. (Unfortunately, it meant that I’d still have to take part in the other deployment, to do some testing, which means that I’d have another sleepless night coming up.)

Also, we ourselves were deploying not one, but two applications. I’ll call them Application 1 and Application 2.

Once we finally got to the point where we were ready to deploy, here’s how it went down.

12:00 midnight: We all logged onto the conference bridge, and confirmed that we were ready to go.

12:05–12:10 (5 minutes): We shut down the application servers for both Application 1 and Application 2. To save time, we also made some necessary configuration changes, during this time, for Application 1. (It was a last-minute change to the procedure; we realized that if we’d waited until it was scheduled, it would have meant an extra reboot in the process. As much planning as you try to do ahead of time, you just can’t control when you’ll have a good idea.)

12:10–12:30 (20 minutes): We backed up the back-end databases that we depend on. Since we had to make some changes to the schemas, it’s always a good idea to back them up, first, in case you have to rollback the deployment.

12:25–1:50 (1 hour 25 minutes): We made the appropriate modifications to the database for Application 1. (The backup of this database finished before the other one, which is why this step starts before the end of the previous step.) Unfortunately, while the database scripts were executing, the database ran out of space in the “temp” tablespace. So the DBA had to page someone who does more low-level support of the database, to get the issue resolved, before the script could be run.

12:30–12:50 (20 minutes): We made the modifications to the database for Application 2. Again, you’ll notice that this step overlaps with the previous one; because the first change involved a long-running DB script, the DBA was able to make the modifications on the second database, while she waited for the long-running script to finish. There were a number of things happening at the same time, but I trusted the DBA to know how much she could do at once, and how much had to be done sequentially.

1:50–2:10 (20 minutes): We brought back one of the application servers for Application 1, and deployed the new version of the application.

2:10–2:20 (10 minutes): One more reboot, to be safe, after the deployment of Application 1.

2:20–2:25 (5 minutes): We did our Sanity Test of Application 1, and everything looked good.

2:25–2:35 (10 minutes): We brought back up the application servers for Application 2, and deployed the new version of the application.

2:35–3:10 (35 minutes): We did our first round of Sanity Tests for Application 2. Unfortunately, one of the back-end systems—one of the ones that was supposed to be up and running—was down, because of some emergency maintenance. (Replacement of a network card, or something.) Nobody bothered to inform us that this was happening, so we were taken by surprise, when we ran our test and it didn’t work. It was decided to Sanity Test as much as we could, ignoring that part of the application, and then do the Landing Tests (still ignoring that part of the application.) When it came back up, we’d revisit Sanity Testing, to verify the functionality.

3:10–5:30 (2 hours and 20 minutes): We did our Landing Testing, for the business to verify that the functionality is up and running. Because of the back-end system’s outage, they ended up Landing Testing some functionality before the Sanity Test was done. Issues discovered:

Character encoding issues with one back-end system that we connect to; French characters were not showing up properly in the UI.
A web app that we link to was down, during the deployment. (Not a big deal.)
We deployed the wrong version of Application 1, and had to re-deploy it. When it was redeployed, it was fine.

At the end of the day, we were only left with the issue regarding French characters. It was decided that the application could be left in production as-is, and we’d have to look into a fix later on, when we could look more closely at it. (It seems to be related to character encoding issues with some XML being sent between applications. If you’re interested in character encoding issues in XML, I can recommend a great book, with some very handsome faces on the cover…)

Sunday, December 23, 2007

Deployment: December 23, 2007

The last deployment was a success. Well… sort of. We added in some new functionality which was pulling additional information from a back-end system. Unfortunately, that system wasn’t able to handle the extra load. And, because of the Christmas season, all of the systems were being taxed more heavily than usual, so there was a danger that this extra load would be enough to start causing crashes.

So it was decided to do an “emergency deployment,” and remove the functionality. The back-end system would be going through a change, the month after, which would make it able to handle this load, so the next version of my app would have the functionality reinstated.

It sucks that we had to do a deployment on the weekend before Christmas—when I was supposed to be on holidays—but that’s the way it goes, sometimes.

1:30AM: We all logged onto the conference bridge, and confirmed we were ready to go.

1:35–1:40: There were no database changes, this time around, so all we had to do was redeploy the application itself. We did so.

1:40–1:50: The application was back up, and we did a quick Sanity Test, to ensure it was working. The Sanity Tests passed.

1:50–2:05: The client did their Landing Tests. Again, the tests passed.

Sunday, December 2, 2007

Deployment: December 2, 2007

Finally. We finally got this thing deployed. After all of the false starts and number of times the release was deferred, it seems anti-climactic to have such a short post for this release, but the fact is, when we finally got a chance to deploy this thing, it went without a hitch.

1:30AM: We all logged onto the conference bridge, and confirmed we were ready to go.

1:35: We shut down the two applications that we had to deploy, for this release—the “front-end” app and the “back-end” app.

1:40: We backed up the database for the front-end app, and began the deployment of the back-end app. I can’t stress enough the importance of having a good, solid deployment plan, so that you can execute tasks in parallel like this, and not worry about losing track of who’s doing what!

1:45: The backup finished for the database, so our Database Analyst (DBA) began executing the new database scripts.

1:45: As the DBA executed the DB scripts, the back-end app was taking a bit longer than expected to come back up.

1:50: The back-end app came back up, and the DBA finished executing the scripts. We did our Sanity Test for the back-end app.

2:00: Sanity tests for the back-end app passed. We now began the deployment of the front-end app. (Because it depends on the back-end app, we had to ensure that the back-end app was up and running properly, before bothering to deploy the new version of the front-end app.)

2:05: The front-end app finished deploying, and we began our Sanity testing. At this point, we were about an hour ahead of schedule.

2:20: Sanity testing finished. We now got the clients to begin their Landing tests. We actually had to call some people, and get them to join the bridge early, since we were still ahead of schedule.

2:20–5:00: We performed Landing tests. We turned up two defects, but they were deemed minor enough that we could leave the release in, until a fix could be found.

Saturday, December 1, 2007

Deployment: December 1, 2007

The investigations into the back-end system have completed, and they believe the problems were caused by a problem with the hardware load balancer, for the back-end system. They’re making the change on the morning of December 1^st, which means that we’re being deferred yet again.

Assuming that all goes well with the changes to the load balancer, we’ll go in Saturday night/Sunday morning, meaning December 2^nd. We’ll have a go/no-go call at 5:00 PM Saturday afternoon, to make the decision.

And just to make everything even more fun, the email servers were down all day Friday, so updates couldn’t be sent via email. We were all waiting around to see what would happen, but nobody was able to send updates.

I’m almost afraid to ask what else can go wrong with this release.

“Load Balancer” Defined/Explained

For high-availability systems, we usually want to cluster our servers. That is, instead of having one, very powerful server, we might want to split the processing between two or more servers. Requests can be processed by any of the servers in the cluster. This way, if any of the servers crashes, the other servers can handle the load, until the broken server is fixed.

However, most client applications can’t deal with a cluster; they need one place to go to, to get requests processed. So in order to enable clustering, there usually needs to be a load balancer put in place. The client applications only know the address/location of the load balancer, and the load balancer takes care of forwarding those requests to the servers in the cluster.

Depending on your needs, you may use a software load balancer, or a hardware load balancer. A software load balancer is simply a program running on an existing server, whereas a hardware load balancer is a dedicated networking device, which does nothing but balance traffic between different servers.

Deployment: November 30, 2007

The investigations into the back-end system’s crash were inconclusive. Still a no-go for our deployment. Again, maybe it’ll happen Friday night/Saturday morning, but otherwise, it’ll be Saturday night/Sunday morning.