Sunday, March 22, 2009

Deployment: March 21, 2009

This should have been a fairly typical deployment. We were putting in a new J2EE application, with a database change. The minor difference between this one and my normal deployments is that this wasn’t an update, it was a brand new J2EE application; so if we screwed up, and it wasn’t properly deployed on time, we wouldn’t have to roll back to a previous version.

22:00–22:01 (1 minute): We all logged onto the conference bridge, and confirmed that we were ready to go.

22:01–23:33 (1 hour and 32 minutes): The DBA ran the database scripts. There were some errors, so this took a lot longer than expected, but she managed to work through the problems and get everything done.

22:01–22:34 (33 minutes): We configured the J2EE cluster (connection pools, security, etc.). We then had to wait for the database scripts to be finished, to move on to the next part of the deployment.

23:33–02:30 (2 hours, 57 minutes): We deployed the J2EE application. Again, there were difficulties, which were hard to troubleshoot, because they had to do with some internal workings of our vendor’s software, rather than our own code. We were getting errors between the vendor’s software and their database objects. After troubleshooting as much as we could, we called it a night.

Overall deployment: 22:00–02:30 (4 hours and 30 minutes). We left the bridge knowing that there was a good chance we’d need to rebuild the environment from scratch; it was quite possibly a configuration issue, but the likelihood of being able to track it down was slim. We tentatively planned to retry the next night, schedules (and permissions) pending.

Sanity/Landing Tests Defined/Explained

I use the terms Sanity Test and Landing Test a lot, so I should define what I mean by them—others might use these terms in different ways, but this is how I and my team use them.

When we’re doing a deployment, there is the technical team, which is doing the actual work of the deployment, and troubleshooting any technical issues, and there are the business users, who want to make sure that the application meets their standards before they hand it over to the end users.

The Sanity Tests are done by the technical team. Once we’ve deployed the application, and everything is up and running, we quickly run through it to make sure that it really is up and running. As the name implies, this really is just a quick sanity check, to make sure that everything is up; it’s not meant to be exhaustive.

The Landing Tests are done by the business users. Once the technical team has given the go ahead that the application is up and running (as verified by sanity testing), the business users go ahead and verify that the application is working according to their specifications. As with the sanity tests, landing tests are not meant to be exhaustive; the exhaustive testing should have been done long before the deployment! (We have a phase of testing called User Acceptance Testing, or UAT, which is where the exhaustive testing should have been done.) The purpose of the landing tests is for the business to ensure that everything is running, and that’s it. If a bug is discovered, it must be one of two things:

  1. An environmental issue, where the production environment is different than the other environments and it’s causing a problem, or
  2. A gap in the UAT, where something wasn’t tested properly
We do our best to plan for the former. We get very annoyed by the latter, but to err is human.

If either the sanity tests or the landing tests fail, we do our best to try and fix the problems right then and there, but that’s not always possible—especially at 2:00 in the morning. So when there are issues, it’s up to the business to decide if these issues are “show stoppers,” meaning we have to abort the deployment and roll back to the previous version of the application, or whether the issues are minor enough that they can live with them.

Thursday, March 19, 2009

Deployment: March 19, 2009

Not a complex deployment, this time. A simple J2EE EAR file to be updated, with no database scripts to worry about or configuration changes to make.

01:30–01:43 (13 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go. It took longer than it was supposed to; some people joined late.

01:43–02:02 (19 minutes): We un-deployed the old version of the EAR file, and deployed the new version.

02:02–02:08 (6 minutes): We did our Sanity Testing, and ensured that the app came back online successfully.

02:08–02:16 (8 minutes): We waited for the business to join the conference call, to do their Landing Tests; we were a bit ahead of schedule, so they hadn’t been expecting to join the call this early.

02:16–02:25 (9 minutes): The client did their Landing Testing, and signed off that everything was working as it was supposed to.

Overall deployment: 01:30–02:25 (55 minutes).

Saturday, September 20, 2008

[Mini] Deployment: September 20, 2008

This was actually just a simple database script, to update one of our data tables. No structural changes, just data. But someone decided to treat it like a full deployment anyway, complete with the conference call and everything. So…

01:30–01:37 (7 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go. It took a bit longer than we’d expected, because we weren’t sure if one particular person was going to join or not.

01:37–01:37 (about 30 seconds): We backed up the database.

01:37–01:41 (4 minutes): We executed the DB scripts.

01:41–01:48 (7 minutes): We did our Sanity Test—mostly just checking the logs that the scripts worked fine, as well as a quick glance at the application, to verify that the data was showing up correctly.

01:43–01:53 (10 minutes): The client performed their Landing Test. (Further tests that the data was showing up correctly in the application…) We let them start before the Sanity was done, in this case, since all indications were that things had worked fine, and we were just waiting on an email to show up in my Inbox to verify the scripts.

Overall, we were done ahead of schedule, and everything worked well. Total deployment: 01:30–01:53 (23 minutes).

Thursday, September 11, 2008

Deployment: September 11, 2008

This was a bug fix deployment, to fix the issue discovered with the previous deployment.

01:30–01:34 (4 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

01:34–02:01 (27 minutes): We backed up the database, and executed the database scripts. This took longer than usual, because the DBA accidentally ran the scripts on the wrong database, the first time.

01:34–01:56 (22 minutes): We un-deployed the old version of the application, and shut down the application servers.

02:06–02:19 (13 minutes): We deployed the new version(s) of the application(s). (Some were actually still deployed from the last time, and just had to be re-activated, while others—the ones with bug fixes—actually had to be redeployed.)

02:19–03:04 (45 minutes): We conducted the Sanity Test. During Sanity, we discovered an issue unrelated to our deployment—one that had been part of the application for the previous four months.

03:00–05:23 (2 hours and 23 minutes): The client did their Landing Tests. (This was overlapped with the Sanity Test a bit.) Some additional issues were discovered, but they were all deemed minor enough that we didn’t need to back out. (The most serious one turned out to not be a bug with our application, but with a back-end system that we depend on. So the bug fix will be for them, not for us.)

Overall deployment: 01:30–05:23 (3 hours and 53 minutes).

Deployment: September 8, 2008

No timeline with this one.

There had been an issue with the previous deployment. Every once in a while, the CPU on the application servers was getting up close to 100%, and the Support Team had to reboot them to keep the application alive.

Unfortunately, rolling back wasn’t working; we tried un-deploying the application, and re-deploying the old version, but the new version seemed to be cached on the application servers. So we had to call the application server vendor, and get help. (It turned out we had to undeploy, remove the temp directory, and then re-deploy the old version.)

Sunday, September 7, 2008

Deployment: September 7, 2008

This was a slightly more intricate deployment than usual, because there were a few moving parts. We had not one but two sets of database changes to go through, plus three separate applications to deploy and/or re-deploy on the application servers. Plus, because of our new “global delivery model,” we had a DBA in India who would be making our database changes, which was new. So I was hoping it wouldn’t be a bad deployment, but I was a bit worried.

00:00: As I was getting ready for the deployment, I logged into my computer, and noticed an email from the DBA in India, dated twenty-four hours prior to the deployment, saying that he wasn’t able to join the bridge, but would I please call his cell phone when we were ready for him to start. Not a good sign. I looked in the database, and found that the changes had already been executed (presumably twenty-four hours earlier than they were supposed to be done). This could have been very bad, but we got lucky, and it turns out that the database changes didn’t break the existing application; the old version of the app was able to run for the last twenty-four hours with the database changes.

01:30–01:50 (20 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go. This took much longer than usual, because we had to confirm with the DBA what had actually been done, and what he hadn’t. It turned out to be even more confusing than I’d thought: out of the two DB changes we needed, one was partially done, and one was not done at all.

01:50–01:52 (2 minutes): We shut down the application servers.

01:52–02:21 (29 minutes): We had the DBA finish the database changes.

02:21–02:24 (3 minutes): We brought back up the application servers

02:24–02:42 (18 minutes): We deployed the updated and new applications.

02:42–05:00 (2 hours 18 minutes): We did our Sanity Testing. We found an issue with one piece of functionality, and determined that it might have been due to a back-end system, not ours. We spent some time troubleshooting it, and ended up not being sure which system was the culprit.

03:40–05:00 (1 hour 20 minutes): The client did their Landing Tests. This was overlapped with the troubleshooting we did, for the functionality that wasn’t working.

05:00–05:30 (30 minutes): One of the back-end systems that we depend on went down for scheduled maintenance, and we had to wait for it to come back up before we could go back to our testing.

05:30–06:20 (50 minutes): We finished testing, and then called it a wrap. There was still an oustanding issue—the one we weren’t able to troubleshoot—but it wasn’t a show-stopper.

Overall deployment: 01:30–06:20 (4 hours and 50 minutes).

Sunday, July 20, 2008

Deployment: July 20, 2008

If all went well, this would be the last deployment for a long time. We were pretty sure that we’d fixed all of our issues, and this release was simply to “turn on” all of the functionality for all users—now that it was working, we’d let them all use it.

01:30–01:35 (5 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

01:35–01:44 (9 minutes): We deployed the application.

01:44–02:19 (35 minutes): We Sanity Tested the application.

02:12–02:31 (19 minutes): The users did their Landing Tests. This was overlapped a bit with the Sanity Testing, because the person doing Sanity had to reboot her computer halfway through the process. However, based on the nature of the changes—i.e. that there were no changes, just a configuration change—we saw very little danger of Sanity failing, so we felt safe in letting the Landing Test start before Sanity was completely finished.

Everything worked fine.

Overall deployment: 01:30–02:31 (1 hour and 1 minute).

Friday, July 11, 2008

Deployment: July 11, 2008

This was sort of a continuation of the previous deployment. If all went well, we’d be able to kiss our recent troubles goodbye, and get back on to a regular schedule, without so many deployments all the time.

01:30–01:33 (3 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

01:33–01:40 (7 minutes): We made some configuration changes to some of the servers. (There was no “deployment” this time, per se, just configuration changes. But it still caused an outage, and so had to be controlled.)

01:40–01:55 (15 minutes): We performed our Sanity Testing. Everything was fine. Normally Landing Tests would follow, but in this case, since there were no changes to the application itself, it wasn’t deemed necessary.

01:55–01:58 (3 minutes): We verified with the user base that everything was back up and running.

Overall deployment: 01:30–01:58 (28 minutes).

Tuesday, July 8, 2008

Deployment: July 7, 2008

This was partially a bug-fix deployment, but it was also a configuration deployment. We had a number of configuration changes to make to our web servers, in preparation for some future changes to our network/physical architecture.

22:00–22:01 (1 minute): We all logged onto the conference bridge, and confirmed that we were ready to go.

22:01–22:28 (27 minutes): We made the appropriate configuration changes to our web servers.

22:08–22:38 (10 minutes): We down-graded the logging level on our application servers. Because of the recent troubleshooting we’d had to do, we’d increased the logging greatly, but we now felt it was time to decrease it, and stop chewing up so much disk space with our logs.

22:38–22:50 (12 minutes): We deployed the application.

22:50–23:11 (21 minutes): We performed our Sanity Testing. Everything was fine.

23:11–23:18 (7 minutes): The users did their Landing Test. Everything was fine. (It’s not usual that Sanity Test takes longer than Landing Test, but I guess the users had their testing down to a science, after so many deployments lately—many of which I missed, since I was on vacation.)

23:18–23:22 (4 minutes): We switched back on everything that had to be switched on, and verified with the user base was completely back up and running.

23:22–23:55 (33 minutes): Just as we had all signed off on the deployment, and were about to hang up, one of the users claimed that there was an issue. We spent this time re-testing the functionality, but there was no issue. (Error between keyboard and chair.)

Overall deployment: 22:00–23:55 (1 hour and 55 minutes).