Sunday, March 2, 2008

Deployment: March 2, 2008

This was a retry of the previous deployment, which was backed out.

For clarity, I’ve decided to start using military notation for the times, so that there’s no confusion between, e.g., 12:00 midnight and 12:00 noon. (I’m still rounding times to the nearest 5 minutes, though, for clarity.)

00:00–00:10 (10 minutes): We all logged onto the conference bridge, and confirmed that we were ready to go.

00:10–00:30 (20 minutes): We made some configuration changes to the Application 1 servers, and shut down the application servers for both Application 1 and Application 2.

00:30–01:45 (1 hour and 15 minutes): We backed up the databases, just in case. This took longer than usual, because the DBA had to back up one of the databases to his own machine, instead of doing it on the server. (It’s normally done directly on the server, which avoids the backup having to travel across the network, but he didn’t plan ahead to get the Unix passwords for the boxen in question.)

01:25–03:20 (2 hours and 55 minutes): We made the modifications to the database for Application 1. (This overlapped with the task above, as the database for Application 2 continued to be backed up.) Again, this took much longer than it should have—it took just under an hour and a half, last time—because the DBA had to run the scripts over the network, instead of from the server.

03:20–03:25 (5 minutes): There were issues with the database scripts, for Application 1. It turns out that the DBA modified the scripts, without telling anyone, before running them, but the modified scripts ran into issues. Because we were not confident that the database was left in a valid state, it was decided to roll back, and restore the database from the backup.

03:20–04:30 (1 hour and 10 minutes): During this time, the team had a roundtable conversation to discuss whether we could still put in Application 2, even though Application 1 was rolled back. The problem is that the new version of Application 1 had never been tested against the old version of Application 1. It was decided that the risk was low—the nature of the changes being made indicated that the different versions of the applications should work together—but risk is risk, and if they weren’t tested, there could be a chance of errors being undiscovered. It was finally decided to leave both applications as-is—that is, not try to deploy Application 2 without Application 1—and revisit the deployment in a day or two.

04:30–05:10 (40 minutes): We brought the applications back up. Luckily, based on the nature of the database changes for Application 1, it was decided that we didn’t need to restore the database; all we had to do was undo some configuration changes, on the application servers, and then bring those servers back online.

04:40–04:50 (10 minutes): The business came onto the bridge, and said that they’d prefer to try the deployment again, rather than roll back. Just to be safe, it was decided to continue with the rollback, and do a quick sanity test, so that there would be some version of the application online, while the decision was made. We also needed to chase the DBA team, and find a new DBA to perform the work.

05:10–05:20 (10 minutes): We all did a sanity/regression test, to verify that the applications were back up and running. They were fine.

05:20–05:30 (10 minutes): Dead air, as we waited for the DBA team to get back to us.

05:30–06:30 (1 hour): Another DBA joined the conference call, and was walked through the existing dilemma. He asked for some time to look over the situation—including what was changed, and why. Presumably, the results of his investigation would give the rest of us the warm and fuzzy that we can proceed to try again.

The result of this discussion was that the DBA(s) would try to recreate the errors we got, to prove that it was the modified scripts that caused the problems. We would reconvene at noon, to go over the findings, and talk about rescheduling the deployment.

No comments: