Preparing for a Bad Day – How to write Disaster Recovery documentation
IT disasters are unpleasant, and can take many forms. However overwhelming the idea of a possible disaster may be, it is crucial to have a well-formed plan in place. Many IT professionals don’t have a good grasp on how to write disaster recovery documentation, which can lead to confusion and problems when disaster strikes.
To give an idea of how good disaster recovery documentation can save the day, I’d like to share a story of how good documentation not only saved the day, but also saved my vacation.
A few years ago, I was riding the airport shuttle on my way to a cruise ship vacation. While en-route, I got a call from my workplace where the person calling said that they had a power outage at our main datacenter. I was gone, my alternate was stuck in traffic, what should they do? I said “Can you find the Mac server rack?” Yes, they found it. “Do you see the packet marked Emergency Server Startup and Shutdown Procedures?” Yes, they did. “OK, open that and start reading. It’ll walk you through the process.” I talked with them for a few more minutes to make sure that they were OK, then I said goodbye, ended the call and prepared to board my plane.
Without that packet attached to the front of the server rack, which I had made sure was updated the day before with the latest information, I might have been trying to talk someone through the shutdown procedure for about fifteen servers and twelve RAID arrays over the phone up until the moment that the flight attendant yanked my phone out of my hand because the plane needed to take off.
There are a few lessons to take from this story.
Q: Where was I when the disaster occurred?
A: Off-site and without access to either a computer or a way to connect back to the work network.
Q: Where was the other person who had been trained on our disaster recovery process?
A: Off-site and unavailable.
Q: Who was the person on the phone?
A: Someone who wasn’t trained in our disaster recovery process.
Q: What allowed the person on the phone to successfully bring down my servers?
A: Accurate and easily-understood documentation that was placed for ready access.
Q: What was not affected by this disaster?
A: My vacation.
With these lessons in mind, see below the jump for my advice on how to write disaster recovery documentation.
Who is the audience for your disaster recovery documentation? You should be writing it for you or other IT professionals, right?
No. In fact, the person reading your documentation may be the nice lady from Facilities who handles the HVAC for the server room, that sharply-dressed gentleman from HR who stopped by the IT office on an errand, or the boss’s niece who stopped by the office to sell some cookies. They will also be under pressure and doing an unfamiliar task.
You never know who will be standing in front of your server rack when the hammer comes down so your disaster recovery documentation should be both comprehensive, and written so that anybody can understand what needs to be done. The janitorial, accounting or HR staff should be able to follow it and use it to start up or shut down your servers.
You can’t count on the person who needs the documentation being able to access it in electronic format, so having it available in printed form is a good idea. Remember, you have no idea ahead of time what form the disaster could take. Whoever is reading your documentation may be working in a dark room and reading with a flashlight.
Don’t hide this documentation. If you can, post a printed copy to the front of the server rack with a cover page or sign clearly indicating what it is and why it’s there. If that’s not an option, store it somewhere else where the documentation is both easily visible and clearly marked.
Now that we’ve covered who may be reading it and how the actual documentation should be handled, let’s talk about what should be in it.
Order Of Operations
Something that’s crucial to document is your order of operations. Pressing the power button will start up your server from being powered off, but what all needs to happen before you can hit the button?
For those not familiar with what the diagram above is, it’s a depiction of the psychological theory known as Maslow’s hierarchy of needs. It’s often portrayed in the shape of a pyramid with the largest, most fundamental levels of needs at the bottom and the need to be fully alive, or self-actualized, up at the top. You have to meet your most fundamental needs first before you can be fully alive.
Maslow’s hierarchy can also be applied to the order of operations in your disaster recovery documentation. What fundamental needs must be met before you hit the power button on your server? You need to have power. You need to have the room’s cooling system online and working properly. You need to have networking available. You need to have DNS available. A storage appliance your server uses to access crucial data needs to be powered on and functioning properly. And so on. Make sure to include this information in your documentation and specify that all of it needs to be available before your servers should be brought back up.
Another thing to document is how to recover from hardware disasters in addition to software ones. Document where your replacement parts are and how to replace them. As part of this, make sure the parts are labeled and readily accessible.
For those servers that may not have a physical keyboard and mouse permanently hooked up, make sure you have a crash cart available with monitor, keyboard, and relevant adapters, then include how to use the cart as part of the documentation.
Show Which Buttons To Hit
Don’t assume that the person in front of the server is going to know which are the correct buttons to hit and in which order. Include this information in your disaster recovery documentation along with graphics indicating which buttons do what and how they should be handled.
Verifying Normal Operation
Wherever possible, provide simple easily-checked ways in your documentation to verify that your devices are working properly. One way to do this is take a picture of the front of individual servers when they’re operating normally and use that to create a diagram with information like “All these lights in the indicated area should be lit up and showing green lights. The third light from the left should be blinking, but all others should not blink.”
One common event in disasters is loss of data. Oftentimes, the loss is due to file system corruption, hardware failures, and deleted or corrupt files. The good news is that regular backups of your data can turn these problems from catastrophes into inconveniences.
Never forget that the real reason your server needs to come back online is because of the applications and data it contains. If practical, include the procedures for restoring data from your backup system as part of the disaster recovery documentation.
That said, this is one part of your disaster recovery process where an IT professional may be required to handle the task. Assuming that is the case, make sure to clearly indicate this in the documentation so that non-IT folks know that they should stop at this point and get assistance.
Make sure your disaster recovery documentation is as up to date as possible. But how do you know for certain what’s now obsolete or wrong? The best way to find out is to test your documented disaster recovery procedures, including data restoration. How often? At least annually, though it’s even better to run these tests on a quarterly basis.
Now that we’ve discussed what should be in your documentation and how to find out if it’s good or not, what comes next? Improvements.
Audit Your Documentation
Use the results of your testing to improve your disaster recovery documentation. Once you’ve identified what’s obsolete, either toss it or move it to a legacy archive system. Make sure it does not remain in your disaster recovery documentation, as obsolete information may make recovery efforts harder.
As part of testing your disaster recovery documentation, identify which human-driven processes can be handled by automation and invest the time and effort to automate them. For example, if part of your current process includes “open a Terminal window and run this command,” figure out how that process can be automated. Always keep in mind that a human being other than you may be the one in the hot seat for bringing your systems down safely.
Identifying And Fixing Documentation Gaps
Another thing you may find as part of testing is that something was missed in the disaster recovery documentation. A common cause is that a new system was added since the last round of testing and it wasn’t added to the disaster recovery docs. Wherever gaps are found, address them as part of updating the disaster recovery documentation.
Outsourcing Services To Reduce Disaster Impact
One other thing to keep in mind as you go through the process of documenting your disaster recovery procedures is if particular services can be moved to other systems in your organization or to an outside cloud service. If you’ve transitioned a service to be handled by other systems, you’ve also outsourced the disaster recovery (and associated disaster recovery documentation) for that service to that other system. That’s less work for you and potentially less risk of having a disaster, assuming that the other system is built on a high availability model.
That said, there are crucial questions to ask before moving a particular service:
- Who is managing my data?
- How is my data backed up?
- What happens if you lose my data?
If you don’t get back answers that satisfy your needs, don’t move the service.
Writing good disaster recovery documentation with the model I’ve laid out above is challenging. It requires in-depth knowledge of how your systems work, but also requires constructing step-by-step directions that nearly anyone in your work environment could complete successfully. IT disaster planning is a never-completed task, as your IT environment is always changing and the documentation will need to reflect those changes.
That said, the reward for all the hard work may be like the one in the story I began with: Being able to get on a plane and take off for that non-ruined vacation. When disaster struck, your physical presence was unnecessary because someone else’s hands and eyes were able to follow the documentation and handle the job.