Posted by: devinmoore | January 24, 2006

Why do systems fail at the end of the day?

I have had many jobs where things were always failing at the end of the day. I wasn’t compensated anything extra for staying to fix these things, so they became the enemy. I came up with the following strategy to prevent end-of-day failures:
1. if you have procedures that must run nightly, push them as late as possible, so they’re effectively running in the morning when no one will have been there long enough in the day to see them fail.
2. Thus, anything else that fails at night will not be critical until the morning when work must resume.
3. If you come in early, then you have the best head-start on fixing any issues with those components, and you have a fresh day to fix them, as opposed to having already worked all day and facing “critical” failures.

Using this method, there is only one type of “critical” failure that can happen at any time, and that’s a production transaction-based system failing. If it’s a production critical system like that, there must be an established process for getting it running again whose first step isn’t “call (you)”. What if you weren’t available? Is the production system down forever?
If you are in charge of a “critical” system like this, it is your responsibility to set up a recovery procedure that can run without you physically present, and to appoint people who can take charge of it besides yourself, and finally to get management to approve it. If they will not, then your place of business isn’t taking the production failure and recovery seriously, so why should you? Recovery should effectively be autonomic, meaning even if a person has to do it, it could really be anyone. Only if the various recovery steps fail should you be getting a call. Thus, the recovery should be as simple and failure-resistant as possible. Granted, this is more expensive (i.e. “power on the backup server” requires purchasing a backup server), but the end result will be a system that doesn’t rely on any particular person in order to recover from the first-line failure.

Lesson: If the only recovery process requires you, then YOU are the critical component to the business! Neither you nor your management should be comfortable with that.



  1. Wow, man, this hit me right here (points at chest)

    You are awesome.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: