True IT confessions

Supergeeks fess up to some of the dumbest things they've ever done -- and the lessons they learned as a result

It's one of the unwritten laws of physics: At some time or another, everybody screws up.

But when IT pros make mistakes, they don't mess around. Entire buildings go dark. Web sites disappear. Companies grind to a halt. Because if you're going to mess up, you might as well make it count.

"I always tell my guys, hey, you're gonna do stupid stuff," says Rich Casselberry, director of IT operations at Enterasys, a networking systems vendor. "It's OK to do something stupid if you have the wrong information. But if you do something stupid because you're stupid, that's a problem. The trick is to not flip out, which only makes it worse, or try to hide it. You need to figure out how to keep it from happening again."

[ For more adventures in IT mishaps, check out Stupid user tricks 3: IT admin follies and Stupid QA tricks: Colossal testing oversights ]

We've gathered up some of the more egregious examples from IT pros brave enough to share their screwups with us. Backups gone bad, people with admin privileges who probably shouldn't, what can go south when you unplug the wrong equipment -- in some cases, we've obscured their identities to spare them embarrassment; other geeks, however, are perfectly willing to own up to their youthful mistakes.

Sure, some of these mishaps are amusing in retrospect. But don't laugh too hard. We know you've probably done worse.

True IT confession No. 1: The case of the mysterious invisible backup Our first tale of misadventure involves a longtime IT pro who doesn't want his real name used, so we'll just call him Hard Luck Harry.

Harry had his share of mishaps when he started out a decade ago at a major networking equipment maker in the Northeast. There was the time he changed an environmental variable that broke everything on his company's financial apps, earning an e-mail from his boss ordering him to "never hack on this system again." Or the time he crashed the company's core ERP system by overwriting /dev/tty. Harry says after he accidentally ripped the company's T1 lines out of the wall with his pager, he was banned from ever reentering the telecom closet.

But the worst one happened after Harry installed an Emerald tape backup system. Did he bother to read the manual? Please. This was child's play. Just load install.exe and let the software do its thing.

It seemed to work perfectly. Four hours later, the first backup completed and everything looked fine.

Fast-forward six months. Harry gets a call late one night at home from one of his work pals. That night's backup tape is completely blank, the friend tells him. Worse, the last four weeks of backups are also blank.

As Harry soon discovered, that particular backup program installs in demo mode by default. Demo mode looked exactly like real mode and even took the same amount of time as an actual backup, but nothing ever got written to tape -- a fact that was noted in the manual, which Harry might have seen had he read it.

Fortunately, the company used ADP for payroll processing. ADP shipped back historical payroll records, so the firm lost only a week's worth of data. The bad news? Harry was up until 3 a.m. manually stuffing payroll envelopes, along with his boss, the VP of finance, the entire payroll department, and the company's brand-new CIO, whom he met for the first time that night.

"I got to say, I was pretty popular," he jokes. "I think the only reason they didn't fire me was by that point they had gotten so used to me screwing up, they realized I couldn't do anything right."

Lessons learned? 1. Test the restores, not the backups, says Harry. "No one cares if the backup works; they care if the restore does." 2. Think before you type. 3. Remove your pager (or BlackBerry) before entering the telecom closet, just to be safe.

True IT confession No. 2: Sometimes it takes a janitor to clean up an IT mess Late one night in 1997, Josh Stephens was working all alone at his console at a large Midwestern telecom company. Stephens was making changes to the Cisco Catalyst switches at the telco's main customer call center, which was located several states away. That's when the spanning tree protocols hit the fan.

"I'm still not sure exactly how I did it, but I caused some sort of broadcast storm and STP freak-out that locked up not only the switch I was working on but every single switch in that facility," he says. That broadcast storm brought down hundreds of call center users, stranding many of them in the middle of customer calls.

[ Of course, janitorial services and IT don't always mix: Server room. Windex. Zot. ]

Worse, the switches were "locked hard," requiring a physical power-off and a slow methodical plan to bring them back online, one at a time. The datacenter was hundreds of miles away and had no on-site IT staff, so Stephens did the next best thing: He called maintenance.

"I ended up finding a janitor that had keys to all of my LAN closets and I talked him through (a) which devices were the Catalyst switches, and (b) how to power them off," he says. "I also promised him he wouldn't get fired for helping me."

Though the call center was down for more than hour, nobody ever found out why or who was behind the glitch, says Stephens, who is now VP of technology and Head Geek (yes, that's the actual title) for SolarWinds, a maker of network management software.

Lessons learned? 1. Don't make changes without scheduling a window for them, even if the changes seem minor, says Stephens. 2. Never conduct a change control event without IT resources near the gear you're changing. 3. Be nice to the janitors. One day they might save your assets.

True IT confession No. 3: Put your hands up and step away from the terminal One of the unavoidable facts of tech life is that when managers are given administrative rights to complex systems, bad things tend to happen.

Back in the late '80s, Johanna Rothman was director of development for a small, distributed process systems maker in the Boston area. Company management insisted on mandatory overtime for everyone, Rothman included. After three months of this, Rothman and her team were cranky and exhausted -- a recipe for disaster.

[ More manager mishaps when meddling in IT can be found in "More stupider user tricks: IT horror stories redux" ]

"One night at 9 p.m., I realize we have a bunch of files to be deleted," she says. "I'm on a Unix system, and the system won't let me delete them -- I'm not root. Well, I'm the Director. I have the root password. I log in as root. I start rm -r -- the recursive delete -- from the directory I know is the right directory. I know this."

After a few minutes, the rm command stops working. Rothman, still busy deleting all the applications, kills the job, calls the IT manager, and explains what she's done.

"He says, 'Move away from the keyboard. I'm coming in to start the restore.' I say, 'I can help. Where are the tapes?' He says, 'Go away. Just leave. I don't need more of your help.'"

The restore takes two days. Rothman says she slept in late on both days and told everyone else on her team to do the same. She also left voicemail apologies to all the developers.

"I think the only reason I didn't get fired is because management was too busy with the crisis to realize what a mess I'd made," says Rothman, who now runs her own IT consulting group and keeps a safe distance from Unix root directories.

Lesson learned? 1. There is no reason for anyone higher than the level of manager to have the root password, says Rothman. 2. Too much overtime makes people tired and stupid. The more tired they are, the stupider they get.

True IT confession No. 4: What can Brown do for you? Here's one of those rare backup mishaps in which data did in fact get backed up. But what it got backed up to is where things goes sour.

Twenty-seven years ago, David Guggenheim had just gotten his first "real job" as biological data manager at an environmental consulting firm in Southern California. At that time, the firm's hardware consisted of a PDP-11 and a time-share IBM 360 mainframe in Los Angeles, accessed via dial-up.

"It was time to archive an important project from the IBM mainframe, so I cracked my knuckles and began pounding out the JCL [Job Control Language] necessary to write our data to tapes that would then be shipped to our office," he says. "I submitted the job, satisfied that our data would be safely backed up."

[ Find out how your programming skills measure up. Take InfoWorld's "Programming IQ test." ]

A few days later a UPS driver poked his head in the door at the firm's office and shouted, "Is there a David Guggenheim here?"

The UPS truck was filled floor to ceiling with boxes, all of them addressed to Guggenheim. He opened the first one. It was full of punch cards. And so were all the rest of them.

"It was our data from the IBM mainframe," he says. "To my horror, I realized that instead of specifying output to magnetic tape, I specified output to punch cards. I can't remember my JCL very well any more, but as I recall, it was the difference between specifying '=0' versus '=1.' I was absolutely humiliated."

It gets worse. A few days after the entire staff got involved clearing enough floor space for the mountain of boxes, the bill arrived. The cost of a punch-card backup job was nearly $1,000 (and remember, we're talking about 1982 dollars here).

"I had blown our budget out of the water, killed a forest, and still failed to back up our data onto tape," says Guggenheim, who's now Dr. David Guggenheim, Ph.D., president of 1planet 1ocean, and a senior fellow at The Ocean Foundation. "I've spent my career since then doing environmental work, so hopefully I paid penance for the dead trees."

Lessons learned? 1. Little mistakes can cause huge problems, so keep checking until it hurts. 2. Immediately own up to your errors; humility is a great teacher. 3. Take the time to appreciate the humor of a colossal screw-up, says Guggenheim. "It does wonders for the sting."

True IT confession No. 5: Unplug at your own risk Back in the mid-'90s, Jan Aleman was interim IT manager for a major telecom company in the Netherlands. He was called in to replace a CTO who'd left under less-than-voluntary circumstances. Before the ex-CTO got canned, though, he'd ordered a $300,000 IBM failover system for the company's mission-critical billing engine.

"A very good IBM salesman had sold them this overpriced hardware, assuring them that if the primary system failed it would rollover seamlessly to the secondary one," says Aleman. "He said it was completely redundant, that nothing could go wrong. I said, 'All right, let's see if it actually works.'"

[ For more IT tales told on the down low, check out InfoWorld's weekly Off the Record blog or subscribe to the InfoWorld Off the Record RSS feed. ]

So Aleman yanked the power plug for the primary system out of the wall, right in front of the IBM salesman. All the company's core systems went dark. The critical billing engine was down for the rest of the afternoon. The phone switches still worked, but nobody in the back office could get anything done.

Though the failover system was installed and running, nobody had bothered to test it. So the next thing Aleman did was institute biweekly tests of the system on weekends.

"I unplugged the company," says Aleman, who is now CEO of Servoy, a developer of hybrid (SaaS and on-premises) software. "Needless to say, they were not very happy, but nothing bad ever happened to me. I'm still not sure how I managed to pull that off."

Lessons learned? 1. Always test systems before you bet the company on them (repeat as needed). 2. Think twice before you yank that power cord.

True IT confession No. 6: Never let another be the master of your domains Back around 2003 or so, "Fred" (not his real name) was the IT manager for a regional cable company in the Midwest. At the time, the company had about 35,000 subscribers. To boost its business services, it decided to become a domain name reseller for Network Solutions.

As part of the transition to domain name sales, the company redirected all domain renewal notifications to a person in its business support unit. "We assumed only our customer's domain notifications would go there, and not the company's own domains," Fred says. But as the saying goes, assuming makes asses out of everyone.

[ Not all work in IT is this glamorous. See the underside in "The 7 dirtiest jobs in IT" and "Even dirtier IT jobs: The muck stops here." ]

Sure enough, one night around 10 p.m. everything at the ISP stopped working: DNS, e-mail, the company's own Web sites, and the sites it hosted for its business customers -- all simply went poof.

The problem? The ISP had neglected to renew its own domains. The person in business support assumed Fred was also getting notified about the renewals (he wasn't) and Fred assumed that since he wasn't being notified, everything was hunky dory (it wasn't).

"By the time we diagnosed the problem -- because you rarely think to check whether your own domain has expired -- we had fallen out of the root servers and it took a full 24 hours before everything was restored," says Fred. He adds a variant on the old MasterCard commercials:

Lessons learned? 1. Always have multiple people receiving import alerts. 2. Register your domains for 10 years and it will most likely be the next guy's problem.

1 2 Page 1
Page 1 of 2
6 tips for scaling up team collaboration tools
  
Shop Tech Products at Amazon