This research group uses a cluster of servers for high-performance computing -- but occasionally even that isn't enough, reports a pilot fish who keeps an eye on it all.
"Sometimes the compute jobs oversubscribe the resources, and the monitoring system lights up like a Christmas tree," fish says. "We know for a fact that the alerts are short lived. As soon as the heavy job finishes -- usually it's in R or Java -- the status returns to normal.
"But we can't stop the compute job just to bring status back to normal, and sometimes the jobs run for days and weeks.
"On several occasions, one particular network admin opened a ticket to tell the on-call that the load was high and we should investigate it. Every time, I replied explaining the reason and saying the only thing to do is wait for the jobs to complete, so please don't open this type of ticket again.
"Then, in the middle of the night on the weekend, I got a call from that net admin: 'Hey, the HPC cluster is alerting again. I tried calling the primary on-call, but there's no answer, so I called you instead...'"
You can send Sharky your story anytime. In fact, now would be a very good time to send me your true tale of IT life at email@example.com. You'll get a stylish Shark shirt if I use it. Add your comments below, and read some great old tales in the Sharkives.
Get your daily dose of out-takes from the IT Theater of the Absurd delivered directly to your Inbox. Subscribe now to the Daily Shark Newsletter.