This term describes the general operational practice of restarting backend services to overcome system failures. While popular with java deployments, this can be observed with other systems as well.
I took the liberty to collect a few gems from the Internet:
- Restart One MongoDB Deployment (or also the manager app)
- DataStax OpsCenter
- IBM Tivoli Storage Manager
With all the above documentation, this obviously shows to be a widely spread industry practice that everyone should absolutely adopt right away! RestartOPs allows you to save precious time spent on debugging applications and understanding the actual problem that one is seeing.
Who needs root cause analysis, when the problem simply goes away after a juicy shutdown -r now was issued. Why bother looking deeper, if one can setup that restart command in a crontab to restart your entire production stack weekly, no, daily. Hmm, why not do it hourly!
This is obviously a joke!
Please never do restart ops if you are serious about your production environment. The only time restarting should solve your issues when you have your parents on the phone and can’t take a closer look at the underlying problem. (Probably switched the keyboard to Japanese again) I’ve seen all the above practices in place at several companies. It almost never payed off in the long run as the problems kept coming back.
As other jobs, I think we have an obligation as engineers to say no in such cases.
I refuse to be ignorant of the issue, but rather want to understand every little detail of why this issue was encounter. I do admit that sometimes a few parts are out of the boundaries that we can comprehend at the time. Isn’t that however the best opportunity to learn something new?
This is also why open source is great: if one wants to dig deep, you absolutely can. I have colleagues that spend a week at looking through every commit of the last weeks to find that single change that could have had caused side effects.
Being a SRE or software engineer / developer should mean we have an active interest in understanding the systems we develop, operate and use.