Downtime is a serious issue in Pubnixes
I am maintaining some public facing services on Exozyme Pubnix or Shared Linux Server for less than a year ig.
I faced issues like outage and one of the services or containers goes down and I had to everytime fix them and restart them manually.
This blog documenting the strategies I applied to solve this program and get consistent uptime from these services.
I have deployed many services, checking them manually to see if they are up is tiddius task. Thanks to iacore and Anthony Wang who created https://status.exozy.me which monitors status of every services been deployed in the server.
Services I deployed
I have deployed:
Now cyberchef and Mysite is statically deployed that there is no daemon constantly running in background to keep them up. but rest are podman containers. Sometimes when global wide system outage occurs they went down. and this is where problem occurs as I may not be available in time to fix the issue and up them again.
Automation to fix service downtime
Thats why I created this script which will run hourly checkup on each services and restart them if they are found to be down.
This is a revive_pods.sh
script which is posix script.
this deployed as systemd user service so it constantly run in bg.
It uses following healthcheck.sh
script to perform healthcheck on each containers or services.
since services are exposed onto public port using unix-socket
they need to run with unlink-early
tag so if service goes down the residual unix socket gets removed. so its ideal to run your service like this:
Healthcheck.sh takes first argument as container or service name so it the unix-socket name and container name should be same. It uses the podman inspect
option to check the state of the service and then start the container and its respected user service or it just prints the running status of service if its still up.
I have been running this method since weeks till now and even we had multiple outtages I am glad to say I didnt saw the services down status for a while now.