Downtime is a serious issue in Pubnixes

I am maintaining some public facing services on Exozyme Pubnix or Shared Linux Server for less than a year ig.

I faced issues like outage and one of the services or containers goes down and I had to everytime fix them and restart them manually.

This blog documenting the strategies I applied to solve this program and get consistent uptime from these services.

I have deployed many services, checking them manually to see if they are up is tiddius task. Thanks to iacore and Anthony Wang who created https://status.exozy.me which monitors status of every services been deployed in the server.

Services I deployed

I have deployed:

Now cyberchef and Mysite is statically deployed that there is no daemon constantly running in background to keep them up. but rest are podman containers. Sometimes when global wide system outage occurs they went down. and this is where problem occurs as I may not be available in time to fix the issue and up them again.

Automation to fix service downtime

Thats why I created this script which will run hourly checkup on each services and restart them if they are found to be down.

#!/bin/sh
 
while true; do
    /home/nvpie/.local/bin/healthcheck.sh yt-local
    /home/nvpie/.local/bin/healthcheck.sh umami
    /home/nvpie/.local/bin/healthcheck.sh spdf
    systemctl --user restart calibre-web.service
    echo "health check finish for all containers"
    sleep 1h
done
 

This is a revive_pods.sh script which is posix script.

this deployed as systemd user service so it constantly run in bg.

It uses following healthcheck.sh script to perform healthcheck on each containers or services.

#!/bin/sh
for container in "$@"; do
status=$(podman inspect -f '{{.State.Status}}' $container)
    if [ "$status" = "exited" ]; then
        podman start $container && echo "Container $container started" || echo "failed to start container $container"
        systemctl --user restart $container.service && echo "service restarted" || echo "failed to restart service"
 
    else
 
        echo "Container $container is $status"
    fi
done
 

since services are exposed onto public port using unix-socket they need to run with unlink-early tag so if service goes down the residual unix socket gets removed. so its ideal to run your service like this:

socat UNIX-LISTEN:/srv/http/yt-local,mode=660,fork,unlink-early TCP:localhost:5052

Healthcheck.sh takes first argument as container or service name so it the unix-socket name and container name should be same. It uses the podman inspect option to check the state of the service and then start the container and its respected user service or it just prints the running status of service if its still up.

I have been running this method since weeks till now and even we had multiple outtages I am glad to say I didnt saw the services down status for a while now.