Building a Self-Healing Systemd Service on AlmaLinux 8 (Without External Monitoring)

Overview

In many production environments, service availability depends on external monitoring systems such as Prometheus, Zabbix, or cloud-provider health checks. While effective, these systems introduce additional complexity and dependencies.

A lesser-known but highly effective approach is to leverage systemd’s native watchdog and restart capabilities to create self-healing services that automatically detect failures and recover without external tools.

This document explains how to design and deploy a self-healing systemd service on AlmaLinux 8 using built-in systemd features only.


Use Cases


Prerequisites


Step 1: Create a Sample Service Script

Create a service script that simulates a long-running process.

nano /usr/local/bin/example-service.sh

Add the following content:

#!/bin/bash

while true; do
    echo "$(date) - Service heartbeat"
    sleep 5
done

Make it executable:

chmod +x /usr/local/bin/example-service.sh

Step 2: Create systemd Service Unit

nano /etc/systemd/system/example-selfhealing.service

Add:

[Unit]
Description=Example Self-Healing Service
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/example-service.sh
Restart=always
RestartSec=3
WatchdogSec=10
NotifyAccess=main

[Install]
WantedBy=multi-user.target

Step 3: Enable systemd Watchdog

Modify the script to notify systemd.

nano /usr/local/bin/example-service.sh

Replace content with:

#!/bin/bash

while true; do
    systemd-notify WATCHDOG=1
    sleep 5
done

Step 4: Reload systemd and Start Service

systemctl daemon-reexec
systemctl daemon-reload
systemctl enable example-selfhealing.service
systemctl start example-selfhealing.service

Step 5: Verify Service Health

Check service status:

systemctl status example-selfhealing.service

Verify watchdog activity:

journalctl -u example-selfhealing.service

Step 6: Simulate Failure

Forcefully terminate the service:

pkill -9 -f example-service.sh

systemd automatically restarts it within seconds.


Step 7: Verify Automatic Recovery

systemctl status example-selfhealing.service

Restart count increases, confirming recovery.


Advanced Hardening Options

Limit Restart Storms

StartLimitIntervalSec=60
StartLimitBurst=3

Resource Control

MemoryMax=256M
CPUQuota=50%

Failure Actions

OnFailure=emergency.target

Logging and Auditing

journalctl -u example-selfhealing.service --since "10 minutes ago"

Why This Approach Is Valuable

This technique is widely used internally by enterprise Linux teams but is rarely documented publicly.


Conclusion

systemd provides powerful native mechanisms for building self-healing services without relying on third-party monitoring tools.

By correctly configuring restart policies and watchdogs, AlmaLinux 8 systems can automatically recover from many service failures with minimal effort.