Building a Self-Healing Systemd Service on AlmaLinux 8 (Without External Monitoring)

Overview

In many production environments, service availability depends on external monitoring systems such as Prometheus, Zabbix, or cloud-provider health checks. While effective, these systems introduce additional complexity and dependencies.

A lesser-known but highly effective approach is to leverage systemd’s native watchdog and restart capabilities to create self-healing services that automatically detect failures and recover without external tools.

This document explains how to design and deploy a self-healing systemd service on AlmaLinux 8 using built-in systemd features only.

Use Cases

Critical internal services without monitoring agents
Edge or isolated systems
Minimal installations
Bootstrap environments
Fallback protection when monitoring systems fail

Prerequisites

AlmaLinux 8
systemd (default)
Root or sudo access
A long-running service or script

Step 1: Create a Sample Service Script

Create a service script that simulates a long-running process.

nano /usr/local/bin/example-service.sh

Add the following content:

#!/bin/bash

while true; do
    echo "$(date) - Service heartbeat"
    sleep 5
done

Make it executable:

chmod +x /usr/local/bin/example-service.sh

Step 2: Create systemd Service Unit

nano /etc/systemd/system/example-selfhealing.service

Add:

[Unit]
Description=Example Self-Healing Service
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/example-service.sh
Restart=always
RestartSec=3
WatchdogSec=10
NotifyAccess=main

[Install]
WantedBy=multi-user.target

Step 3: Enable systemd Watchdog

Modify the script to notify systemd.

nano /usr/local/bin/example-service.sh

Replace content with:

#!/bin/bash

while true; do
    systemd-notify WATCHDOG=1
    sleep 5
done

Step 4: Reload systemd and Start Service

systemctl daemon-reexec
systemctl daemon-reload
systemctl enable example-selfhealing.service
systemctl start example-selfhealing.service

Step 5: Verify Service Health

Check service status:

systemctl status example-selfhealing.service

Verify watchdog activity:

journalctl -u example-selfhealing.service

Step 6: Simulate Failure

Forcefully terminate the service:

pkill -9 -f example-service.sh

systemd automatically restarts it within seconds.

Step 7: Verify Automatic Recovery

systemctl status example-selfhealing.service

Restart count increases, confirming recovery.

Advanced Hardening Options

Limit Restart Storms

StartLimitIntervalSec=60
StartLimitBurst=3

Resource Control

MemoryMax=256M
CPUQuota=50%

Failure Actions

OnFailure=emergency.target

Logging and Auditing

journalctl -u example-selfhealing.service --since "10 minutes ago"

Why This Approach Is Valuable

No external agents required
Faster recovery than monitoring alerts
Reduced system complexity
Built-in to systemd
Highly reliable

This technique is widely used internally by enterprise Linux teams but is rarely documented publicly.

Conclusion

systemd provides powerful native mechanisms for building self-healing services without relying on third-party monitoring tools.

By correctly configuring restart policies and watchdogs, AlmaLinux 8 systems can automatically recover from many service failures with minimal effort.