Skip to content

RFE: A system monitoring and alerting role #47

@myllynen

Description

@myllynen

I would like to see a role added that would configure essential system monitoring and in case something bad happens then automatically alert the administrator. The areas to be monitored, the thresholds to raise alerts, and the methods of alerting should be configurable. Once configured, the administrator should not be required to manually monitor or read anything to see that a system is behaving as expected and in case of issues would receive a notification alert.

In practice at least the following could be considered as methods to alert:

  • D-Bus
  • email
  • HTTP POST (this would probably also cover chat)
  • SNMP
  • SMS
  • syslog

The following could be areas to monitor with configurable thresholds, e.g., by default 90% limit for the disk-full case:

  • CPU usage - e.g., detect CPU hogs on non-dedicated systems where no process should utilize CPU for a long time
  • memory usage - e.g., monitor how much memory and swap is used and how much there is swapping in/out activity
  • disk usage - e.g., monitor that no partition is getting full
  • network connectivity - e.g., monitor that gateway, DNS, NTP servers are pingable and no packet loss detected
  • application issues - e.g., generic cases like process segfaulting constantly or a service failing to start
  • security violations - e.g., high amount of failed SSH login attempts, SELinux AVCs, DDoS, or sudo failures
  • hardware failures - e.g., IO errors from storage or current hardware not matching a predefined configuration

The user could select one or more alerting methods, local syslog could be the default since it's probably easiest to set up correctly. The default set of what to monitor and the default thresholds could be determined after consulting people and organizations maintaining and supporting production systems.

Implementation-wise one potential candidate would be PCP/pmie at least for the CPU/memory/storage/network related areas. PCP/pmie uses the same PCP infra as the existing metrics role to detect anomalies, is fully configurable, allows calling external scripts on events, and is nowadays a standard component in most distributions. It should however be tested how PCP/pmie behaves in case an alert should be raised, e.g., when disk full.

Later on it could be considered whether adding optional remediation scripts would be helpful or possible.

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions