Tighten infrastructure outage dispatch rules

This commit is contained in:
Carlo Costanzo
2026-07-03 17:27:08 -04:00
parent 0a70f4fd5e
commit 356e2b5f64
4 changed files with 110 additions and 7 deletions
+2 -2
View File
@@ -49,10 +49,10 @@ Live collection of plug-and-play Home Assistant packages. Each YAML file in this
| [![YAML source: logbook_activity_feed](https://img.shields.io/static/v1?label=YAML&message=logbook_activity_feed&color=lightgrey&logo=github&logoColor=181717)](logbook_activity_feed.yaml) | Dummy `sensor.activity_feed` + helper to write clean Activity entries (Issue #1550). | `sensor.activity_feed`, `script.send_to_logbook` |
| [![YAML source: mariadb_monitoring](https://img.shields.io/static/v1?label=YAML&message=mariadb_monitoring&color=lightgrey&logo=github&logoColor=181717)](mariadb_monitoring.yaml) | MariaDB health sensors and Lovelace dashboard snippet for recorder stats. | `sensor.mariadb_status`, `sensor.database_size` |
| [![YAML source: llmvision](https://img.shields.io/static/v1?label=YAML&message=llmvision&color=lightgrey&logo=github&logoColor=181717)](llmvision.yaml) | Vision-backed garage-can and front-door package checks with rate-limited, downscaled OpenAI calls for package detection. [![Watch on YouTube](https://img.shields.io/badge/Watch-YouTube-FF0000?logo=youtube&logoColor=white)](https://youtu.be/nAhCezFetvI) | `input_button.llmvision_*`, `binary_sensor.front_door_packages_present`, `llmvision.stream_analyzer` |
| [![YAML source: docker_infrastructure](https://img.shields.io/static/v1?label=YAML&message=docker_infrastructure&color=lightgrey&logo=github&logoColor=181717)](docker_infrastructure.yaml) | Docker host patching telemetry, container/stack Repairs automation, retired Portainer repair cleanup, 20-minute Joanna escalation for persistent container outages using stable configured monitor membership, and weekly scheduled prune actions across docker_10/14/17/69; the dedicated codex_appliance VM is monitored through BearClaw status telemetry. | `sensor.docker_*_apt_status`, `binary_sensor.*_stack_status`, `sensor.docker_stacks_down_count`, `repairs.create`, `repairs.remove`, `script.joanna_dispatch` |
| [![YAML source: docker_infrastructure](https://img.shields.io/static/v1?label=YAML&message=docker_infrastructure&color=lightgrey&logo=github&logoColor=181717)](docker_infrastructure.yaml) | Docker host patching telemetry, container/stack Repairs automation, retired Portainer repair cleanup, 20-minute Joanna escalation for persistent container outages including stuck `restarting`/`created` states, and weekly scheduled prune actions across docker_10/14/17/69; the dedicated codex_appliance VM is monitored through BearClaw status telemetry. | `sensor.docker_*_apt_status`, `binary_sensor.*_stack_status`, `sensor.docker_stacks_down_count`, `repairs.create`, `repairs.remove`, `script.joanna_dispatch` |
| [![YAML source: proxmox](https://img.shields.io/static/v1?label=YAML&message=proxmox&color=lightgrey&logo=github&logoColor=181717)](proxmox.yaml) | Proxmox update detection with Repairs, 02:15 Joanna patch orchestration, final per-host HA success notifications, kernel-refresh handoff hints, runtime and disk pressure monitoring, plus nightly Frigate reboot. | `binary_sensor.node_proxmox*_updates_packages`, `sensor.node_proxmox*_total_updates`, `persistent_notification.create`, `script.joanna_dispatch`, `binary_sensor.proxmox*_runtime_healthy`, `sensor.proxmox*_disk_used_percentage`, `button.qemu_docker2_101_reboot` |
| [![YAML source: synology_dsm](https://img.shields.io/static/v1?label=YAML&message=synology_dsm&color=lightgrey&logo=github&logoColor=181717)](synology_dsm.yaml) | Synology DSM integration health normalization for Carlo-NAS01 and Carlo-NVR, with outage-aware Joanna-first handling for lone post-outage volume warnings and Repairs escalation for persistent or non-outage problems. | `binary_sensor.carlo_*_synology_problem`, `sensor.carlo_*_synology_problem_summary`, `binary_sensor.powerwall_grid_status`, `repairs.create`, `script.joanna_dispatch` |
| [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](infrastructure.yaml) | Normalized WAN/DNS/backup/domain/cert health, Nebula Sync and promoted IoT primary/backup Pi-hole consistency monitoring with Joanna dispatch, Glances-backed Docker host disk pressure with Joanna-only warning cleanup and critical Repairs, and website uptime/latency SLO signals for Infrastructure dashboards, plus nightly backup verification and monthly Joanna HA log hygiene review with public-safe GitHub issue follow-up. | `sensor.infra_nebula_sync_dns_consistency`, `sensor.infra_pihole_iot_dns_consistency`, `binary_sensor.infra_nebula_sync_degraded`, `binary_sensor.infra_pihole_iot_dns_degraded`, `sensor.docker_*_disk_used_percentage`, `automation.infra_nebula_sync_health_dispatch`, `automation.infra_pihole_iot_dns_drift_dispatch`, `automation.docker_host_disk_pressure_monitor`, `binary_sensor.infra_website_uptime_slo_breach`, `binary_sensor.infra_website_latency_degraded`, `automation.infra_backup_nightly_verification`, `script.joanna_dispatch` |
| [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](infrastructure.yaml) | Normalized WAN/DNS/backup/domain/cert health, Nebula Sync and promoted IoT primary/backup Pi-hole consistency monitoring with Joanna dispatch, Glances-backed Docker host disk pressure with Joanna-only warning cleanup and critical Repairs, immediate website-down Repairs/Joanna dispatch plus uptime/latency SLO signals, nightly backup verification, and monthly Joanna HA log hygiene review with public-safe GitHub issue follow-up. | `sensor.infra_nebula_sync_dns_consistency`, `sensor.infra_pihole_iot_dns_consistency`, `binary_sensor.infra_nebula_sync_degraded`, `binary_sensor.infra_pihole_iot_dns_degraded`, `sensor.docker_*_disk_used_percentage`, `automation.infra_nebula_sync_health_dispatch`, `automation.infra_pihole_iot_dns_drift_dispatch`, `automation.docker_host_disk_pressure_monitor`, `automation.infra_website_down_repair_and_dispatch`, `binary_sensor.infra_website_uptime_slo_breach`, `binary_sensor.infra_website_latency_degraded`, `automation.infra_backup_nightly_verification`, `script.joanna_dispatch` |
| [![YAML source: onenote_indexer](https://img.shields.io/static/v1?label=YAML&message=onenote_indexer&color=lightgrey&logo=github&logoColor=181717)](onenote_indexer.yaml) | Dedicated-appliance OneNote indexer health/status monitoring for Joanna, explicit index-health confirmation, failure-repair automation, and a daily duplicate-delete maintenance request. | `sensor.onenote_indexer_last_job_status`, `binary_sensor.onenote_indexer_last_job_successful`, `binary_sensor.onenote_indexer_index_healthy` |
| [![YAML source: mqtt_status](https://img.shields.io/static/v1?label=YAML&message=mqtt_status&color=lightgrey&logo=github&logoColor=181717)](mqtt_status.yaml) | Command-line MQTT broker reachability probe with Spook Repairs escalation and Joanna troubleshooting dispatch on outage. | `binary_sensor.mqtt_status_raw`, `binary_sensor.mqtt_broker_problem`, `repairs.create`, `rest_command.bearclaw_command` |
| [![YAML source: mariadb](https://img.shields.io/static/v1?label=YAML&message=mariadb&color=lightgrey&logo=github&logoColor=181717)](mariadb.yaml) | MariaDB recorder health and capacity snapshots with hourly live metrics, weekly admin/recorder polling, and stats-ready numeric sensors. | `sensor.mariadb_status`, `sensor.database_size` |
+6 -5
View File
@@ -18,6 +18,7 @@
# Notes: Tapple is now served by `games_hub` on `/tapple/`; do not keep a standalone `tapple` container switch in the monitored group.
# Notes: Teslamate and crystalsoftwashsolutions are live services and should remain in the monitored group when their discovery switches are present.
# Notes: Treat telemetry reconnects from unavailable/unknown to a concrete stopped state as actionable outages.
# Notes: Treat stuck `restarting` and `created` states as down so monitored containers dispatch remediation.
# Notes: Infra Info was removed; BearClaw Admin is the planning snapshot surface.
# Notes: codex_appliance moved to a dedicated VM; keep the standard codex_appliance switches and retire the legacy hashed discovery entity when it disappears.
# Notes: Paige's Bookshelf is a live monitored service and should remain in the group when its discovery switch is present.
@@ -471,7 +472,7 @@ template:
{% endfor %}
{% endif %}
{% set effective_state = resolver.state %}
{% if effective_state in ['off', 'stopped'] %}
{% if effective_state in ['off', 'stopped', 'exited', 'dead', 'restarting', 'created'] %}
{% set ns.down = ns.down + [key] %}
{% elif not telemetry_degraded and effective_state in ['unknown', 'unavailable'] %}
{% set ns.down = ns.down + [key] %}
@@ -515,7 +516,7 @@ template:
{% endfor %}
{% endif %}
{% set effective_state = resolver.state %}
{% if effective_state in ['off', 'stopped'] %}
{% if effective_state in ['off', 'stopped', 'exited', 'dead', 'restarting', 'created'] %}
{% set ns.down = ns.down + [key] %}
{% elif not telemetry_degraded and effective_state in ['unknown', 'unavailable'] %}
{% set ns.down = ns.down + [key] %}
@@ -596,7 +597,7 @@ script:
example: true
sequence:
- variables:
down_states: ['off', 'stopped', 'exited', 'dead', 'unknown', 'unavailable']
down_states: ['off', 'stopped', 'exited', 'dead', 'restarting', 'created', 'unknown', 'unavailable']
src_entity: "{{ entity_id | default('', true) }}"
op: "{{ operation | default('create', true) | lower }}"
wait_minutes: "{{ delay_minutes | default(0) | int(0) }}"
@@ -1046,13 +1047,13 @@ automation:
value_template: "{{ is_monitored_container_event }}"
sequence:
- variables:
down_states: ['off', 'stopped', 'exited', 'dead', 'unknown', 'unavailable']
down_states: ['off', 'stopped', 'exited', 'dead', 'restarting', 'created', 'unknown', 'unavailable']
- choose:
- conditions: >-
{{ new_state in down_states and
(old_state not in down_states or
(old_state in ['unknown', 'unavailable'] and
new_state in ['off', 'stopped', 'exited', 'dead'])) and
new_state in ['off', 'stopped', 'exited', 'dead', 'restarting', 'created'])) and
not (is_state('binary_sensor.docker_container_telemetry_degraded', 'on') and
new_state in ['unknown', 'unavailable']) }}
sequence:
+101
View File
@@ -19,6 +19,7 @@
# Notes: Warning-level Docker host disk pressure is Joanna-only; Repairs are reserved for critical pressure.
# Notes: Nebula Sync DNS consistency compares primary/backup Pi-hole answers and dispatches Joanna on sustained drift or container loss.
# Notes: Promoted IoT DNS consistency compares primary/backup Pi-hole answers for reserved IoT host records.
# Notes: Immediate website-down states create Repairs and dispatch Joanna; SLO/latency automations cover longer-term UptimeRobot trends.
######################################################################
input_text:
@@ -227,6 +228,31 @@ template:
{% endif %}
{% endfor %}
{{ ns.count }}
attributes:
monitored_entities: >-
{{ [
'binary_sensor.vcloudinfo_com',
'binary_sensor.ipmer_com',
'binary_sensor.fordst_com',
'binary_sensor.www_kingcrafthomes_com'
] }}
down_entities: >-
{% set ids = [
'binary_sensor.vcloudinfo_com',
'binary_sensor.ipmer_com',
'binary_sensor.fordst_com',
'binary_sensor.www_kingcrafthomes_com'
] %}
{% set ns = namespace(items=[]) %}
{% for id in ids %}
{% if expand(id) | count > 0 %}
{% set st = states(id) %}
{% if st in ['off', 'unknown', 'unavailable'] %}
{% set ns.items = ns.items + [id ~ '=' ~ st] %}
{% endif %}
{% endif %}
{% endfor %}
{{ ns.items }}
- binary_sensor:
- name: "Infra WAN Quality Degraded"
@@ -417,6 +443,81 @@ automation:
message: >-
External IP changed from {{ trigger.from_state.state }} to {{ trigger.to_state.state }}.
- alias: "Infrastructure - Website Down Repair And Dispatch"
id: infra_website_down_repair_and_dispatch
description: "Create/clear Repairs and dispatch Joanna when monitored websites are immediately down."
mode: queued
trigger:
- platform: state
entity_id: binary_sensor.infra_website_degraded
to: "on"
for: "00:05:00"
id: degraded
- platform: state
entity_id: binary_sensor.infra_website_degraded
to: "off"
id: recovered
variables:
down_count: "{{ states('sensor.infra_website_down_count') | int(0) }}"
down_entities: "{{ state_attr('sensor.infra_website_down_count', 'down_entities') | default([], true) | list }}"
down_summary: "{{ down_entities | join(', ') if (down_entities | count > 0) else 'none' }}"
action:
- choose:
- conditions:
- condition: template
value_template: "{{ trigger.id == 'degraded' and down_count > 0 }}"
sequence:
- service: repairs.create
data:
issue_id: infra_website_down
title: "Website availability degraded"
description: >-
{{ down_count }} monitored website
{{ 'entity is' if down_count == 1 else 'entities are' }} down:
{{ down_summary }}.
severity: error
persistent: true
- service: script.joanna_dispatch
data:
trigger_context: >-
HA automation infra_website_down_repair_and_dispatch
(Infrastructure - Website Down Repair And Dispatch)
source: "home_assistant_automation.infra_website_down_repair_and_dispatch"
summary: "Monitored website availability degraded ({{ down_count }} down)"
entity_ids:
- binary_sensor.infra_website_degraded
- sensor.infra_website_down_count
- binary_sensor.vcloudinfo_com
- sensor.vcloudinfo_com
- sensor.wordpress_wp_state_2
- switch.wordpress_wp_container_2
diagnostics: >-
down_entities={{ down_summary }};
vcloudinfo_sensor={{ states('sensor.vcloudinfo_com') }};
vcloudinfo_binary={{ states('binary_sensor.vcloudinfo_com') }};
wordpress_state={{ states('sensor.wordpress_wp_state_2') }};
wordpress_switch={{ states('switch.wordpress_wp_container_2') }};
cloudflared_wp={{ states('switch.cloudflared_wp_container_2') }}.
request: >-
Investigate and resolve the monitored website outage. For
vcloudinfo.com, start with public HTTPS reachability,
wordpress_wp, wordpress_db, and cloudflared_wp telemetry.
Verify public HTTP 200 recovery before closing out. Do not
power-cycle unrelated infrastructure.
- service: script.send_to_logbook
data:
topic: "INTERNET"
message: "Website availability dispatch requested ({{ down_summary }})."
default:
- service: repairs.remove
continue_on_error: true
data:
issue_id: infra_website_down
- service: script.send_to_logbook
data:
topic: "INTERNET"
message: "Website availability recovered."
- alias: "Infrastructure - Website Uptime SLO Repair"
id: infra_website_uptime_slo_repair
description: "Create/clear Repairs issue when website 1-day uptime breaches SLO."
+1
View File
@@ -61,6 +61,7 @@ Current automations that kick off automated resolutions (via `script.joanna_disp
| `infra_monthly_log_hygiene_review` | Infrastructure - Monthly HA Log Hygiene Review | [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/infrastructure.yaml) |
| `infra_nebula_sync_health_dispatch` | Infrastructure - Nebula Sync Health Dispatch | [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/infrastructure.yaml) |
| `infra_pihole_iot_dns_drift_dispatch` | Infrastructure - Pi-hole IoT DNS Drift Dispatch | [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/infrastructure.yaml) |
| `infra_website_down_repair_and_dispatch` | Infrastructure - Website Down Repair And Dispatch | [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/infrastructure.yaml) |
| `docker_state_sync_repairs_dynamic` | Docker State Sync - Repairs (Dynamic) | [![YAML source: docker_infrastructure](https://img.shields.io/static/v1?label=YAML&message=docker_infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/docker_infrastructure.yaml) |
| `docker_group_reconcile_weekly_joanna_review` | Docker Group Reconcile - Weekly Joanna Review | [![YAML source: docker_infrastructure](https://img.shields.io/static/v1?label=YAML&message=docker_infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/docker_infrastructure.yaml) |
| `docker_host_disk_pressure_monitor` | Docker Host Disk Pressure Monitor | [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/infrastructure.yaml) |