Tighten infrastructure outage dispatch rules

2026-07-04 05:15:31 -07:00 · 2026-07-03 17:27:08 -04:00
parent 0a70f4fd5e
commit 356e2b5f64
4 changed files with 110 additions and 7 deletions
@@ -49,10 +49,10 @@ Live collection of plug-and-play Home Assistant packages. Each YAML file in this
 | [![YAML source: logbook_activity_feed](https://img.shields.io/static/v1?label=YAML&message=logbook_activity_feed&color=lightgrey&logo=github&logoColor=181717)](logbook_activity_feed.yaml) | Dummy `sensor.activity_feed` + helper to write clean Activity entries (Issue #1550). | `sensor.activity_feed`, `script.send_to_logbook` |
 | [![YAML source: mariadb_monitoring](https://img.shields.io/static/v1?label=YAML&message=mariadb_monitoring&color=lightgrey&logo=github&logoColor=181717)](mariadb_monitoring.yaml) | MariaDB health sensors and Lovelace dashboard snippet for recorder stats. | `sensor.mariadb_status`, `sensor.database_size` |
 | [![YAML source: llmvision](https://img.shields.io/static/v1?label=YAML&message=llmvision&color=lightgrey&logo=github&logoColor=181717)](llmvision.yaml) | Vision-backed garage-can and front-door package checks with rate-limited, downscaled OpenAI calls for package detection. [![Watch on YouTube](https://img.shields.io/badge/Watch-YouTube-FF0000?logo=youtube&logoColor=white)](https://youtu.be/nAhCezFetvI) | `input_button.llmvision_*`, `binary_sensor.front_door_packages_present`, `llmvision.stream_analyzer` |
-| [![YAML source: docker_infrastructure](https://img.shields.io/static/v1?label=YAML&message=docker_infrastructure&color=lightgrey&logo=github&logoColor=181717)](docker_infrastructure.yaml) | Docker host patching telemetry, container/stack Repairs automation, retired Portainer repair cleanup, 20-minute Joanna escalation for persistent container outages using stable configured monitor membership, and weekly scheduled prune actions across docker_10/14/17/69; the dedicated codex_appliance VM is monitored through BearClaw status telemetry. | `sensor.docker_*_apt_status`, `binary_sensor.*_stack_status`, `sensor.docker_stacks_down_count`, `repairs.create`, `repairs.remove`, `script.joanna_dispatch` |
+| [![YAML source: docker_infrastructure](https://img.shields.io/static/v1?label=YAML&message=docker_infrastructure&color=lightgrey&logo=github&logoColor=181717)](docker_infrastructure.yaml) | Docker host patching telemetry, container/stack Repairs automation, retired Portainer repair cleanup, 20-minute Joanna escalation for persistent container outages including stuck `restarting`/`created` states, and weekly scheduled prune actions across docker_10/14/17/69; the dedicated codex_appliance VM is monitored through BearClaw status telemetry. | `sensor.docker_*_apt_status`, `binary_sensor.*_stack_status`, `sensor.docker_stacks_down_count`, `repairs.create`, `repairs.remove`, `script.joanna_dispatch` |
 | [![YAML source: proxmox](https://img.shields.io/static/v1?label=YAML&message=proxmox&color=lightgrey&logo=github&logoColor=181717)](proxmox.yaml) | Proxmox update detection with Repairs, 02:15 Joanna patch orchestration, final per-host HA success notifications, kernel-refresh handoff hints, runtime and disk pressure monitoring, plus nightly Frigate reboot. | `binary_sensor.node_proxmox*_updates_packages`, `sensor.node_proxmox*_total_updates`, `persistent_notification.create`, `script.joanna_dispatch`, `binary_sensor.proxmox*_runtime_healthy`, `sensor.proxmox*_disk_used_percentage`, `button.qemu_docker2_101_reboot` |
 | [![YAML source: synology_dsm](https://img.shields.io/static/v1?label=YAML&message=synology_dsm&color=lightgrey&logo=github&logoColor=181717)](synology_dsm.yaml) | Synology DSM integration health normalization for Carlo-NAS01 and Carlo-NVR, with outage-aware Joanna-first handling for lone post-outage volume warnings and Repairs escalation for persistent or non-outage problems. | `binary_sensor.carlo_*_synology_problem`, `sensor.carlo_*_synology_problem_summary`, `binary_sensor.powerwall_grid_status`, `repairs.create`, `script.joanna_dispatch` |
-| [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](infrastructure.yaml) | Normalized WAN/DNS/backup/domain/cert health, Nebula Sync and promoted IoT primary/backup Pi-hole consistency monitoring with Joanna dispatch, Glances-backed Docker host disk pressure with Joanna-only warning cleanup and critical Repairs, and website uptime/latency SLO signals for Infrastructure dashboards, plus nightly backup verification and monthly Joanna HA log hygiene review with public-safe GitHub issue follow-up. | `sensor.infra_nebula_sync_dns_consistency`, `sensor.infra_pihole_iot_dns_consistency`, `binary_sensor.infra_nebula_sync_degraded`, `binary_sensor.infra_pihole_iot_dns_degraded`, `sensor.docker_*_disk_used_percentage`, `automation.infra_nebula_sync_health_dispatch`, `automation.infra_pihole_iot_dns_drift_dispatch`, `automation.docker_host_disk_pressure_monitor`, `binary_sensor.infra_website_uptime_slo_breach`, `binary_sensor.infra_website_latency_degraded`, `automation.infra_backup_nightly_verification`, `script.joanna_dispatch` |
+| [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](infrastructure.yaml) | Normalized WAN/DNS/backup/domain/cert health, Nebula Sync and promoted IoT primary/backup Pi-hole consistency monitoring with Joanna dispatch, Glances-backed Docker host disk pressure with Joanna-only warning cleanup and critical Repairs, immediate website-down Repairs/Joanna dispatch plus uptime/latency SLO signals, nightly backup verification, and monthly Joanna HA log hygiene review with public-safe GitHub issue follow-up. | `sensor.infra_nebula_sync_dns_consistency`, `sensor.infra_pihole_iot_dns_consistency`, `binary_sensor.infra_nebula_sync_degraded`, `binary_sensor.infra_pihole_iot_dns_degraded`, `sensor.docker_*_disk_used_percentage`, `automation.infra_nebula_sync_health_dispatch`, `automation.infra_pihole_iot_dns_drift_dispatch`, `automation.docker_host_disk_pressure_monitor`, `automation.infra_website_down_repair_and_dispatch`, `binary_sensor.infra_website_uptime_slo_breach`, `binary_sensor.infra_website_latency_degraded`, `automation.infra_backup_nightly_verification`, `script.joanna_dispatch` |
 | [![YAML source: onenote_indexer](https://img.shields.io/static/v1?label=YAML&message=onenote_indexer&color=lightgrey&logo=github&logoColor=181717)](onenote_indexer.yaml) | Dedicated-appliance OneNote indexer health/status monitoring for Joanna, explicit index-health confirmation, failure-repair automation, and a daily duplicate-delete maintenance request. | `sensor.onenote_indexer_last_job_status`, `binary_sensor.onenote_indexer_last_job_successful`, `binary_sensor.onenote_indexer_index_healthy` |
 | [![YAML source: mqtt_status](https://img.shields.io/static/v1?label=YAML&message=mqtt_status&color=lightgrey&logo=github&logoColor=181717)](mqtt_status.yaml) | Command-line MQTT broker reachability probe with Spook Repairs escalation and Joanna troubleshooting dispatch on outage. | `binary_sensor.mqtt_status_raw`, `binary_sensor.mqtt_broker_problem`, `repairs.create`, `rest_command.bearclaw_command` |
 | [![YAML source: mariadb](https://img.shields.io/static/v1?label=YAML&message=mariadb&color=lightgrey&logo=github&logoColor=181717)](mariadb.yaml) | MariaDB recorder health and capacity snapshots with hourly live metrics, weekly admin/recorder polling, and stats-ready numeric sensors. | `sensor.mariadb_status`, `sensor.database_size` |
@@ -18,6 +18,7 @@
 # Notes: Tapple is now served by `games_hub` on `/tapple/`; do not keep a standalone `tapple` container switch in the monitored group.
 # Notes: Teslamate and crystalsoftwashsolutions are live services and should remain in the monitored group when their discovery switches are present.
 # Notes: Treat telemetry reconnects from unavailable/unknown to a concrete stopped state as actionable outages.
+# Notes: Treat stuck `restarting` and `created` states as down so monitored containers dispatch remediation.
 # Notes: Infra Info was removed; BearClaw Admin is the planning snapshot surface.
 # Notes: codex_appliance moved to a dedicated VM; keep the standard codex_appliance switches and retire the legacy hashed discovery entity when it disappears.
 # Notes: Paige's Bookshelf is a live monitored service and should remain in the group when its discovery switch is present.
@@ -471,7 +472,7 @@ template:
              {% endfor %}
            {% endif %}
            {% set effective_state = resolver.state %}
-            {% if effective_state in ['off', 'stopped'] %}
+            {% if effective_state in ['off', 'stopped', 'exited', 'dead', 'restarting', 'created'] %}
              {% set ns.down = ns.down + [key] %}
            {% elif not telemetry_degraded and effective_state in ['unknown', 'unavailable'] %}
              {% set ns.down = ns.down + [key] %}
@@ -515,7 +516,7 @@ template:
                {% endfor %}
              {% endif %}
              {% set effective_state = resolver.state %}
-              {% if effective_state in ['off', 'stopped'] %}
+              {% if effective_state in ['off', 'stopped', 'exited', 'dead', 'restarting', 'created'] %}
                {% set ns.down = ns.down + [key] %}
              {% elif not telemetry_degraded and effective_state in ['unknown', 'unavailable'] %}
                {% set ns.down = ns.down + [key] %}
@@ -596,7 +597,7 @@ script:
        example: true
    sequence:
      - variables:
-          down_states: ['off', 'stopped', 'exited', 'dead', 'unknown', 'unavailable']
+          down_states: ['off', 'stopped', 'exited', 'dead', 'restarting', 'created', 'unknown', 'unavailable']
          src_entity: "{{ entity_id | default('', true) }}"
          op: "{{ operation | default('create', true) | lower }}"
          wait_minutes: "{{ delay_minutes | default(0) | int(0) }}"
@@ -1046,13 +1047,13 @@ automation:
                value_template: "{{ is_monitored_container_event }}"
            sequence:
              - variables:
-                  down_states: ['off', 'stopped', 'exited', 'dead', 'unknown', 'unavailable']
+                  down_states: ['off', 'stopped', 'exited', 'dead', 'restarting', 'created', 'unknown', 'unavailable']
              - choose:
                  - conditions: >-
                      {{ new_state in down_states and
                         (old_state not in down_states or
                          (old_state in ['unknown', 'unavailable'] and
-                           new_state in ['off', 'stopped', 'exited', 'dead'])) and
+                           new_state in ['off', 'stopped', 'exited', 'dead', 'restarting', 'created'])) and
                         not (is_state('binary_sensor.docker_container_telemetry_degraded', 'on') and
                              new_state in ['unknown', 'unavailable']) }}
                    sequence:
@@ -19,6 +19,7 @@
 # Notes: Warning-level Docker host disk pressure is Joanna-only; Repairs are reserved for critical pressure.
 # Notes: Nebula Sync DNS consistency compares primary/backup Pi-hole answers and dispatches Joanna on sustained drift or container loss.
 # Notes: Promoted IoT DNS consistency compares primary/backup Pi-hole answers for reserved IoT host records.
+# Notes: Immediate website-down states create Repairs and dispatch Joanna; SLO/latency automations cover longer-term UptimeRobot trends.
 ######################################################################

 input_text:
@@ -227,6 +228,31 @@ template:
            {% endif %}
          {% endfor %}
          {{ ns.count }}
+        attributes:
+          monitored_entities: >-
+            {{ [
+              'binary_sensor.vcloudinfo_com',
+              'binary_sensor.ipmer_com',
+              'binary_sensor.fordst_com',
+              'binary_sensor.www_kingcrafthomes_com'
+            ] }}
+          down_entities: >-
+            {% set ids = [
+              'binary_sensor.vcloudinfo_com',
+              'binary_sensor.ipmer_com',
+              'binary_sensor.fordst_com',
+              'binary_sensor.www_kingcrafthomes_com'
+            ] %}
+            {% set ns = namespace(items=[]) %}
+            {% for id in ids %}
+              {% if expand(id) | count > 0 %}
+                {% set st = states(id) %}
+                {% if st in ['off', 'unknown', 'unavailable'] %}
+                  {% set ns.items = ns.items + [id ~ '=' ~ st] %}
+                {% endif %}
+              {% endif %}
+            {% endfor %}
+            {{ ns.items }}

  - binary_sensor:
      - name: "Infra WAN Quality Degraded"
@@ -417,6 +443,81 @@ automation:
          message: >-
            External IP changed from {{ trigger.from_state.state }} to {{ trigger.to_state.state }}.

+  - alias: "Infrastructure - Website Down Repair And Dispatch"
+    id: infra_website_down_repair_and_dispatch
+    description: "Create/clear Repairs and dispatch Joanna when monitored websites are immediately down."
+    mode: queued
+    trigger:
+      - platform: state
+        entity_id: binary_sensor.infra_website_degraded
+        to: "on"
+        for: "00:05:00"
+        id: degraded
+      - platform: state
+        entity_id: binary_sensor.infra_website_degraded
+        to: "off"
+        id: recovered
+    variables:
+      down_count: "{{ states('sensor.infra_website_down_count') | int(0) }}"
+      down_entities: "{{ state_attr('sensor.infra_website_down_count', 'down_entities') | default([], true) | list }}"
+      down_summary: "{{ down_entities | join(', ') if (down_entities | count > 0) else 'none' }}"
+    action:
+      - choose:
+          - conditions:
+              - condition: template
+                value_template: "{{ trigger.id == 'degraded' and down_count > 0 }}"
+            sequence:
+              - service: repairs.create
+                data:
+                  issue_id: infra_website_down
+                  title: "Website availability degraded"
+                  description: >-
+                    {{ down_count }} monitored website
+                    {{ 'entity is' if down_count == 1 else 'entities are' }} down:
+                    {{ down_summary }}.
+                  severity: error
+                  persistent: true
+              - service: script.joanna_dispatch
+                data:
+                  trigger_context: >-
+                    HA automation infra_website_down_repair_and_dispatch
+                    (Infrastructure - Website Down Repair And Dispatch)
+                  source: "home_assistant_automation.infra_website_down_repair_and_dispatch"
+                  summary: "Monitored website availability degraded ({{ down_count }} down)"
+                  entity_ids:
+                    - binary_sensor.infra_website_degraded
+                    - sensor.infra_website_down_count
+                    - binary_sensor.vcloudinfo_com
+                    - sensor.vcloudinfo_com
+                    - sensor.wordpress_wp_state_2
+                    - switch.wordpress_wp_container_2
+                  diagnostics: >-
+                    down_entities={{ down_summary }};
+                    vcloudinfo_sensor={{ states('sensor.vcloudinfo_com') }};
+                    vcloudinfo_binary={{ states('binary_sensor.vcloudinfo_com') }};
+                    wordpress_state={{ states('sensor.wordpress_wp_state_2') }};
+                    wordpress_switch={{ states('switch.wordpress_wp_container_2') }};
+                    cloudflared_wp={{ states('switch.cloudflared_wp_container_2') }}.
+                  request: >-
+                    Investigate and resolve the monitored website outage. For
+                    vcloudinfo.com, start with public HTTPS reachability,
+                    wordpress_wp, wordpress_db, and cloudflared_wp telemetry.
+                    Verify public HTTP 200 recovery before closing out. Do not
+                    power-cycle unrelated infrastructure.
+              - service: script.send_to_logbook
+                data:
+                  topic: "INTERNET"
+                  message: "Website availability dispatch requested ({{ down_summary }})."
+        default:
+          - service: repairs.remove
+            continue_on_error: true
+            data:
+              issue_id: infra_website_down
+          - service: script.send_to_logbook
+            data:
+              topic: "INTERNET"
+              message: "Website availability recovered."
+
  - alias: "Infrastructure - Website Uptime SLO Repair"
    id: infra_website_uptime_slo_repair
    description: "Create/clear Repairs issue when website 1-day uptime breaches SLO."
@@ -61,6 +61,7 @@ Current automations that kick off automated resolutions (via `script.joanna_disp
 | `infra_monthly_log_hygiene_review` | Infrastructure - Monthly HA Log Hygiene Review | [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/infrastructure.yaml) |
 | `infra_nebula_sync_health_dispatch` | Infrastructure - Nebula Sync Health Dispatch | [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/infrastructure.yaml) |
 | `infra_pihole_iot_dns_drift_dispatch` | Infrastructure - Pi-hole IoT DNS Drift Dispatch | [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/infrastructure.yaml) |
+| `infra_website_down_repair_and_dispatch` | Infrastructure - Website Down Repair And Dispatch | [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/infrastructure.yaml) |
 | `docker_state_sync_repairs_dynamic` | Docker State Sync - Repairs (Dynamic) | [![YAML source: docker_infrastructure](https://img.shields.io/static/v1?label=YAML&message=docker_infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/docker_infrastructure.yaml) |
 | `docker_group_reconcile_weekly_joanna_review` | Docker Group Reconcile - Weekly Joanna Review | [![YAML source: docker_infrastructure](https://img.shields.io/static/v1?label=YAML&message=docker_infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/docker_infrastructure.yaml) |
 | `docker_host_disk_pressure_monitor` | Docker Host Disk Pressure Monitor | [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/infrastructure.yaml) |