diff --git a/config/packages/README.md b/config/packages/README.md index 4b3d2e86..e536c1af 100755 --- a/config/packages/README.md +++ b/config/packages/README.md @@ -49,10 +49,10 @@ Live collection of plug-and-play Home Assistant packages. Each YAML file in this | [![YAML source: logbook_activity_feed](https://img.shields.io/static/v1?label=YAML&message=logbook_activity_feed&color=lightgrey&logo=github&logoColor=181717)](logbook_activity_feed.yaml) | Dummy `sensor.activity_feed` + helper to write clean Activity entries (Issue #1550). | `sensor.activity_feed`, `script.send_to_logbook` | | [![YAML source: mariadb_monitoring](https://img.shields.io/static/v1?label=YAML&message=mariadb_monitoring&color=lightgrey&logo=github&logoColor=181717)](mariadb_monitoring.yaml) | MariaDB health sensors and Lovelace dashboard snippet for recorder stats. | `sensor.mariadb_status`, `sensor.database_size` | | [![YAML source: llmvision](https://img.shields.io/static/v1?label=YAML&message=llmvision&color=lightgrey&logo=github&logoColor=181717)](llmvision.yaml) | Vision-backed garage-can and front-door package checks with rate-limited, downscaled OpenAI calls for package detection. [![Watch on YouTube](https://img.shields.io/badge/Watch-YouTube-FF0000?logo=youtube&logoColor=white)](https://youtu.be/nAhCezFetvI) | `input_button.llmvision_*`, `binary_sensor.front_door_packages_present`, `llmvision.stream_analyzer` | -| [![YAML source: docker_infrastructure](https://img.shields.io/static/v1?label=YAML&message=docker_infrastructure&color=lightgrey&logo=github&logoColor=181717)](docker_infrastructure.yaml) | Docker host patching telemetry, container/stack Repairs automation, retired Portainer repair cleanup, 20-minute Joanna escalation for persistent container outages using stable configured monitor membership, and weekly scheduled prune actions across docker_10/14/17/69; the dedicated codex_appliance VM is monitored through BearClaw status telemetry. | `sensor.docker_*_apt_status`, `binary_sensor.*_stack_status`, `sensor.docker_stacks_down_count`, `repairs.create`, `repairs.remove`, `script.joanna_dispatch` | +| [![YAML source: docker_infrastructure](https://img.shields.io/static/v1?label=YAML&message=docker_infrastructure&color=lightgrey&logo=github&logoColor=181717)](docker_infrastructure.yaml) | Docker host patching telemetry, container/stack Repairs automation, retired Portainer repair cleanup, 20-minute Joanna escalation for persistent container outages including stuck `restarting`/`created` states, and weekly scheduled prune actions across docker_10/14/17/69; the dedicated codex_appliance VM is monitored through BearClaw status telemetry. | `sensor.docker_*_apt_status`, `binary_sensor.*_stack_status`, `sensor.docker_stacks_down_count`, `repairs.create`, `repairs.remove`, `script.joanna_dispatch` | | [![YAML source: proxmox](https://img.shields.io/static/v1?label=YAML&message=proxmox&color=lightgrey&logo=github&logoColor=181717)](proxmox.yaml) | Proxmox update detection with Repairs, 02:15 Joanna patch orchestration, final per-host HA success notifications, kernel-refresh handoff hints, runtime and disk pressure monitoring, plus nightly Frigate reboot. | `binary_sensor.node_proxmox*_updates_packages`, `sensor.node_proxmox*_total_updates`, `persistent_notification.create`, `script.joanna_dispatch`, `binary_sensor.proxmox*_runtime_healthy`, `sensor.proxmox*_disk_used_percentage`, `button.qemu_docker2_101_reboot` | | [![YAML source: synology_dsm](https://img.shields.io/static/v1?label=YAML&message=synology_dsm&color=lightgrey&logo=github&logoColor=181717)](synology_dsm.yaml) | Synology DSM integration health normalization for Carlo-NAS01 and Carlo-NVR, with outage-aware Joanna-first handling for lone post-outage volume warnings and Repairs escalation for persistent or non-outage problems. | `binary_sensor.carlo_*_synology_problem`, `sensor.carlo_*_synology_problem_summary`, `binary_sensor.powerwall_grid_status`, `repairs.create`, `script.joanna_dispatch` | -| [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](infrastructure.yaml) | Normalized WAN/DNS/backup/domain/cert health, Nebula Sync and promoted IoT primary/backup Pi-hole consistency monitoring with Joanna dispatch, Glances-backed Docker host disk pressure with Joanna-only warning cleanup and critical Repairs, and website uptime/latency SLO signals for Infrastructure dashboards, plus nightly backup verification and monthly Joanna HA log hygiene review with public-safe GitHub issue follow-up. | `sensor.infra_nebula_sync_dns_consistency`, `sensor.infra_pihole_iot_dns_consistency`, `binary_sensor.infra_nebula_sync_degraded`, `binary_sensor.infra_pihole_iot_dns_degraded`, `sensor.docker_*_disk_used_percentage`, `automation.infra_nebula_sync_health_dispatch`, `automation.infra_pihole_iot_dns_drift_dispatch`, `automation.docker_host_disk_pressure_monitor`, `binary_sensor.infra_website_uptime_slo_breach`, `binary_sensor.infra_website_latency_degraded`, `automation.infra_backup_nightly_verification`, `script.joanna_dispatch` | +| [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](infrastructure.yaml) | Normalized WAN/DNS/backup/domain/cert health, Nebula Sync and promoted IoT primary/backup Pi-hole consistency monitoring with Joanna dispatch, Glances-backed Docker host disk pressure with Joanna-only warning cleanup and critical Repairs, immediate website-down Repairs/Joanna dispatch plus uptime/latency SLO signals, nightly backup verification, and monthly Joanna HA log hygiene review with public-safe GitHub issue follow-up. | `sensor.infra_nebula_sync_dns_consistency`, `sensor.infra_pihole_iot_dns_consistency`, `binary_sensor.infra_nebula_sync_degraded`, `binary_sensor.infra_pihole_iot_dns_degraded`, `sensor.docker_*_disk_used_percentage`, `automation.infra_nebula_sync_health_dispatch`, `automation.infra_pihole_iot_dns_drift_dispatch`, `automation.docker_host_disk_pressure_monitor`, `automation.infra_website_down_repair_and_dispatch`, `binary_sensor.infra_website_uptime_slo_breach`, `binary_sensor.infra_website_latency_degraded`, `automation.infra_backup_nightly_verification`, `script.joanna_dispatch` | | [![YAML source: onenote_indexer](https://img.shields.io/static/v1?label=YAML&message=onenote_indexer&color=lightgrey&logo=github&logoColor=181717)](onenote_indexer.yaml) | Dedicated-appliance OneNote indexer health/status monitoring for Joanna, explicit index-health confirmation, failure-repair automation, and a daily duplicate-delete maintenance request. | `sensor.onenote_indexer_last_job_status`, `binary_sensor.onenote_indexer_last_job_successful`, `binary_sensor.onenote_indexer_index_healthy` | | [![YAML source: mqtt_status](https://img.shields.io/static/v1?label=YAML&message=mqtt_status&color=lightgrey&logo=github&logoColor=181717)](mqtt_status.yaml) | Command-line MQTT broker reachability probe with Spook Repairs escalation and Joanna troubleshooting dispatch on outage. | `binary_sensor.mqtt_status_raw`, `binary_sensor.mqtt_broker_problem`, `repairs.create`, `rest_command.bearclaw_command` | | [![YAML source: mariadb](https://img.shields.io/static/v1?label=YAML&message=mariadb&color=lightgrey&logo=github&logoColor=181717)](mariadb.yaml) | MariaDB recorder health and capacity snapshots with hourly live metrics, weekly admin/recorder polling, and stats-ready numeric sensors. | `sensor.mariadb_status`, `sensor.database_size` | diff --git a/config/packages/docker_infrastructure.yaml b/config/packages/docker_infrastructure.yaml index aa34ee4b..112e7835 100644 --- a/config/packages/docker_infrastructure.yaml +++ b/config/packages/docker_infrastructure.yaml @@ -18,6 +18,7 @@ # Notes: Tapple is now served by `games_hub` on `/tapple/`; do not keep a standalone `tapple` container switch in the monitored group. # Notes: Teslamate and crystalsoftwashsolutions are live services and should remain in the monitored group when their discovery switches are present. # Notes: Treat telemetry reconnects from unavailable/unknown to a concrete stopped state as actionable outages. +# Notes: Treat stuck `restarting` and `created` states as down so monitored containers dispatch remediation. # Notes: Infra Info was removed; BearClaw Admin is the planning snapshot surface. # Notes: codex_appliance moved to a dedicated VM; keep the standard codex_appliance switches and retire the legacy hashed discovery entity when it disappears. # Notes: Paige's Bookshelf is a live monitored service and should remain in the group when its discovery switch is present. @@ -471,7 +472,7 @@ template: {% endfor %} {% endif %} {% set effective_state = resolver.state %} - {% if effective_state in ['off', 'stopped'] %} + {% if effective_state in ['off', 'stopped', 'exited', 'dead', 'restarting', 'created'] %} {% set ns.down = ns.down + [key] %} {% elif not telemetry_degraded and effective_state in ['unknown', 'unavailable'] %} {% set ns.down = ns.down + [key] %} @@ -515,7 +516,7 @@ template: {% endfor %} {% endif %} {% set effective_state = resolver.state %} - {% if effective_state in ['off', 'stopped'] %} + {% if effective_state in ['off', 'stopped', 'exited', 'dead', 'restarting', 'created'] %} {% set ns.down = ns.down + [key] %} {% elif not telemetry_degraded and effective_state in ['unknown', 'unavailable'] %} {% set ns.down = ns.down + [key] %} @@ -596,7 +597,7 @@ script: example: true sequence: - variables: - down_states: ['off', 'stopped', 'exited', 'dead', 'unknown', 'unavailable'] + down_states: ['off', 'stopped', 'exited', 'dead', 'restarting', 'created', 'unknown', 'unavailable'] src_entity: "{{ entity_id | default('', true) }}" op: "{{ operation | default('create', true) | lower }}" wait_minutes: "{{ delay_minutes | default(0) | int(0) }}" @@ -1046,13 +1047,13 @@ automation: value_template: "{{ is_monitored_container_event }}" sequence: - variables: - down_states: ['off', 'stopped', 'exited', 'dead', 'unknown', 'unavailable'] + down_states: ['off', 'stopped', 'exited', 'dead', 'restarting', 'created', 'unknown', 'unavailable'] - choose: - conditions: >- {{ new_state in down_states and (old_state not in down_states or (old_state in ['unknown', 'unavailable'] and - new_state in ['off', 'stopped', 'exited', 'dead'])) and + new_state in ['off', 'stopped', 'exited', 'dead', 'restarting', 'created'])) and not (is_state('binary_sensor.docker_container_telemetry_degraded', 'on') and new_state in ['unknown', 'unavailable']) }} sequence: diff --git a/config/packages/infrastructure.yaml b/config/packages/infrastructure.yaml index 75b26f7d..f1c98377 100644 --- a/config/packages/infrastructure.yaml +++ b/config/packages/infrastructure.yaml @@ -19,6 +19,7 @@ # Notes: Warning-level Docker host disk pressure is Joanna-only; Repairs are reserved for critical pressure. # Notes: Nebula Sync DNS consistency compares primary/backup Pi-hole answers and dispatches Joanna on sustained drift or container loss. # Notes: Promoted IoT DNS consistency compares primary/backup Pi-hole answers for reserved IoT host records. +# Notes: Immediate website-down states create Repairs and dispatch Joanna; SLO/latency automations cover longer-term UptimeRobot trends. ###################################################################### input_text: @@ -227,6 +228,31 @@ template: {% endif %} {% endfor %} {{ ns.count }} + attributes: + monitored_entities: >- + {{ [ + 'binary_sensor.vcloudinfo_com', + 'binary_sensor.ipmer_com', + 'binary_sensor.fordst_com', + 'binary_sensor.www_kingcrafthomes_com' + ] }} + down_entities: >- + {% set ids = [ + 'binary_sensor.vcloudinfo_com', + 'binary_sensor.ipmer_com', + 'binary_sensor.fordst_com', + 'binary_sensor.www_kingcrafthomes_com' + ] %} + {% set ns = namespace(items=[]) %} + {% for id in ids %} + {% if expand(id) | count > 0 %} + {% set st = states(id) %} + {% if st in ['off', 'unknown', 'unavailable'] %} + {% set ns.items = ns.items + [id ~ '=' ~ st] %} + {% endif %} + {% endif %} + {% endfor %} + {{ ns.items }} - binary_sensor: - name: "Infra WAN Quality Degraded" @@ -417,6 +443,81 @@ automation: message: >- External IP changed from {{ trigger.from_state.state }} to {{ trigger.to_state.state }}. + - alias: "Infrastructure - Website Down Repair And Dispatch" + id: infra_website_down_repair_and_dispatch + description: "Create/clear Repairs and dispatch Joanna when monitored websites are immediately down." + mode: queued + trigger: + - platform: state + entity_id: binary_sensor.infra_website_degraded + to: "on" + for: "00:05:00" + id: degraded + - platform: state + entity_id: binary_sensor.infra_website_degraded + to: "off" + id: recovered + variables: + down_count: "{{ states('sensor.infra_website_down_count') | int(0) }}" + down_entities: "{{ state_attr('sensor.infra_website_down_count', 'down_entities') | default([], true) | list }}" + down_summary: "{{ down_entities | join(', ') if (down_entities | count > 0) else 'none' }}" + action: + - choose: + - conditions: + - condition: template + value_template: "{{ trigger.id == 'degraded' and down_count > 0 }}" + sequence: + - service: repairs.create + data: + issue_id: infra_website_down + title: "Website availability degraded" + description: >- + {{ down_count }} monitored website + {{ 'entity is' if down_count == 1 else 'entities are' }} down: + {{ down_summary }}. + severity: error + persistent: true + - service: script.joanna_dispatch + data: + trigger_context: >- + HA automation infra_website_down_repair_and_dispatch + (Infrastructure - Website Down Repair And Dispatch) + source: "home_assistant_automation.infra_website_down_repair_and_dispatch" + summary: "Monitored website availability degraded ({{ down_count }} down)" + entity_ids: + - binary_sensor.infra_website_degraded + - sensor.infra_website_down_count + - binary_sensor.vcloudinfo_com + - sensor.vcloudinfo_com + - sensor.wordpress_wp_state_2 + - switch.wordpress_wp_container_2 + diagnostics: >- + down_entities={{ down_summary }}; + vcloudinfo_sensor={{ states('sensor.vcloudinfo_com') }}; + vcloudinfo_binary={{ states('binary_sensor.vcloudinfo_com') }}; + wordpress_state={{ states('sensor.wordpress_wp_state_2') }}; + wordpress_switch={{ states('switch.wordpress_wp_container_2') }}; + cloudflared_wp={{ states('switch.cloudflared_wp_container_2') }}. + request: >- + Investigate and resolve the monitored website outage. For + vcloudinfo.com, start with public HTTPS reachability, + wordpress_wp, wordpress_db, and cloudflared_wp telemetry. + Verify public HTTP 200 recovery before closing out. Do not + power-cycle unrelated infrastructure. + - service: script.send_to_logbook + data: + topic: "INTERNET" + message: "Website availability dispatch requested ({{ down_summary }})." + default: + - service: repairs.remove + continue_on_error: true + data: + issue_id: infra_website_down + - service: script.send_to_logbook + data: + topic: "INTERNET" + message: "Website availability recovered." + - alias: "Infrastructure - Website Uptime SLO Repair" id: infra_website_uptime_slo_repair description: "Create/clear Repairs issue when website 1-day uptime breaches SLO." diff --git a/config/script/README.md b/config/script/README.md index 51384c02..5d8d113d 100755 --- a/config/script/README.md +++ b/config/script/README.md @@ -61,6 +61,7 @@ Current automations that kick off automated resolutions (via `script.joanna_disp | `infra_monthly_log_hygiene_review` | Infrastructure - Monthly HA Log Hygiene Review | [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/infrastructure.yaml) | | `infra_nebula_sync_health_dispatch` | Infrastructure - Nebula Sync Health Dispatch | [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/infrastructure.yaml) | | `infra_pihole_iot_dns_drift_dispatch` | Infrastructure - Pi-hole IoT DNS Drift Dispatch | [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/infrastructure.yaml) | +| `infra_website_down_repair_and_dispatch` | Infrastructure - Website Down Repair And Dispatch | [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/infrastructure.yaml) | | `docker_state_sync_repairs_dynamic` | Docker State Sync - Repairs (Dynamic) | [![YAML source: docker_infrastructure](https://img.shields.io/static/v1?label=YAML&message=docker_infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/docker_infrastructure.yaml) | | `docker_group_reconcile_weekly_joanna_review` | Docker Group Reconcile - Weekly Joanna Review | [![YAML source: docker_infrastructure](https://img.shields.io/static/v1?label=YAML&message=docker_infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/docker_infrastructure.yaml) | | `docker_host_disk_pressure_monitor` | Docker Host Disk Pressure Monitor | [![YAML source: infrastructure](https://img.shields.io/static/v1?label=YAML&message=infrastructure&color=lightgrey&logo=github&logoColor=181717)](../packages/infrastructure.yaml) |