Architecture¶

Glint is a single Go binary that polls Proxmox APIs, caches results in memory, persists history to SQLite, and serves a server-rendered dashboard via htmx.

System Overview¶

                      Browser (htmx polling every 15s)
                              |
                        HTTPS (reverse proxy)
                              |
                  +-----------+-----------+
                  |         Glint         |
                  |   Go binary (Docker)  |
                  |                       |
                  |  templ HTML handlers  |
                  |         |             |
                  |  Collector Engine     |
                  |  (worker pool)        |
                  |         |             |
                  |  In-Memory Cache      |
                  |  (sync.RWMutex)       |
                  |         |             |
                  |  SQLite               |
                  |  (metadata + history) |
                  +-----------+-----------+
                   /          |          \
        PVE Instance 1   PVE Instance 2   PBS Instance 1
        ├─ Node A        ├─ Node C        ├─ Datastore X
        │  ├─ CTs/VMs    │  ├─ CTs/VMs    └─ Datastore Y
        │  └─ Disks      │  └─ Disks
        └─ Node B        └─ Node D
                                              ntfy (optional)

Data Flow¶

Collector goroutines (one per PVE/PBS instance) submit API calls to a bounded worker pool
Each collector updates the in-memory cache and appends snapshots to SQLite
HTTP handlers read from cache snapshots (lock-free for consumers)
htmx polls fragment endpoints every 15s, swapping HTML in-place
Alerter goroutine evaluates cache state against rules, sends to ntfy with deduplication

Key Design Decisions¶

From Pulse¶

Instance + Node hierarchy --- each PVE config entry is an "instance" (standalone or cluster). Nodes are auto-discovered via GET /nodes. Guests are scoped by cluster name to prevent duplicates.
Per-instance collectors with worker pool --- all API calls go through a bounded worker pool (chan struct{} semaphore, default 4 workers) to cap memory and concurrent HTTP connections.
Snapshot-based cache --- the cache exposes deep-copy snapshots so HTTP handlers never hold locks. Uses sync.RWMutex (not sync.Map) for atomic cross-key consistency.

From Scrutiny¶

WWN-based disk identity --- disks are identified by World Wide Name (globally unique, survives reboots and cable changes), not /dev/sdX.
Backblaze-derived SMART thresholds --- individual attributes are evaluated against real-world failure rate data from Backblaze's public hard drive statistics.
Protocol-aware SMART parsing --- ATA, NVMe, and SCSI each have different data models, parsed separately.

Project Structure¶

cmd/glint/main.go              Entry point, wiring, signal handling
internal/
  api/                         HTTP handlers + htmx fragments
    handlers.go                Route registration + fragment handlers
    middleware.go              Logging, recovery
  collector/                   Data collection from Proxmox APIs
    collector.go               Collector interface + runner loop
    backup.go                  BackupCollector interface
    errors.go                  RetryableError, behavior-based error types
    pve.go                     PVE client (nodes, guests, disks, SMART)
    pbs.go                     PBS client (datastores, snapshots, tasks)
    temperature.go             Optional SSH-based temp polling
  smart/                       S.M.A.R.T. health assessment
    evaluate.go                Attribute status evaluation
    thresholds.go              Backblaze failure rate lookup tables
    ata.go                     ATA attribute parsing
    nvme.go                    NVMe text field parsing
  store/                       SQLite persistence
    store.go                   Repository (insert, query, migrate)
    migrations.go              Embedded schema
    pruner.go                  Retention cleanup
  cache/                       Thread-safe in-memory state
    cache.go                   Multi-instance cache with snapshots
  alerter/                     Alert rule engine
    alerter.go                 Rule evaluation + deduplication
    templates.go               Default message templates
  notify/                      Notification providers
    provider.go                Provider interface + Notification struct
    ntfy.go                    ntfy provider
    webhook.go                 Generic webhook provider
  config/                      Configuration loading
    config.go                  YAML + env var parsing
  model/                       Shared domain types
    model.go                   Node, Guest, Disk, Backup, etc.
templates/                     templ HTML templates
  layout.templ                 Base HTML shell
  dashboard.templ              Main page
  nodes.templ                  Node cards
  guests.templ                 Guest table
  backups.templ                PBS backup panel
  disks.templ                  Disk health table
  disk_detail.templ            Expanded SMART attributes
  components/                  Reusable UI components
static/                        CSS + htmx.min.js

Collector Engine¶

Each PVE/PBS instance gets its own collector goroutine. All collectors share a single worker pool that bounds concurrent API calls.

Worker Pool¶

type WorkerPool struct {
    sem chan struct{}  // buffered channel as semaphore
}

Default: 4 workers. Configurable via worker_pool_size. When polling 4 nodes in parallel, only 4 HTTP requests fly at once --- prevents unbounded goroutine spawning.

PVE Poll Cycle¶

1. GET /nodes → discover/update node list
2. For each online node (fan out via worker pool):
   a. GET /nodes/{node}/status → host metrics
   b. GET /nodes/{node}/lxc → containers
   c. GET /nodes/{node}/qemu → VMs
   d. If disk poll due (>1h since last):
      - GET /nodes/{node}/disks/list → disk inventory
      - For each disk: GET /nodes/{node}/disks/smart → SMART data
3. Merge results, dedup guests by cluster_id
4. Update cache + write to SQLite

PBS Poll Cycle¶

1. GET /status/datastore-usage → all datastores
2. For each monitored datastore:
   a. GET /admin/datastore/{store}/snapshots → backup snapshots
3. GET /nodes/localhost/tasks → recent tasks
4. Update cache + write to SQLite

Polling Intervals¶

Data	Interval	Notes
Node + guest metrics	15s	Fan out across nodes in parallel
S.M.A.R.T. disk data	1h	Slow operation (1-5s per disk)
PBS backups + tasks	5m
Node discovery	5m	Within PVE collector
SSH temperatures	60s	Graceful fallback if unavailable

In-Memory Cache¶

The cache stores the latest state from all collectors. HTTP handlers read snapshots (deep copies) so they never hold locks during template rendering.

type Cache struct {
    mu sync.RWMutex

    Nodes      map[string]map[string]*Node        // [instance][node]
    Guests     map[string]map[int]*Guest           // [cluster_id][vmid]
    Disks      map[string]*Disk                    // [wwn]
    Datastores map[string]map[string]*DatastoreStatus
    Backups    map[string]map[string]*Backup
    Tasks      map[string][]*PBSTask
    LastPoll   map[string]time.Time
}

Why sync.RWMutex over sync.Map: rendering a dashboard page requires consistent data across nodes, guests, and disks simultaneously. sync.Map provides no cross-key consistency guarantees.

SQLite Schema¶

Time-Series Tables¶

All time-series tables use WITHOUT ROWID with ts-leading composite primary keys. This makes them clustered B-trees ordered by time --- ideal for range queries and pruning.

Table	Retention	Primary Key
`node_snapshots`	48h	`(ts, instance, node)`
`guest_snapshots`	48h	`(ts, instance, vmid)`
`smart_snapshots`	30d	`(ts, wwn)`
`backup_snapshots`	7d	`(ts, pbs_instance, backup_id, backup_time)`
`datastore_snapshots`	7d	`(ts, pbs_instance, store_name)`
`alert_log`	30d	`(id)` autoincrement

Pruner¶

An hourly goroutine deletes rows older than the retention period. The ts-leading PK means DELETE FROM X WHERE ts < ? is a fast range scan on the clustered index.

S.M.A.R.T. Health Assessment¶

Disk health uses a multi-level evaluation inspired by Scrutiny:

Status Bitfield¶

Bit	Value	Meaning
0	`StatusPassed`	All checks passed
1	`StatusFailedSmart`	Manufacturer SMART says `FAILING_NOW`
2	`StatusWarnScrutiny`	Backblaze data suggests elevated risk
4	`StatusFailedScrutiny`	Backblaze data suggests high failure probability
8	`StatusUnknown`	Disk disappeared / unreachable
16	`StatusInternalError`	Parse failure, API timeout

Evaluation Order¶

Manufacturer SMART says FAILING_NOW → StatusFailedSmart
IN_THE_PAST → StatusWarnScrutiny
Look up raw value in Backblaze thresholds:
- Critical attribute (IDs 5, 10, 187, 188, 196, 197, 198) + failure rate >= 10% → StatusFailedScrutiny
- Non-critical + failure rate >= 20% → StatusFailedScrutiny
- Non-critical + failure rate >= 10% → StatusWarnScrutiny
Device status = bitwise OR of all attribute statuses

NVMe Handling¶

NVMe drives don't return ATA-style attributes. Glint parses the raw smartctl text output for NVMe-specific fields (critical_warning, available_spare, percentage_used, media_errors, etc.) and applies NVMe-specific thresholds.

HTTP Routes¶

Route	Type	Refresh	Description
`GET /`	Full page	---	Dashboard shell
`GET /fragments/nodes`	htmx	15s	All node cards with sparklines
`GET /fragments/guests`	htmx	15s	Guest table (all instances)
`GET /fragments/backups`	htmx	60s	PBS backup status + tasks
`GET /fragments/disks`	htmx	300s	S.M.A.R.T. health (all nodes)
`GET /fragments/disk/{wwn}`	htmx	on-click	Expanded attributes for one disk
`GET /api/sparkline/node/{instance}/{node}`	JSON	on-demand	Node sparkline data
`GET /api/sparkline/guest/{instance}/{vmid}`	JSON	on-demand	Guest sparkline data
`GET /healthz`	JSON	---	Health check

Alerting¶

The alerter goroutine evaluates the cache state against configured rules on each poll cycle. Alerts are deduplicated with per-rule cooldowns to prevent notification storms.

Built-in Rules¶

Alert	Default Condition	Cooldown
Node CPU high	> 90% for 5min	1h
Node offline	status != "online"	30min
Guest down	not running for 2min	30min
Backup stale	last backup > 36h	6h
Backup failed	PBS task error	1h
Disk SMART failed	manufacturer failure	6h
Datastore full	> 85% used	6h

Notification Providers¶

Providers implement a common interface:

type Provider interface {
    Name() string
    Send(ctx context.Context, n Notification) error
}

Built-in providers: ntfy (with priority mapping and tags) and webhook (generic JSON POST to any URL).

Lifecycle Management¶

All goroutines accept a context.Context and exit when cancelled. The main function uses errgroup to manage the lifecycle:

g, ctx := errgroup.WithContext(ctx)
g.Go(func() error { return pveCollector.Run(ctx) })
g.Go(func() error { return pbsCollector.Run(ctx) })
g.Go(func() error { return pruner.Run(ctx) })
g.Go(func() error { return server.Run(ctx) })

Graceful shutdown on SIGINT/SIGTERM: collectors stop polling, in-flight requests complete, SQLite is closed cleanly.