Messaging Update: Queue-Mediated Agent Protocol¶
Problem Statement¶
How might we ensure GitHub events dispatched to agent containers are processed reliably, even when agents are temporarily unavailable?
Recommended Direction¶
The harness owns a task queue (SQLite by default, pluggable interface).
Events are enqueued before any dispatch attempt — the queue is the source of truth.
POST /task → 202 Accepted becomes a nudge ("check your queue now"), not a delivery mechanism.
Agents poll the queue at startup, on a background interval, and when nudged.
Results flow symmetrically: the agent writes its DecisionMessage back to the queue,
then POSTs to POST /harness/result → 202 Accepted to nudge the harness.
The harness also has a background task that periodically checks for completed tasks.
HTTP nudges are optimizations that degrade gracefully — the queue always wins.
This preserves the core constraint: the harness owns all infrastructure.
Agents embed a thin foreman-client library that handles queue I/O.
Agent authors call client.next_task() and client.complete_task(task_id, decision).
They don't implement queue management.
Key Assumptions to Validate¶
- SQLite with WAL mode handles concurrent harness writes + agent reads without contention — benchmark before committing the schema
- Agents are Python (or can embed a Python client) — validate the agent container build process supports a shared library dependency
- 202 nudge + background poll provides acceptable end-to-end latency — define "acceptable" explicitly (target: < 30s for MVP)
- One agent per queue is sufficient for MVP — the queue abstraction must not bake in single-consumer assumptions that block future fan-out
MVP Scope¶
In:
task_queuetable in existingmemory.db: task_id, agent_url, status, payload, created_at, claimed_at, completed_at, result, retry_count- Harness writes: enqueue on poll event;
POST /task → 202nudge to agent;POST /resultendpoint for agent callback; background drain loop for completed tasks; re-enqueue tasks claimed but not completed within timeout - Harness reads: poll queue for completed tasks on callback + interval
foreman-clientlib:next_task(),complete_task(task_id, decision),heartbeat(task_id)— heartbeat resets the claim timeout clock- Agent protocol:
POST /task → 202(nudge only); startup queue poll; configurable background poll interval - Delivery guarantee: at-least-once; task_id is the idempotency key
Out:
- Multiple agent containers per queue (no consumer groups in MVP)
- External queue backends (Redis, NATS) — define pluggable interface, implement SQLite only
- Task prioritization or ordering beyond FIFO
- Monitoring UI — structured log output only
Not Doing (and Why)¶
- Agent-owned queues — every agent author would reimplement queue logic; harness owns infrastructure
- Exactly-once delivery — requires distributed coordination; at-least-once
- idempotency is sufficient and far simpler
- File-system queuing — ephemeral in containers; shared volumes add deployment surface for no real gain over SQLite
- Keep synchronous dispatch as fallback — two delivery paths means neither is authoritative; commit to queue-first fully
Open Questions¶
- What is the claim timeout? If an agent pulls a task and crashes before completing, the harness must detect and re-enqueue it — define the TTL and re-enqueue logic before writing the schema.
- Is
foreman-clienta separate PyPI package, part of theforemanpackage, or vendored into each agent at build time? - Should
GET /queue/statusbe exposed on the harness for operator visibility, or is structured logging sufficient for MVP?