Skip to content

Messaging Update: Queue-Mediated Agent Protocol

Problem Statement

How might we ensure GitHub events dispatched to agent containers are processed reliably, even when agents are temporarily unavailable?

The harness owns a task queue (SQLite by default, pluggable interface). Events are enqueued before any dispatch attempt — the queue is the source of truth. POST /task → 202 Accepted becomes a nudge ("check your queue now"), not a delivery mechanism. Agents poll the queue at startup, on a background interval, and when nudged.

Results flow symmetrically: the agent writes its DecisionMessage back to the queue, then POSTs to POST /harness/result → 202 Accepted to nudge the harness. The harness also has a background task that periodically checks for completed tasks. HTTP nudges are optimizations that degrade gracefully — the queue always wins.

This preserves the core constraint: the harness owns all infrastructure. Agents embed a thin foreman-client library that handles queue I/O. Agent authors call client.next_task() and client.complete_task(task_id, decision). They don't implement queue management.

Key Assumptions to Validate

  • SQLite with WAL mode handles concurrent harness writes + agent reads without contention — benchmark before committing the schema
  • Agents are Python (or can embed a Python client) — validate the agent container build process supports a shared library dependency
  • 202 nudge + background poll provides acceptable end-to-end latency — define "acceptable" explicitly (target: < 30s for MVP)
  • One agent per queue is sufficient for MVP — the queue abstraction must not bake in single-consumer assumptions that block future fan-out

MVP Scope

In:

  • task_queue table in existing memory.db: task_id, agent_url, status, payload, created_at, claimed_at, completed_at, result, retry_count
  • Harness writes: enqueue on poll event; POST /task → 202 nudge to agent; POST /result endpoint for agent callback; background drain loop for completed tasks; re-enqueue tasks claimed but not completed within timeout
  • Harness reads: poll queue for completed tasks on callback + interval
  • foreman-client lib: next_task(), complete_task(task_id, decision), heartbeat(task_id) — heartbeat resets the claim timeout clock
  • Agent protocol: POST /task → 202 (nudge only); startup queue poll; configurable background poll interval
  • Delivery guarantee: at-least-once; task_id is the idempotency key

Out:

  • Multiple agent containers per queue (no consumer groups in MVP)
  • External queue backends (Redis, NATS) — define pluggable interface, implement SQLite only
  • Task prioritization or ordering beyond FIFO
  • Monitoring UI — structured log output only

Not Doing (and Why)

  • Agent-owned queues — every agent author would reimplement queue logic; harness owns infrastructure
  • Exactly-once delivery — requires distributed coordination; at-least-once
    • idempotency is sufficient and far simpler
  • File-system queuing — ephemeral in containers; shared volumes add deployment surface for no real gain over SQLite
  • Keep synchronous dispatch as fallback — two delivery paths means neither is authoritative; commit to queue-first fully

Open Questions

  • What is the claim timeout? If an agent pulls a task and crashes before completing, the harness must detect and re-enqueue it — define the TTL and re-enqueue logic before writing the schema.
  • Is foreman-client a separate PyPI package, part of the foreman package, or vendored into each agent at build time?
  • Should GET /queue/status be exposed on the harness for operator visibility, or is structured logging sufficient for MVP?