Architecture

Hot path & latency

The hot path is the critical pipeline from a visitor's message arriving at the server to the first token streaming back. The target is 1 second p95. Past that point, the conversation feels broken — so the architecture is built around pushing everything non-essential off the hot path.

Overview

The entire pipeline runs within a single PHP request, kept alive by Octane workers so cold-start cost is paid once at process boot, not per turn. The 1-second budget covers HTTP receive, retrieval, prompt assembly, LLM time-to-first-token, and the first SSE flush.

Latency budget

The 865 ms working target spreads across nine phases. The LLM time-to-first-token is by far the largest single slice.

Phase	Budget
HTTP receive + auth	30 ms
Workspace + agent resolve	10 ms
Curated match check	5 ms
Query embedding	80 ms
Vector search	120 ms
Rerank	80 ms
Prompt assembly	10 ms
LLM time-to-first-token	500 ms
SSE flush	30 ms

That leaves roughly 135 ms of contingency before the 1-second wall.

Architectural rules

Five hard rules are enforced through code review and CI tests:

Asynchronous persistence — every database write happens after the response stream completes, not before.
Queue-based webhooks — outgoing integrations dispatch as background jobs. Webhooks never block a turn.
Single LLM call per turn — no multi-step reasoning chains in the request path.
Batched conversation history — recent turns are read from Redis, not Postgres, to avoid round trips.
Graceful degradation — when a provider fails, the system falls through to the next provider or to an honest low-confidence answer; it never loops on the hot path.

Deferred operations

Nine job types run after the stream closes:

Persist the user message and the assistant message.
Extract and persist citations.
Increment usage_events for billing.
Detect knowledge gaps for low-confidence turns.
Dispatch outgoing webhooks.
Queue page auto-indexing if the visited page is unseen.
Update conversation activity timestamps.
Notify human agents of unclaimed conversations matching alert rules.
Refresh per-agent analytics counters.

None of these can block the visitor experience.

Caching strategy

Redis carries two hot-path caches:

Retrieval cache — 30-minute TTL, keyed by (agent_id, query, page_url). Repeated phrasings of the same question on the same page reuse the same chunks.
Conversation history cache — 2-hour TTL, capped at the last 12 messages. Lets the LLM see recent turns without a Postgres read.

Failure philosophy

The visitor always gets a response, even if uncertain. When the vector store is unavailable, the system degrades to an ungrounded answer with a low_confidence flag rather than 500-ing. Silent failure is never the default.

Note. The headline production metric is p95 of rag.llm.first_token. Watch it in your tracing tool — see Observability.