Architecture
Hot path & latency
The hot path is the critical pipeline from a visitor's message arriving at the server to the first token streaming back. The target is 1 second p95. Past that point, the conversation feels broken — so the architecture is built around pushing everything non-essential off the hot path.
Overview
The entire pipeline runs within a single PHP request, kept alive by Octane workers so cold-start cost is paid once at process boot, not per turn. The 1-second budget covers HTTP receive, retrieval, prompt assembly, LLM time-to-first-token, and the first SSE flush.
Latency budget
The 865 ms working target spreads across nine phases. The LLM time-to-first-token is by far the largest single slice.
| Phase | Budget |
|---|---|
| HTTP receive + auth | 30 ms |
| Workspace + agent resolve | 10 ms |
| Curated match check | 5 ms |
| Query embedding | 80 ms |
| Vector search | 120 ms |
| Rerank | 80 ms |
| Prompt assembly | 10 ms |
| LLM time-to-first-token | 500 ms |
| SSE flush | 30 ms |
That leaves roughly 135 ms of contingency before the 1-second wall.
Architectural rules
Five hard rules are enforced through code review and CI tests:
- Asynchronous persistence — every database write happens after the response stream completes, not before.
- Queue-based webhooks — outgoing integrations dispatch as background jobs. Webhooks never block a turn.
- Single LLM call per turn — no multi-step reasoning chains in the request path.
- Batched conversation history — recent turns are read from Redis, not Postgres, to avoid round trips.
- Graceful degradation — when a provider fails, the system falls through to the next provider or to an honest low-confidence answer; it never loops on the hot path.
Deferred operations
Nine job types run after the stream closes:
- Persist the user message and the assistant message.
- Extract and persist citations.
- Increment
usage_eventsfor billing. - Detect knowledge gaps for low-confidence turns.
- Dispatch outgoing webhooks.
- Queue page auto-indexing if the visited page is unseen.
- Update conversation activity timestamps.
- Notify human agents of unclaimed conversations matching alert rules.
- Refresh per-agent analytics counters.
None of these can block the visitor experience.
Caching strategy
Redis carries two hot-path caches:
- Retrieval cache — 30-minute TTL, keyed by
(agent_id, query, page_url). Repeated phrasings of the same question on the same page reuse the same chunks. - Conversation history cache — 2-hour TTL, capped at the last 12 messages. Lets the LLM see recent turns without a Postgres read.
Failure philosophy
The visitor always gets a response, even if uncertain. When the vector store is unavailable, the system degrades to an ungrounded answer with a low_confidence flag rather than 500-ing. Silent failure is never the default.
Note. The headline production metric is p95 of rag.llm.first_token. Watch it in your tracing tool — see Observability.