Glozr docs

Architecture

Hot path & latency

The hot path is the critical pipeline from a visitor's message arriving at the server to the first token streaming back. The target is 1 second p95. Past that point, the conversation feels broken — so the architecture is built around pushing everything non-essential off the hot path.

Overview

The entire pipeline runs within a single PHP request, kept alive by Octane workers so cold-start cost is paid once at process boot, not per turn. The 1-second budget covers HTTP receive, retrieval, prompt assembly, LLM time-to-first-token, and the first SSE flush.

Latency budget

The 865 ms working target spreads across nine phases. The LLM time-to-first-token is by far the largest single slice.

PhaseBudget
HTTP receive + auth30 ms
Workspace + agent resolve10 ms
Curated match check5 ms
Query embedding80 ms
Vector search120 ms
Rerank80 ms
Prompt assembly10 ms
LLM time-to-first-token500 ms
SSE flush30 ms

That leaves roughly 135 ms of contingency before the 1-second wall.

Architectural rules

Five hard rules are enforced through code review and CI tests:

  1. Asynchronous persistence — every database write happens after the response stream completes, not before.
  2. Queue-based webhooks — outgoing integrations dispatch as background jobs. Webhooks never block a turn.
  3. Single LLM call per turn — no multi-step reasoning chains in the request path.
  4. Batched conversation history — recent turns are read from Redis, not Postgres, to avoid round trips.
  5. Graceful degradation — when a provider fails, the system falls through to the next provider or to an honest low-confidence answer; it never loops on the hot path.

Deferred operations

Nine job types run after the stream closes:

  • Persist the user message and the assistant message.
  • Extract and persist citations.
  • Increment usage_events for billing.
  • Detect knowledge gaps for low-confidence turns.
  • Dispatch outgoing webhooks.
  • Queue page auto-indexing if the visited page is unseen.
  • Update conversation activity timestamps.
  • Notify human agents of unclaimed conversations matching alert rules.
  • Refresh per-agent analytics counters.

None of these can block the visitor experience.

Caching strategy

Redis carries two hot-path caches:

  • Retrieval cache — 30-minute TTL, keyed by (agent_id, query, page_url). Repeated phrasings of the same question on the same page reuse the same chunks.
  • Conversation history cache — 2-hour TTL, capped at the last 12 messages. Lets the LLM see recent turns without a Postgres read.

Failure philosophy

The visitor always gets a response, even if uncertain. When the vector store is unavailable, the system degrades to an ungrounded answer with a low_confidence flag rather than 500-ing. Silent failure is never the default.

Note. The headline production metric is p95 of rag.llm.first_token. Watch it in your tracing tool — see Observability.