Architecture

RAG pipeline

The RAG pipeline converts a visitor question into an answer grounded in the workspace's knowledge base, streamed in real time. It runs in five ordered stages: curated short-circuit, retrieval, prompt assembly, LLM streaming, and asynchronous persistence.

Overview

Every visitor turn passes through the same pipeline. The first stage can bypass the rest entirely for curated answers; the remaining stages always run in order.

1. Curated short-circuit

If the question matches a curated trigger, the system streams the prepared answer and skips retrieval, prompt assembly, and the LLM call. Matching is delegated to CuratedAnswerMatcher.php.

2. Retrieval

The Retriever::retrieve() method runs a two-stage search:

Embed the query using the configured LLM client.
Vector search with metadata filtering — default topK=6, fanOut=3.
Rerank results with a cross-encoder model.
Boost chunks from the current page by +0.15.
Apply the confidence threshold; require at least 2 chunks to survive.

Results are cached in Redis for 30 minutes, keyed by query + page URL. The cache is purged automatically on source changes.

3. Prompt assembly

PromptBuilder::build() assembles the system prompt from:

The agent persona (name, tone).
Core instructions that restrict answers to retrieved source material.
Prompt-injection defences — retrieved content is wrapped in source tags and labelled as data, not instructions.
Guardrails and topic avoid-lists.
Current page context hints (title, headings, JSON-LD).
The optional system_prompt override from the agent.
Language directives (pinned by language_default or visitor locale).

4. LLM streaming

The LLM client returns a token generator. Each token fires a TokenStreamed event that is forwarded to the widget as Server-Sent Events. The pipeline never buffers the full response server-side before flushing.

5. Async persistence

Once streaming completes, the system runs the post-stream tasks: extract citations, save messages, detect knowledge gaps (low_confidence turns), increment usage metrics, and dispatch any webhook integrations. None of this blocks the visitor — see Hot path.

Confidence scoring

The maximum rerank score across surviving chunks becomes the turn's confidence. When the current page context is available and grounded, the score is boosted to at least 0.85. With no grounding at all, confidence reports as 0.3 and the agent answers honestly that it doesn't know.

Page context

The widget extracts structured signals from the page the visitor is on — title, meta description, JSON-LD, headings — and submits them as source[0]. This lets the agent answer questions about pages that haven't been indexed yet, like product detail pages with dynamic content.

Provider abstraction

LLM, embedding, vector store, and crawler implementations all sit behind interfaces. Tests inject fake clients; production picks providers based on env-var credentials. Switching providers is a config change, not a code change.

Note. If you change the embedding model after a workspace has ingested content, run php artisan vector:rebuild-index. Embeddings are not portable across models.