Glozr docs

Build your agent

Knowledge sources

Everything the agent knows about your business comes from sources you add to its knowledge base. Sources flow through a single ingestion pipeline — extraction, chunking, embedding — and then become retrievable context for every visitor message.

Source types

Glozr supports a wide range of source formats:

  • URL / Sitemap / Feed — web content crawled and extracted.
  • Text — direct paste of FAQs, snippets, or scripts.
  • Notion / Google Docs / Google Sheets — API-based document ingestion.
  • SQL — direct PDO read-only SELECT from MySQL or PostgreSQL databases.
  • Files — PDF, DOCX, XLSX uploads parsed via Cloudflare Workers AI.
  • WooCommerce — synced via the WordPress companion plugin.
  • Auto — pages visitors land on, queued by auto-index.

SQL database sources

SQL sources are powerful and the runtime enforces strict guardrails:

  • Hosts must pass SSRF validation — internal IPs (RFC1918, loopback, link-local) are blocked.
  • Queries are limited to SELECT statements only. Anything else is rejected before execution.
  • Connections run inside read-only transactions.
  • A 5,000-row cap per sync prevents resource exhaustion on accidental wide queries.

The host, port, database name, username, and password are stored encrypted via Laravel's encrypted:array cast (AES-256-GCM). Credentials are never returned to the dashboard after creation — you re-enter them to rotate.

Processing pipeline

Every source goes through the same three stages:

  1. Extraction — fetch / parse the raw content into clean text plus metadata (title, URL, locale).
  2. Chunking — a recursive splitter that prefers semantic boundaries: it splits on headings and blank lines first, then packs paragraphs up to ~2000 characters per chunk.
  3. Embedding — chunks are vectorized and upserted into the configured vector store (Cloudflare Vectorize or OpenAI / Postgres pgvector depending on the provider).

Limitations

  • Cloudflare Vectorize has eventual consistency — expect a 30–60 second propagation delay after a sync completes before chunks are retrievable.
  • Spreadsheet (XLSX) uploads require Cloudflare Workers AI. There is no local fallback parser today.
  • Re-syncing is currently manual. URL and database sources don't poll for changes — you trigger a resync from the Sources index when content changes.

Note. Sources can be previewed, reindexed, or deleted at any time from /app/agents/{id}/sources. Deleting a source removes its chunks from the vector store on the next sync sweep.