DocuMind

DocuMind Architecture Mini-Book

Authored by Artem Moshnin, Lead Architect and Engineer. This guide documents the technical blueprint for a vendor-agnostic RAG system built for enterprise reliability and speed.

DocuMind is implemented as a highly decoupled RAG platform. Retrieval, embeddings, generation routing, security boundaries, and recovery logic are separated intentionally so the platform can evolve across model vendors without replatforming.

Ground every answer in retrievable evidence.

Prefer explicit contracts over hidden coupling in every layer.

Decouple retrieval, embeddings, and generation to preserve runtime flexibility.

Prevent lock-in through provider abstraction and stateless security boundaries.

Optimize for explainability, resilience, and operational clarity.

Measure quality through repeatable evaluation loops before relying on intuition.

Runtime Orchestration

Summary: DocuMind orchestrates ingestion, retrieval, reformulation, generation, and citation rendering as one deterministic runtime pipeline.

What Was Built: The implementation combines strict endpoint contracts, streaming SSE semantics, typed payloads, and cross-pane citation linking so every answer remains traceable.

Backend stream events and frontend parser state machines are synchronized.
Chat history persists in session flow while rendering remains low-latency.
Error semantics stay explicit so recovery paths remain deterministic.

Retrieval Intelligence

Summary: BM25 and dense retrieval are fused because enterprise queries fail across both lexical and semantic axes.

What Was Built: Hybrid retrieval, rank fusion, persistent vectors, and startup hydration of lexical memory were implemented as one coherent retrieval system.

Retrieval continuity survives restarts by rebuilding BM25 from disk-backed data.
Chunk topology is tuned for both citation fidelity and answer coherence.
Embedding infrastructure is isolated from generation infrastructure.

Vendor-Agnostic Security

Summary: Provider routing remains abstract while user credential control stays local through BYOK.

What Was Built: The security model includes provider headers, conditional key injection, automatic 401 key invalidation, and settings-driven remediation UX.

Server-side logic remains stateless with respect to user secrets.
Provider failure handling is explicit for both OpenAI and Groq paths.
Operating modes remain flexible without divergent product surfaces.

Chapter 1

Chapter 1: End-to-End Runtime Flow

How a request moves from UI input to grounded answer output

1.1 Ingestion entrypoint and request contracts

PDF ingestion starts at `/api/v1/documents/upload`, where provider context is validated before file bytes are processed.

Every request carries `X-Model-Provider`, and FastAPI dependency resolution enforces provider constraints early in the lifecycle.

Missing filenames and empty payloads are rejected with explicit HTTP errors to keep client behavior deterministic.

Provider values are validated in `get_model_provider` before endpoint logic executes.
OpenAI key format checks are centralized in `get_provider_context`.
Groq server-key checks fail fast to surface misconfiguration predictably.

1.2 Conversational execution as a streaming pipeline

Chat requests are handled through `/api/v1/chat/stream` with a stable `session_id` and a `query` payload.

Follow-up turns are reformulated against existing chat history before retrieval, then passed into generation.

Responses are emitted as SSE events so token rendering begins immediately instead of waiting for full completion.

`token` events stream incremental assistant text.
A `final` event carries the completed answer and typed citations.
`[DONE]` terminates the stream for clean client state transitions.

1.3 Deterministic stream finalization

The backend tracks the finalized assistant answer during streaming and appends user/assistant turns to `session_store` after completion.

This ordering ensures next-turn reformulation sees stable finalized content rather than partial token buffers.

Stream lifecycle behavior is treated as a strict backend/frontend contract to avoid hidden side effects.

Chapter 2

Chapter 2: Hybrid Retrieval Logic

Combining lexical and semantic search with rank fusion

2.1 Retrieval responsibilities by failure mode

BM25 handles strict lexical anchors such as exact clauses, IDs, and policy terms that dense search may miss.

Dense vector retrieval captures conceptual similarity when query phrasing diverges from source wording.

Combining both branches improves recall while preserving precision under enterprise document variability.

2.2 Candidate merge with Reciprocal Rank Fusion

Direct score mixing is avoided because BM25 and dense scores operate on different numeric scales.

`EnsembleRetriever` with balanced weights provides position-aware ranking that is less fragile than raw-score blending.

Rank-based fusion remains stable as corpora, tokenization behavior, and model characteristics evolve.

Dense retrieval runs at `k=3` from Chroma for semantic candidates.
BM25 retrieval runs at `k=3` from in-memory documents for lexical precision.
Fusion weights `[0.5, 0.5]` keep behavior balanced and interpretable.

2.3 Startup hydration and retrieval continuity

Vectors are persisted to Chroma on disk, and BM25 memory is rebuilt from persisted records during app lifespan startup.

Because BM25 is memory-resident in this design, hydration is required to preserve retrieval continuity after restarts.

Hybrid retrieval quality depends on both branches being warm and synchronized.

Chapter 3

Chapter 3: PDF ETL and Chunk Engineering

Converting raw PDFs into deterministic retrieval memory

3.1 PDF extraction and page-level metadata

PDF files are parsed with PyMuPDF (`fitz`) page-by-page so extracted content remains tied to page boundaries.

Empty pages are skipped, and metadata fields such as `source` and `page` are normalized for downstream consistency.

Each page becomes a typed LangChain `Document`, which keeps later pipeline stages schema-safe.

3.2 Recursive chunking as a quality control lever

Chunking uses `RecursiveCharacterTextSplitter` with configurable `CHUNK_SIZE` and fixed overlap `150`.

Overlap preserves clause continuity across boundaries and reduces edge-loss for legal or specification-heavy documents.

Deterministic chunk topology is essential because unstable chunking degrades both retrieval quality and citation trust.

3.3 Storage loading and chunk identity

Each chunk receives a durable `chunk_id`, then is written into Chroma with explicit IDs for repeatable ingestion behavior.

Vector persistence remains configurable via `CHROMA_PERSIST_DIR`.

The same chunk corpus is kept in BM25 memory so lexical and semantic retrieval operate on aligned evidence units.

3.4 Context formatting for grounded generation

Retrieved chunks are formatted into numbered context blocks containing `document_id`, `filename`, and `page_number`.

Metadata is made explicit so citation tool output can match source attributes exactly.

This formatted context is the core bridge between retrieval truth and generation behavior.

Chapter 4

Chapter 4: Prompt and Generation Orchestration

Grounded answer generation through strict prompt contracts

4.1 Retrieval intent separated from answer intent

Multi-turn input is reformulated into a standalone retrieval query when history exists.

Retrieval uses the reformulated query, while generation uses the original user question plus chat history.

This separation keeps retrieval precision and conversational coherence aligned without coupling the two concerns.

4.2 Citation integrity through structured outputs

System prompts prohibit outside knowledge and require citation metadata to match retrieved context exactly.

Pydantic response models enforce typed outputs for non-streaming mode, including citation schema validation.

`source_text` remains verbatim so claims can be audited directly against the uploaded document.

4.3 Single-pass streaming for text and citations

A `CitationTool` binding is attached during streaming generation while token text and tool-call fragments are accumulated concurrently.

Tool-call JSON buffers are parsed at stream completion and emitted in a typed final payload.

Single-pass streaming reduces latency and token overhead while preserving structured provenance.

Token events append into the live assistant message in real time.
Tool-call chunks are parsed defensively so partial JSON does not break the stream.
Final payloads are serialized through `ChatResponse` for deterministic client parsing.

4.4 Predictable behavior under insufficient context

Prompt rules require explicit context-insufficiency statements when evidence is missing.

Citation arrays must be empty in that state to prevent fabricated provenance.

This policy favors reliability over speculative completion in enterprise usage.

Chapter 5

Chapter 5: Vendor-Agnostic LLM Router

One product surface with interchangeable model backends

5.1 Provider routing by explicit API contract

Provider selection is routed through `X-Model-Provider` with allowed values `groq` and `openai`.

Unsupported values are rejected with `400`, and checks remain centralized in dependency logic.

This keeps control-plane behavior consistent across endpoints.

5.2 Model instantiation through a factory boundary

`get_llm` constructs provider-specific model clients while endpoint logic remains provider-neutral.

Current runtime defaults are `gpt-4o-mini` for OpenAI and `llama3-8b-8192` for Groq-compatible execution at temperature `0`.

Constructor isolation reduces blast radius when adding or changing providers.

5.3 Onboarding and governance as separate operating modes

Groq is configured as the default for low-friction onboarding and low-latency first-token behavior.

OpenAI remains available through BYOK for teams that require quota ownership and governance controls.

This dual-mode architecture is a deliberate anti-lock-in strategy, not a temporary compatibility layer.

Dimension	Groq (Default)	OpenAI (BYOK)
Primary Advantage	Fast onboarding and very low latency.	User-controlled quota and policy governance.
Operational Risk	Strict free-tier rate limits.	Invalid or expired user credentials.
UX Recovery Path	Retry guidance or provider switch after `429`.	Key reset and settings reopen after `401`.

Chapter 6

Chapter 6: Decoupled Embeddings and Index Stability

Retrieval infrastructure independent from generation engines

6.1 Provider-independent embedding strategy

Indexing uses `HuggingFaceEmbeddings` (`all-MiniLM-L6-v2`) with `lru_cache` to reduce repeated initialization overhead.

Embeddings remain decoupled from generation providers so model switching does not trigger re-indexing.

This separation treats document memory as stable infrastructure while generation providers remain replaceable.

6.2 Retrieval stability across provider switches

Vector and BM25 corpora remain stable while providers are toggled, preserving retrieval consistency on the same document set.

Expensive re-embedding loops are avoided during active sessions, keeping latency focused on retrieval plus generation.

Provider switching does not reset document intelligence, reducing both cost and user friction.

Chunk identity remains stable through durable `chunk_id` values.
Vector persistence remains durable through Chroma disk storage.
Lexical and dense branches stay aligned on the same chunk corpus.

Chapter 7

Chapter 7: Stateless Security and BYOK

Credential control in the browser without server-side secret storage

7.1 Browser-owned API key model

User OpenAI keys remain in browser local storage and are not persisted in backend databases.

Keys are injected only when OpenAI is selected and transmitted through `X-OpenAI-Key` over HTTPS.

The server remains stateless with respect to user secrets, reducing compromise blast radius.

7.2 Auth recovery in the API client layer

Header injection is centralized in `apiFetch`, which always attaches `X-Model-Provider` for routing consistency.

`401` responses trigger immediate local key clearing and dispatch a settings recovery event.

Settings reopen with explicit error copy so remediation happens without losing broader chat context.

Stale credentials are cleared automatically after unauthorized responses.
Groq mode remains key-free for frictionless onboarding.
Provider preference state and key state are separated to avoid accidental lockouts.

Chapter 8

Chapter 8: Frontend Runtime and Interaction Design

Streaming-first UI state machine for evidence-driven chat

8.1 Readiness gates and workflow safety

Message submission remains disabled until prerequisites are satisfied: provider state, key state, and document readiness.

Blocked reasons are rendered in the composer so users can resolve constraints without ambiguity.

`New Session` remains hidden until the current session is used, preventing redundant state-reset actions.

8.2 Manual SSE parsing for deterministic rendering

The chat component reads `ReadableStream` directly and decodes incremental bytes with `TextDecoder`.

SSE blocks are parsed on `\n\n` boundaries with support for multi-line `data:` payloads.

Incoming token chunks append to the active assistant message for low-latency typing behavior without polling.

Typing indicators switch off after first-byte token receipt.
Assistant messages finalize on the `final` event payload.
Citations stay attached to the same completed assistant turn for traceability.

8.3 Cross-pane citation synchronization

`activeCitation` is lifted to the dashboard container and injected into both chat and document panes.

Citation badges under assistant messages map directly to source chunk cards in the right panel.

This split-pane interaction keeps provenance visible at the exact point of user intent.

Chapter 9

Chapter 9: Resilience, Performance, and Operational Discipline

Hardening failure paths while keeping latency and quality predictable

9.1 Vendor-specific failure mapping

Provider-specific exceptions are treated as first-class runtime states: Groq pressure as `429` and OpenAI auth issues as `401`.

Backend semantics are paired with frontend recovery: invalid OpenAI credentials clear and reopen settings, while Groq limits preserve context and suggest retry or provider switch.

Explicit failure semantics are favored over silent degradation to preserve operator and user trust.

9.2 Throughput optimization with asynchronous boundaries

CPU-heavy operations such as PDF extraction and retrieval writes are offloaded to `asyncio.to_thread` to preserve API responsiveness.

The first stream event is pre-read before continuous yielding so immediate provider failures surface deterministically.

Settings and embedding models are cached where safe to reduce repeated hot-path initialization overhead.

Provider-aware fallback paths reduce operational dead-ends.
Session continuity is preserved across provider switches and retries.
Latency remains stable through chunked streaming and incremental rendering.

9.3 Repeatable quality evaluation loops

A scripted evaluation loop runs over a golden dataset and scores faithfulness plus answer relevance with structured judge outputs.

Quality is measured end-to-end rather than by isolated component metrics because retrieval and generation are tightly coupled in production.

This evaluation loop acts as release discipline so architecture changes improve business outcomes, not just internal code aesthetics.

9.4 Durable product-system framing

DocuMind is designed to convert general LLM capability into reliable document-intelligence throughput for real teams.

Retrieval rigor, provider abstraction, stateless security, and streaming UX are combined to maintain trust under enterprise load.

The system is intentionally built as a platform architecture rather than a single-model demo so providers, prompts, and retrieval policies can evolve without replatforming.