DocuMind is implemented as a highly decoupled RAG platform. Retrieval, embeddings, generation routing, security boundaries, and recovery logic are separated intentionally so the platform can evolve across model vendors without replatforming.
Summary: DocuMind orchestrates ingestion, retrieval, reformulation, generation, and citation rendering as one deterministic runtime pipeline.
What Was Built: The implementation combines strict endpoint contracts, streaming SSE semantics, typed payloads, and cross-pane citation linking so every answer remains traceable.
Summary: BM25 and dense retrieval are fused because enterprise queries fail across both lexical and semantic axes.
What Was Built: Hybrid retrieval, rank fusion, persistent vectors, and startup hydration of lexical memory were implemented as one coherent retrieval system.
Summary: Provider routing remains abstract while user credential control stays local through BYOK.
What Was Built: The security model includes provider headers, conditional key injection, automatic 401 key invalidation, and settings-driven remediation UX.
PDF ingestion starts at `/api/v1/documents/upload`, where provider context is validated before file bytes are processed.
Every request carries `X-Model-Provider`, and FastAPI dependency resolution enforces provider constraints early in the lifecycle.
Missing filenames and empty payloads are rejected with explicit HTTP errors to keep client behavior deterministic.
Chat requests are handled through `/api/v1/chat/stream` with a stable `session_id` and a `query` payload.
Follow-up turns are reformulated against existing chat history before retrieval, then passed into generation.
Responses are emitted as SSE events so token rendering begins immediately instead of waiting for full completion.
The backend tracks the finalized assistant answer during streaming and appends user/assistant turns to `session_store` after completion.
This ordering ensures next-turn reformulation sees stable finalized content rather than partial token buffers.
Stream lifecycle behavior is treated as a strict backend/frontend contract to avoid hidden side effects.
BM25 handles strict lexical anchors such as exact clauses, IDs, and policy terms that dense search may miss.
Dense vector retrieval captures conceptual similarity when query phrasing diverges from source wording.
Combining both branches improves recall while preserving precision under enterprise document variability.
Direct score mixing is avoided because BM25 and dense scores operate on different numeric scales.
`EnsembleRetriever` with balanced weights provides position-aware ranking that is less fragile than raw-score blending.
Rank-based fusion remains stable as corpora, tokenization behavior, and model characteristics evolve.
Vectors are persisted to Chroma on disk, and BM25 memory is rebuilt from persisted records during app lifespan startup.
Because BM25 is memory-resident in this design, hydration is required to preserve retrieval continuity after restarts.
Hybrid retrieval quality depends on both branches being warm and synchronized.
PDF files are parsed with PyMuPDF (`fitz`) page-by-page so extracted content remains tied to page boundaries.
Empty pages are skipped, and metadata fields such as `source` and `page` are normalized for downstream consistency.
Each page becomes a typed LangChain `Document`, which keeps later pipeline stages schema-safe.
Chunking uses `RecursiveCharacterTextSplitter` with configurable `CHUNK_SIZE` and fixed overlap `150`.
Overlap preserves clause continuity across boundaries and reduces edge-loss for legal or specification-heavy documents.
Deterministic chunk topology is essential because unstable chunking degrades both retrieval quality and citation trust.
Each chunk receives a durable `chunk_id`, then is written into Chroma with explicit IDs for repeatable ingestion behavior.
Vector persistence remains configurable via `CHROMA_PERSIST_DIR`.
The same chunk corpus is kept in BM25 memory so lexical and semantic retrieval operate on aligned evidence units.
Retrieved chunks are formatted into numbered context blocks containing `document_id`, `filename`, and `page_number`.
Metadata is made explicit so citation tool output can match source attributes exactly.
This formatted context is the core bridge between retrieval truth and generation behavior.
Multi-turn input is reformulated into a standalone retrieval query when history exists.
Retrieval uses the reformulated query, while generation uses the original user question plus chat history.
This separation keeps retrieval precision and conversational coherence aligned without coupling the two concerns.
System prompts prohibit outside knowledge and require citation metadata to match retrieved context exactly.
Pydantic response models enforce typed outputs for non-streaming mode, including citation schema validation.
`source_text` remains verbatim so claims can be audited directly against the uploaded document.
A `CitationTool` binding is attached during streaming generation while token text and tool-call fragments are accumulated concurrently.
Tool-call JSON buffers are parsed at stream completion and emitted in a typed final payload.
Single-pass streaming reduces latency and token overhead while preserving structured provenance.
Prompt rules require explicit context-insufficiency statements when evidence is missing.
Citation arrays must be empty in that state to prevent fabricated provenance.
This policy favors reliability over speculative completion in enterprise usage.
Provider selection is routed through `X-Model-Provider` with allowed values `groq` and `openai`.
Unsupported values are rejected with `400`, and checks remain centralized in dependency logic.
This keeps control-plane behavior consistent across endpoints.
`get_llm` constructs provider-specific model clients while endpoint logic remains provider-neutral.
Current runtime defaults are `gpt-4o-mini` for OpenAI and `llama3-8b-8192` for Groq-compatible execution at temperature `0`.
Constructor isolation reduces blast radius when adding or changing providers.
Groq is configured as the default for low-friction onboarding and low-latency first-token behavior.
OpenAI remains available through BYOK for teams that require quota ownership and governance controls.
This dual-mode architecture is a deliberate anti-lock-in strategy, not a temporary compatibility layer.
| Dimension | Groq (Default) | OpenAI (BYOK) |
|---|---|---|
| Primary Advantage | Fast onboarding and very low latency. | User-controlled quota and policy governance. |
| Operational Risk | Strict free-tier rate limits. | Invalid or expired user credentials. |
| UX Recovery Path | Retry guidance or provider switch after `429`. | Key reset and settings reopen after `401`. |
Indexing uses `HuggingFaceEmbeddings` (`all-MiniLM-L6-v2`) with `lru_cache` to reduce repeated initialization overhead.
Embeddings remain decoupled from generation providers so model switching does not trigger re-indexing.
This separation treats document memory as stable infrastructure while generation providers remain replaceable.
Vector and BM25 corpora remain stable while providers are toggled, preserving retrieval consistency on the same document set.
Expensive re-embedding loops are avoided during active sessions, keeping latency focused on retrieval plus generation.
Provider switching does not reset document intelligence, reducing both cost and user friction.
User OpenAI keys remain in browser local storage and are not persisted in backend databases.
Keys are injected only when OpenAI is selected and transmitted through `X-OpenAI-Key` over HTTPS.
The server remains stateless with respect to user secrets, reducing compromise blast radius.
Header injection is centralized in `apiFetch`, which always attaches `X-Model-Provider` for routing consistency.
`401` responses trigger immediate local key clearing and dispatch a settings recovery event.
Settings reopen with explicit error copy so remediation happens without losing broader chat context.
Message submission remains disabled until prerequisites are satisfied: provider state, key state, and document readiness.
Blocked reasons are rendered in the composer so users can resolve constraints without ambiguity.
`New Session` remains hidden until the current session is used, preventing redundant state-reset actions.
The chat component reads `ReadableStream` directly and decodes incremental bytes with `TextDecoder`.
SSE blocks are parsed on `\n\n` boundaries with support for multi-line `data:` payloads.
Incoming token chunks append to the active assistant message for low-latency typing behavior without polling.
`activeCitation` is lifted to the dashboard container and injected into both chat and document panes.
Citation badges under assistant messages map directly to source chunk cards in the right panel.
This split-pane interaction keeps provenance visible at the exact point of user intent.
Provider-specific exceptions are treated as first-class runtime states: Groq pressure as `429` and OpenAI auth issues as `401`.
Backend semantics are paired with frontend recovery: invalid OpenAI credentials clear and reopen settings, while Groq limits preserve context and suggest retry or provider switch.
Explicit failure semantics are favored over silent degradation to preserve operator and user trust.
CPU-heavy operations such as PDF extraction and retrieval writes are offloaded to `asyncio.to_thread` to preserve API responsiveness.
The first stream event is pre-read before continuous yielding so immediate provider failures surface deterministically.
Settings and embedding models are cached where safe to reduce repeated hot-path initialization overhead.
A scripted evaluation loop runs over a golden dataset and scores faithfulness plus answer relevance with structured judge outputs.
Quality is measured end-to-end rather than by isolated component metrics because retrieval and generation are tightly coupled in production.
This evaluation loop acts as release discipline so architecture changes improve business outcomes, not just internal code aesthetics.
DocuMind is designed to convert general LLM capability into reliable document-intelligence throughput for real teams.
Retrieval rigor, provider abstraction, stateless security, and streaming UX are combined to maintain trust under enterprise load.
The system is intentionally built as a platform architecture rather than a single-model demo so providers, prompts, and retrieval policies can evolve without replatforming.