How the RAG Pipeline Works
This page documents the full RAG pipeline — from document ingestion through vector search to citation extraction. Each stage includes the relevant code paths and architectural decisions.
Ingestion Pipeline
The ingestion pipeline converts raw documents into searchable, embedded chunks stored in PostgreSQL with pgvector.
Step 1: Scan Documents
The ingestion entry point (src/lib/rag/ingest.ts) recursively scans the RAG/Data/ directory for .pdf and .docx files. Each file is hashed (SHA-256) to detect changes — if a document’s hash matches an existing RagDocument record, it is skipped.
Step 2: Extract Text
Text extraction (src/lib/rag/extract.ts) uses format-specific libraries:
- PDF:
pdfjs-distfor text extraction with page boundary detection. - DOCX:
mammothfor converting Word documents to plain text while preserving paragraph structure.
Step 3: Detect Category and Language
The directory structure determines the document’s category:
RAG/Data/TF-CBT/→TF_CBTcategoryRAG/Data/TF-VAWG/→TF_VAWGcategoryRAG/Data/General/→GENERALcategory
Language is detected from filename conventions or defaults to English.
Step 4: Structure-Aware Chunking
The chunker (src/lib/rag/chunk.ts) splits extracted text into segments of 500–800 tokens with 100-token overlap. Unlike naive fixed-size splitting, it is structure-aware:
- Section header detection: Recognizes markdown-style headings, numbered sections, and all-caps headers. Chunk boundaries prefer to align with section breaks.
- Paragraph preservation: Avoids splitting mid-paragraph where possible.
- Overlap: 100-token overlap between consecutive chunks ensures that no information is lost at boundaries.
Step 5: Generate Embeddings
The embedder (src/lib/rag/embed.ts) converts each chunk’s text content into a 1,536-dimensional vector using OpenAI’s text-embedding-3-small model. Embeddings are generated in batches of 50 to stay within API rate limits.
Step 6: Store and Index
The vector store (src/lib/rag/vector-store.ts) performs the following:
- Creates the
pgvectorextension if not present. - Adds the
embeddingcolumn (vector(1536)) to theRagChunktable via raw SQL (Prisma does not natively support vector types). - Upserts
RagDocumentandRagChunkrecords. - Builds an IVFFlat index for fast approximate nearest-neighbor search.
Query-Time Retrieval
When a teacher requests story generation, the RAG system retrieves relevant evidence to inject into the LLM prompt.
Step 1: Query Construction
The teacher’s topic selection and learning objectives are combined into a search query. For example, a course about “cyberbullying on social media” with objectives about “recognizing warning signs” and “reporting mechanisms” produces a query that captures both the threat type and the desired educational outcomes.
Step 2: Embed the Query
The same embedding model (text-embedding-3-small) converts the query into a 1,536-dimensional vector, ensuring it lives in the same vector space as the document chunks.
Step 3: Hybrid Search
The retrieval module (src/lib/rag/retrieve.ts) runs two parallel searches:
| Search Type | Method | Strengths |
|---|---|---|
| Semantic | Cosine similarity between query embedding and chunk embeddings | Finds conceptually related content even with different wording |
| Keyword | PostgreSQL full-text search on chunk content | Catches exact terms, acronyms, and proper nouns that embeddings may miss |
Reciprocal Rank Fusion (RRF) merges the two ranked result lists into a single ranking. RRF works by assigning each result a score based on its rank position in each list: score = 1 / (k + rank). Results that appear highly in both lists get the highest combined scores.
Step 4: Quality Boost
The quality scorer (src/lib/rag/quality.ts) adjusts chunk relevance scores based on content quality signals:
| Signal | Boost | Rationale |
|---|---|---|
| Content density | Higher score for information-rich text | Filters out tables of contents, headers-only chunks |
| Statistics presence | Boost for chunks containing numbers and data | Prioritizes evidence with concrete data points |
| Content type | Category-specific weighting | TF-CBT chunks boosted for therapy-related queries |
Step 5: Format as Evidence Context
The context formatter (src/lib/rag/format-context.ts) takes the top-K ranked chunks and produces the EVIDENCE_CONTEXT block injected into the LLM prompt:
- Group by document: Chunks from the same source document are grouped together.
- Assign citation keys: Each document group gets a sequential key —
[S1],[S2],[S3], etc. - Include metadata: Document title, organization, category, and source URL.
- Token budget: The total context is capped at approximately 3,000 tokens to leave room for the story generation prompt.
Citation Flow
After the LLM generates a story, citations are extracted and stored for display.
Step 1: LLM Generates with Source Markers
The story generation prompt instructs the LLM to tag factual claims with [SOURCE:S1], [SOURCE:S2], etc. — matching the citation keys from the evidence context. For example:
Cyberbullying affects 1 in 3 young people worldwide [SOURCE:S1].The 5-4-3-2-1 grounding technique can help manage anxiety [SOURCE:S3].Step 2: Extract and Map Citations
The citation extractor (src/lib/rag/citations.ts) parses the generated Twee source for [SOURCE:Sn] patterns, then maps each key back to the corresponding RagChunk and RagDocument records. This produces a structured citation object with:
- Document title and source organization
- Category (TF-CBT, TF-VAWG, General)
- Original chunk text (the specific evidence used)
- Source URL (link to the original document)
Step 3: Store Citations
Each citation is saved as a RagCitation record in the database, linking the course to the specific chunks that informed its content. This creates an auditable trail from generated content back to source evidence.
Database Models
The RAG system uses five Prisma models:
| Model | Purpose | Key Fields |
|---|---|---|
RagDocument | Source document metadata | title, sourceUrl, category, status, contentHash |
RagChunk | Individual text segments with embeddings | content, embedding (vector 1536), tokenCount, sectionTitle |
KnowledgeConcept | Hierarchical concept taxonomy | name, nameKo, category, parentId |
RagConceptTag | Many-to-many chunk-to-concept links | chunkId, conceptId, confidence |
RagCitation | Course-to-chunk citation records | courseId, chunkId, citationKey, context |
The embedding column on RagChunk is managed via raw SQL (vector(1536)) because Prisma does not natively support pgvector types. The initializeVectorStore() function in src/lib/rag/vector-store.ts handles creating the extension and column programmatically.
Key Architecture Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Embedding model | OpenAI text-embedding-3-small | Best cost/quality ratio for 1536-d; consistent with LLM provider abstraction |
| Vector index | IVFFlat (pgvector) | Good recall at low latency for < 1M vectors; no separate vector DB needed |
| Chunk size | 500–800 tokens, 100 overlap | Balances context completeness with embedding quality; overlap prevents boundary information loss |
| Search strategy | Hybrid (cosine + keyword + RRF) | Semantic search alone misses exact terms; keyword alone misses paraphrased content |
| Citation format | [SOURCE:Sn] markers in LLM output | Simple regex extraction; LLMs follow this pattern reliably |
| Storage | PostgreSQL with pgvector | Single database for all data (relational + vector); no infrastructure overhead of a separate vector DB |
| Batch size | 50 embeddings per API call | Stays within OpenAI rate limits while maintaining reasonable ingestion speed |