Skip to content

How the RAG Pipeline Works

This page documents the full RAG pipeline — from document ingestion through vector search to citation extraction. Each stage includes the relevant code paths and architectural decisions.

Ingestion Pipeline

The ingestion pipeline converts raw documents into searchable, embedded chunks stored in PostgreSQL with pgvector.

Step 1: Scan Documents

The ingestion entry point (src/lib/rag/ingest.ts) recursively scans the RAG/Data/ directory for .pdf and .docx files. Each file is hashed (SHA-256) to detect changes — if a document’s hash matches an existing RagDocument record, it is skipped.

Step 2: Extract Text

Text extraction (src/lib/rag/extract.ts) uses format-specific libraries:

  • PDF: pdfjs-dist for text extraction with page boundary detection.
  • DOCX: mammoth for converting Word documents to plain text while preserving paragraph structure.

Step 3: Detect Category and Language

The directory structure determines the document’s category:

  • RAG/Data/TF-CBT/TF_CBT category
  • RAG/Data/TF-VAWG/TF_VAWG category
  • RAG/Data/General/GENERAL category

Language is detected from filename conventions or defaults to English.

Step 4: Structure-Aware Chunking

The chunker (src/lib/rag/chunk.ts) splits extracted text into segments of 500–800 tokens with 100-token overlap. Unlike naive fixed-size splitting, it is structure-aware:

  • Section header detection: Recognizes markdown-style headings, numbered sections, and all-caps headers. Chunk boundaries prefer to align with section breaks.
  • Paragraph preservation: Avoids splitting mid-paragraph where possible.
  • Overlap: 100-token overlap between consecutive chunks ensures that no information is lost at boundaries.

Step 5: Generate Embeddings

The embedder (src/lib/rag/embed.ts) converts each chunk’s text content into a 1,536-dimensional vector using OpenAI’s text-embedding-3-small model. Embeddings are generated in batches of 50 to stay within API rate limits.

Step 6: Store and Index

The vector store (src/lib/rag/vector-store.ts) performs the following:

  1. Creates the pgvector extension if not present.
  2. Adds the embedding column (vector(1536)) to the RagChunk table via raw SQL (Prisma does not natively support vector types).
  3. Upserts RagDocument and RagChunk records.
  4. Builds an IVFFlat index for fast approximate nearest-neighbor search.

Query-Time Retrieval

When a teacher requests story generation, the RAG system retrieves relevant evidence to inject into the LLM prompt.

Step 1: Query Construction

The teacher’s topic selection and learning objectives are combined into a search query. For example, a course about “cyberbullying on social media” with objectives about “recognizing warning signs” and “reporting mechanisms” produces a query that captures both the threat type and the desired educational outcomes.

Step 2: Embed the Query

The same embedding model (text-embedding-3-small) converts the query into a 1,536-dimensional vector, ensuring it lives in the same vector space as the document chunks.

The retrieval module (src/lib/rag/retrieve.ts) runs two parallel searches:

Search TypeMethodStrengths
SemanticCosine similarity between query embedding and chunk embeddingsFinds conceptually related content even with different wording
KeywordPostgreSQL full-text search on chunk contentCatches exact terms, acronyms, and proper nouns that embeddings may miss

Reciprocal Rank Fusion (RRF) merges the two ranked result lists into a single ranking. RRF works by assigning each result a score based on its rank position in each list: score = 1 / (k + rank). Results that appear highly in both lists get the highest combined scores.

Step 4: Quality Boost

The quality scorer (src/lib/rag/quality.ts) adjusts chunk relevance scores based on content quality signals:

SignalBoostRationale
Content densityHigher score for information-rich textFilters out tables of contents, headers-only chunks
Statistics presenceBoost for chunks containing numbers and dataPrioritizes evidence with concrete data points
Content typeCategory-specific weightingTF-CBT chunks boosted for therapy-related queries

Step 5: Format as Evidence Context

The context formatter (src/lib/rag/format-context.ts) takes the top-K ranked chunks and produces the EVIDENCE_CONTEXT block injected into the LLM prompt:

  1. Group by document: Chunks from the same source document are grouped together.
  2. Assign citation keys: Each document group gets a sequential key — [S1], [S2], [S3], etc.
  3. Include metadata: Document title, organization, category, and source URL.
  4. Token budget: The total context is capped at approximately 3,000 tokens to leave room for the story generation prompt.

Citation Flow

After the LLM generates a story, citations are extracted and stored for display.

Step 1: LLM Generates with Source Markers

The story generation prompt instructs the LLM to tag factual claims with [SOURCE:S1], [SOURCE:S2], etc. — matching the citation keys from the evidence context. For example:

Cyberbullying affects 1 in 3 young people worldwide [SOURCE:S1].
The 5-4-3-2-1 grounding technique can help manage anxiety [SOURCE:S3].

Step 2: Extract and Map Citations

The citation extractor (src/lib/rag/citations.ts) parses the generated Twee source for [SOURCE:Sn] patterns, then maps each key back to the corresponding RagChunk and RagDocument records. This produces a structured citation object with:

  • Document title and source organization
  • Category (TF-CBT, TF-VAWG, General)
  • Original chunk text (the specific evidence used)
  • Source URL (link to the original document)

Step 3: Store Citations

Each citation is saved as a RagCitation record in the database, linking the course to the specific chunks that informed its content. This creates an auditable trail from generated content back to source evidence.

Database Models

The RAG system uses five Prisma models:

ModelPurposeKey Fields
RagDocumentSource document metadatatitle, sourceUrl, category, status, contentHash
RagChunkIndividual text segments with embeddingscontent, embedding (vector 1536), tokenCount, sectionTitle
KnowledgeConceptHierarchical concept taxonomyname, nameKo, category, parentId
RagConceptTagMany-to-many chunk-to-concept linkschunkId, conceptId, confidence
RagCitationCourse-to-chunk citation recordscourseId, chunkId, citationKey, context

The embedding column on RagChunk is managed via raw SQL (vector(1536)) because Prisma does not natively support pgvector types. The initializeVectorStore() function in src/lib/rag/vector-store.ts handles creating the extension and column programmatically.

Key Architecture Decisions

DecisionChoiceRationale
Embedding modelOpenAI text-embedding-3-smallBest cost/quality ratio for 1536-d; consistent with LLM provider abstraction
Vector indexIVFFlat (pgvector)Good recall at low latency for < 1M vectors; no separate vector DB needed
Chunk size500–800 tokens, 100 overlapBalances context completeness with embedding quality; overlap prevents boundary information loss
Search strategyHybrid (cosine + keyword + RRF)Semantic search alone misses exact terms; keyword alone misses paraphrased content
Citation format[SOURCE:Sn] markers in LLM outputSimple regex extraction; LLMs follow this pattern reliably
StoragePostgreSQL with pgvectorSingle database for all data (relational + vector); no infrastructure overhead of a separate vector DB
Batch size50 embeddings per API callStays within OpenAI rate limits while maintaining reasonable ingestion speed