How the RAG Pipeline Works

This page documents the full RAG pipeline — from document ingestion through vector search to citation extraction. Each stage includes the relevant code paths and architectural decisions.

Ingestion Pipeline

The ingestion pipeline converts raw documents into searchable, embedded chunks stored in PostgreSQL with pgvector.

Step 1: Scan Documents

The ingestion entry point (src/lib/rag/ingest.ts) recursively scans the RAG/Data/ directory for .pdf and .docx files. Each file is hashed (SHA-256) to detect changes — if a document’s hash matches an existing RagDocument record, it is skipped.

Step 2: Extract Text

Text extraction (src/lib/rag/extract.ts) uses format-specific libraries:

PDF: pdfjs-dist for text extraction with page boundary detection.
DOCX: mammoth for converting Word documents to plain text while preserving paragraph structure.

Step 3: Detect Category and Language

The directory structure determines the document’s category:

RAG/Data/TF-CBT/ → TF_CBT category
RAG/Data/TF-VAWG/ → TF_VAWG category
RAG/Data/General/ → GENERAL category

Language is detected from filename conventions or defaults to English.

Step 4: Structure-Aware Chunking

The chunker (src/lib/rag/chunk.ts) splits extracted text into segments of 500–800 tokens with 100-token overlap. Unlike naive fixed-size splitting, it is structure-aware:

Section header detection: Recognizes markdown-style headings, numbered sections, and all-caps headers. Chunk boundaries prefer to align with section breaks.
Paragraph preservation: Avoids splitting mid-paragraph where possible.
Overlap: 100-token overlap between consecutive chunks ensures that no information is lost at boundaries.

Step 5: Generate Embeddings

The embedder (src/lib/rag/embed.ts) converts each chunk’s text content into a 1,536-dimensional vector using OpenAI’s text-embedding-3-small model. Embeddings are generated in batches of 50 to stay within API rate limits.

Step 6: Store and Index

The vector store (src/lib/rag/vector-store.ts) performs the following:

Creates the pgvector extension if not present.
Adds the embedding column (vector(1536)) to the RagChunk table via raw SQL (Prisma does not natively support vector types).
Upserts RagDocument and RagChunk records.
Builds an IVFFlat index for fast approximate nearest-neighbor search.

Query-Time Retrieval

When a teacher requests story generation, the RAG system retrieves relevant evidence to inject into the LLM prompt.

Step 1: Query Construction

The teacher’s topic selection and learning objectives are combined into a search query. For example, a course about “cyberbullying on social media” with objectives about “recognizing warning signs” and “reporting mechanisms” produces a query that captures both the threat type and the desired educational outcomes.

Step 2: Embed the Query

The same embedding model (text-embedding-3-small) converts the query into a 1,536-dimensional vector, ensuring it lives in the same vector space as the document chunks.

Step 3: Hybrid Search

The retrieval module (src/lib/rag/retrieve.ts) runs two parallel searches:

Search Type	Method	Strengths
Semantic	Cosine similarity between query embedding and chunk embeddings	Finds conceptually related content even with different wording
Keyword	PostgreSQL full-text search on chunk content	Catches exact terms, acronyms, and proper nouns that embeddings may miss

Reciprocal Rank Fusion (RRF) merges the two ranked result lists into a single ranking. RRF works by assigning each result a score based on its rank position in each list: score = 1 / (k + rank). Results that appear highly in both lists get the highest combined scores.

Step 4: Quality Boost

The quality scorer (src/lib/rag/quality.ts) adjusts chunk relevance scores based on content quality signals:

Signal	Boost	Rationale
Content density	Higher score for information-rich text	Filters out tables of contents, headers-only chunks
Statistics presence	Boost for chunks containing numbers and data	Prioritizes evidence with concrete data points
Content type	Category-specific weighting	TF-CBT chunks boosted for therapy-related queries

Step 5: Format as Evidence Context

The context formatter (src/lib/rag/format-context.ts) takes the top-K ranked chunks and produces the EVIDENCE_CONTEXT block injected into the LLM prompt:

Group by document: Chunks from the same source document are grouped together.
Assign citation keys: Each document group gets a sequential key — [S1], [S2], [S3], etc.
Include metadata: Document title, organization, category, and source URL.
Token budget: The total context is capped at approximately 3,000 tokens to leave room for the story generation prompt.

Citation Flow

After the LLM generates a story, citations are extracted and stored for display.

Step 1: LLM Generates with Source Markers

The story generation prompt instructs the LLM to tag factual claims with [SOURCE:S1], [SOURCE:S2], etc. — matching the citation keys from the evidence context. For example:

Cyberbullying affects 1 in 3 young people worldwide [SOURCE:S1].
The 5-4-3-2-1 grounding technique can help manage anxiety [SOURCE:S3].

Step 2: Extract and Map Citations

The citation extractor (src/lib/rag/citations.ts) parses the generated Twee source for [SOURCE:Sn] patterns, then maps each key back to the corresponding RagChunk and RagDocument records. This produces a structured citation object with:

Document title and source organization
Category (TF-CBT, TF-VAWG, General)
Original chunk text (the specific evidence used)
Source URL (link to the original document)

Step 3: Store Citations

Each citation is saved as a RagCitation record in the database, linking the course to the specific chunks that informed its content. This creates an auditable trail from generated content back to source evidence.

Database Models

The RAG system uses five Prisma models:

Model	Purpose	Key Fields
`RagDocument`	Source document metadata	`title`, `sourceUrl`, `category`, `status`, `contentHash`
`RagChunk`	Individual text segments with embeddings	`content`, `embedding` (vector 1536), `tokenCount`, `sectionTitle`
`KnowledgeConcept`	Hierarchical concept taxonomy	`name`, `nameKo`, `category`, `parentId`
`RagConceptTag`	Many-to-many chunk-to-concept links	`chunkId`, `conceptId`, `confidence`
`RagCitation`	Course-to-chunk citation records	`courseId`, `chunkId`, `citationKey`, `context`

The embedding column on RagChunk is managed via raw SQL (vector(1536)) because Prisma does not natively support pgvector types. The initializeVectorStore() function in src/lib/rag/vector-store.ts handles creating the extension and column programmatically.

Key Architecture Decisions

Decision	Choice	Rationale
Embedding model	OpenAI `text-embedding-3-small`	Best cost/quality ratio for 1536-d; consistent with LLM provider abstraction
Vector index	IVFFlat (pgvector)	Good recall at low latency for < 1M vectors; no separate vector DB needed
Chunk size	500–800 tokens, 100 overlap	Balances context completeness with embedding quality; overlap prevents boundary information loss
Search strategy	Hybrid (cosine + keyword + RRF)	Semantic search alone misses exact terms; keyword alone misses paraphrased content
Citation format	`[SOURCE:Sn]` markers in LLM output	Simple regex extraction; LLMs follow this pattern reliably
Storage	PostgreSQL with pgvector	Single database for all data (relational + vector); no infrastructure overhead of a separate vector DB
Batch size	50 embeddings per API call	Stays within OpenAI rate limits while maintaining reasonable ingestion speed