What is RAG?

SisterShield uses Retrieval-Augmented Generation (RAG) to ensure that every AI-generated interactive story is grounded in verified, expert-authored educational content. Instead of relying on the AI model’s training data alone, the system retrieves relevant evidence from a curated knowledge base and injects it into the generation prompt — so every fact can be traced back to its source.

Why RAG Matters

Traditional AI models generate text from patterns learned during training. This means they can produce plausible-sounding content that is factually incorrect — a phenomenon called hallucination. For an educational platform about technology-facilitated violence against women and girls (TF-VAWG), incorrect information is not just unhelpful — it can be harmful.

RAG addresses this by:

Benefit	How RAG Achieves It
Grounded in evidence	Every story draws from verified sources — UN Women reports, TF-CBT clinical guides, child safety frameworks
No hallucination	The AI is instructed to use only the provided evidence, not its own training data, for factual claims
Traceable citations	Every generated fact is tagged with a source reference (`[S1]`, `[S2]`) that maps back to a specific document and chunk
Extensible without code changes	Adding new knowledge is as simple as uploading documents and re-indexing — no prompt engineering or code changes required
Curriculum-aligned	A knowledge graph of 24 concepts organizes the evidence by TF-VAWG risk types, prevention strategies, legal frameworks, and coping skills

Key Concepts

These are the building blocks of the RAG system, explained for non-technical readers.

Chunks

Large documents (50+ pages of clinical guidelines, policy reports) cannot be fed to an AI model all at once. Instead, the system breaks them into chunks — smaller segments of 500–800 tokens (roughly a long paragraph) that each capture a single idea or section. Chunks overlap slightly (100 tokens) so that no information falls between the cracks.

Embeddings

An embedding is a list of 1,536 numbers that captures the meaning of a piece of text. Two chunks about the same topic will have similar embeddings, even if they use completely different words. SisterShield uses OpenAI’s text-embedding-3-small model to convert every chunk — and every search query — into embeddings.

Vector Database

A vector database stores embeddings and can quickly find chunks whose meaning is closest to a query. SisterShield uses PostgreSQL with the pgvector extension and an IVFFlat index for fast approximate nearest-neighbor search. This means the system finds chunks by semantic similarity (what they mean), not just keyword matching (what words they contain).

Hybrid Search

Neither semantic search nor keyword search is perfect on its own. Hybrid search combines both:

Semantic search finds chunks with similar meaning (cosine similarity on embeddings).
Keyword search finds chunks containing exact terms (useful for names, acronyms, legal references).
Reciprocal Rank Fusion (RRF) merges the two ranked lists into a single result set that captures the best of both.

Citations

Every piece of evidence used in a generated story is tagged with a citation key like [S1] or [S2]. After the AI generates the story, a post-processor extracts these markers and maps them back to the original document, page, and chunk. These citations are stored in the database and can be displayed to teachers reviewing the content.

How It All Fits Together

Current Knowledge Base Stats

Metric	Value
Indexed documents	20
Total chunks	655
Embedding dimensions	1,536
Embedding model	OpenAI `text-embedding-3-small`
Knowledge concepts	24
Document categories	TF-CBT, TF-VAWG, General
Supported file types	PDF, DOCX

Connection to Responsible AI

RAG directly supports SisterShield’s commitment to responsible AI use:

Transparency: Teachers can inspect exactly which sources informed a generated story through the citation panel.
Accountability: Every factual claim traces back to a verified source document from a trusted organization (UN Women, UNICEF, WHO, etc.).
Human oversight: The RAG system provides evidence — it does not make decisions. Teachers review all generated content before it reaches students.
No black box: The retrieval process is deterministic and inspectable. The system logs which chunks were retrieved, their relevance scores, and how they were formatted into the prompt.