What is RAG?
SisterShield uses Retrieval-Augmented Generation (RAG) to ensure that every AI-generated interactive story is grounded in verified, expert-authored educational content. Instead of relying on the AI model’s training data alone, the system retrieves relevant evidence from a curated knowledge base and injects it into the generation prompt — so every fact can be traced back to its source.
Why RAG Matters
Traditional AI models generate text from patterns learned during training. This means they can produce plausible-sounding content that is factually incorrect — a phenomenon called hallucination. For an educational platform about technology-facilitated violence against women and girls (TF-VAWG), incorrect information is not just unhelpful — it can be harmful.
RAG addresses this by:
| Benefit | How RAG Achieves It |
|---|---|
| Grounded in evidence | Every story draws from verified sources — UN Women reports, TF-CBT clinical guides, child safety frameworks |
| No hallucination | The AI is instructed to use only the provided evidence, not its own training data, for factual claims |
| Traceable citations | Every generated fact is tagged with a source reference ([S1], [S2]) that maps back to a specific document and chunk |
| Extensible without code changes | Adding new knowledge is as simple as uploading documents and re-indexing — no prompt engineering or code changes required |
| Curriculum-aligned | A knowledge graph of 24 concepts organizes the evidence by TF-VAWG risk types, prevention strategies, legal frameworks, and coping skills |
Key Concepts
These are the building blocks of the RAG system, explained for non-technical readers.
Chunks
Large documents (50+ pages of clinical guidelines, policy reports) cannot be fed to an AI model all at once. Instead, the system breaks them into chunks — smaller segments of 500–800 tokens (roughly a long paragraph) that each capture a single idea or section. Chunks overlap slightly (100 tokens) so that no information falls between the cracks.
Embeddings
An embedding is a list of 1,536 numbers that captures the meaning of a piece of text. Two chunks about the same topic will have similar embeddings, even if they use completely different words. SisterShield uses OpenAI’s text-embedding-3-small model to convert every chunk — and every search query — into embeddings.
Vector Database
A vector database stores embeddings and can quickly find chunks whose meaning is closest to a query. SisterShield uses PostgreSQL with the pgvector extension and an IVFFlat index for fast approximate nearest-neighbor search. This means the system finds chunks by semantic similarity (what they mean), not just keyword matching (what words they contain).
Hybrid Search
Neither semantic search nor keyword search is perfect on its own. Hybrid search combines both:
- Semantic search finds chunks with similar meaning (cosine similarity on embeddings).
- Keyword search finds chunks containing exact terms (useful for names, acronyms, legal references).
- Reciprocal Rank Fusion (RRF) merges the two ranked lists into a single result set that captures the best of both.
Citations
Every piece of evidence used in a generated story is tagged with a citation key like [S1] or [S2]. After the AI generates the story, a post-processor extracts these markers and maps them back to the original document, page, and chunk. These citations are stored in the database and can be displayed to teachers reviewing the content.
How It All Fits Together
Current Knowledge Base Stats
| Metric | Value |
|---|---|
| Indexed documents | 20 |
| Total chunks | 655 |
| Embedding dimensions | 1,536 |
| Embedding model | OpenAI text-embedding-3-small |
| Knowledge concepts | 24 |
| Document categories | TF-CBT, TF-VAWG, General |
| Supported file types | PDF, DOCX |
Connection to Responsible AI
RAG directly supports SisterShield’s commitment to responsible AI use:
- Transparency: Teachers can inspect exactly which sources informed a generated story through the citation panel.
- Accountability: Every factual claim traces back to a verified source document from a trusted organization (UN Women, UNICEF, WHO, etc.).
- Human oversight: The RAG system provides evidence — it does not make decisions. Teachers review all generated content before it reaches students.
- No black box: The retrieval process is deterministic and inspectable. The system logs which chunks were retrieved, their relevance scores, and how they were formatted into the prompt.