Retrieval-Augmented Generation (RAG) for Chatbots
1. Identifying RAG Architecture
What is RAG? Retrieval-Augmented Generation (RAG) is an AI architecture that combines two components:
- Retrieval – fetching relevant external data such as private documents, knowledge bases, APIs, or databases.
- Generation – passing that data into a Large Language Model (LLM) to generate context-aware answers.
Why? LLMs (like GPT, Claude, or LLaMA) are not trained on sensitive or private data (for example, company policies, internal manuals, or medical data). RAG ensures the LLM can access real-time, domain-specific knowledge without retraining.
How RAG is Made
Step-by-step process:
-
Data Ingestion – Load documents, PDFs, websites, databases, or APIs.
-
Data Chunking and Embedding – Split text into manageable chunks and convert them into vector embeddings.
-
Store in Vector Database – Save embeddings in a vector database such as Pinecone, Weaviate, FAISS, or Milvus.
-
Query Handling – When a user asks a question:
- Convert the query into an embedding.
- Retrieve the most relevant chunks from the vector database.
-
Augment LLM Input – Add the retrieved chunks as context to the user query.
-
LLM Generates Answer – Produce a context-aware, accurate, and grounded response.
Diagram of What Happens When User Queries
User Query → Embed Query → Retrieve Relevant Chunks → Augment Prompt → LLM → Answer
AI Integration and Frameworks
-
Amazon Bedrock
- Fully managed service to build RAG applications.
- Supports foundation models without needing to manage infrastructure.
-
Knowledge Base
- A managed vector store inside Bedrock.
- Can connect to Amazon OpenSearch, Pinecone, Weaviate, or self-managed databases.
-
Bedrock Agent
- Orchestrates the RAG workflow.
- Automates retrieval, reasoning, and response.
- Can integrate with enterprise APIs.
-
Frameworks for Building RAG Systems
- LlamaIndex (formerly GPT Index): Specialized for data connectors and indexing, useful for managing structured and unstructured data pipelines.
- LangChain: Popular framework for chaining LLM workflows, with modules for retrieval, prompt templates, and integration with vector databases.
2. Chunking Strategies
RAG depends heavily on how you chunk the data. If chunks are too large, retrieval becomes noisy. If chunks are too small, context is lost.
Preparing the Data
-
Data Ingestion
- Sources include PDFs, websites, databases, APIs, and CSVs.
- Convert everything into text format.
-
Prepare the Data
- Clean data (remove headers, footers, ads).
- Normalize text (lowercasing, removing special characters).
-
Store Data
- Generate embeddings using an embedding model such as OpenAI text-embedding-ada-002, Cohere, or HuggingFace models.
- Store embeddings in a vector database.
Sanitizing: remove PII, redundant content, or noisy data. Chunking: break down large text into smaller pieces for efficient retrieval.
Ways to Chunk Data
- Sentence-based – Each sentence is a chunk (useful for FAQs or Q&A).
- Paragraph-based – Chunk by paragraphs (good for structured documents).
- Page-based – Each page is a chunk (common for scanned documents and contracts).
- Section-based – Based on logical document structure such as headings or chapters.
- Semantic-based – NLP-driven splitting at points where meaning shifts (best for contextual RAG).
Chunking Strategies
-
Chunk Size
- Typical size: 200–500 tokens per chunk.
- Too large makes retrieval difficult; too small fragments context.
-
Chunk Overlap
- Overlap chunks by 10–20 percent to preserve continuity between them.
- Example: chunk size = 300 tokens, overlap = 50 tokens.
-
Splitting Techniques
- Rule-based splitting: by sentence, paragraph, or delimiter.
- Recursive splitting: start with large chunks and recursively split until within the size limit.
- Semantic splitting: use embeddings to detect shifts in meaning.
Best Practices for Chunking in RAG
- Use semantic chunking when possible.
- Keep chunks shorter than the LLM’s context window (typically 300–500 tokens).
- Maintain overlap for smooth flow of context.
- Always test retrieval quality with sample queries.

