# RAG (Retrieval-Augmented Generation) Question-answering system using vector embeddings and hybrid search. ## Glossary | Term | Description | |---|---| | **RAG** | Retrieval-Augmented Generation. Technique that enhances LLM responses by first retrieving relevant documents from a knowledge base, then using them as context for answer generation. Reduces hallucinations and grounds answers in actual content. | | **Embedding** | A vector (list of numbers) representing text in a high-dimensional space. Similar texts have similar vectors, enabling semantic search. Our embeddings have 1024 dimensions. | | **Vector Search / kNN** | k-Nearest Neighbors search. Finds documents whose embedding vectors are closest to the query vector using cosine similarity. Captures semantic meaning beyond keyword matching. | | **BM25** | Best Match 25. Traditional text search algorithm that ranks documents by term frequency and inverse document frequency. Good for exact keyword matches. | | **Hybrid Search** | Combines vector search (semantic) with BM25 (keyword) for better results. Vector search finds conceptually similar content; BM25 finds exact term matches. | | **Chunk** | A segment of text from a larger document. Long documents are split into overlapping chunks (default: 1000 chars with 200 char overlap) so each chunk fits in the embedding model's context and retrieval is more precise. | | **LLM** | Large Language Model. AI model trained on text that can generate human-like responses. Used here to formulate answers based on retrieved context (Ollama/Mistral). | | **Ollama** | Local LLM runtime for running open-source models. Used for development with models like `llama3.1:8b` (chat) and `mxbai-embed-large` (embeddings). | | **Cosine Similarity** | Measure of similarity between two vectors (0 to 1). Value of 1 means identical direction, 0 means orthogonal. Used to find semantically similar text chunks. | | **Dense Vector** | Elasticsearch field type for storing embedding vectors. Enables efficient approximate nearest neighbor (ANN) search at scale. | | **Context Window** | Maximum amount of text an LLM can process at once. Retrieved chunks are concatenated as context for the LLM, limited by `RAG_MAX_CONTEXT_LENGTH`. | | **Top-K** | Number of most relevant results to return from search. Higher values provide more context but may include less relevant content. | | **num_candidates** | Elasticsearch kNN parameter controlling how many candidates to consider before selecting top-K. Higher values improve accuracy but slow down search. | ## Architecture ```text +------------------+ +------------------+ +------------------+ | Plone CMS | | Redis/RQ | | Elasticsearch | | | | Worker | | | | - REST API |---->| - Embedding |---->| - Chunks Index | | - Subscribers | | Generation | | - kNN + BM25 | +------------------+ | - RAG Ask | +------------------+ +------------------+ | v +------------------+ | Ollama/Mistral | | - Embeddings | | - LLM Chat | +------------------+ ``` ## Components ### REST API Endpoints | Endpoint | Method | Description | |---|---|---| | `@rag-ask` | POST | Ask a question (async by default, sync with `"sync": true` for logged-in users) | | `@rag-ask` | GET | Poll for async result by job_id | | `@rag-status` | GET | RAG system status and statistics | | `@rag-index` | POST | Manually trigger embedding for single content | | `@rag-index-all` | POST | Reindex all RAG-enabled content | ### Modules - **config.py** - Environment-based configuration and registry access - **client.py** - OpenAI-compatible API client for embeddings and LLM - **chunks.py** - Elasticsearch chunks index management - **index.py** - Content embedding and hybrid search interface - **tasks.py** - Redis/RQ background tasks for async embedding and question-answering - **subscribers.py** - Zope event handlers for automatic indexing ## Configuration ### Environment Variables ```bash # Feature flag RAG_ENABLED=true # Embedding API (Ollama local / Mistral production) EMBEDDING_BASE_URL=http://localhost:11434/v1 EMBEDDING_API_KEY=ollama EMBEDDING_MODEL=mxbai-embed-large EMBEDDING_DIMENSIONS=1024 # LLM API (Ollama local / Mistral production) LLM_BASE_URL=http://localhost:11434/v1 LLM_API_KEY=ollama LLM_MODEL=llama3.1:8b # Chunking RAG_CHUNK_SIZE=1000 RAG_CHUNK_OVERLAP=200 # Search RAG_TOP_K=5 RAG_NUM_CANDIDATES=100 ``` ### Plone Registry Content types and LLM behavior are configured via Plone registry: | Record | Description | |---|---| | `wcs.backend.rag_content_types` | Content types to index (default: ContentPage, News, File, Contact, Book, Chapter) | | `wcs.backend.rag_system_prompt` | System prompt for the LLM | | `wcs.backend.rag_no_answer_message` | Message when no relevant context is found | | `wcs.backend.rag_error_message` | Message when answer generation fails | | `wcs.backend.rag_llm_temperature` | LLM temperature (0.0 = deterministic, 1.0 = creative) | | `wcs.backend.rag_llm_max_tokens` | Maximum tokens in LLM response | | `wcs.backend.rag_boost_bm25` | BM25 text search boost factor | | `wcs.backend.rag_boost_knn` | kNN vector search boost factor | | `wcs.backend.rag_title_boost_factor` | Additional boost for title matches | | `wcs.backend.rag_score_high_threshold` | Minimum score for high confidence results | | `wcs.backend.rag_score_medium_threshold` | Minimum score for medium confidence results | ## Elasticsearch Index Separate index `{plone-index}-rag-chunks` with mapping: ```json { "properties": { "chunk_id": {"type": "keyword"}, "parent_uid": {"type": "keyword"}, "parent_title": {"type": "text"}, "parent_path": {"type": "keyword"}, "portal_type": {"type": "keyword"}, "allowedRolesAndUsers": {"type": "keyword"}, "chunk_index": {"type": "integer"}, "chunk_text": {"type": "text"}, "embedding": { "type": "dense_vector", "dims": 1024, "index": true, "similarity": "cosine" } } } ``` ## Data Flow ### Indexing (Async via Worker) 1. Content created/modified triggers `on_content_added`/`on_content_modified` 2. `queue_embedding_task` queues job with ES settings and index name 3. Worker fetches content via `fetch_data` (REST API) 4. Text split into overlapping chunks 5. Embeddings generated via OpenAI-compatible API 6. Chunks indexed to Elasticsearch ### Question-Answering (Async via Worker) 1. `@rag-ask` (POST) receives question from client 2. Job ID generated from question hash + user security context + optional path 3. If cached result exists, return immediately 4. Otherwise, queue `rag_ask_task` to Redis worker 5. Return pending status with job_id for polling 6. Worker performs hybrid search and generates LLM answer 7. Result cached in Redis (TTL: 5 minutes) 8. Client polls `@rag-ask` (GET) with job_id until completed ## Local Development Start Ollama: ```bash ollama serve ollama pull mxbai-embed-large ollama pull llama3.1:8b ``` Enable RAG: ```bash export RAG_ENABLED=true ``` ## REST API ### Ask a Question (Async) The `@rag-ask` endpoint processes questions asynchronously by default, allowing the frontend to poll for results without blocking. Results are cached based on the question, user permissions, and optional path filter. ```javascript // Submit question (async) const response = await fetch('/Plone/@rag-ask', { method: 'POST', headers: { 'Accept': 'application/json', 'Content-Type': 'application/json' }, body: JSON.stringify({ question: 'What are the opening hours?', path: '/plone/section' // optional: filter to specific section }) }); const data = await response.json(); // data.status === 'pending', data.job_id === 'ragask_abc123' ``` **Pending response:** ```json { "@id": "http://localhost:8080/Plone/@rag-ask", "status": "pending", "job_id": "ragask_abc123", "question": "What are the opening hours?" } ``` **Poll for result:** ```javascript // Poll until completed const pollResponse = await fetch('/Plone/@rag-ask?job_id=ragask_abc123', { headers: { 'Accept': 'application/json' } }); const result = await pollResponse.json(); ``` **Completed response:** ```json { "@id": "http://localhost:8080/Plone/@rag-ask", "status": "completed", "question": "What are the opening hours?", "answer": "Based on the information...", "sources": [ { "title": "Contact", "path": "/contact", "portal_type": "Contact", "score": 0.92, "chunk_index": 0, "snippet": "Opening hours: Monday to Friday..." } ] } ``` ### Ask a Question (Sync) Logged-in users can request synchronous processing by adding `"sync": true`. This bypasses the queue and returns the answer directly. ```javascript const response = await fetch('/Plone/@rag-ask', { method: 'POST', headers: { 'Accept': 'application/json', 'Content-Type': 'application/json' }, body: JSON.stringify({ question: 'What are the opening hours?', sync: true }) }); const data = await response.json(); ``` **Sync response:** ```json { "@id": "http://localhost:8080/Plone/@rag-ask", "question": "What are the opening hours?", "answer": "Based on the information...", "sources": [ {"title": "Contact", "path": "/contact", "score": 0.92, "chunk_index": 0} ] } ``` ### Path Filtering Use the `path` parameter to restrict search results to a specific section of the site: ```javascript const response = await fetch('/Plone/@rag-ask', { method: 'POST', headers: { 'Accept': 'application/json', 'Content-Type': 'application/json' }, body: JSON.stringify({ question: 'How do I configure this?', path: '/plone/documentation/admin-guide' }) }); ``` This filters results to only include content under the specified path prefix. ### Reindex All Content ```javascript const response = await fetch('/Plone/@rag-index-all', { method: 'POST', headers: { 'Accept': 'application/json', 'Content-Type': 'application/json' } }); const data = await response.json(); ``` **Response:** ```json { "@id": "http://localhost:8080/Plone/@rag-index-all", "status": "queued", "queued_count": 150, "content_types": ["ContentPage", "News", "File", "Contact", "Book", "Chapter"] } ``` ### Index Single Content ```javascript const response = await fetch('/Plone/@rag-index', { method: 'POST', headers: { 'Accept': 'application/json', 'Content-Type': 'application/json' }, body: JSON.stringify({ uid: 'content-uid-here', async: true // default: true }) }); const data = await response.json(); ``` **Response:** ```json { "@id": "http://localhost:8080/Plone/@rag-index", "status": "queued", "uid": "content-uid-here" } ``` ### Check Status ```javascript const response = await fetch('/Plone/@rag-status', { headers: { 'Accept': 'application/json' } }); const data = await response.json(); ``` **Response:** ```json { "@id": "http://localhost:8080/Plone/@rag-status", "enabled": true, "index_name": "plone-rag-chunks", "exists": true, "chunk_count": 1250, "parent_count": 150, "dimensions": 1024 } ```