# RAG (Retrieval-Augmented Generation)

Question-answering system using vector embeddings and hybrid search.

## Glossary

| Term | Description |
|---|---|
| **RAG** | Retrieval-Augmented Generation. Technique that enhances LLM responses by first retrieving relevant documents from a knowledge base, then using them as context for answer generation. Reduces hallucinations and grounds answers in actual content. |
| **Embedding** | A vector (list of numbers) representing text in a high-dimensional space. Similar texts have similar vectors, enabling semantic search. Our embeddings have 1024 dimensions. |
| **Vector Search / kNN** | k-Nearest Neighbors search. Finds documents whose embedding vectors are closest to the query vector using cosine similarity. Captures semantic meaning beyond keyword matching. |
| **BM25** | Best Match 25. Traditional text search algorithm that ranks documents by term frequency and inverse document frequency. Good for exact keyword matches. |
| **Hybrid Search** | Combines vector search (semantic) with BM25 (keyword) for better results. Vector search finds conceptually similar content; BM25 finds exact term matches. |
| **Chunk** | A segment of text from a larger document. Long documents are split into overlapping chunks (default: 1000 chars with 200 char overlap) so each chunk fits in the embedding model's context and retrieval is more precise. |
| **LLM** | Large Language Model. AI model trained on text that can generate human-like responses. Used here to formulate answers based on retrieved context (Ollama/Mistral). |
| **Ollama** | Local LLM runtime for running open-source models. Used for development with models like `llama3.1:8b` (chat) and `mxbai-embed-large` (embeddings). |
| **Cosine Similarity** | Measure of similarity between two vectors (0 to 1). Value of 1 means identical direction, 0 means orthogonal. Used to find semantically similar text chunks. |
| **Dense Vector** | Elasticsearch field type for storing embedding vectors. Enables efficient approximate nearest neighbor (ANN) search at scale. |
| **Context Window** | Maximum amount of text an LLM can process at once. Retrieved chunks are concatenated as context for the LLM, limited by `RAG_MAX_CONTEXT_LENGTH`. |
| **Top-K** | Number of most relevant results to return from search. Higher values provide more context but may include less relevant content. |
| **num_candidates** | Elasticsearch kNN parameter controlling how many candidates to consider before selecting top-K. Higher values improve accuracy but slow down search. |


## Architecture

```text
+------------------+     +------------------+     +------------------+
|   Plone CMS      |     |   Redis/RQ       |     |  Elasticsearch   |
|                  |     |   Worker         |     |                  |
|  - REST API      |---->|  - Embedding     |---->|  - Chunks Index  |
|  - Subscribers   |     |    Generation    |     |  - kNN + BM25    |
+------------------+     |  - RAG Ask       |     +------------------+
                         +------------------+
                                  |
                                  v
                         +------------------+
                         |  Ollama/Mistral  |
                         |  - Embeddings    |
                         |  - LLM Chat      |
                         +------------------+
```


## Components

### REST API Endpoints

| Endpoint | Method | Description |
|---|---|---|
| `@rag-ask` | POST | Ask a question (async by default, sync with `"sync": true` for logged-in users) |
| `@rag-ask` | GET | Poll for async result by job_id |
| `@rag-status` | GET | RAG system status and statistics |
| `@rag-index` | POST | Manually trigger embedding for single content |
| `@rag-index-all` | POST | Reindex all RAG-enabled content |


### Modules

- **config.py** - Environment-based configuration and registry access
- **client.py** - OpenAI-compatible API client for embeddings and LLM
- **chunks.py** - Elasticsearch chunks index management
- **index.py** - Content embedding and hybrid search interface
- **tasks.py** - Redis/RQ background tasks for async embedding and question-answering
- **subscribers.py** - Zope event handlers for automatic indexing


## Configuration

### Environment Variables

```bash
# Feature flag
RAG_ENABLED=true

# Embedding API (Ollama local / Mistral production)
EMBEDDING_BASE_URL=http://localhost:11434/v1
EMBEDDING_API_KEY=ollama
EMBEDDING_MODEL=mxbai-embed-large
EMBEDDING_DIMENSIONS=1024

# LLM API (Ollama local / Mistral production)
LLM_BASE_URL=http://localhost:11434/v1
LLM_API_KEY=ollama
LLM_MODEL=llama3.1:8b

# Chunking
RAG_CHUNK_SIZE=1000
RAG_CHUNK_OVERLAP=200

# Search
RAG_TOP_K=5
RAG_NUM_CANDIDATES=100
```


### Plone Registry

Content types and LLM behavior are configured via Plone registry:

| Record | Description |
|---|---|
| `wcs.backend.rag_content_types` | Content types to index (default: ContentPage, News, File, Contact, Book, Chapter) |
| `wcs.backend.rag_system_prompt` | System prompt for the LLM |
| `wcs.backend.rag_no_answer_message` | Message when no relevant context is found |
| `wcs.backend.rag_error_message` | Message when answer generation fails |
| `wcs.backend.rag_llm_temperature` | LLM temperature (0.0 = deterministic, 1.0 = creative) |
| `wcs.backend.rag_llm_max_tokens` | Maximum tokens in LLM response |
| `wcs.backend.rag_boost_bm25` | BM25 text search boost factor |
| `wcs.backend.rag_boost_knn` | kNN vector search boost factor |
| `wcs.backend.rag_title_boost_factor` | Additional boost for title matches |
| `wcs.backend.rag_score_high_threshold` | Minimum score for high confidence results |
| `wcs.backend.rag_score_medium_threshold` | Minimum score for medium confidence results |


## Elasticsearch Index

Separate index `{plone-index}-rag-chunks` with mapping:

```json
{
  "properties": {
    "chunk_id": {"type": "keyword"},
    "parent_uid": {"type": "keyword"},
    "parent_title": {"type": "text"},
    "parent_path": {"type": "keyword"},
    "portal_type": {"type": "keyword"},
    "allowedRolesAndUsers": {"type": "keyword"},
    "chunk_index": {"type": "integer"},
    "chunk_text": {"type": "text"},
    "embedding": {
      "type": "dense_vector",
      "dims": 1024,
      "index": true,
      "similarity": "cosine"
    }
  }
}
```


## Data Flow

### Indexing (Async via Worker)

1. Content created/modified triggers `on_content_added`/`on_content_modified`
2. `queue_embedding_task` queues job with ES settings and index name
3. Worker fetches content via `fetch_data` (REST API)
4. Text split into overlapping chunks
5. Embeddings generated via OpenAI-compatible API
6. Chunks indexed to Elasticsearch


### Question-Answering (Async via Worker)

1. `@rag-ask` (POST) receives question from client
2. Job ID generated from question hash + user security context + optional path
3. If cached result exists, return immediately
4. Otherwise, queue `rag_ask_task` to Redis worker
5. Return pending status with job_id for polling
6. Worker performs hybrid search and generates LLM answer
7. Result cached in Redis (TTL: 5 minutes)
8. Client polls `@rag-ask` (GET) with job_id until completed


## Local Development

Start Ollama:

```bash
ollama serve
ollama pull mxbai-embed-large
ollama pull llama3.1:8b
```


Enable RAG:

```bash
export RAG_ENABLED=true
```


## REST API

### Ask a Question (Async)

The `@rag-ask` endpoint processes questions asynchronously by default, allowing the frontend to poll for results without blocking. Results are cached based on the question, user permissions, and optional path filter.

```javascript
// Submit question (async)
const response = await fetch('/Plone/@rag-ask', {
    method: 'POST',
    headers: {
        'Accept': 'application/json',
        'Content-Type': 'application/json'
    },
    body: JSON.stringify({
        question: 'What are the opening hours?',
        path: '/plone/section'  // optional: filter to specific section
    })
});
const data = await response.json();
// data.status === 'pending', data.job_id === 'ragask_abc123'
```

**Pending response:**

```json
{
  "@id": "http://localhost:8080/Plone/@rag-ask",
  "status": "pending",
  "job_id": "ragask_abc123",
  "question": "What are the opening hours?"
}
```

**Poll for result:**

```javascript
// Poll until completed
const pollResponse = await fetch('/Plone/@rag-ask?job_id=ragask_abc123', {
    headers: { 'Accept': 'application/json' }
});
const result = await pollResponse.json();
```

**Completed response:**

```json
{
  "@id": "http://localhost:8080/Plone/@rag-ask",
  "status": "completed",
  "question": "What are the opening hours?",
  "answer": "Based on the information...",
  "sources": [
    {
      "title": "Contact",
      "path": "/contact",
      "portal_type": "Contact",
      "score": 0.92,
      "chunk_index": 0,
      "snippet": "Opening hours: Monday to Friday..."
    }
  ]
}
```


### Ask a Question (Sync)

Logged-in users can request synchronous processing by adding `"sync": true`. This bypasses the queue and returns the answer directly.

```javascript
const response = await fetch('/Plone/@rag-ask', {
    method: 'POST',
    headers: {
        'Accept': 'application/json',
        'Content-Type': 'application/json'
    },
    body: JSON.stringify({
        question: 'What are the opening hours?',
        sync: true
    })
});
const data = await response.json();
```

**Sync response:**

```json
{
  "@id": "http://localhost:8080/Plone/@rag-ask",
  "question": "What are the opening hours?",
  "answer": "Based on the information...",
  "sources": [
    {"title": "Contact", "path": "/contact", "score": 0.92, "chunk_index": 0}
  ]
}
```


### Path Filtering

Use the `path` parameter to restrict search results to a specific section of the site:

```javascript
const response = await fetch('/Plone/@rag-ask', {
    method: 'POST',
    headers: {
        'Accept': 'application/json',
        'Content-Type': 'application/json'
    },
    body: JSON.stringify({
        question: 'How do I configure this?',
        path: '/plone/documentation/admin-guide'
    })
});
```

This filters results to only include content under the specified path prefix.


### Reindex All Content

```javascript
const response = await fetch('/Plone/@rag-index-all', {
    method: 'POST',
    headers: {
        'Accept': 'application/json',
        'Content-Type': 'application/json'
    }
});
const data = await response.json();
```

**Response:**

```json
{
  "@id": "http://localhost:8080/Plone/@rag-index-all",
  "status": "queued",
  "queued_count": 150,
  "content_types": ["ContentPage", "News", "File", "Contact", "Book", "Chapter"]
}
```


### Index Single Content

```javascript
const response = await fetch('/Plone/@rag-index', {
    method: 'POST',
    headers: {
        'Accept': 'application/json',
        'Content-Type': 'application/json'
    },
    body: JSON.stringify({
        uid: 'content-uid-here',
        async: true  // default: true
    })
});
const data = await response.json();
```

**Response:**

```json
{
  "@id": "http://localhost:8080/Plone/@rag-index",
  "status": "queued",
  "uid": "content-uid-here"
}
```


### Check Status

```javascript
const response = await fetch('/Plone/@rag-status', {
    headers: { 'Accept': 'application/json' }
});
const data = await response.json();
```

**Response:**

```json
{
  "@id": "http://localhost:8080/Plone/@rag-status",
  "enabled": true,
  "index_name": "plone-rag-chunks",
  "exists": true,
  "chunk_count": 1250,
  "parent_count": 150,
  "dimensions": 1024
}
```