Arabic Hadith RAG Pipeline

Hadith RAG is an AI-powered retrieval and reasoning engine designed for Islamic text understanding — specifically Hadith corpora like Sahih Bukhari, Sahih Muslim, and others. It leverages semantic chunking, Arabic embeddings, and LLM reasoning to answer complex questions accurately, while preserving the sacred context and meaning of the Hadith.

A comprehensive Retrieval-Augmented Generation (RAG) system specialized for Arabic Hadith texts, built with LlamaIndex, Ollama, and Qdrant vector store.

🌟 Features

Arabic Hadith Processing: Specialized pipeline for Arabic religious texts
Modern RAG Architecture: LlamaIndex + Ollama + Qdrant integration
Semantic Chunking: Intelligent document segmentation for better context
Sentence Window Context: Enhanced retrieval with surrounding context
Multiple Data Formats: Support for .txt, .md, and .json files
Interactive CLI: Rich terminal interface for queries
Modular Design: Clean, extensible codebase

🛠️ Technology Stack

LLM: Ollama qwen2.5:7b (Chinese model with Arabic capabilities)
Embeddings: Ollama qwen3-embedding:4b
Vector Store: Qdrant (in-memory or server)
Framework: LlamaIndex ≥0.11.0
Interface: Typer + Rich for beautiful CLI

🚀 Quick Start

Prerequisites

Install Ollama (https://ollama.ai)

Pull required models:

ollama pull qwen2.5:7b
ollama pull qwen3-embedding:4b

Installation

Option 1: Automated Setup

git clone <repository-url>
cd hadith-rag
./setup.sh

Option 2: Conda Environment Setup

git clone <repository-url>
cd hadith-rag

# Create conda environment
conda env create -f environment.yml
conda activate hadith-rag

# Or use make commands
make conda-setup
conda activate hadith-rag

Option 3: Manual Setup

git clone <repository-url>
cd hadith-rag

# With pip
pip install -r requirements.txt

# Or with conda
conda create -n hadith-rag python=3.11 -y
conda activate hadith-rag
pip install -r requirements.txt

Add your Hadith data to the data/ directory:
- .json files with structure: {"hadiths": [{"text": "...", ...}]}
- .txt or .md files with Hadith texts
- Sample files included for testing
Run the pipeline:
```
python main.py interactive
```

📖 Usage

Interactive Mode

Start an interactive query session:

python main.py interactive [OPTIONS]

Options:

--rebuild: Rebuild the index from scratch
--no-window: Disable sentence window context
--top-k N: Number of documents to retrieve (default: 5)
--data-dir PATH: Custom data directory
--storage-dir PATH: Custom storage directory

Single Query

Execute a single query and exit:

python main.py query-single "ما هو حديث النية؟"

Build Index Only

Build or rebuild the document index:

python main.py build-index [OPTIONS]

System Check

Verify setup and dependencies:

python main.py check-setup

🗂️ Project Structure

hadith-rag/
├── main.py                 # CLI entry point
├── requirements.txt        # Python dependencies
├── data/                  # Hadith documents
│   ├── sahih_bukhari_sample.json
│   └── hadith_collection.md
├── storage/               # Vector index storage
├── src/                   # Core modules
│   ├── __init__.py
│   ├── config.py          # Configuration management
│   ├── embeddings.py      # Custom Ollama embedding wrapper
│   ├── document_loader.py # Multi-format document loading
│   ├── index_builder.py   # Vector index creation
│   └── query_engine.py    # Query processing & response generation
└── README.md

⚙️ Configuration

Settings are managed in src/config.py:

# Ollama settings
OLLAMA_BASE_URL = "http://localhost:11434"
EMBEDDING_MODEL = "qwen3-embedding:4b"
LLM_MODEL = "qwen2.5:7b"

# Retrieval settings
SIMILARITY_TOP_K = 5
SENTENCE_WINDOW_SIZE = 3

# Chunking settings  
CHUNK_SIZE = 512
CHUNK_OVERLAP = 50

Environment variables can override defaults via .env file.

📊 Data Formats

JSON Format

{
  "collection": "sahih_bukhari",
  "hadiths": [
    {
      "text": "Arabic hadith text here...",
      "english": "English translation",
      "narrator": "Narrator name",
      "grade": "Sahih",
      "number": 1
    }
  ]
}

Text/Markdown Format

## Hadith Title

Arabic hadith text here...

## Another Hadith

More Arabic text...

The system automatically extracts metadata from file paths and content structure.

🎯 Example Queries

"ما هو حديث النية؟" (What is the hadith about intention?)
"أحاديث عن بر الوالدين" (Hadiths about honoring parents)
"قال رسول الله عن الصدق" (What the Prophet said about truthfulness)
"أحاديث في صحيح البخاري عن الصلاة" (Hadiths in Sahih Bukhari about prayer)

🔧 Advanced Usage

Custom Embedding Model

from src.embeddings import create_embedding_model

# Use different embedding model
embed_model = create_embedding_model(
    model_name="custom-arabic-model",
    base_url="http://localhost:11434"
)

Programmatic Usage

from src import HadithQueryEngine, build_hadith_index

# Build index
index = build_hadith_index(
    data_dir="./data",
    use_sentence_window=True,
    rebuild=True
)

# Create query engine  
engine = HadithQueryEngine(index, similarity_top_k=10)

# Query
result = engine.query("ما هو الإسلام؟")
print(result["answer"])

🛡️ Best Practices

Data Quality: Ensure Arabic text is properly encoded (UTF-8)
Index Management: Use --rebuild when adding new documents
Performance: Start with smaller document collections for testing
Memory: Qdrant in-memory mode suitable for smaller datasets
Models: Verify Ollama models are running before starting

🚨 Troubleshooting

Common Issues

Ollama Connection Failed:

# Check Ollama is running
ollama serve

# Verify models are available
ollama list

Import Errors:

# Install missing dependencies
pip install -r requirements.txt

Empty Index:

# Check data directory has valid files
python main.py check-setup

# Rebuild index
python main.py build-index --rebuild

Arabic Text Issues:

Ensure files are UTF-8 encoded
Check Arabic text renders correctly in terminal
Use --no-window if semantic chunking fails

📝 Development

Adding New Features

New Data Sources: Extend HadithDocumentLoader
Custom Retrievers: Modify query_engine.py
UI Improvements: Enhance main.py CLI
New Models: Update embeddings.py and config.py

Running Tests

# Install test dependencies
pip install pytest

# Run tests (when available)
pytest tests/

🤝 Contributing

Fork the repository
Create a feature branch
Make changes with proper documentation
Add tests for new functionality
Submit a pull request

📜 License

This project is open source and available under the MIT License.

🙏 Acknowledgments

LlamaIndex: Excellent RAG framework
Ollama: Local LLM serving made simple
Qdrant: High-performance vector database
Arabic NLP Community: Research and tools for Arabic text processing

Built with ❤️ for the Arabic Hadith preservation and accessibility

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
hadith-rag		hadith-rag
README.md		README.md
صلى_الله_عليه_وسلم.svg.png		صلى_الله_عليه_وسلم.svg.png

ahmedeltaher/Hadith-AI

Folders and files

Latest commit

History

Repository files navigation