Hadith RAG is an AI-powered retrieval and reasoning engine designed for Islamic text understanding โ specifically Hadith corpora like Sahih Bukhari, Sahih Muslim, and others. It leverages semantic chunking, Arabic embeddings, and LLM reasoning to answer complex questions accurately, while preserving the sacred context and meaning of the Hadith.
A comprehensive Retrieval-Augmented Generation (RAG) system specialized for Arabic Hadith texts, built with LlamaIndex, Ollama, and Qdrant vector store.
๐ Features
- Arabic Hadith Processing: Specialized pipeline for Arabic religious texts
- Modern RAG Architecture: LlamaIndex + Ollama + Qdrant integration
- Semantic Chunking: Intelligent document segmentation for better context
- Sentence Window Context: Enhanced retrieval with surrounding context
- Multiple Data Formats: Support for .txt, .md, and .json files
- Interactive CLI: Rich terminal interface for queries
- Modular Design: Clean, extensible codebase
- LLM: Ollama qwen2.5:7b (Chinese model with Arabic capabilities)
- Embeddings: Ollama qwen3-embedding:4b
- Vector Store: Qdrant (in-memory or server)
- Framework: LlamaIndex โฅ0.11.0
- Interface: Typer + Rich for beautiful CLI
- Install Ollama (https://ollama.ai)
- Pull required models:
ollama pull qwen2.5:7b ollama pull qwen3-embedding:4b
git clone <repository-url>
cd hadith-rag
./setup.shgit clone <repository-url>
cd hadith-rag
# Create conda environment
conda env create -f environment.yml
conda activate hadith-rag
# Or use make commands
make conda-setup
conda activate hadith-raggit clone <repository-url>
cd hadith-rag
# With pip
pip install -r requirements.txt
# Or with conda
conda create -n hadith-rag python=3.11 -y
conda activate hadith-rag
pip install -r requirements.txt-
Add your Hadith data to the
data/directory:.jsonfiles with structure:{"hadiths": [{"text": "...", ...}]}.txtor.mdfiles with Hadith texts- Sample files included for testing
-
Run the pipeline:
python main.py interactive
Start an interactive query session:
python main.py interactive [OPTIONS]Options:
--rebuild: Rebuild the index from scratch--no-window: Disable sentence window context--top-k N: Number of documents to retrieve (default: 5)--data-dir PATH: Custom data directory--storage-dir PATH: Custom storage directory
Execute a single query and exit:
python main.py query-single "ู
ุง ูู ุญุฏูุซ ุงูููุฉุ"Build or rebuild the document index:
python main.py build-index [OPTIONS]Verify setup and dependencies:
python main.py check-setuphadith-rag/
โโโ main.py # CLI entry point
โโโ requirements.txt # Python dependencies
โโโ data/ # Hadith documents
โ โโโ sahih_bukhari_sample.json
โ โโโ hadith_collection.md
โโโ storage/ # Vector index storage
โโโ src/ # Core modules
โ โโโ __init__.py
โ โโโ config.py # Configuration management
โ โโโ embeddings.py # Custom Ollama embedding wrapper
โ โโโ document_loader.py # Multi-format document loading
โ โโโ index_builder.py # Vector index creation
โ โโโ query_engine.py # Query processing & response generation
โโโ README.md
Settings are managed in src/config.py:
# Ollama settings
OLLAMA_BASE_URL = "http://localhost:11434"
EMBEDDING_MODEL = "qwen3-embedding:4b"
LLM_MODEL = "qwen2.5:7b"
# Retrieval settings
SIMILARITY_TOP_K = 5
SENTENCE_WINDOW_SIZE = 3
# Chunking settings
CHUNK_SIZE = 512
CHUNK_OVERLAP = 50Environment variables can override defaults via .env file.
{
"collection": "sahih_bukhari",
"hadiths": [
{
"text": "Arabic hadith text here...",
"english": "English translation",
"narrator": "Narrator name",
"grade": "Sahih",
"number": 1
}
]
}## Hadith Title
Arabic hadith text here...
## Another Hadith
More Arabic text...The system automatically extracts metadata from file paths and content structure.
- "ู ุง ูู ุญุฏูุซ ุงูููุฉุ" (What is the hadith about intention?)
- "ุฃุญุงุฏูุซ ุนู ุจุฑ ุงููุงูุฏูู" (Hadiths about honoring parents)
- "ูุงู ุฑุณูู ุงููู ุนู ุงูุตุฏู" (What the Prophet said about truthfulness)
- "ุฃุญุงุฏูุซ ูู ุตุญูุญ ุงูุจุฎุงุฑู ุนู ุงูุตูุงุฉ" (Hadiths in Sahih Bukhari about prayer)
from src.embeddings import create_embedding_model
# Use different embedding model
embed_model = create_embedding_model(
model_name="custom-arabic-model",
base_url="http://localhost:11434"
)from src import HadithQueryEngine, build_hadith_index
# Build index
index = build_hadith_index(
data_dir="./data",
use_sentence_window=True,
rebuild=True
)
# Create query engine
engine = HadithQueryEngine(index, similarity_top_k=10)
# Query
result = engine.query("ู
ุง ูู ุงูุฅุณูุงู
ุ")
print(result["answer"])- Data Quality: Ensure Arabic text is properly encoded (UTF-8)
- Index Management: Use
--rebuildwhen adding new documents - Performance: Start with smaller document collections for testing
- Memory: Qdrant in-memory mode suitable for smaller datasets
- Models: Verify Ollama models are running before starting
Ollama Connection Failed:
# Check Ollama is running
ollama serve
# Verify models are available
ollama listImport Errors:
# Install missing dependencies
pip install -r requirements.txtEmpty Index:
# Check data directory has valid files
python main.py check-setup
# Rebuild index
python main.py build-index --rebuildArabic Text Issues:
- Ensure files are UTF-8 encoded
- Check Arabic text renders correctly in terminal
- Use
--no-windowif semantic chunking fails
- New Data Sources: Extend
HadithDocumentLoader - Custom Retrievers: Modify
query_engine.py - UI Improvements: Enhance
main.pyCLI - New Models: Update
embeddings.pyandconfig.py
# Install test dependencies
pip install pytest
# Run tests (when available)
pytest tests/- Fork the repository
- Create a feature branch
- Make changes with proper documentation
- Add tests for new functionality
- Submit a pull request
This project is open source and available under the MIT License.
- LlamaIndex: Excellent RAG framework
- Ollama: Local LLM serving made simple
- Qdrant: High-performance vector database
- Arabic NLP Community: Research and tools for Arabic text processing
Built with โค๏ธ for the Arabic Hadith preservation and accessibility
