A internal Retrieval-Augmented Generation (RAG) assistant that answers your questions using a local Embedding model and LLM model pulled via Ollama, Rerank model downloaded from Hugging Face and run on vLLM.
This version is trained on community-sourced cooking tips, but you can customize it to your own content easily.
- Fully local RAG setup (no need for cloud api)
- Uses Ollama-compatible models for both embedding and generation
- Incorporate huggingface.co rerank model, run on vLLM
- Built with Python, using LangChain and FAISS library
- Utilized the power of embedding models and rerank models to enhance knowledge retrieval accuracy
- Interactive terminal interface
- Fallback to general knowledge if answer isn't found in local data
- Python 3.11+
- Ollama installed and running
- pull LLM model and Embedding model from Ollama to local (after pulling, update line 21 and 25 of the code accordingly)
- vLLM installed and running (update line 31 after vLLM is setup)
- install huggingface_hub
- download the Rerank model 'bge-reranker-v2-m3' from huggingface.co with huggingface-cli command
Install Python library dependencies:
pip install langchain langchain-community langchain-ollama faiss-cpu requests jsonCheck if the knowledge base file is in the same directory as "internal-rag-cookbot.py". I have named mine "cooking-tips-comments.txt", the name and contents of the file can be changed.
This project uses Retrieval-Augmented Generation (RAG), which combines a embedding vector database created from your content, a reranker to rank relevance of content, along with a language model to provide more accurate and contextual answers.
-
The text file (cooking-tips-comments.txt) is our knowledge base file.
-
The contents of the file is converted by projecting the high-dimensional space of initial data vectors into a lower-dimensional space using a local embedding model (in the code provided, we used 'snowflake-arctic-embed:335m' from Ollama).
-
These vectors capture the semantic meaning of the text — two similar tips will have similar embeddings.
-
The vectors are stored in FAISS, a fast, in-memory vector store that supports efficient similarity search.
-
This lets the system quickly find the most relevant chunks of text when a new question is asked.
-
When you ask a question, it is also embedded into a vector on the spot for the LLM model to actully understand your question.
-
FAISS compares this vector to the ones in the vector database and returns the top matching text chunks. The specific phrase we use here is "top-k".
-
top-k is the number of top matching entries we retrieve from the embedding using a fast but rough similarity search. If the number k is too small, we might miss out on some relevant information; if it's too large, we're likely getting too much unrelated content. In the code provided we are using "top-k 10", which is a good number compared to our data size. In the case for using a much larger knowledge base, a bigger k is prefered (some knowledge bases are so big that they use k = 100).
-
These chunks are sent to a rerank API using 'bge-reranker-v2-m3' to reorder them by relevance.
-
Only the top result is used as context for generation.
-
The contxt is passed to the LLM (in the provided code we used 'gemma3:12b') along with our question.
-
The LLM uses this context to generate a more accurate, grounded, and helpful response.
Run the following command in the directory of the python file.
python internal-rag-cookbot.pyType in what you want to ask when the prompt "Your question: " shows up.
Example use of the code:
<some-user-directory>:~$ python chatbot-rerank.py
Internal RAG Q&A Bot
Ask questions about cooking hacks and kitchen tips.
This assistant is powered by a local language model and a custom knowledge base built from community-sourced cooking advice.
Type 'exit' to quit.
Your question:Here we enter our question:
Your question: i am trying to make chocolate chip cookiesThe answer responded is:
Top relevant chunk:
Not mine, but my wife browns the butter before she adds it to chocolate chip cookie dough and they're the best freakin' cookies I've ever eaten! ...
Answer: That's great! My wife browns the butter before adding it to her chocolate chip cookie dough, and it makes a huge difference – they're amazing! You should try it!
Your question: Now we can type "exit" to close this chatbot:
Your question: exit
Goodbye!