Unlocking the Full Potential of Local Llama 3 on Windows
Written on
I view Retrieval-Augmented Generation (RAG) as fundamental for the evolution of AI technologies. Our goal isn't just to have AI that produces random responses; we need an AI that can pull answers from designated document collections, comprehend the context of inquiries, navigate its embeddings, and perform web searches as required. It should also be able to evaluate the accuracy of its responses to avoid generating misleading information, ultimately providing answers that are as coherent and human-like as possible, based on the documents we provide.
The wait is over. Let's delve into the details.
This article draws inspiration from this video:
Several modifications have been made to enhance how source data is utilized. Instead of depending on a single PDF, the system now accesses a directory of PDFs as one of its sources. Moreover, the approach has shifted to direct all inquiries to the vector store, foregoing the need for web searches when possible.
In this comprehensive analysis, we will break down the provided code snippet to unveil the mechanics of Langchain:
# Install modules
!pip install ollama langchain beautifulsoup4 chromadb gradio unstructured langchain-nomic langchain_community tiktoken langchainhub langgraph tavily-python gpt4all -q
!pip install "unstructured[all-docs]" -q
!ollama pull llama3
!ollama pull nomic-embed-text
These commands kickstart the installation of necessary modules and libraries required for Langchain and its functionalities. The pip install commands guarantee that all vital dependencies are included, while ollama pull retrieves specific models and resources needed for text processing.
# Importing libraries
import os
import bs4
import getpass
import ollama
from typing import List
from typing_extensions import TypedDict
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import (
WebBaseLoader,
UnstructuredPDFLoader,
OnlinePDFLoader,
UnstructuredFileLoader,
PyPDFDirectoryLoader,
)
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings, GPT4AllEmbeddings
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.prompts import PromptTemplate, ChatPromptTemplate
from langchain_community.chat_models import ChatOllama
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_community.tools.tavily_search import TavilySearchResults
Here, we import various libraries and modules essential for the operation of Langchain. These include tools for text splitting, document loading, vector embedding, output parsing, and others. Each import statement contributes vital functionality for diverse aspects of natural language processing (NLP) tasks.
# Options
local_llm = 'llama3'
llm = ChatOllama(model=local_llm, format="json", temperature=0)
# embeddings
embeddings = GPT4AllEmbeddings()
This segment configures the options for Langchain. local_llm identifies the local model to be utilized, while llm initializes a ChatOllama instance for engaging with the model. Here, the type of embeddings, whether Ollama's or GPT-4, is also specified.
# Sources
# URL
urls = [
"https://lilianweng.github.io/posts/2023-06-23-agent/",
"https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
"https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/",
]
docs = [WebBaseLoader(url).load() for url in urls]
docs_list = [item for sublist in docs for item in sublist]
loader = PyPDFDirectoryLoader("C://Users//ASUS//Downloads//sources//")
data = loader.load()
docs_list.extend(data)
This code snippet retrieves textual information from various sources, including web links and PDF documents. The WebBaseLoader is utilized to load data from URLs, while the PyPDFDirectoryLoader is responsible for fetching PDF files from a specified local directory.
# Splitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=1000, chunk_overlap=200)
doc_splits = text_splitter.split_documents(docs_list)
In this section, the text splitter is initialized to divide documents into smaller segments for efficient processing. This step is vital for tasks such as vectorization and retrieval, where managing large documents can be challenging.
# Add to vectorDB
vectorstore = Chroma.from_documents(
documents=doc_splits,
collection_name="rag-chroma",
embedding=embeddings,
)
retriever = vectorstore.as_retriever()
This code creates a vector store using Chroma, a Langchain component designed for storing and querying document embeddings. The documents are vectorized with the specified embeddings and stored in the vector store, enabling efficient retrieval based on semantic similarity.
# Retrieval Grader
prompt = PromptTemplate(
template="""system You are a grader assessing relevance
of a retrieved document to a user question. If the document contains keywords related to the user question,
grade it as relevant. It does not need to be a stringent test. The goal is to filter out erroneous retrievals.
Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question.
Provide the binary score as a JSON with a single key 'score' and no preamble or explanation.
user
Here is the retrieved document: nn {document} nn
Here is the user question: {question} n assistant
""",
input_variables=["question", "document"],
)
retrieval_grader = prompt | llm | JsonOutputParser()
This section defines a prompt template for grading the relevance of retrieved documents in relation to user questions. The template sets the criteria for grading and prompts the user to provide a binary score indicating the document's relevance. The score is processed through Langchain’s ChatOllama instance and converted to JSON format for further evaluation.
# Generate
prompt = PromptTemplate(
template="""system You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know.
Use three sentences maximum and keep the answer concise user
Question: {question}
Context: {context}
Answer: assistant""",
input_variables=["question", "document"],
)
This prompt template is designed for generating responses to user questions based on retrieved context. It instructs the assistant to provide concise answers using up to three sentences, leveraging the retrieved documents as context. This template aids in question-answering tasks within the Langchain framework.
# Post-processing
def format_docs(docs):
return "nn".join(doc.page_content for doc in docs)
This function processes retrieved documents, formatting them into a readable text format by concatenating the page content of each document with double line breaks for clarity.
# Chain
rag_chain = prompt | llm | StrOutputParser()
A processing chain (rag_chain) is established using Langchain components, including the prompt template, ChatOllama instance (llm), and string output parser. This chain facilitates generating responses to user queries based on the provided context.
# Hallucination Grader
prompt = PromptTemplate(
template=""" system You are a grader assessing whether
an answer is grounded in / supported by a set of facts. Give a binary score 'yes' or 'no' score to indicate
whether the answer is grounded in / supported by a set of facts. Provide the binary score as a JSON with a
single key 'score' and no preamble or explanation. user
Here are the facts:
n ------- n
{documents}
n ------- n
Here is the answer: {generation} assistant""",
input_variables=["generation", "documents"],
)
This section sets up a prompt template to evaluate whether the generated answer is supported by a set of facts. The template requests the user to determine if the answer is grounded in the provided documents and processes the generated answer and relevant documents through Langchain components for scoring, parsing the results into JSON format.
# Answer Grader
prompt = PromptTemplate(
template="""system You are a grader assessing whether an
answer is useful to resolve a question. Give a binary score 'yes' or 'no' to indicate whether the answer is
useful to resolve a question. Provide the binary score as a JSON with a single key 'score' and no preamble or explanation.
user Here is the answer:
n ------- n
{generation}
n ------- n
Here is the question: {question} assistant""",
input_variables=["generation", "question"],
)
This prompt template assesses the usefulness of generated answers in addressing user questions. It instructs the user to provide a binary score indicating whether the answer effectively resolves the inquiry. The answer and question are processed via Langchain components, and the resulting score is parsed into JSON format.
# Router
prompt = PromptTemplate(
template="""system
You excel in directing user inquiries either to a vector store or a web search.
For queries related to documents within the vector store, prioritize utilizing the vector store.
There's no need to strictly match keywords in the question to topics within the vector store.
If the question isn't covered by the vector store's content, resort to a web search.
Provide a binary decision, 'web_search' or 'vectorstore', depending on the nature of the question.
Return the a JSON with a single key 'datasource' and
no preamble or explanation. Question to route: {question} assistant""",
input_variables=["question"],
)
This segment defines a prompt template for directing user inquiries to either a vector store or a web search engine based on the query’s nature. The template instructs the user to provide a binary decision indicating the preferred data source (web_search or vectorstore) for answering the question, with the decision being processed through Langchain components and parsed into JSON format.
# Search
os.environ["TAVILY_API_KEY"] = "tvly-XXXX"
web_search_tool = TavilySearchResults(k=3)
This code initializes a web search tool (web_search_tool) powered by the Tavily API, setting the Tavily API key as an environment variable to enable web search functionalities.
# State
class GraphState(TypedDict):
"""
Represents the state of our graph.
Attributes:
question: user question
generation: generated response
web_search: flag for web search
documents: list of documents
"""
question: str
generation: str
web_search: str
documents: List[str]
Here, a GraphState class is defined to represent the state of the Langchain graph, encompassing attributes like the user question, generated answer, whether a web search is needed, and a list of pertinent documents.
# Nodes
def retrieve(state):
"""
Retrieve documents from vectorstore
Args:
state (dict): The current graph stateReturns:
state (dict): Updated state with retrieved documents"""
print("---RETRIEVE---")
question = state["question"]
# Retrieval
documents = retriever.invoke(question)
return {"documents": documents, "question": question}
def generate(state):
"""
Generate answer using RAG on retrieved documents
Args:
state (dict): The current graph stateReturns:
state (dict): Updated state with generated answer"""
print("---GENERATE---")
question = state["question"]
documents = state["documents"]
# RAG generation
generation = rag_chain.invoke({"context": documents, "question": question})
return {"documents": documents, "question": question, "generation": generation}
def grade_documents(state):
"""
Determines whether the retrieved documents are relevant to the question.
If any document is not relevant, we set a flag to run web search.
Args:
state (dict): The current graph stateReturns:
state (dict): Updated state with relevant documents and web_search flag"""
print("---CHECK DOCUMENT RELEVANCE TO QUESTION---")
question = state["question"]
documents = state["documents"]
# Score each doc
filtered_docs = []
web_search = "No"
for d in documents:
score = retrieval_grader.invoke({"question": question, "document": d.page_content})
grade = score['score']
# Document relevant
if grade.lower() == "yes":
print("---GRADE: DOCUMENT RELEVANT---")
filtered_docs.append(d)
# Document not relevant
else:
print("---GRADE: DOCUMENT NOT RELEVANT---")
web_search = "Yes"
continue
return {"documents": filtered_docs, "question": question, "web_search": web_search}
These functions represent distinct nodes in the Langchain graph, each responsible for a specific task: retrieve fetches relevant documents from the vector store, generate produces an answer using the RAG model, and grade_documents assesses the relevance of retrieved documents to the user query, determining if a web search is necessary.
def web_search(state):
"""
Perform web search based on the question
Args:
state (dict): The current graph stateReturns:
state (dict): Updated state with appended web results"""
print("---WEB SEARCH---")
question = state["question"]
documents = state["documents"]
# Web search
docs = web_search_tool.invoke({"query": question})
web_results = "n".join([d["content"] for d in docs])
web_results = Document(page_content=web_results)
if documents is not None:
documents.append(web_results)else:
documents = [web_results]return {"documents": documents, "question": question}
This function conducts a web search based on the user question and adds the retrieved results to the existing document list. It employs the Tavily API to perform the web search, formatting and appending the acquired content to the document list.
def route_question(state):
"""
Route question to web search or RAG.
Args:
state (dict): The current graph stateReturns:
str: Next node to call"""
print("---ROUTE QUESTION---")
question = state["question"]
print(question)
source = question_router.invoke({"question": question})
print(source)
print(source['datasource'])
if source['datasource'] == 'web_search':
print("---ROUTE QUESTION TO WEB SEARCH---")
return "websearch"
elif source['datasource'] == 'vectorstore':
print("---ROUTE QUESTION TO RAG---")
return "vectorstore"
This function determines how to route user questions based on their nature. It uses the question_router to evaluate if the question should be directed to a web search or processed with the RAG model, based on the output from the question_router component.
def decide_to_generate(state):
"""
Determine whether to generate an answer or add web search
Args:
state (dict): The current graph stateReturns:
str: Decision for next node to call"""
print("---ASSESS GRADED DOCUMENTS---")
question = state["question"]
web_search = state["web_search"]
filtered_documents = state["documents"]
if web_search == "Yes":
# All documents have been filtered check_relevance
# We will re-generate a new query
print("---DECISION: ALL DOCUMENTS ARE NOT RELEVANT TO QUESTION, INCLUDE WEB SEARCH---")
return "websearch"
else:
# We have relevant documents, so generate answer
print("---DECISION: GENERATE---")
return "generate"
This function decides whether to generate an answer using the RAG model or to proceed with a web search based on the relevance of retrieved documents. If all documents are considered irrelevant, it opts for a web search; otherwise, it proceeds with answer generation.
def grade_generation_v_documents_and_question(state):
"""
Determine if the generation is grounded in the document and answers the question.
Args:
state (dict): The current graph stateReturns:
str: Decision for next node to call"""
print("---CHECK HALLUCINATIONS---")
question = state["question"]
documents = state["documents"]
generation = state["generation"]
score = hallucination_grader.invoke({"documents": documents, "generation": generation})
grade = score['score']
# Check hallucination
if grade == "yes":
print("---DECISION: GENERATION IS GROUNDED IN DOCUMENTS---")
# Check question-answering
print("---GRADE GENERATION vs QUESTION---")
score = answer_grader.invoke({"question": question, "generation": generation})
grade = score['score']
if grade == "yes":
print("---DECISION: GENERATION ADDRESSES QUESTION---")
return "useful"
else:
print("---DECISION: GENERATION DOES NOT ADDRESS QUESTION---")
return "not useful"
else:
print("---DECISION: GENERATION IS NOT GROUNDED IN DOCUMENTS, RE-TRY---")
return "not supported"
This function evaluates the factual basis and relevance of the generated answer in relation to the user question. It employs the hallucination_grader to verify if the answer is supported by the provided documents and the answer_grader to assess its effectiveness in addressing the question. Based on these evaluations, it determines whether the answer is useful.
from langgraph.graph import END, StateGraph
workflow = StateGraph(GraphState)
# Define the nodes
workflow.add_node("websearch", web_search) # web search
workflow.add_node("retrieve", retrieve) # retrieve
workflow.add_node("grade_documents", grade_documents) # grade documents
workflow.add_node("generate", generate) # generate
# Build graph
workflow.set_conditional_entry_point(
route_question,
{
"websearch": "websearch",
"vectorstore": "retrieve",
},
)
workflow.add_edge("retrieve", "grade_documents")
workflow.add_conditional_edges(
"grade_documents",
decide_to_generate,
{
"websearch": "websearch",
"generate": "generate",
},
)
workflow.add_edge("websearch", "generate")
workflow.add_conditional_edges(
"generate",
grade_generation_v_documents_and_question,
{
"not supported": "generate",
"useful": END,
"not useful": "websearch",
},
)
In this section, a Langchain graph (workflow) is built to orchestrate the sequence of operations. Nodes representing various tasks, such as document retrieval, grading, answer generation, and web search, are integrated into the graph. Conditional edges are established to manage routing decisions and guide the execution flow based on the current state.
try:
# Compile
app = workflow.compile()
# Test
from pprint import pprint
inputs = {"question": "Who is bedy kharisma?"}
for output in app.stream(inputs):
for key, value in output.items():
pprint(f"Finished running: {key}:")pprint(value["generation"])
except Exception as e:
# Handle the error
print("An error occurred:", e)
Finally, the Langchain graph is compiled into a functional application (app), which is then tested using sample inputs. The graph processes inputs through its defined nodes and edges, executing the specified tasks and producing output. Any errors encountered during execution are handled gracefully, ensuring robustness.
This thorough breakdown offers an in-depth view of the intricate mechanisms within the Langchain framework, demonstrating its versatility and capability in addressing complex natural language processing tasks. By leveraging Langchain, developers can unlock new avenues and revolutionize interactions with textual data.
For those interested, the complete Python notebook can be downloaded here:
rag-llama3/llama3-rag.ipynb at 93c5808b87b7885c2b4bc7d3b633063dcf72115c · bedy-kharisma/rag-llama3
You can easily adapt the PDF directory in the future by simply updating it here:
# sources
# url
urls = [
"https://lilianweng.github.io/posts/2023-06-23-agent/",
"https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
"https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/",
]
docs = [WebBaseLoader(url).load() for url in urls]
docs_list = [item for sublist in docs for item in sublist]
loader = PyPDFDirectoryLoader("C://Users//ASUS//Downloads//sources//")
data = loader.load()
docs_list.extend(data)
And modify the question here:
try:
# Compile
app = workflow.compile()
# Test
from pprint import pprint
inputs = {"question": "Who is bedy kharisma?"}
for output in app.stream(inputs):
for key, value in output.items():
pprint(f"Finished running: {key}:")pprint(value["generation"])
except Exception as e:
# Handle the error
print("An error occurred:", e)