Unlocking the Full Potential of Local Llama 3 on Windows

I view Retrieval-Augmented Generation (RAG) as fundamental for the evolution of AI technologies. Our goal isn't just to have AI that produces random responses; we need an AI that can pull answers from designated document collections, comprehend the context of inquiries, navigate its embeddings, and perform web searches as required. It should also be able to evaluate the accuracy of its responses to avoid generating misleading information, ultimately providing answers that are as coherent and human-like as possible, based on the documents we provide.

The wait is over. Let's delve into the details.

This article draws inspiration from this video:

Several modifications have been made to enhance how source data is utilized. Instead of depending on a single PDF, the system now accesses a directory of PDFs as one of its sources. Moreover, the approach has shifted to direct all inquiries to the vector store, foregoing the need for web searches when possible.

In this comprehensive analysis, we will break down the provided code snippet to unveil the mechanics of Langchain:

# Install modules

!pip install ollama langchain beautifulsoup4 chromadb gradio unstructured langchain-nomic langchain_community tiktoken langchainhub langgraph tavily-python gpt4all -q

!pip install "unstructured[all-docs]" -q

!ollama pull llama3

!ollama pull nomic-embed-text

These commands kickstart the installation of necessary modules and libraries required for Langchain and its functionalities. The pip install commands guarantee that all vital dependencies are included, while ollama pull retrieves specific models and resources needed for text processing.

# Importing libraries

import os

import bs4

import getpass

import ollama

from typing import List

from typing_extensions import TypedDict

from langchain import hub

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_community.document_loaders import (

WebBaseLoader,

UnstructuredPDFLoader,

OnlinePDFLoader,

UnstructuredFileLoader,

PyPDFDirectoryLoader,

)

from langchain_community.vectorstores import Chroma

from langchain_community.embeddings import OllamaEmbeddings, GPT4AllEmbeddings

from langchain_core.output_parsers import StrOutputParser, JsonOutputParser

from langchain_core.runnables import RunnablePassthrough

from langchain.prompts import PromptTemplate, ChatPromptTemplate

from langchain_community.chat_models import ChatOllama

from langchain.retrievers.multi_query import MultiQueryRetriever

from langchain_community.tools.tavily_search import TavilySearchResults

Here, we import various libraries and modules essential for the operation of Langchain. These include tools for text splitting, document loading, vector embedding, output parsing, and others. Each import statement contributes vital functionality for diverse aspects of natural language processing (NLP) tasks.

# Options

local_llm = 'llama3'

llm = ChatOllama(model=local_llm, format="json", temperature=0)

# embeddings

embeddings = GPT4AllEmbeddings()

This segment configures the options for Langchain. local_llm identifies the local model to be utilized, while llm initializes a ChatOllama instance for engaging with the model. Here, the type of embeddings, whether Ollama's or GPT-4, is also specified.

# Sources

# URL

urls = [

"https://lilianweng.github.io/posts/2023-06-23-agent/",

"https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",

"https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/",

]

docs = [WebBaseLoader(url).load() for url in urls]

docs_list = [item for sublist in docs for item in sublist]

# PDF

loader = PyPDFDirectoryLoader("C://Users//ASUS//Downloads//sources//")

data = loader.load()

docs_list.extend(data)

This code snippet retrieves textual information from various sources, including web links and PDF documents. The WebBaseLoader is utilized to load data from URLs, while the PyPDFDirectoryLoader is responsible for fetching PDF files from a specified local directory.

# Splitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(

chunk_size=1000, chunk_overlap=200

)

doc_splits = text_splitter.split_documents(docs_list)

In this section, the text splitter is initialized to divide documents into smaller segments for efficient processing. This step is vital for tasks such as vectorization and retrieval, where managing large documents can be challenging.

# Add to vectorDB

vectorstore = Chroma.from_documents(

documents=doc_splits,

collection_name="rag-chroma",

embedding=embeddings,

)

retriever = vectorstore.as_retriever()

This code creates a vector store using Chroma, a Langchain component designed for storing and querying document embeddings. The documents are vectorized with the specified embeddings and stored in the vector store, enabling efficient retrieval based on semantic similarity.

# Retrieval Grader

prompt = PromptTemplate(

template="""system You are a grader assessing relevance

of a retrieved document to a user question. If the document contains keywords related to the user question,

grade it as relevant. It does not need to be a stringent test. The goal is to filter out erroneous retrievals.

Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question.

Provide the binary score as a JSON with a single key 'score' and no preamble or explanation.

user

Here is the retrieved document: nn {document} nn

Here is the user question: {question} n assistant

""",

input_variables=["question", "document"],

)

retrieval_grader = prompt | llm | JsonOutputParser()

This section defines a prompt template for grading the relevance of retrieved documents in relation to user questions. The template sets the criteria for grading and prompts the user to provide a binary score indicating the document's relevance. The score is processed through Langchain’s ChatOllama instance and converted to JSON format for further evaluation.

# Generate

prompt = PromptTemplate(

template="""system You are an assistant for question-answering tasks.

Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know.

Use three sentences maximum and keep the answer concise user

Question: {question}

Context: {context}

Answer: assistant""",

input_variables=["question", "document"],

)

This prompt template is designed for generating responses to user questions based on retrieved context. It instructs the assistant to provide concise answers using up to three sentences, leveraging the retrieved documents as context. This template aids in question-answering tasks within the Langchain framework.

# Post-processing

def format_docs(docs):

return "nn".join(doc.page_content for doc in docs)

This function processes retrieved documents, formatting them into a readable text format by concatenating the page content of each document with double line breaks for clarity.

# Chain

rag_chain = prompt | llm | StrOutputParser()

A processing chain (rag_chain) is established using Langchain components, including the prompt template, ChatOllama instance (llm), and string output parser. This chain facilitates generating responses to user queries based on the provided context.

# Hallucination Grader

prompt = PromptTemplate(

template=""" system You are a grader assessing whether

an answer is grounded in / supported by a set of facts. Give a binary score 'yes' or 'no' score to indicate

whether the answer is grounded in / supported by a set of facts. Provide the binary score as a JSON with a

single key 'score' and no preamble or explanation. user

Here are the facts:

n ------- n

{documents}

n ------- n

Here is the answer: {generation} assistant""",

input_variables=["generation", "documents"],

)

This section sets up a prompt template to evaluate whether the generated answer is supported by a set of facts. The template requests the user to determine if the answer is grounded in the provided documents and processes the generated answer and relevant documents through Langchain components for scoring, parsing the results into JSON format.

# Answer Grader

prompt = PromptTemplate(

template="""system You are a grader assessing whether an

answer is useful to resolve a question. Give a binary score 'yes' or 'no' to indicate whether the answer is

useful to resolve a question. Provide the binary score as a JSON with a single key 'score' and no preamble or explanation.

user Here is the answer:

n ------- n

{generation}

n ------- n

Here is the question: {question} assistant""",

input_variables=["generation", "question"],

)

This prompt template assesses the usefulness of generated answers in addressing user questions. It instructs the user to provide a binary score indicating whether the answer effectively resolves the inquiry. The answer and question are processed via Langchain components, and the resulting score is parsed into JSON format.

# Router

prompt = PromptTemplate(

template="""system

You excel in directing user inquiries either to a vector store or a web search.

For queries related to documents within the vector store, prioritize utilizing the vector store.

There's no need to strictly match keywords in the question to topics within the vector store.

If the question isn't covered by the vector store's content, resort to a web search.

Provide a binary decision, 'web_search' or 'vectorstore', depending on the nature of the question.

Return the a JSON with a single key 'datasource' and

no preamble or explanation. Question to route: {question} assistant""",

input_variables=["question"],

)

This segment defines a prompt template for directing user inquiries to either a vector store or a web search engine based on the query’s nature. The template instructs the user to provide a binary decision indicating the preferred data source (web_search or vectorstore) for answering the question, with the decision being processed through Langchain components and parsed into JSON format.

# Search

os.environ["TAVILY_API_KEY"] = "tvly-XXXX"

web_search_tool = TavilySearchResults(k=3)

This code initializes a web search tool (web_search_tool) powered by the Tavily API, setting the Tavily API key as an environment variable to enable web search functionalities.

# State

class GraphState(TypedDict):

"""

Represents the state of our graph.

Attributes:

question: user question

generation: generated response

web_search: flag for web search

documents: list of documents

"""

question: str

generation: str

web_search: str

documents: List[str]

Here, a GraphState class is defined to represent the state of the Langchain graph, encompassing attributes like the user question, generated answer, whether a web search is needed, and a list of pertinent documents.

# Nodes

def retrieve(state):

"""

Retrieve documents from vectorstore

Args:

state (dict): The current graph state

Returns:

state (dict): Updated state with retrieved documents

"""

print("---RETRIEVE---")

question = state["question"]

# Retrieval

documents = retriever.invoke(question)

return {"documents": documents, "question": question}

def generate(state):

"""

Generate answer using RAG on retrieved documents

Args:

state (dict): The current graph state

Returns:

state (dict): Updated state with generated answer

"""

print("---GENERATE---")

question = state["question"]

documents = state["documents"]

# RAG generation

generation = rag_chain.invoke({"context": documents, "question": question})

return {"documents": documents, "question": question, "generation": generation}

def grade_documents(state):

"""

Determines whether the retrieved documents are relevant to the question.

If any document is not relevant, we set a flag to run web search.

Args:

state (dict): The current graph state

Returns:

state (dict): Updated state with relevant documents and web_search flag

"""

print("---CHECK DOCUMENT RELEVANCE TO QUESTION---")

question = state["question"]

documents = state["documents"]

# Score each doc

filtered_docs = []

web_search = "No"

for d in documents:

score = retrieval_grader.invoke({"question": question, "document": d.page_content})

grade = score['score']

# Document relevant

if grade.lower() == "yes":

print("---GRADE: DOCUMENT RELEVANT---")

filtered_docs.append(d)

# Document not relevant

else:

print("---GRADE: DOCUMENT NOT RELEVANT---")

web_search = "Yes"

continue

return {"documents": filtered_docs, "question": question, "web_search": web_search}

These functions represent distinct nodes in the Langchain graph, each responsible for a specific task: retrieve fetches relevant documents from the vector store, generate produces an answer using the RAG model, and grade_documents assesses the relevance of retrieved documents to the user query, determining if a web search is necessary.

def web_search(state):

"""

Perform web search based on the question

Args:

state (dict): The current graph state

Returns:

state (dict): Updated state with appended web results

"""

print("---WEB SEARCH---")

question = state["question"]

documents = state["documents"]

# Web search

docs = web_search_tool.invoke({"query": question})

web_results = "n".join([d["content"] for d in docs])

web_results = Document(page_content=web_results)

if documents is not None:

documents.append(web_results)

else:

documents = [web_results]

return {"documents": documents, "question": question}

This function conducts a web search based on the user question and adds the retrieved results to the existing document list. It employs the Tavily API to perform the web search, formatting and appending the acquired content to the document list.

def route_question(state):

"""

Route question to web search or RAG.

Args:

state (dict): The current graph state

Returns:

str: Next node to call

"""

print("---ROUTE QUESTION---")

question = state["question"]

print(question)

source = question_router.invoke({"question": question})

print(source)

print(source['datasource'])

if source['datasource'] == 'web_search':

print("---ROUTE QUESTION TO WEB SEARCH---")

return "websearch"

elif source['datasource'] == 'vectorstore':

print("---ROUTE QUESTION TO RAG---")

return "vectorstore"

This function determines how to route user questions based on their nature. It uses the question_router to evaluate if the question should be directed to a web search or processed with the RAG model, based on the output from the question_router component.

def decide_to_generate(state):

"""

Determine whether to generate an answer or add web search

Args:

state (dict): The current graph state

Returns:

str: Decision for next node to call

"""

print("---ASSESS GRADED DOCUMENTS---")

question = state["question"]

web_search = state["web_search"]

filtered_documents = state["documents"]

if web_search == "Yes":

# All documents have been filtered check_relevance

# We will re-generate a new query

print("---DECISION: ALL DOCUMENTS ARE NOT RELEVANT TO QUESTION, INCLUDE WEB SEARCH---")

return "websearch"

else:

# We have relevant documents, so generate answer

print("---DECISION: GENERATE---")

return "generate"

This function decides whether to generate an answer using the RAG model or to proceed with a web search based on the relevance of retrieved documents. If all documents are considered irrelevant, it opts for a web search; otherwise, it proceeds with answer generation.

def grade_generation_v_documents_and_question(state):

"""

Determine if the generation is grounded in the document and answers the question.

Args:

state (dict): The current graph state

Returns:

str: Decision for next node to call

"""

print("---CHECK HALLUCINATIONS---")

question = state["question"]

documents = state["documents"]

generation = state["generation"]

score = hallucination_grader.invoke({"documents": documents, "generation": generation})

grade = score['score']

# Check hallucination

if grade == "yes":

print("---DECISION: GENERATION IS GROUNDED IN DOCUMENTS---")

# Check question-answering

print("---GRADE GENERATION vs QUESTION---")

score = answer_grader.invoke({"question": question, "generation": generation})

grade = score['score']

if grade == "yes":

print("---DECISION: GENERATION ADDRESSES QUESTION---")

return "useful"

else:

print("---DECISION: GENERATION DOES NOT ADDRESS QUESTION---")

return "not useful"

else:

print("---DECISION: GENERATION IS NOT GROUNDED IN DOCUMENTS, RE-TRY---")

return "not supported"

This function evaluates the factual basis and relevance of the generated answer in relation to the user question. It employs the hallucination_grader to verify if the answer is supported by the provided documents and the answer_grader to assess its effectiveness in addressing the question. Based on these evaluations, it determines whether the answer is useful.

from langgraph.graph import END, StateGraph

workflow = StateGraph(GraphState)

# Define the nodes

workflow.add_node("websearch", web_search) # web search

workflow.add_node("retrieve", retrieve) # retrieve

workflow.add_node("grade_documents", grade_documents) # grade documents

workflow.add_node("generate", generate) # generate

# Build graph

workflow.set_conditional_entry_point(

route_question,

{

"websearch": "websearch",

"vectorstore": "retrieve",

},

)

workflow.add_edge("retrieve", "grade_documents")

workflow.add_conditional_edges(

"grade_documents",

decide_to_generate,

{

"websearch": "websearch",

"generate": "generate",

},

)

workflow.add_edge("websearch", "generate")

workflow.add_conditional_edges(

"generate",

grade_generation_v_documents_and_question,

{

"not supported": "generate",

"useful": END,

"not useful": "websearch",

},

)

In this section, a Langchain graph (workflow) is built to orchestrate the sequence of operations. Nodes representing various tasks, such as document retrieval, grading, answer generation, and web search, are integrated into the graph. Conditional edges are established to manage routing decisions and guide the execution flow based on the current state.

try:

# Compile

app = workflow.compile()

# Test

from pprint import pprint

inputs = {"question": "Who is bedy kharisma?"}

for output in app.stream(inputs):

for key, value in output.items():

pprint(f"Finished running: {key}:")

pprint(value["generation"])

except Exception as e:

# Handle the error

print("An error occurred:", e)

Finally, the Langchain graph is compiled into a functional application (app), which is then tested using sample inputs. The graph processes inputs through its defined nodes and edges, executing the specified tasks and producing output. Any errors encountered during execution are handled gracefully, ensuring robustness.

This thorough breakdown offers an in-depth view of the intricate mechanisms within the Langchain framework, demonstrating its versatility and capability in addressing complex natural language processing tasks. By leveraging Langchain, developers can unlock new avenues and revolutionize interactions with textual data.

For those interested, the complete Python notebook can be downloaded here:

rag-llama3/llama3-rag.ipynb at 93c5808b87b7885c2b4bc7d3b633063dcf72115c · bedy-kharisma/rag-llama3

You can easily adapt the PDF directory in the future by simply updating it here:

# sources

# url

urls = [

"https://lilianweng.github.io/posts/2023-06-23-agent/",

"https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",

"https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/",

]

docs = [WebBaseLoader(url).load() for url in urls]

docs_list = [item for sublist in docs for item in sublist]

# pdf

loader = PyPDFDirectoryLoader("C://Users//ASUS//Downloads//sources//")

data = loader.load()

docs_list.extend(data)

And modify the question here:

try:

# Compile

app = workflow.compile()

# Test

from pprint import pprint

inputs = {"question": "Who is bedy kharisma?"}

for output in app.stream(inputs):

for key, value in output.items():

pprint(f"Finished running: {key}:")

pprint(value["generation"])

except Exception as e:

# Handle the error

print("An error occurred:", e)

zgtangqian.com

Unlocking the Full Potential of Local Llama 3 on Windows

Share the page:

Recent Post:

The Rise of Lightyear One: A Game Changer in Electric Vehicles

Understanding Our Decision-Making: Control or Illusion?

A Reflection on Life Before Surgery: Embracing the Journey

Morbilli: The Importance of Vaccination and Its Impact on Health

# Exploring Character Archetypes: Their Significance in Storytelling

Unleashing Creativity: Insights from the Great Minds

Exploring Control Flow in Java: A Beginner's Guide

Will AI Uncover Signs of Alien Life? New Signals Discovered