LLM and Ollama: Difference between revisions

From bibbleWiki
Jump to navigation Jump to search
Line 70: Line 70:
This is referred to as a simple RAG system. We shall see<br>
This is referred to as a simple RAG system. We shall see<br>
[[File:Simple Rag.png|800px]]<br>
[[File:Simple Rag.png|800px]]<br>
This was a bit of strange experience. The code is at [https://github.com/pdichone/ollama-fundamentals/blob/main/pdf-rag.py here] but really the thing consisted of the boxes in the diagram.
<systemhighlight lang="py">
## 1. Ingest PDF Files
# 2. Extract Text from PDF Files and split into small chunks
# 3. Send the chunks to the embedding model
# 4. Save the embeddings to a vector database
# 5. Perform similarity search on the vector database to find similar documents
# 6. retrieve the similar documents and present them to the user
## run pip install -r requirements.txt to install the required packages
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.document_loaders import OnlinePDFLoader
doc_path = "./data/BOI.pdf"
model = "llama3.2"
# Local PDF file uploads
if doc_path:
    loader = UnstructuredPDFLoader(file_path=doc_path)
    data = loader.load()
    print("done loading....")
else:
    print("Upload a PDF file")
    # Preview first page
content = data[0].page_content
# print(content[:100])
# ==== End of PDF Ingestion ====
# ==== Extract Text from PDF Files and Split into Small Chunks ====
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
# Split and chunk
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=300)
chunks = text_splitter.split_documents(data)
print("done splitting....")
# print(f"Number of chunks: {len(chunks)}")
# print(f"Example chunk: {chunks[0]}")
# ===== Add to vector database ===
import ollama
ollama.pull("nomic-embed-text")
vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=OllamaEmbeddings(model="nomic-embed-text"),
    collection_name="simple-rag",
)
print("done adding to vector database....")
## === Retrieval ===
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_ollama import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever
# set up our model to use
llm = ChatOllama(model=model)
# a simple technique to generate multiple questions from a single question and then retrieve documents
# based on those questions, getting the best of both worlds.
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(), llm, prompt=QUERY_PROMPT
)
# RAG prompt
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
# res = chain.invoke(input=("what is the document about?",))
# res = chain.invoke(
#    input=("what are the main points as a business owner I should be aware of?",)
# )
res = chain.invoke(input=("how to report BOI?",))
print(res)
</systemhighlight>

Revision as of 01:06, 1 April 2025

Introduction

This is a page about ollama and you guest it LLM. I have downloaded several models and got a UI going over them locally. The plan is to build something like Claude desktop in Typescript to Golang. First some theory in Python from here. Here is the problem I am trying to solve.

Using the Remote Ollama

You can connect by setting the host with

export OLLAMA_HOST=192.blah.blah.blah

Now you can use with

ollama run llama3.2:latest

Taking lamma3.2b as an example

Model Info

  • Architecture:llama Who made it
  • Parameters:3.2B - Means 3.2 billion parameters (bigger requires more resources)
  • Context Length:131072 - Number of tokens it can injest
  • Embedding Length:3072 - Size of the vector for each token in the input text
  • Quantization:Q4_K_M - Too complex to explain

You can customize the mode with a Modelfile and running create with ollama. For example


FROM llama3.2

# set the temperature where higher is more creative
PARAMETER temperature 0.3

SYSTEM """
   You are Bill, a very smart assistant who answers questions succintly and informatively
"""

Now we can create a copy with

ollama create bill -f ./Modelfile

Rest API Interaction

So we can send questions to llama using the the rest endpoint to 11434

curl http://192.blah.blah.blah:11434/api/generate -d '{ 
  "model": "llama3.2", 
  "prompt": "Why is the sky blue?",
  "stream": false 
}'

We can chat by changing the endpoint and the format by adding format in the playload

curl http://192.blah.blah.blah:11434/api/chat -d '{ 
  "model": "llama3.2", 
  "prompt": "Why is the sky blue?",
  "stream": false,
  "format": "json" 
}'

All of the options are at here

UI Based Client Msty

Seems that Msty was a good choice. You specify a provide and you can then put in 192.blah.blah.blah:11434. It support deepseek and other provider too.

RAG Retrieval-Augmented Generation

This allows us to converse with out own documents/data and solves the bizarre statement LLMs produce

  • LLM
  • Document Corpus (Knowledge Base)
  • Document Embeddings
  • Vecto Store (Vector DB, Faiss, Pinecone, Chromadb)
  • Retrieval Mechanism

LangChain is a tool to make this easier

  • Loading and parsing documents
  • Splitting documents
  • Generating embeddings
  • Provides a unified abstraction for working with LLMs and Apps

This is referred to as a simple RAG system. We shall see

This was a bit of strange experience. The code is at here but really the thing consisted of the boxes in the diagram. <systemhighlight lang="py">

    1. 1. Ingest PDF Files
  1. 2. Extract Text from PDF Files and split into small chunks
  2. 3. Send the chunks to the embedding model
  3. 4. Save the embeddings to a vector database
  4. 5. Perform similarity search on the vector database to find similar documents
  5. 6. retrieve the similar documents and present them to the user
    1. run pip install -r requirements.txt to install the required packages

from langchain_community.document_loaders import UnstructuredPDFLoader from langchain_community.document_loaders import OnlinePDFLoader

doc_path = "./data/BOI.pdf" model = "llama3.2"

  1. Local PDF file uploads

if doc_path:

   loader = UnstructuredPDFLoader(file_path=doc_path)
   data = loader.load()
   print("done loading....")

else:

   print("Upload a PDF file")
   # Preview first page

content = data[0].page_content

  1. print(content[:100])


  1. ==== End of PDF Ingestion ====


  1. ==== Extract Text from PDF Files and Split into Small Chunks ====

from langchain_ollama import OllamaEmbeddings from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_community.vectorstores import Chroma

  1. Split and chunk

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=300) chunks = text_splitter.split_documents(data) print("done splitting....")

  1. print(f"Number of chunks: {len(chunks)}")
  2. print(f"Example chunk: {chunks[0]}")
  1. ===== Add to vector database ===

import ollama

ollama.pull("nomic-embed-text")

vector_db = Chroma.from_documents(

   documents=chunks,
   embedding=OllamaEmbeddings(model="nomic-embed-text"),
   collection_name="simple-rag",

) print("done adding to vector database....")


    1. === Retrieval ===

from langchain.prompts import ChatPromptTemplate, PromptTemplate from langchain_core.output_parsers import StrOutputParser

from langchain_ollama import ChatOllama

from langchain_core.runnables import RunnablePassthrough from langchain.retrievers.multi_query import MultiQueryRetriever

  1. set up our model to use

llm = ChatOllama(model=model)

  1. a simple technique to generate multiple questions from a single question and then retrieve documents
  2. based on those questions, getting the best of both worlds.

QUERY_PROMPT = PromptTemplate(

   input_variables=["question"],
   template="""You are an AI language model assistant. Your task is to generate five
   different versions of the given user question to retrieve relevant documents from
   a vector database. By generating multiple perspectives on the user question, your
   goal is to help the user overcome some of the limitations of the distance-based
   similarity search. Provide these alternative questions separated by newlines.
   Original question: {question}""",

)

retriever = MultiQueryRetriever.from_llm(

   vector_db.as_retriever(), llm, prompt=QUERY_PROMPT

)


  1. RAG prompt

template = """Answer the question based ONLY on the following context: {context} Question: {question} """

prompt = ChatPromptTemplate.from_template(template)


chain = (

   {"context": retriever, "question": RunnablePassthrough()}
   | prompt
   | llm
   | StrOutputParser()

)


  1. res = chain.invoke(input=("what is the document about?",))
  2. res = chain.invoke(
  3. input=("what are the main points as a business owner I should be aware of?",)
  4. )

res = chain.invoke(input=("how to report BOI?",))

print(res) </systemhighlight>