Skip to main content

LangChain Code Journey From Basics PART-4 RAG

Understanding Retrieval-Augmented Generation (RAG) with LangChain

Retrieval-Augmented Generation (RAG) enhances the capabilities of language models (LLMs) like GPT-3.5 by allowing them to access external information sources. This is crucial because models like GPT-3.5 are limited to knowledge from data they were trained on, up until 2021. Without RAG, they can't answer questions about events or information beyond that time.

How RAG Works

RAG works by:

  1. Retrieving Information: When you ask a question, the system searches for relevant documents or data sources, like web pages, PDFs, or articles, containing up-to-date information.

  2. Augmenting the Prompt: The retrieved information is combined with the original prompt, giving the LLM additional context.

  3. Generating a Response: The LLM uses this enriched input to generate a more accurate and informed response.

For example, if you ask, "Who won the Miss Universe award in 2023?" GPT-3.5 wouldn’t know because its knowledge stops at 2021. With RAG, you could provide it with a relevant Wikipedia page or article. The model would then retrieve the necessary information and give a correct answer.

Practical Applications of RAG

  • Web Pages: Feed web content to the LLM for the latest information.
  • PDFs and Textbooks: Use documents as sources, allowing the model to extract relevant details.
  • Video Transcripts: Provide transcripts from videos or podcasts for the model to generate informed responses.
  • Blogs and Articles: Use recent blog posts or news articles to keep the model updated on current events.

Example: Processing PDFs with RAG

To see RAG in action, you can start by processing PDFs. By inputting text from a PDF, the LLM can retrieve and use the latest information to answer questions, even if the data wasn’t part of its original training.

RAG is a powerful tool that makes LLMs more versatile by enabling them to access and use up-to-date, context-specific information. This significantly improves their ability to answer questions accurately and efficiently.

How PDF is used by RAG:


Understanding Retrieval-Augmented Generation (RAG) with PDFs

RAG is a powerful technique that enhances the capabilities of language models by integrating external sources of information, such as PDFs, web pages, or transcripts. Here’s how it works, specifically focusing on processing PDFs:

Part 1: Processing and Embedding the PDF

  1. Chunking the PDF: The first step involves breaking down the PDF into smaller chunks, typically of 1000 tokens each. This is crucial because LLMs have a limit on how much text they can process at once.

  2. Embedding the Chunks: Each chunk of text is then converted into numerical vectors, known as embeddings. These embeddings capture the semantic meaning of the text and are generated using tools like OpenAI’s embedding services.

  3. Storing in a Vector Database: Once the text is embedded, these vectors are stored in a vector database such as ChromaDB. This database is optimized for fast retrieval of relevant information based on semantic similarity.

Part 2: Querying and Generating Responses

  1. User Query: When a user asks a question, the query is also embedded into a numerical vector using the same embedding technique.

  2. Similarity Search: The system then performs a similarity search within the vector database to find chunks of the PDF that are most relevant to the user’s query. This ensures that the response is informed by the most pertinent information in the document.

  3. Combining Information: The retrieved chunks (usually the top 3-5 most relevant ones) are combined with the user’s query to form a new prompt.

  4. Generating the Response: This enriched prompt is passed to the LLM, which then generates a response based on both its pre-trained knowledge and the specific information retrieved from the PDF.

Practical Example

Imagine you have a PDF of a data science book. If you ask, “Who is the father of machine learning?”, the system would:

  • Break the PDF into chunks and embed them.
  • Store these embeddings in ChromaDB.
  • When you ask your question, it will find the most relevant chunks, combine them with your query, and use the LLM to provide a well-informed answer, even if the LLM wasn't specifically trained on the latest information.

This process allows LLMs to effectively "learn" from new documents and provide accurate, up-to-date responses, making RAG a highly effective tool for extending the capabilities of AI beyond its initial training.

This code demonstrates how to implement Retrieval-Augmented Generation (RAG) using a PDF as the source document. Here's a breakdown of how it works:

Python
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import os 
from dotenv import load_dotenv
load_dotenv()

ld=PyPDFLoader(r"C:\Users\amang\Downloads\proj celeb search by krish naik\1706.03762v7.pdf")
dc=ld.load()

ts=RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200)
docs=ts.split_documents(dc)

if docs:
    db = Chroma.from_documents(docs, OpenAIEmbeddings(),persist_directory="./chroma_db")
else:
    print("No documents to process")


q="Discoveries of attention is all you need?"
r=db.similarity_search(q)

print(r)



Explanation

  1. Module Imports: The code starts by importing necessary modules. We use PyPDFLoader to load the PDF, RecursiveCharacterTextSplitter for splitting the text, OpenAIEmbeddings for creating text embeddings, and Chroma as the vector database to store and search the embeddings.

  2. Loading the PDF: The PyPDFLoader is used to load the PDF file into a document object. This allows us to process the PDF content.

  3. Text Splitting: The RecursiveCharacterTextSplitter is chosen because it conserves the contextual integrity of the text by splitting it into meaningful chunks of 1000 tokens with a 200-token overlap. This overlap ensures that the context is maintained between chunks.

  4. Embedding and Storing: The code then checks if there are any documents to process. If so, it creates embeddings for the text chunks using OpenAI’s embedding model. These embeddings are stored in a local Chroma database. The persist_directory argument specifies where the database is stored locally.

  5. Similarity Search: Finally, the code performs a similarity search on the stored embeddings. The query is converted into an embedding, and the system searches for the most similar chunks in the database. The results are printed out, showing the most relevant text chunks related to the query.

Key Concepts

  • Chunking with Overlap: By splitting the text into overlapping chunks, the process ensures that important contextual connections between paragraphs or sentences are preserved, improving the accuracy of similarity searches.

  • Embeddings: These are numerical representations of the text chunks that capture their semantic meaning, enabling effective similarity searches.

  • Vector Database: Chroma is used here to store and retrieve the embeddings. The database allows for efficient searching and matching of relevant text based on the user's query.

Practical Use

This setup is ideal for querying large documents like research papers, textbooks, or legal documents. By converting these documents into embeddings and storing them in a vector database, you can quickly retrieve and utilize specific information in response to user queries.

This method effectively combines the strengths of LLMs with external data sources, enabling the model to provide informed answers even when the information is not part of its original training.

Alternate Databases:

This code demonstrates how to implement Retrieval-Augmented Generation (RAG) using a PDF as the source document, but this time using FAISS as the vector database instead of Chroma. Here’s how it works:

Python
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import os 
from dotenv import load_dotenv
load_dotenv()

ld=PyPDFLoader(r"C:\Users\amang\Downloads\proj celeb search by krish naik\1706.03762v7.pdf")
dc=ld.load()

ts=RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200)
docs=ts.split_documents(dc)

if docs:
    db1= FAISS.from_documents(docs, OpenAIEmbeddings())
else:
    print("No documents to process")


q="Discoveries of attention is all you need?"
r=db1.similarity_search(q)

print(r)

Using database Retrievers:

In LangChain, a retriever is a powerful tool used to query a database with more control and flexibility compared to simple similarity searches. A retriever allows you to define specific parameters, such as the number of top results to return (k) and a similarity score threshold, which helps in refining the search results according to your needs. Here's how it works:

Python
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import os 
from dotenv import load_dotenv
load_dotenv()

ld=PyPDFLoader(r"C:\Users\amang\Downloads\proj celeb search by krish naik\1706.03762v7.pdf")
dc=ld.load()

ts=RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200)
docs=ts.split_documents(dc)

if docs:
    db = Chroma.from_documents(docs, OpenAIEmbeddings(),persist_directory="./chroma_db")
else:
    print("No documents to process")


q="Discoveries of attention is all you need?"

retriever = db.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 3, "score_threshold": 0.4},
)
relevant_docs = retriever.invoke(q)

print(relevant_docs)

In this example, we enhance our document search process by using a retriever with specific parameters, providing more control and flexibility compared to a basic similarity search.

Key Points about the Retriever:

  • Retriever Creation: We create the retriever using db.as_retriever(). This allows us to define how we want to search the database.

  • Search Type: The retriever is configured with search_type="similarity_score_threshold". This method returns results based on a defined similarity score threshold, ensuring that only results above a certain relevance level are considered.

  • Search Parameters:

    • k: 3: Retrieves the top 3 most relevant results.
    • score_threshold: 0.4: Only results with a similarity score of 0.4 or higher are returned. This helps filter out less relevant results, ensuring higher quality in the information retrieved.

Practical Use:

By using a retriever, you gain finer control over the retrieval process, allowing you to specify not just how many results to return, but also the minimum relevance level they must meet. This makes it particularly useful when precision is important, such as when querying large and complex documents.




When using retrievers with relevance scores, the precision of the results can vary significantly depending on the threshold set. Here’s what happens in different scenarios:

  • Low Relevance Score (0.4): When the score threshold is set to 0.4, the retriever is more lenient and can find documents that are somewhat related to the query. In this case, it’s easier to find relevant chunks within a single PDF because the criteria aren’t too strict.

  • High Relevance Score (0.9): With a higher threshold like 0.9, the retriever only returns results that are very closely related to the query. If no chunk meets this high standard, the retriever will return no relevant documents. This strict approach ensures high accuracy but may lead to no results if the information is not closely aligned with the query.

Metadata

Metadata plays a crucial role, especially when dealing with multiple documents or large datasets. Metadata helps verify the source of the information provided by the LLM, which is essential because:

  • Hallucination Risk: LLMs can sometimes generate answers that seem plausible but are not based on the provided source material. By including metadata, you can trace the answer back to its source, ensuring it’s accurate.

  • Single vs. Multiple Sources: In this example, since we're working with a single PDF, metadata might seem less critical. However, in more complex scenarios with multiple documents, metadata is invaluable. It indicates exactly where the retrieved information came from, including the document name and page number, helping to validate the accuracy of the LLM's response.

Practical Takeaway

Using relevance scores effectively helps balance between finding useful information and avoiding irrelevant results. Including metadata ensures transparency and reliability, especially when dealing with multiple sources. This approach not only enhances the accuracy of the results but also provides a way to validate the LLM's output, reducing the risk of relying on potentially incorrect or hallucinated information.

Multiple PDF and Meta Data:

In this example, we extend the previous approach to handle multiple PDF files. When working with multiple sources, it’s crucial to include metadata to keep track of where each piece of information comes from. Here’s how the code works:

Python
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

import os 
from dotenv import load_dotenv
load_dotenv()
current_dir = os.path.dirname(os.path.abspath(__file__))
books_dir = os.path.join(current_dir, "data")
persistent_directory = os.path.join(current_dir, "faissmeta")
book_files = [f for f in os.listdir(books_dir) if f.endswith(".pdf")]
print(books_dir)
dc = []
for book_file in book_files:
    file_path = os.path.join(books_dir, book_file)
    loader = PyPDFLoader(file_path)
    book_docs = loader.load()
    for doc in book_docs:
        doc.metadata = {"source": book_file}
        dc.append(doc)

ts=RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200)
docs=ts.split_documents(dc)

if docs:
    db= FAISS.from_documents(docs, OpenAIEmbeddings())
else:
    print("No documents to process")

q="what are terms for the lease"
retriever = db.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 3, "score_threshold": 0.6},
)
relevant_docs = retriever.invoke(q)
print("\n--- Relevant Documents ---")
for i, doc in enumerate(relevant_docs, 1):
    print(f"Document {i}:\n{doc.page_content}\n")
    print(f"Source: {doc.metadata['source']}\n")

Key Steps

  1. Loading Multiple PDFs: The code loads multiple PDF files from a specified directory. Each file is processed individually, and its content is loaded into a list of documents.

  2. Adding Metadata: For each document, metadata is manually added to track the source (i.e., which PDF the content came from). This is essential for later identifying the origin of retrieved information, especially when dealing with multiple sources.

  3. Chunking the Documents: The documents are split into chunks of 1000 tokens with a 200-token overlap, just like in previous examples. This ensures that the context is preserved across chunks.

  4. Creating the Vector Database: The documents are then embedded using OpenAI's embedding model and stored in a FAISS vector database. FAISS is chosen here for its efficiency in handling similarity searches.

  5. Querying with Retrieval: The retriever is created with a similarity score threshold of 0.6, meaning that only chunks with a similarity score above this threshold are considered relevant. The query is executed, and the relevant documents are retrieved.

  6. Displaying Results with Metadata: The results are printed, including both the relevant content and the source of the content, thanks to the metadata we added earlier. This helps verify where the information comes from and ensures accuracy.

Practical Takeaways

  • Metadata Importance: When dealing with multiple files, metadata is crucial for tracking the origin of information. This is especially important to counteract the risk of hallucination in LLMs.

  • Flexible Database Use: The code uses FAISS as the vector database, but the approach is flexible enough to work with other databases like Chroma, depending on the environment and specific needs.

  • Scalability: This approach can be scaled to handle large datasets across multiple documents, making it suitable for complex information retrieval tasks.

By including metadata and carefully managing multiple documents, this approach ensures that the retrieved information is accurate and traceable, enhancing the reliability of the RAG process.



The RAG process worked as expected, accurately retrieving relevant information from the correct PDF sources:

  • Correct Source Matching: The retriever correctly identified and returned three outputs from the right source, demonstrating the system's accuracy.

  • Efficient Retrieval: The use of LLMs with FAISS vector databases ensures efficient and precise retrieval, filtering out irrelevant results with a set relevance threshold.

  • Next Steps: We'll explore using RAG with websites and discuss advanced chunking and splitting techniques.

This setup shows RAG's capability to effectively couple LLMs with external data, ensuring relevant and accurate responses.

Different Text Splitters:

In this section, we explore various types of text splitters offered by LangChain. Text splitters break down large chunks of text (like a PDF or document) into smaller, manageable pieces, ensuring the content can fit within LLM token limits for better processing.

Here are the main types of text splitters:

1. Character-based Splitting  

   How it Works: Splits the text at a set number of characters without considering sentence structure.  

   Example: For the text, "The quick brown fox jumps over the lazy dog," with a chunk size of 10, it might split into:  

   `['The quick ', 'ick brown ', 'wn fox jum', 'ox jumps o', 'ps over th', 'ver the la', ' lazy dog.']`

2. Sentence-based Splitting  

   How it Works: Splits text at sentence boundaries, keeping sentences intact.  

   Example: For the text, "The quick brown fox jumps over the lazy dog. It was a sunny day.", the splitting would result in:  

   `['The quick brown fox jumps over the lazy dog.', 'It was a sunny day.']`

3. Token-based Splitting  

   How it Works: Splits based on tokens (words or subwords).  

   Example: For the sentence, "The quick brown fox jumps over the lazy dog," a token-based splitter with a token size of 5 might split it into:  

   `['The quick brown fox jumps', 'fox jumps over the lazy dog.']`

4. Recursive Character-based Splitting  

   How it Works: First tries to split the text at larger boundaries like sentences. If still too long, it falls back on character splitting.  

   Example: For the text, "The quick brown fox jumps over the lazy dog. It was a sunny day.", with a chunk size of 30 characters, it might give:  

   `['The quick brown fox jumps over', 'umps over the lazy dog. It was', ' It was a sunny day.']`

5. Custom Text Splitter  

   How it Works: You can create custom text splitters based on your requirements (e.g., splitting by paragraphs).  

   Example: A custom splitter can be set up to split by two new lines (`"\n\n"`), which might result in chunks that represent individual paragraphs.  

Python
import os
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    SentenceTransformersTokenTextSplitter,
    TextSplitter,
    TokenTextSplitter,
)
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader

ld=PyPDFLoader("411 Lease 2024.pdf")
dc=ld.load()

from dotenv import load_dotenv
load_dotenv()

char_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
char_docs = char_splitter.split_documents(dc)
sent_splitter = SentenceTransformersTokenTextSplitter(chunk_size=1000)
sent_docs = sent_splitter.split_documents(dc)
token_splitter = TokenTextSplitter(chunk_overlap=0, chunk_size=512)
token_docs = token_splitter.split_documents(dc)
rec_char_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=100)
rec_char_docs = rec_char_splitter.split_documents(dc)
class CustomTextSplitter(TextSplitter):
    def split_text(self, text):
        return text.split("\n\n") 
custom_splitter = CustomTextSplitter()
custom_docs = custom_splitter.split_documents(dc) 
    

def printer(db):
    q="what are terms for the lease"
    retriever = db.as_retriever(
        search_type="similarity_score_threshold",
        search_kwargs={"k": 3, "score_threshold": 0.6},
    )
    relevant_docs = retriever.invoke(q)
    print("*******************************************************"+"--- Relevant Documents ---"+"*******************************************************")
    for i, doc in enumerate(relevant_docs, 1):
        print(f"Document {i}:\n{doc.page_content}\n")
        print(f"Source: {doc.metadata['source']}\n")


db= FAISS.from_documents(char_docs, OpenAIEmbeddings())
printer(db)
db1= FAISS.from_documents(sent_docs, OpenAIEmbeddings())
printer(db1)
db2= FAISS.from_documents(token_docs, OpenAIEmbeddings())
printer(db2)
db3= FAISS.from_documents(rec_char_docs, OpenAIEmbeddings())
printer(db3)
db4= FAISS.from_documents(custom_docs, OpenAIEmbeddings())
printer(db4)

Key Takeaways

  • Character-based: Simple splitting without context preservation.
  • Sentence-based: Maintains meaning by keeping sentences intact.
  • Token-based: Useful for fitting text within token limits for LLMs.
  • Recursive Character-based: Intelligent splitting, maintaining structure while ensuring size constraints.
  • Custom Splitter: Customizable for specific needs.

By using different text splitters based on the use case, you can optimize how your documents are processed, ensuring meaningful chunks while maintaining efficiency for downstream tasks like querying and retrieval in RAG setups.

Different Embedding Types:

Now, let’s talk about embeddings. There are different types of embedding methods, and one of the most common ways is using OpenAI embeddings. Additionally, there are several options like Llama embeddings or even Hugging Face’s open-source embeddings. These come with various types based on retraining categories and other factors. We can choose any of these based on our specific requirements.

In the code below, I will demonstrate the usage of both OpenAI embeddings and Hugging Face embeddings. You can also use other embedding methods like Llama embeddings or Google embeddings, as well as many other available options. I’m going to show you different types of embeddings right here. These embeddings are highly useful when working with textual data that needs to be stored in a vector database. The data must first be converted into a set of vectors to be stored in the vector database and can then be retrieved quickly when needed. So, let's use one type of database, perform this embedding process, and I'll show you how it works.

And if you’re new to working with embeddings and don’t know much about them, I suggest you check out my earlier post where I explain how embeddings are generated, how the background process works, and the basics behind it all. I also dive into some slightly more complex algorithms to show how everything fits together.

Python
import os
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from dotenv import load_dotenv
load_dotenv()
ld=PyPDFLoader("411 Lease 2024.pdf")
documents = ld.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

openai_embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")


persistent_directory1 = os.path.join("vc")
if not os.path.exists(persistent_directory1):
    Chroma.from_documents(
        docs,openai_embeddings, persist_directory=persistent_directory1)
else:
    print(
        f"Vector store already exists. No need to initialize.")


q="what is lease?"
db = Chroma(
            persist_directory=persistent_directory1,
            embedding_function=openai_embeddings,
        )
retriever = db.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 3, "score_threshold": 0.1},
)
relevant_docs = retriever.invoke(q)

print("OpenAI embeddings")
print(relevant_docs)

#huggingface embeddings 
huggingface_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

persistent_directory2 = os.path.join("vh")
if not os.path.exists(persistent_directory2):
    Chroma.from_documents(
        docs, huggingface_embeddings, persist_directory=persistent_directory2)
else:
    print(
        f"Vector store already exists. No need to initialize.")


q="what is attention?"
db = Chroma(
            persist_directory=persistent_directory2,
            embedding_function=huggingface_embeddings,
        )
retriever = db.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 3, "score_threshold": 0.1},
)
relevant_docs = retriever.invoke(q)

print("HuggingFace Docs")
print(relevant_docs)

print("Querying demonstrations completed.")

In the above code, we are loading the necessary variables and importing the required libraries. We are using a loader to load the document, in this case, a PDF file, and the remaining steps are similar to the previous examples we’ve gone through. The key difference here is that we are creating separate Chroma databases for storing the embeddings, where each is stored in different persistent directories. This allows us to store the embeddings generated by different types of models—OpenAI and Hugging Face in this case.

With this setup, you can experiment with different embeddings from both models. One important thing to note is that there is no single "best" embedding model; the right choice depends on your specific use case. However, many people prefer using OpenAI embeddings, as they are known for their strong performance. OpenAI’s embeddings also have robust capabilities when it comes to a variety of tasks, but feel free to try out different embeddings and choose the one that works best for your needs.

Different types of  Retrievers:

We can primarily look at three major types of retrievers, and you can choose any of them based on your specific needs. As I mentioned earlier, there is no "perfect" method for retrieving documents—it all depends on your application and data. Some methods might suit your data really well, while others may not. It often requires a bit of trial and error to figure out what works best. Let me walk you through a few of the methods we use. But first, let me show you the code, and then we’ll discuss how each method operates.

Python
import os
from dotenv import load_dotenv
from langchain.chains import create_history_aware_retriever, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_community.vectorstores import Chroma
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

load_dotenv()
persistent_directory = os.path.join("vc")

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

db = Chroma(persist_directory=persistent_directory, embedding_function=embeddings)

retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3},
)

llm = ChatOpenAI(model="gpt-4o")

contextualize_q_system_prompt = (
    "Given a chat history and the latest user question "
    "which might reference context in the chat history, "
    "formulate a standalone question which can be understood "
    "without the chat history. Do NOT answer the question, just "
    "reformulate it if needed and otherwise return it as is."
)

contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_q_prompt
)

qa_system_prompt = (
    "You are an assistant for question-answering tasks. Use "
    "the following pieces of retrieved context to answer the "
    "question. If you don't know the answer, just say that you "
    "don't know. Use three sentences maximum and keep the answer "
    "concise."
    "\n\n"
    "{context}"
)

qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", qa_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)

rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

chat_history = []
query = input("You: ")
result = rag_chain.invoke({"input": query, "chat_history": chat_history})
print(f"AI: {result['answer']}")
chat_history.append(HumanMessage(content=query))
chat_history.append(SystemMessage(content=result["answer"]))

query = input("You: ")
result = rag_chain.invoke({"input": query, "chat_history": chat_history})
print(f"AI: {result['answer']}")
chat_history.append(HumanMessage(content=query))
chat_history.append(SystemMessage(content=result["answer"]))

From the code above, you can see that we are doing the same process as before, but with different vector retrieval options. The first retrieval method we're using is similarity search, where we define the k value (in this case, 3). This means it will try to find the top 3 most similar documents to the query based on their similarity scores. Essentially, it retrieves the documents that are closest in context to the question we’re asking.

Next, we are using Max Marginal Relevance (MMR). This method is useful when you want to explore not just simple, closely related content, but a broader set of results with more diverse relevance. For example, if you're querying a document about a story like Harry Potter, and you ask "How did Harry Potter discover his powers?"—instead of focusing only on specific, repeated information, MMR will look for a more diverse set of relevant documents. This could include not only how Harry discovered his powers but also other contextual information, like details about his mother or his background. It essentially provides a wider range of context, avoiding redundancy.

MMR works with three parameters:

  1. k: Determines how many top results you want.
  2. fetch_k: Specifies how many documents to fetch initially (in this case, 20).
  3. lambda_mult: Controls the balance between relevance and diversity. A value closer to 1 prioritizes relevance, meaning it will focus more on retrieving documents that are very similar to the query. A value closer to 0, on the other hand, prioritizes diversity, meaning it will retrieve documents that are more varied and not too similar to each other.

The final method is similarity score threshold. Here, you set a threshold value (e.g., 0.1), which filters the results so that only documents with a similarity score higher than the threshold are retrieved. This method is helpful when you want to be more selective and ensure that only highly relevant documents are considered, eliminating less relevant ones.

Overall, the code remains similar to previous examples, with the main difference being the use of different retrieval methods to suit varying needs.

Combining Retriever and LLM:

Now, let's talk about how we can convert retrieved documents into a more reasonable response for the user. Simply revealing similar documents in their raw form isn't very helpful because the user might not be interested in going through 1000 tokens or characters of content. It's not the right approach. Instead, the better way is to present the information in a well-organized and understandable manner.

How can we achieve this? It's quite simple. We combine these retrieved documents and pass them to a Large Language Model (LLM), which is capable of processing textual data. We can then ask the model to refine the answer or use these documents as references to find the correct response. For example, if you ask about something specific, like the terms of your lease, the LLM may not be sure and might pull irrelevant information from the internet or from its pre-learned subset of knowledge. It won’t have access to how your particular lease is defined.

This is where RAG (Retrieval-Augmented Generation) comes in, as we have learned so far. In this approach, we retrieve relevant documents based on the user’s question and then use those documents to help the model generate an accurate response. By pulling in all the relevant documents, we give the LLM the right context so it can provide a response based on the actual lease document, rather than generic or incorrect information from the internet.

So, the process works by first retrieving the documents and then passing them to a third model, which uses those documents to generate the final answer. This way, the response is based on the correct information and is aligned with the user’s specific scenario, like their lease terms.

Currently, I'm just demonstrating this process with some random text. But later, I’ll show you how to use prompt templates and how to invoke all of this properly. There are also special methods that we will explore further, which help in running the entire chain smoothly.

Python
import os
from dotenv import load_dotenv
from langchain.chains import create_history_aware_retriever, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_community.vectorstores import Chroma
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

load_dotenv()
persistent_directory = os.path.join("vc")

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

db = Chroma(persist_directory=persistent_directory, embedding_function=embeddings)

retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3},
)

llm = ChatOpenAI(model="gpt-4o")

contextualize_q_system_prompt = (
    "Given a chat history and the latest user question "
    "which might reference context in the chat history, "
    "formulate a standalone question which can be understood "
    "without the chat history. Do NOT answer the question, just "
    "reformulate it if needed and otherwise return it as is."
)

contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_q_prompt
)

qa_system_prompt = (
    "You are an assistant for question-answering tasks. Use "
    "the following pieces of retrieved context to answer the "
    "question. If you don't know the answer, just say that you "
    "don't know. Use three sentences maximum and keep the answer "
    "concise."
    "\n\n"
    "{context}"
)

qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", qa_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)

rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

chat_history = []
query = input("You: ")
result = rag_chain.invoke({"input": query, "chat_history": chat_history})
print(f"AI: {result['answer']}")
chat_history.append(HumanMessage(content=query))
chat_history.append(SystemMessage(content=result["answer"]))

query = input("You: ")
result = rag_chain.invoke({"input": query, "chat_history": chat_history})
print(f"AI: {result['answer']}")
chat_history.append(HumanMessage(content=query))
chat_history.append(SystemMessage(content=result["answer"]))

Let me explain what's happening in this code. It's quite similar to the previous examples we've used, with just a few differences. Here, we are retrieving a document, and once it's retrieved, we are passing it to the LLM (Large Language Model). In this case, we're using ChatGPT 4.0, the latest version available.

In this scenario, we’re retrieving a specific document—I've kept it simple by limiting the retrieval to one document to avoid complexity. You can see I’m asking a query about MOVE-OUT POLICY 43, and the retriever automatically pulls the document that contains relevant information related to the query.

Once the document is retrieved, we are formatting it and pushing it to the LLM. We're creating a custom prompt where we ask the model to refer to the retrieved documents and answer the user's question. In this way, the model has both the context of the query and the document right in front of it. It tries to understand the content and refines the answer based on the useful information it finds in the document.

The cool part is how well this works—the model can accurately respond by leveraging the content provided to it. I'll show you a screenshot of how effective this process is because it's really impressive how smoothly the entire thing works together. As mentioned earlier, we can use different methods to achieve this, and we'll dive into the details of those approaches in upcoming topics.


Conversational Retriever:

In this topic, we are not just learning about one function but multiple functions. The main idea of the program I’m going to write is to create a conversational retrieval system that works in a continuous manner. It is essential to keep track of the chat history when using an LLM (Large Language Model), as it’s necessary for answering follow-up questions.

For instance, if you ask, “Who is the founder of Apple?” and then follow up with, “What is his age?”—the LLM needs to understand that “his” refers to the founder of Apple. In order to do this, the model needs to maintain a record of the chat history so it can properly reference the context, meaning the word "his" should point back to "the founder of Apple". This is why it’s important to remember the conversation history.

We have already explored various methods to store the chat history, such as storing it locally, in the cloud, or even deploying it to your Docker file system as textual context. We’ve already covered these topics, so now, I will demonstrate how to integrate this into a retrieval system.

For example, in a conversational retrieval system, every time a new question is asked, the relevant documents need to be retrieved. But it’s not always straightforward. Say you first ask, “Who is the founder of Apple?”, and then follow up with, “What’s his age?”. A simple retrieval system may not understand the context of "his"—because retrieval is based on similarity scores. In this case, it may search for “his age” across the entire document, which isn’t correct.

The right way to search in this case would be something like, “What’s the age of the founder of Apple?”. This way, the retrieval system can recognize the relevant keywords and retrieve the correct document. So, it’s mandatory to reformulate or rephrase the question to help the retrieval system properly search the document in the vector database.

We are working to build a system that can automatically reformulate the question. Once the question is reformulated, the system will retrieve the relevant documents, and those documents, along with the previous chat history, will be passed to the LLM. Now, the LLM has both the chat history and the updated content from the retrieval system. It can combine this information to produce a well-structured, accurate response.

This process is key to ensuring that the LLM has a complete understanding of both the ongoing conversation and the new documents. Once I write the code, I’ll be able to describe it in more detail.

Python
      import os
from dotenv import load_dotenv
from langchain.chains import create_history_aware_retriever, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_community.vectorstores import Chroma
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

load_dotenv()
persistent_directory = os.path.join("vc")

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

db = Chroma(persist_directory=persistent_directory, embedding_function=embeddings)

retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3},
)

llm = ChatOpenAI(model="gpt-4o")

contextualize_q_system_prompt = (
    "Given a chat history and the latest user question "
    "which might reference context in the chat history, "
    "formulate a standalone question which can be understood "
    "without the chat history. Do NOT answer the question, just "
    "reformulate it if needed and otherwise return it as is."
)

contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_q_prompt
)

qa_system_prompt = (
    "You are an assistant for question-answering tasks. Use "
    "the following pieces of retrieved context to answer the "
    "question. If you don't know the answer, just say that you "
    "don't know. Use three sentences maximum and keep the answer "
    "concise."
    "\n\n"
    "{context}"
)

qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", qa_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)

rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

chat_history = []
query = input("You: ")
result = rag_chain.invoke({"input": query, "chat_history": chat_history})
print(f"AI: {result['answer']}")
chat_history.append(HumanMessage(content=query))
chat_history.append(SystemMessage(content=result["answer"]))

query = input("You: ")
result = rag_chain.invoke({"input": query, "chat_history": chat_history})
print(f"AI: {result['answer']}")
chat_history.append(HumanMessage(content=query))
chat_history.append(SystemMessage(content=result["answer"]))

Let's take a deep dive into this code. Initially, we are loading the required documents and setting up our file system directory, similar to what we've done before. We then create a retriever using the similarity approach, where we're retrieving the top three most similar documents. So far, everything is straightforward. We're also initializing ChatGPT 4, which follows the same setup steps as before.

Once we've set up the retrieval system, we move towards something called the contextualized question system. The idea here is to instruct the model (LLM) to refer to the chat history and the latest user question, and reformulate the question as a standalone query. This is essential because asking vague questions like “What’s his age?” isn’t scalable or contextually meaningful. Instead, we want it to be something like “What’s the age of the founder of Apple?”.

To get this right, we ask the LLM to reformulate the question based on the chat history. Once the model reformulates the question, we use this revised question to filter the documents. After the reformulation, the question is passed to a chat prompt template, where the system message contains the instructions, and the chat history and human input (the user’s question) are placeholders.

Next, we use create_history_aware_retriever, a function that helps retrieve relevant documents based on the reformulated question. We pass three arguments here: the LLM, the retriever, and the contextualization prompt. What happens is that every time the user asks a question, the prompt ensures that the LLM reformulates the query before sending it to the retriever. The reformulated question is then used to retrieve the most relevant documents.

This is how the history-aware retriever works, which is one part of the code. The second part involves creating another prompt for question answering. Here, once the relevant documents are retrieved, we ask the LLM to answer the user’s query. If it doesn’t know the answer, the LLM is instructed to respond with “I don’t know”. This part is handled by the qa_prompt, which includes the chat history and the user’s question as input.

Now, let’s discuss two important functions that work together here:

1. create_stuff_documents_chain:

This function combines all the retrieved documents and sends them as input to the LLM along with the prompt. Since multiple documents could be retrieved, this function ensures they are merged into one set of inputs, providing the LLM with proper context. Once the documents are combined, they are passed to the LLM along with the qa_prompt to generate the answer. The main idea here is to ensure that the LLM receives all the necessary context from multiple documents.

2. create_retrieval_chain:

This function ties together the retriever and the document chain. It triggers the process by passing the retrieved documents to create_stuff_documents_chain, which combines them and sends them to the LLM for question answering. The retrieval chain function takes two arguments: the retrieved documents and the document chain. When triggered, it allows the system to retrieve the relevant documents, combine them, and then send the combined documents to the LLM for the final response.

How They Work Together:

These two functions—create_stuff_documents_chain and create_retrieval_chain—work together because one combines the documents, and the other triggers the process by sending the documents to the LLM for answering. They are interdependent, as one function triggers the document combination, and the other ensures the retrieval is accurate.

In summary:

  • create_history_aware_retriever retrieves relevant documents based on the chat history and reformulated question.
  • create_stuff_documents_chain combines the retrieved documents and feeds them into the LLM.
  • create_retrieval_chain connects the document retriever and document chain, triggering the document retrieval and combination process.

Once the relevant documents are passed through these functions, the system can provide an accurate answer based on both the retrieved context and the chat history.

Lastly, we store the chat history as a list, saving all the human and system messages. This list keeps track of the conversation, ensuring that the model maintains context throughout the interaction. The stored chat history allows the model to build on previous conversations, making the responses more accurate and context-aware.

Web Scraping:

Up until now, we’ve been working with documents—converting them into embeddings, storing them in the vector database, and using a retriever to pull relevant information. But there are other sources of information we can retrieve beyond just documents. One of the popular retrieval options is web scraping, where we can extract content directly from websites.

There are some inbuilt functions in LangChain that allow us to retrieve information from specific websites. In addition to these, we can also use platforms like Google or other integrated solutions to extract content from the web.

Once the information is scraped from the website, we can convert the extracted text into embeddings, similar to how we handle document data, and store them in our vector database. This way, instead of using only static text from documents, we can incorporate fresh information from the web.

Furthermore, if you prefer or need more control over the data extraction process, you can also use custom web scrapers. These tools allow you to scrape specific data from websites if you don’t want to rely on built-in or integrated retrievers. There are plenty of web scraping technologies and techniques available, giving you the flexibility to choose the best option for your needs.

Python
import os
from dotenv import load_dotenv
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

load_dotenv()
persistent_directory = os.path.join("chroma_db_web")

urls = ["https://machinelearningguider.blogspot.com/2024/09/langchain-code-journey-from-basics-part_2.html"]
loader = WebBaseLoader(urls)
documents = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")


if not os.path.exists(persistent_directory):
    print(f"\n--- Creating vector store in {persistent_directory} ---")
    db = Chroma.from_documents(docs, embeddings, persist_directory=persistent_directory)
    print(f"--- Finished creating vector store in {persistent_directory} ---")
else:
    print(f"Vector store {persistent_directory} already exists. No need to initialize.")
    db = Chroma(persist_directory=persistent_directory, embedding_function=embeddings)

retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)
query = "What are key fucntions for conversational Retriver"
relevant_docs = retriever.invoke(query)

combined_input = (
    "Here are some documents that might help answer the question: "
    + query
    + "\n\nRelevant Documents:\n"
    + "\n\n".join([doc.page_content for doc in relevant_docs])
    + "\n\nPlease provide an answer based only on the provided documents. If the answer is not found in the documents, respond with 'I'm not sure'."
)

model = ChatOpenAI(model="gpt-4o")

messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content=combined_input),
]

result = model.invoke(messages)

print("\n--- Generated Response ---")
print("Content only:")
print(result.content)

We are trying to retrieve information from one of my blog posts. I’ve provided the blog URL, and we are using a function called WebBaseLoader. This function automatically loads the content from the webpage provided, but it doesn't crawl through every link on the page. Instead, it only extracts information from the specific link you’ve given. It goes through the entire content on that page, from top to bottom.

Once the information is extracted, it follows the same steps we’ve used earlier. The content is converted into a set of documents, and each document is chunked using CharacterTextSplitter. You can also use RecursiveCharacterTextSplitter if needed—both methods work fine. After splitting the content into chunks, we convert them into a set of documents and follow the same process of creating embeddings. We then store these embeddings in a vector database using Chroma.

Next, we use a retriever to fetch the most relevant result (in this case, we retrieve the top 1 result based on similarity). Once the top result is fetched from the vector database, it is passed to ChatGPT in the form of a custom text message. ChatGPT then uses this context and query to generate a response.

This process allows us to extract content from any web page by simply providing the URLs. You can give a list of multiple links, and the function will return all the information from those pages. A key point to note here is that WebBaseLoader only extracts content from the specific webpage you’ve provided; it doesn’t crawl into every link, subpage, or embedded page.

If you need to scrape deeper content across multiple links or subpages, you’ll need to use external tools (as I mentioned earlier) that are more capable of crawling through complex site structures.

Fire Crawl Loader:

FireCrawl is one of the latest and most efficient web-based crawling tools available today, offering around three key trials for scraping and other tasks. It stands out as a high-performing web scraping solution, especially suited for analysts. What makes FireCrawl particularly useful is that it doesn’t retrieve the entire HTML structure and other unnecessary elements. Instead, it focuses purely on extracting content, which helps reduce the number of tokens and makes it more useful for analysis-driven applications.

FireCrawl offers two main modes of operation:

  1. Scrape Mode: This mode allows you to scrape content from a single webpage. It focuses on extracting the data from one URL without navigating deeper into linked pages.
  2. Crawl Mode: In this mode, FireCrawl can go deeper, following more links across multiple pages. It crawls through the website, retrieving content from various linked pages, allowing for more comprehensive data extraction.

Now, let's take a closer look at how FireCrawl works in the code. First, you need to create an account on FireCrawl and generate an API key. Once you have the API key, you load it into your environment variables to securely use it within your code. From there, the process is quite similar to using a WebBaseLoader, but with FireCrawl's specific functionality for scraping and crawling.

In the code, we first import the required modules, including FireCrawl. Once everything is set up, we load the content from the desired webpage or crawl through multiple linked pages, depending on the mode we use. After the content is extracted, we handle it the same way as other documents—storing it in the database, creating embeddings, and using the retriever to query relevant information.

This process allows you to take advantage of FireCrawl's efficient content extraction and integrate it seamlessly into your workflow. FireCrawl is a powerful tool, particularly for those needing to extract clean, usable content without the clutter of HTML structures.



Python
import os
from dotenv import load_dotenv
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import FireCrawlLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.messages import HumanMessage, SystemMessage
load_dotenv()

persistent_directory = os.path.join("chroma_db_firecrawl")
api_key = os.getenv("FIRECRAWL_API_KEY")
loader = FireCrawlLoader(api_key=api_key, url="https://about.fb.com/news/2024/04/introducing-our-open-mixed-reality-ecosystem/",mode="scrape")
docs = loader.load()
print(docs)
#if meta data is in the list format setting it to string format
for doc in docs:
    for key, value in doc.metadata.items():
        if isinstance(value, list):
            doc.metadata[key] = ", ".join(map(str, value))

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
split_docs = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

if not os.path.exists(persistent_directory):
    print(f"\n--- Creating vector store in {persistent_directory} ---")
    db = Chroma.from_documents(
        split_docs, embeddings, persist_directory=persistent_directory
    )
    print(f"--- Finished creating vector store in {persistent_directory} ---")
else:
    print(f"Vector store {persistent_directory} already exists. No need to initialize.")
    db = Chroma(persist_directory=persistent_directory, embedding_function=embeddings)

retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3},
)

query = "what is meta horizon os?"
relevant_docs = retriever.invoke(query)

print(relevant_docs)

combined_input = (
    "Here are some documents that might help answer the question: "
    + query
    + "\n\nRelevant Documents:\n"
    + "\n\n".join([doc.page_content for doc in relevant_docs])
    + "\n\nPlease provide an answer based only on the provided documents. If the answer is not found in the documents, respond with 'I'm not sure'."
)

model = ChatOpenAI(model="gpt-4o")

messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content=combined_input),
]

result = model.invoke(messages)

print("\n--- Generated Response ---")
print("Content only:")
print(result.content)

The FireCrawlLoader is a web crawling and scraping tool, and like other crawlers, it may offer multiple options that allow you to customize how you scrape and retrieve content from websites. Below are some potential options you can use with FireCrawlLoader or similar web scraping tools. These options are inferred from general web-scraping practices, and while they may not be directly available in your current FireCrawl implementation, these are common features in web-crawling frameworks:

1. url (Required)

  • This is the URL of the website or webpage you want to scrape. It’s a mandatory field.
  • Example: url="https://about.fb.com/news"

2. mode (Required)

  • This sets the scraping mode. "scrape" is typically for retrieving content from a single webpage, while "crawl" can retrieve multiple linked pages.
  • Example: mode="scrape" or mode="crawl"

3. api_key (Required)

  • The API key is used to authenticate with FireCrawl. This is essential for accessing the FireCrawl service.
  • Example: api_key=os.getenv("FIRECRAWL_API_KEY")

4. depth (Optional, relevant for crawl mode)

  • This sets how deep you want the crawler to go when following links on a webpage. Depth defines how many levels of links the crawler should follow.
  • Example: depth=2 (crawls up to 2 levels of links)

5. max_pages (Optional, relevant for crawl mode)

  • This limits the number of pages that FireCrawl should scrape when operating in "crawl" mode. It can prevent the crawler from scraping too many pages.
  • Example: max_pages=50 (limits the crawler to 50 pages)

6. selectors (Optional)

  • Allows you to specify CSS selectors or XPath to scrape only certain parts of a webpage (e.g., specific divs, paragraphs, or images).
  • Example: selectors={"content": ".article-body", "title": "h1"}

7. headers (Optional)

  • Custom HTTP headers (e.g., user-agent) can be used if the target website requires it. This helps mimic a real browser and avoid getting blocked.
  • Example: headers={"User-Agent": "Mozilla/5.0"}

8. timeout (Optional)

  • Set a timeout for each page load. This ensures that the crawler doesn’t get stuck waiting for a slow website.
  • Example: timeout=10 (timeout after 10 seconds)

9. allow_redirects (Optional)

  • Whether or not to follow HTTP redirects when scraping.
  • Example: allow_redirects=True

10. follow_links (Optional, crawl mode)

  • This option controls whether to follow internal or external links during crawling. By default, it usually follows only internal links.
  • Example: follow_links=True (follows internal links)

11. crawl_rate_limit (Optional, crawl mode)

  • This option helps avoid overloading the target website by specifying a delay between requests. It’s useful for avoiding getting blocked by the website.
  • Example: crawl_rate_limit=2 (waits 2 seconds between each page crawl)

12. skip_media (Optional)

  • Skips downloading media content like images, videos, and audio files. It focuses only on textual data.
  • Example: skip_media=True

13. output_format (Optional)

  • Specifies the format for the crawled data, such as JSON, plain text, or HTML.
  • Example: output_format="json"

14. max_token_limit (Optional)

  • Limits the number of tokens or characters extracted from a webpage. Useful for ensuring that only the most relevant content is retrieved.
  • Example: max_token_limit=5000

BeautifulSoup Web-based Loader:

We can also use a tool called BeautifulSoup, which is extremely useful when paired with web-based loaders. You can pass something called BeautifulSoup arguments (or Beautiful arguments) to filter out unnecessary elements from a webpage.

For example, when you visit a webpage, the main content might be found within specific containers, such as titles, paragraphs, or other structured elements in the HTML, depending on the layout of the site. You can inspect the webpage to see where the main content is stored—whether in specific div containers, p tags, or other HTML elements. Once you inspect the page, you’ll notice that the containers with the main content often repeat themselves because of the structured nature of HTML pages.

By identifying the container names that store the main content, you can filter (or "strain") the irrelevant parts and focus only on the classes or elements that contain the information you need. This is where BeautifulSoup’s strainer comes in handy. It allows you to remove unnecessary content and zero in on specific elements or classes, helping to reduce the amount of text being processed. This, in turn, reduces the number of tokens passed to the language model (LLM), making the process more efficient.

So essentially, using BeautifulSoup’s strainer helps eliminate irrelevant parts of the page and focuses only on the desired elements. Below is an example of the same code we’ve been working with, but with two or three more arguments to incorporate BeautifulSoup’s functionality.


Python
import os
from dotenv import load_dotenv
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
import bs4

load_dotenv()
persistent_directory = os.path.join("chroma_db_web")

urls = ["https://en.wikipedia.org/wiki/Machine_learning"]
loader=WebBaseLoader(urls,bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_="mw-body-content")))
documents = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")


if not os.path.exists(persistent_directory):
    print(f"\n--- Creating vector store in {persistent_directory} ---")
    db = Chroma.from_documents(docs, embeddings, persist_directory=persistent_directory)
    print(f"--- Finished creating vector store in {persistent_directory} ---")
else:
    print(f"Vector store {persistent_directory} already exists. No need to initialize.")
    db = Chroma(persist_directory=persistent_directory, embedding_function=embeddings)

retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)
query = "What is Machine Learning"
relevant_docs = retriever.invoke(query)

combined_input = (
    "Here are some documents that might help answer the question: "
    + query
    + "\n\nRelevant Documents:\n"
    + "\n\n".join([doc.page_content for doc in relevant_docs])
    + "\n\nPlease provide an answer based only on the provided documents. If the answer is not found in the documents, respond with 'I'm not sure'."
)

model = ChatOpenAI(model="gpt-4o")

messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content=combined_input),
]

result = model.invoke(messages)

print("\n--- Generated Response ---")
print("Content only:")
print(result.content)

We can also use BeautifulSoup for many other purposes. One example is the code provided below, where we extract content specifically from paragraph (<p>) tags, which are block-level elements. Besides extracting paragraph content, BeautifulSoup provides a variety of powerful functions that allow us to manipulate and extract data from HTML documents.

Here are some of the key functions you can use with BeautifulSoup:

  1. find():

    • Finds the first occurrence of an element that matches the specified criteria.
    • Example: soup.find("div", {"class": "content"}) to find the first div element with class content.
  2. find_all():

    • Returns all elements that match the given criteria, in the form of a list.
    • Example: soup.find_all("p") extracts all <p> elements (paragraphs).
  3. get_text():

    • Extracts all the text inside an element, stripping away the HTML tags.
    • Example: element.get_text() returns the text content of a tag, excluding any HTML.
  4. select():

    • Uses CSS selectors to find elements.
    • Example: soup.select(".content p") finds all <p> tags inside elements with class content.
  5. decompose():

    • Removes an element from the DOM, including all of its children.
    • Example: soup.find("script").decompose() will remove the <script> tags from the document.
  6. replace_with():

    • Replaces a tag or string with another element.
    • Example: soup.find("b").replace_with("strong") will replace the <b> tag with a <strong> tag.
  7. find_parents():

    • Finds all parent elements that match the specified criteria.
    • Example: soup.find("p").find_parents("div") finds all div elements that are parents of the p element.
  8. find_next_siblings():

    • Finds the sibling elements that follow the current element.
    • Example: soup.find("h2").find_next_siblings("p") finds all paragraph siblings after an h2 element.
  9. attrs:

    • Allows access to an element's attributes.
    • Example: soup.find("img")["src"] will return the src attribute of the first image element.
  10. strip():

    • Removes leading and trailing whitespace from strings.
    • Example: soup.get_text().strip() removes extra whitespace from the extracted text.

Python
import os
from dotenv import load_dotenv
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from bs4 import BeautifulSoup

load_dotenv()

persistent_directory = os.path.join("chroma_db_web")

urls = ["https://en.wikipedia.org/wiki/Machine_learning"]

loader = WebBaseLoader(urls)
documents = loader.load()

for doc in documents:
    soup = BeautifulSoup(doc.page_content, "html.parser")

    # Example 1: Extract all p elements
    paragraphs = soup.find_all("p")
    doc.page_content = " ".join([p.get_text() for p in paragraphs])

    # Example 2: Remove all script and style elements
    for tag in soup(["script", "style"]):
        tag.decompose()

    # Example 3: Strip extra whitespace from the text
    doc.page_content = soup.get_text().strip()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

if not os.path.exists(persistent_directory):
    db = Chroma.from_documents(docs, embeddings, persist_directory=persistent_directory)
else:
    db = Chroma(persist_directory=persistent_directory, embedding_function=embeddings)

retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 1})

query = "What is Machine Learning?"
relevant_docs = retriever.invoke(query)

combined_input = (
    "Here are some documents that might help answer the question: "
    + query
    + "\n\nRelevant Documents:\n"
    + "\n\n".join([doc.page_content for doc in relevant_docs])
    + "\n\nPlease provide an answer based only on the provided documents. If the answer is not found in the documents, respond with 'I'm not sure'."
)

model = ChatOpenAI(model="gpt-4o")

messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content=combined_input),
]

result = model.invoke(messages)

print("\n--- Generated Response ---")
print("Content only:")
print(result.content)
From the above code, you can see an example of how to use the different techniques I mentioned for BeautifulSoup. I’ve provided examples of the basic methods available, which you can use in web scraping. These include extracting elements like <p> tags, cleaning up the content, and more. You can experiment with different options and find which ones work best for your specific use cases.

These are the foundational methods commonly used in web scraping. I hope everything has been clear so far.

Now, let's move on to the next topic: Agents. This is a very exciting topic, especially when used in combination with RAG (Retrieval-Augmented Generation). Agents can help us take things a step further, making our system more dynamic and intelligent.

That’s it for BeautifulSoup and web scraping. Let’s dive into Agents!

Comments