Anas P’s Post

Data Science • Machine Learning • Deep Learning • GenAI ┃ New Batch Starting Jan 9, 2025

3mo

🔍 Why Chunking Strategy is Key in Building a RAG System with LLMs 🔍 Recently, a student of mine while attending the interview was asked to explain how he would chose chunking strategy to build a better RAG system. When building a Retrieval-Augmented Generation (RAG) system using LLMs, chunking is a critical step. The process of splitting data into manageable pieces not only affects how well relevant information is retrieved but also impacts the LLM’s ability to generate coherent and accurate responses. • Contextual Understanding: LLMs have limitations in processing extremely long texts. Chunking helps maintain context and ensures the model can focus on relevant information. If you are using Opensource LLM with limitation on context length, this becomes important. • Efficiency: Smaller chunks reduce the computational burden on the LLM, leading to faster responses. • Retrieval Accuracy: Chunking can improve the accuracy of information retrieval by making it easier to find relevant segments. Essentially, Chunks should not be too large or too small. So, what’s the right answer ? Well, it depends on the problem you are solving. First you should ask below questions to yourselves. • What is the nature of the content being indexed? Long documents, Short sentences, Paragraphs ? • Which embedding model are you using, and what chunk sizes does it perform optimally on? • What are your expectations for the length and complexity of user queries? Will they be short and specific or long and complex? This may inform the way you choose to chunk • How will the retrieved results be utilized within your specific application? For example, will they be used for semantic search, question answering, summarization, or other purposes? Then chose one of the below strategies: 𝗖𝗼𝗺𝗺𝗼𝗻 𝗖𝗵𝘂𝗻𝗸𝗶𝗻𝗴 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝗶𝗲𝘀: Sentence-Based Chunking Paragraph-Based Chunking Token-Based Chunking Overlapping Chunking (Sliding Window) Topic-Based Chunking: Identify topic boundaries using techniques like topic modeling or keyword extraction. Specialized chunking : You would need specialised packages for chunking if you are using LaTex or Markdown etc. Take that into consideration. Semantic Chunking : Variable length chunking retaining the semantic coherence. Creating Sentence groups etc. Langchain has a Semantic Chunking splitter. Hybrid Chunking: Combine multiple strategies to optimize for your specific use case. Choosing the Right Strategy: The best chunking strategy depends on factors like the structure of your documents, the complexity of the information, and the desired level of granularity. Experimentation is often key to finding the optimal approach. By carefully considering chunking strategies, you can significantly enhance the performance and effectiveness of your RAG system.

1 Comment

Jaya Parthasarathy

3mo

Insightful

To view or add a comment, sign in

More Relevant Posts

Massimiliano Marchesiello

AI & Machine Learning Specialist | Data Scientist
2mo
Report this post
The Ultimate Guide to RAGs — Each Component Dissected https://ift.tt/Co4KNg0 The Ultimate Guide to RAGs — Each Component Dissected A visual tour of what it takes to build CHAD-level LLM pipelines Let’s learn RAGs! (Image by Author) If you have worked with Large Language Models, there is a great chance that you have at least heard the term RAG — Retrieval Augmented Generation. The idea of RAGs are pretty simple — suppose you want to ask a question to a LLM, instead of just relying on the LLM’s pre-trained knowledge, you first retrieve relevant information from an external knowledge base. This retrieved information is then provided to the LLM along with the question, allowing it to generate a more informed and up-to-date response. Comparing standard LLM calls with RAG (Source: Image by Author) So, why use Retrieval Augmented Generation? When providing accurate and up-to-date information is key, you cannot rely on the LLM’s inbuilt knowledge. RAGs are a cheap practical way to use LLMs to generate content about recent topics or niche topics without needing to finetune them on your own and burn away your life’s savings. Even when LLMs internal knowledge may be enough to answer questions, it might be a good idea to use RAGs anyway, since recent studies have shown that they could help reduce LLMs hallucinations. The different components of a bare-bones RAG Before we dive into the advanced portion of this article, let’s review the basics. Generally RAGs consist of two pipelines — preprocessing and inferencing. Inferencing is all about using data from your existing database to answer questions from a user query. Preprocessing is the process of setting up the database in the correct way so that retrieval is done correctly later on. Here is a diagramatic look into the entire basic barebones RAG pipeline. The Basic RAG pipeline (Image by Author) The Indexing or Preprocessing Steps This is the offline preprocessing stage, where we would set up our database. Identify Data Source: Choose a relevant data source based on the application, such as Wikipedia, books, or manuals. Since this is domain dependent, I am going to skip over this step in this article. Go choose any data you want to use, knock yourself out! Chunking the Data: Break down the dataset into smaller, manageable documents or chunks. Convert to Searchable Format: Transform each chunk into a numerical vector or similar searchable representation. Insert into Database: Store these searchable chunks in a custom database, though external databases or search engines could also be used. The Inferencing Steps During the Query Inferencing stage, the following components stand out. Query Processing: A method to convert the user’s query into a format suitable for search. Retrieval/Search Strategy: A similarity search mechanism to retrieve the most relevant documents. Post-Retrieval Answer Generation: Use retrieved documents as context to generate the answer with...

$The Ultimate Guide to RAGs — Each Component Dissected https://ift.tt/Co4KNg0 The Ultimate Guide to RAGs — Each Component Dissected A visual tour of what it takes to build CHAD-level LLM pipelines Let’s learn RAGs! $Image by Author$ If you have worked with Large Language Models, there is a great chance that you have at least heard the term RAG — Retrieval Augmented Generation. The idea of...$

The Ultimate Guide to RAGs — Each Component Dissected https://ift.tt/Co4KNg0 The Ultimate Guide to RAGs — Each Component Dissected A visual tour of what it takes to build CHAD-level LLM pipelines Let’s learn RAGs! $Image by Author$ If you have worked with Large Language Models, there is a great chance that you have at least heard the term RAG — Retrieval Augmented Generation. The idea of...

towardsdatascience.com
Like Comment
To view or add a comment, sign in
Azizi Othman

...
2mo
Report this post
The Ultimate Guide to RAGs — Each Component Dissected The Ultimate Guide to RAGs — Each Component Dissected A visual tour of what it takes to build CHAD-level LLM pipelines Let’s learn RAGs! (Image by Author) If you have worked with Large Language Models, there is a great chance that you have at least heard the term RAG — Retrieval Augmented Generation. The idea of RAGs are pretty simple — suppose you want to ask a question to a LLM, instead of just relying on the LLM’s pre-trained knowledge, you first retrieve relevant information from an external knowledge base. This retrieved information is then provided to the LLM along with the question, allowing it to generate a more informed and up-to-date response. Comparing standard LLM calls with RAG (Source: Image by Author) So, why use Retrieval Augmented Generation? When providing accurate and up-to-date information is key, you cannot rely on the LLM’s inbuilt knowledge. RAGs are a cheap practical way to use LLMs to generate content about recent topics or niche topics without needing to finetune them on your own and burn away your life’s savings. Even when LLMs internal knowledge may be enough to answer questions, it might be a good idea to use RAGs anyway, since recent studies have shown that they could help reduce LLMs hallucinations. The different components of a bare-bones RAG Before we dive into the advanced portion of this article, let’s review the basics. Generally RAGs consist of two pipelines — preprocessing and inferencing. Inferencing is all about using data from your existing database to answer questions from a user query. Preprocessing is the process of setting up the database in the correct way so that retrieval is done correctly later on. Here is a diagramatic look into the entire basic barebones RAG pipeline. The Basic RAG pipeline (Image by Author) The Indexing or Preprocessing Steps This is the offline preprocessing stage, where we would set up our database. Identify Data Source: Choose a relevant data source based on the application, such as Wikipedia, books, or manuals. Since this is domain dependent, I am going to skip over this step in this article. Go choose any data you want to use, knock yourself out! Chunking the Data: Break down the dataset into smaller, manageable documents or chunks. Convert to Searchable Format: Transform each chunk into a numerical vector or similar searchable representation. Insert into Database: Store these searchable chunks in a custom database, though external databases or search engines could also be used. The Inferencing Steps During the Query Inferencing stage, the following components stand out. Query Processing: A method to convert the user’s query into a format suitable for search. Retrieval/Search Strategy: A similarity search mechanism to retrieve the most relevant documents. Post-Retrieval Answer Generation: Use retrieved documents as context to generate the answer with an LLM. Great — so we identified...

$The Ultimate Guide to RAGs — Each Component Dissected The Ultimate Guide to RAGs — Each Component Dissected A visual tour of what it takes to build CHAD-level LLM pipelines Let’s learn RAGs! $Image by Author$ If you have worked with Large Language Models, there is a great chance that you have at least heard the term RAG — Retrieval Augmented Generation. The idea of RAGs are pretty...$

The Ultimate Guide to RAGs — Each Component Dissected The Ultimate Guide to RAGs — Each Component Dissected A visual tour of what it takes to build CHAD-level LLM pipelines Let’s learn RAGs! $Image by Author$ If you have worked with Large Language Models, there is a great chance that you have at least heard the term RAG — Retrieval Augmented Generation. The idea of RAGs are pretty...

towardsdatascience.com
Like Comment
To view or add a comment, sign in
Daily Dose of Data Science

24,028 followers
1mo
Report this post
5 Chunking Strategies For RAG explained in a single frame 🧩 Before embedding the additional info, it is advised to chunk it, wherein a large document is divided into smaller/manageable pieces. This step is crucial since it ensures the text fits the input size of the embedding model. Moreover, it enhances the efficiency and accuracy of the retrieval step, which directly impacts the quality of generated responses. The below visual explains 5 common strategies for chunking. 1) Fixed-size chunking - Generate chunks by splitting the text into uniform segments. - Not a pretty good strategy since it usually breaks sentences (or ideas) in between - Thus, important information will likely get distributed between chunks. 2) Semantic chunking - Segment the document based on meaningful units like sentences, paragraphs, or thematic sections. - Next, create embeddings for each segment. - Start with the first segment and its embedding. ↳ If the first segment’s embedding has a high cosine similarity with that of the second segment, both segments form a chunk. ↳ This continues until cosine similarity drops significantly. ↳ The moment it does, we start a new chunk and repeat. 3) Recursive chunking - First, chunk based on inherent separators like paragraphs, or sections. - Split each chunk into smaller chunks if the size exceeds a pre-defined chunk size limit. 4) Document structure-based chunking: - It utilizes the inherent structure of documents, like headings, sections, or paragraphs, to define chunk boundaries. - This way, it maintains structural integrity by aligning with the document’s logical sections. - But it assumes that the document has a clear structure, which may not be true. - Also, chunks may vary in length, possibly exceeding model token limits. You can try merging it with recursive splitting. 5) LLM-based chunking - Since every approach has upsides and downsides, why not use the LLM to create chunks? - The LLM can be prompted to generate semantically isolated and meaningful chunks. - Quite evidently, this method will ensure high semantic accuracy since the LLM can understand context and meaning beyond simple heuristics (used in the above four approaches). - The only problem is that it is the most computationally demanding chunking technique of all five techniques discussed here. - Also, since LLMs typically have a limited context window, that is something to be taken care of. -- If you want to learn AI/ML engineering, we have put together a free PDF (530+ pages) with 150+ core DS/ML lessons. Get here: https://lnkd.in/gKB_VaaM -- 👉 Over to you: What other chunking strategies do you know?
2 Comments
Like Comment
To view or add a comment, sign in
Amit Singh

Immediate Joiner: Senior Technical Architect | Azure OpenAI & GenAI Specialist | RAG, Agentic RAG, Langchain, Autogen | DevOps/MLOps Expert | Azure IoT | Test Automation (Selenium/Python/UFT)
1mo Edited
Report this post
𝐁𝐮𝐢𝐥𝐝𝐢𝐧𝐠 𝐄𝐟𝐟𝐞𝐜𝐭𝐢𝐯𝐞 𝐑𝐀𝐆 𝐒𝐲𝐬𝐭𝐞𝐦𝐬: 𝐂𝐡𝐮𝐧𝐤𝐢𝐧𝐠 𝐒𝐭𝐫𝐚𝐭𝐞𝐠𝐢𝐞𝐬 𝐟𝐨𝐫 𝐁𝐞𝐭𝐭𝐞𝐫 𝐂𝐨𝐧𝐭𝐞𝐱𝐭𝐮𝐚𝐥 𝐔𝐧𝐝𝐞𝐫𝐬𝐭𝐚𝐧𝐝𝐢𝐧𝐠 In Retrieval-Augmented Generation (RAG) systems, chunking is a crucial step for dividing large documents into manageable pieces for indexing and retrieval. A thoughtful chunking strategy directly impacts the accuracy and performance of downstream tasks like search and response generation. Here are the 3 chunking strategies I use to optimize RAG systems: 1️⃣ 𝐅𝐢𝐱𝐞𝐝-𝐒𝐢𝐳𝐞 𝐂𝐡𝐮𝐧𝐤𝐢𝐧𝐠 How it works: Divide text into fixed-size chunks (e.g., 256, 512 tokens) using sliding windows or static boundaries. Applications: Best for indexing and retrieval when working with unstructured or semi-structured data where context length needs to be uniform. Tools: Libraries like LangChain or Haystack are great for implementing token-based sliding windows. 2️⃣ 𝐑𝐞𝐜𝐮𝐫𝐬𝐢𝐯𝐞 𝐂𝐡𝐮𝐧𝐤𝐢𝐧𝐠 How it works: Use logical delimiters (e.g., new line characters or repeated line breaks) as proxies for paragraphs or semantic sections. This approach respects natural content boundaries and retains meaningful context. Applications: Ideal for structured documents like reports, articles, or books. Tools: Python libraries like spaCy or NLTK can help process and detect logical delimiters efficiently. 3️⃣ 𝐃𝐨𝐜𝐮𝐦𝐞𝐧𝐭 𝐋𝐚𝐲𝐨𝐮𝐭 𝐀𝐰𝐚𝐫𝐞 𝐂𝐡𝐮𝐧𝐤𝐢𝐧𝐠 How it works: Parse document elements such as paragraphs, tables, and images separately. Use modality-specific models to embed each type of content (e.g., text embeddings for paragraphs, image embeddings for visuals). For non-text modalities (e.g., images), you can summarize the content and index the summary alongside text. Applications: Perfect for working with multi-modal documents like PDFs, resumes, or research papers. 𝐓𝐨𝐨𝐥𝐬: PDF Parsing: PyMuPDF, pdfplumber Image Embeddings: CLIP by OpenAI Table Processing: Pandas, Tabular Transformers Summarization Models: Hugging Face Transformers like BART or T5 𝐖𝐡𝐲 𝐓𝐡𝐢𝐬 𝐌𝐚𝐭𝐭𝐞𝐫𝐬?? These chunking strategies ensure that both semantic meaning and context are preserved, enabling more accurate indexing and search capabilities. When combined with efficient vector databases (e.g., Pinecone, Weaviate, or FAISS) and robust retriever models, they significantly enhance the quality of responses in RAG workflows. 𝑯𝒐𝒘 𝒅𝒐 𝒚𝒐𝒖 𝒂𝒑𝒑𝒓𝒐𝒂𝒄𝒉 𝒄𝒉𝒖𝒏𝒌𝒊𝒏𝒈 𝒊𝒏 𝒚𝒐𝒖𝒓 𝑹𝑨𝑮 𝒔𝒚𝒔𝒕𝒆𝒎𝒔? 𝑳𝒆𝒕’𝒔 𝒆𝒙𝒄𝒉𝒂𝒏𝒈𝒆 𝒊𝒅𝒆𝒂𝒔 𝒂𝒏𝒅 𝒍𝒆𝒂𝒓𝒏 𝒇𝒓𝒐𝒎 𝒆𝒂𝒄𝒉 𝒐𝒕𝒉𝒆𝒓! #RAG #NLP #MachineLearning #GenAI #DocumentProcessing #AI
Like Comment
To view or add a comment, sign in
Massimiliano Marchesiello

AI & Machine Learning Specialist | Data Scientist
1mo
Report this post
Reranking Using Huggingface Transformers for Optimizing Retrieval in RAG Pipelines https://ift.tt/aN1OIbQ Understanding when reranking makes a difference Visualization of the reranking results for the user query “What is rigid motion?”. Original ranks on the left, new ranks on the right. (image create by author) In this article I will show you how you can use the Huggingface Transformers and Sentence Transformers libraries to boost you RAG pipelines using reranking models. Concretely we will do the following: Establish a baseline with a simple vanilla RAG pipeline. Integrate a simple reranking model using the Huggingface Transformers library. Evaluate in which cases the reranking model is significantly improving context quality to gain a better understanding on the benefits. For all of this, I will link to the corresponding code on Github. What is Reranking? Before we dive right into our evaluation I want to say few words on what rerankers are. Rerankers are usually applied as follows: A simple embedding-based retrieval approach is used to retrieve an initial set of candidates in the retrieval step of a RAG pipeline. A Reranker is used to reorder the results to provide a new result order that betters suits the user queries. But why should the reranker model yield something different than my already quite powerful embedding model, and why do I not leverage the semantic understanding of a reranker in an earlier stage you may ask yourself? This is quite multi-faceted but some key points are that e.g. the bge-reranker we use here is inherently processing queries and documents together in a cross-encoding approach and can thus explicitely model query-document interactions. Another major difference is that the reranking model is trained in a supervised manner on predicting relevance scores that are obtained through human annotation. What that means in practice will also be shown in the evaluation section later-on. Our Baseline For our baseline we choose the simplest possible RAG pipeline possible and focus solely on the retrieval part. Concretely, we: Choose one large PDF document. I went for my Master’s Thesis, but you can choose what ever you like. Extract the text from the PDF and split it into equal chunks of about 10 sentences each. Create embedding for our chunks and insert them in a vector database, in this case LanceDB. For details, about this part, check our the notebook on Github. After following this, a simple semantic search would be possible in two lines of code, namely: query_embedding = model.encode([query])[0] results = table.search(query_embedding).limit(INITIAL_RESULTS).to_pandas() Here query would be the query provided by the user, e.g., the question “What is shape completion about?”. Limit, in this case, is the number of results to retrieve. In a normal RAG pipeline, the retrieved results would now just be directly be provided as context to the LLM that will synthesize the answer. In many...

$Reranking Using Huggingface Transformers for Optimizing Retrieval in RAG Pipelines https://ift.tt/aN1OIbQ Understanding when reranking makes a difference Visualization of the reranking results for the user query “What is rigid motion?”. Original ranks on the left, new ranks on the right. $image create by author$ In this article I will show you how you can use the Huggingface Transformers...$

Reranking Using Huggingface Transformers for Optimizing Retrieval in RAG Pipelines https://ift.tt/aN1OIbQ Understanding when reranking makes a difference Visualization of the reranking results for the user query “What is rigid motion?”. Original ranks on the left, new ranks on the right. $image create by author$ In this article I will show you how you can use the Huggingface Transformers...

towardsdatascience.com
Like Comment
To view or add a comment, sign in
Rob Tyrie

I am leading GTM adventures in AI, Insurance and iBanking. Building new and marvelous cloud apps and systems to make customers, advisors and agents lives easier. AI ++
4mo Edited
Report this post
Read “How to Implement Graph RAG Using Knowledge Graphs and Vector Databases“ by Steve Hedden on Medium: https://lnkd.in/g7f28tzQ this is likely very good way to fine tune an llm. keeping metadata close to LMS whether it's a knowledge graph or a book of knowledge is a different kind of rag then using a database of information that is structured. this is about focusing related ideas to solve a problem. 1. using an an llm, to create a knowledge graph.. either a small mobile one or a massive one that may be an ontology of something important that may have millions of nodes. consider a Knowledge Graph about maybe computer science... this is a very useful ontology to build and maintain code. 2. as a prompt is created for his sent to the GPT it can be structured or organized or improved by using one or more knowledge graphs to enhance the prompt and add context so what is being requested... it also controls contacts. if you're asking about software patterns let me solve us performance the contacts of that software problem I may deal with software and what software is and how it's made into components is all in and Knowledge Graph and can be added to the prompt automatically.. this is useful and novices who may not know how things things roll up. 3. after the enhanced prompt goes to the GPT as a vector, with with the knowledge graph or knowledge graphs the add contacts and organize the prompt with ontological cohesion, what is returned will probably be more useful to the user because part of the knowledge graph is now part of the vector. 4. the knowledge graph or knowledge graphs can also be used as structured reasoning or planning and creating a prompt. the problem is created and if this is about software programming the ontology of this sldc could be used to consider the prompt from the point of view of architecture, requirements, design, development, quality assurance, deployment etc. the business here is there may not know this cycle and this will improve the reasoning as a Q&A cycle could occur it's not only the right steps but it's The logical steps to get a response. 5. on generation through the chatbot on GPT depending on the vector in the request multiple multiple knowledge can be applied as part of it the answer if verbose is required to expand the output along the lines of Knowledge Graph going from the general to the specific depending on the needs of the user.. in the case of education this is how educators teach.. from the general to the specific the big picture and then going along the line of the knowledge got to get to the specific result and expand that. that's called Knowledge Graph oriented expansion 6. an knowledge graph of exploration can also be something that a user for GPT could use after a generation to explore data as different choices along the branches of the graph could be used for answering guides. /more

How to Implement Graph RAG Using Knowledge Graphs and Vector Databases

towardsdatascience.com

1 Comment
Like Comment
To view or add a comment, sign in
Abhinav Kimothi
5mo
Report this post
About 6 months ago I made these pdf notes on RAG. Resharing it with the community here. At Yarnit, we are developing a SOTA RAG system that combines advanced retrieval, preretrieval and postretrieval techniques along with concepts like graphRAG, RAG agents etc. While the in-production application of RAG is complex, I've compiled these notes that'll help you getting introduced to some paradigms and terms. These include - ⭐ What is Retrieval Augmented Generation? - The inherent problems of LLMs (like gpt4, llama3 ) that led to the popularity of RAG. ⭐ How does RAG help? - How does RAG address the problems of hallucination and enables relevant context ⭐ What are some popular RAG use cases? - RAG is being used from summarization to conversational agents to content generation. ⭐ What does the RAG Architecture look like? - The indexing pipeline and the generation pipeline ⭐ How is data loaded and chunked? - creating a context knowledge source ⭐ What are Embeddings? - conversion of text data in vector form using openai and opensource embeddings from huggingface ⭐ What are Vector Stores? - storage of embeddings in vector databases and using vector indices ⭐ What are the best retrieval strategies? - simple to nuanced retrieval strategies for better context ⭐ How to Evaluate RAG outputs? - the complexities of evaluation and popular metrics and frameworks ⭐ RAG vs Finetuning - What is better? - RAG and finetuning being complementary to each other ⭐ How does the evolving LLMOps Stack look like? - Tools and frameworks to put RAG in production ⭐ What is Multimodal RAG? - The growing popularity of Image+Text RAG ⭐ What is Naive, Advanced and Modular RAG? - The SOTA in RAG and it's evolution These notes largely focus on inference RAG. There are some research papers, blogs from thought leaders like Aman Chadha, Lillian Weng, Leonie Monigatti, Chip Huyen, and official documentation from HuggingFace, Pinecone, Trulens etc. continue to help me immensely in my journey with RAG. While these notes may be dated, I am thrilled to be working on A Simple Guide to Retrieval Augmented Generation with Manning Publications Co. These notes are an inspiration behind the book and the book goes into the details of each of the RAG components. The first three chapters of the book are available via the Manning Early Access Program. You can also look at the source code for indexing and generation pipelines along with evaluation using RAGAS. Link to the official source code repository - https://lnkd.in/gNJASE_w If you'd like to be an early subscriber of the book and access it at a discount while it is being developed, join the MEAP Link to join the MEAP - https://mng.bz/8wdg Here's a 60 second intro to the book - https://lnkd.in/gF7JZ2WB I believe RAG will continue to play a significant role in operationalizing LLMs and Multimodal models for business. I'll be eager to hear what you think about about the book and RAG, in general.

17 Comments
Like Comment
To view or add a comment, sign in
Bazeed Shaik

Chief AI Officer (CAIO)-Steering Gen AI, CCoE, Multi-Cloud Solutions & DevSecOps a with Passionate Leadership | Digital Pioneer | EMBA | 5xAWS, 5xAzure, 1xGCP | CKAD, CCIE, ITILV3 & PMP | 11K+ LinkedIn Connections
8mo
Report this post
Advanced Retrieval-Augmented Generation (#RAG) ## What is Advanced RAG? Advanced RAG models incorporate more complex retrieval techniques, better integration of retrieved information, and often, the ability to iteratively refine both the retrieval and generation processes. Here are some key characteristics of advanced RAG: 1. **Advanced Retrieval Algorithms**: - These algorithms go beyond simple keyword matching. They include techniques like semantic search and contextual understanding. By leveraging these advanced retrieval methods, RAG models can retrieve more relevant and accurate information from knowledge repositories. 2. **Enhanced Integration**: - Advanced RAG seamlessly integrates retrieved content with the generated response. It ensures that the information retrieved from external sources is effectively incorporated into the final answer. - This integration allows RAG to produce responses rooted in factual information, making it more informative and accurate than conventional generational models operating independently. 3. **Iterative Refinement**: - Unlike naive RAG, which follows a straightforward pipeline, advanced RAG allows for iterative refinement. - It means that the model can learn from its own output, continuously improving both the retrieval and generation steps based on feedback and context. ## Techniques in Advanced RAG: Advanced RAG techniques can be categorized into three main areas: 1. **Pre-Retrieval Optimization**: - These optimizations focus on data indexing and query enhancements. - Techniques include: - **Sliding Window Retrieval**: Using an overlap between chunks for efficient retrieval. - **Enhancing Data Granularity**: Cleaning data by removing irrelevant information and ensuring factual accuracy. - **Adding Metadata**: Incorporating dates, purposes, or chapters for filtering. - **Optimizing Index Structures**: Strategies like adjusting chunk sizes or using multi-indexing approaches. - An interesting technique we'll explore is **Sentence Window Retrieval**, which embeds single sentences for retrieval and replaces them with a larger text window during inference. 2. **Retrieval Optimization**: - These techniques enhance the retrieval process itself: - **Hybrid Search**: Combining different search methods (e.g., keyword-based and semantic search) for better results. - **Query Routing, Rewriting, and Expansion**: Improving query formulation and routing. - **Re-ranking**: Adjusting the order of retrieved documents based on relevance. 3. **Post-Retrieval Optimization**: - After retrieval, additional steps can enhance the final answer: - **Re-ranking**: Further refining the ranking of retrieved documents. - **Contextual Understanding**: Incorporating context-aware information. - **Iterative Refinement**: Continuously improving the answer based on context and feedback.
Like Comment
To view or add a comment, sign in
Sohrab Rahimi

Partner at McKinsey & Company | AI Researcher | Keynote Speaker
10mo Edited
Report this post
The retrieval step is crucial in Retrieval-Augmented Generation (RAG) systems, where it interprets the user's query to grasp its intent and employs advanced algorithms to scour a vast knowledge base for semantically relevant information. This ensures the gathered data is pertinent, forming the backbone for accurate and relevant response generation, highlighting the step's importance in determining the final answer's relevance and accuracy. Here are five ways to improve the retrieval step: 𝟭. 𝗥𝗲𝗰𝘂𝗿𝘀𝗶𝘃𝗲 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗲 𝗳𝗼𝗿 𝗖𝗼𝗺𝗽𝗿𝗲𝗵𝗲𝗻𝘀𝗶𝘃𝗲 𝗜𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻 𝗚𝗮𝘁𝗵𝗲𝗿𝗶𝗻𝗴: Implement a step-by-step approach (e.g. Chain-of-Thought) where you dissect complex queries into simpler, manageable segments. This method ensures no critical piece of information is overlooked, providing a more thorough and enriched basis for generating responses (see here: https://lnkd.in/eMpxxfgF) 𝟮. 𝗖𝗵𝘂𝗻𝗸 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗳𝗼𝗿 𝗘𝗻𝗵𝗮𝗻𝗰𝗲𝗱 𝗥𝗲𝗹𝗲𝘃𝗮𝗻𝗰𝗲: Employ strategic retrieval methods such as sentence-window retrieval, which selects small text snippets along with adjacent context, and auto-merge retrieval, which dynamically combines these snippets based on relevance. This approach ensures a nuanced understanding of the topic by capturing essential details and their surrounding context. LlamaIndex offers multiple methods for chunk optimization. 𝟯. 𝗙𝗶𝗻𝗲-𝘁𝘂𝗻𝗶𝗻𝗴 𝘁𝗵𝗲 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗲𝗿 𝗳𝗼𝗿 𝗣𝗿𝗲𝗰𝗶𝘀𝗶𝗼𝗻 𝗮𝗻𝗱 𝗥𝗲𝗹𝗲𝘃𝗮𝗻𝗰𝗲: Elevate your retrieval model's accuracy by incorporating domain or task-specific data into its training. This refinement process sharpens the model's ability to discern and prioritize content that shares a closer semantic relationship with the query (see here: https://lnkd.in/exYUXwwr) 𝟰. 𝗥𝗲-𝗿𝗮𝗻𝗸𝗶𝗻𝗴 𝗳𝗼𝗿 𝗢𝗽𝘁𝗶𝗺𝗮𝗹 𝗜𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻 𝗦𝗲𝗹𝗲𝗰𝘁𝗶𝗼𝗻: Apply advanced algorithms to reassess and reorder the initially retrieved content, prioritizing pieces that offer the greatest diversity and relevance to the query. This re-ranking process ensures that the final selection of content not only covers the broad spectrum of the query's intent but also presents the most authoritative and comprehensive answers, enriching the final output (see here: https://lnkd.in/eDMXQxmm) 𝟱. 𝗠𝗲𝘁𝗮𝗱𝗮𝘁𝗮 𝗙𝗶𝗹𝘁𝗲𝗿𝗶𝗻𝗴 𝗳𝗼𝗿 𝗖𝗼𝗻𝘁𝗲𝘅𝘁𝘂𝗮𝗹 𝗣𝗿𝗲𝗰𝗶𝘀𝗶𝗼𝗻: Incorporate a layer of metadata analysis to refine the retrieval process further. By filtering documents based on metadata attributes like publication date, authorship, or relevance to the query's context, you ensure that the retrieved information is not just topically relevant but also contextually aligned with the specific needs of the query. Pinecode offers a great functionality for this one.
1 Comment
Like Comment
To view or add a comment, sign in
Azizi Othman

...
1mo
Report this post
Reranking Using Huggingface Transformers for Optimizing Retrieval in RAG Pipelines Understanding when reranking makes a difference Visualization of the reranking results for the user query “What is rigid motion?”. Original ranks on the left, new ranks on the right. (image create by author) In this article I will show you how you can use the Huggingface Transformers and Sentence Transformers libraries to boost you RAG pipelines using reranking models. Concretely we will do the following: Establish a baseline with a simple vanilla RAG pipeline. Integrate a simple reranking model using the Huggingface Transformers library. Evaluate in which cases the reranking model is significantly improving context quality to gain a better understanding on the benefits. For all of this, I will link to the corresponding code on Github. What is Reranking? Before we dive right into our evaluation I want to say few words on what rerankers are. Rerankers are usually applied as follows: A simple embedding-based retrieval approach is used to retrieve an initial set of candidates in the retrieval step of a RAG pipeline. A Reranker is used to reorder the results to provide a new result order that betters suits the user queries. But why should the reranker model yield something different than my already quite powerful embedding model, and why do I not leverage the semantic understanding of a reranker in an earlier stage you may ask yourself? This is quite multi-faceted but some key points are that e.g. the bge-reranker we use here is inherently processing queries and documents together in a cross-encoding approach and can thus explicitely model query-document interactions. Another major difference is that the reranking model is trained in a supervised manner on predicting relevance scores that are obtained through human annotation. What that means in practice will also be shown in the evaluation section later-on. Our Baseline For our baseline we choose the simplest possible RAG pipeline possible and focus solely on the retrieval part. Concretely, we: Choose one large PDF document. I went for my Master’s Thesis, but you can choose what ever you like. Extract the text from the PDF and split it into equal chunks of about 10 sentences each. Create embedding for our chunks and insert them in a vector database, in this case LanceDB. For details, about this part, check our the notebook on Github. After following this, a simple semantic search would be possible in two lines of code, namely: query_embedding = model.encode([query])[0] results = table.search(query_embedding).limit(INITIAL_RESULTS).to_pandas() Here query would be the query provided by the user, e.g., the question “What is shape completion about?”. Limit, in this case, is the number of results to retrieve. In a normal RAG pipeline, the retrieved results would now just be directly be provided as context to the LLM that will synthesize the answer. In many cases, this is also...

$Reranking Using Huggingface Transformers for Optimizing Retrieval in RAG Pipelines Understanding when reranking makes a difference Visualization of the reranking results for the user query “What is rigid motion?”. Original ranks on the left, new ranks on the right. $image create by author$ In this article I will show you how you can use the Huggingface Transformers and Sentence...$

Reranking Using Huggingface Transformers for Optimizing Retrieval in RAG Pipelines Understanding when reranking makes a difference Visualization of the reranking results for the user query “What is rigid motion?”. Original ranks on the left, new ranks on the right. $image create by author$ In this article I will show you how you can use the Huggingface Transformers and Sentence...

towardsdatascience.com
Like Comment
To view or add a comment, sign in

2,454 followers

View Profile Connect

Anas P’s Post

More from this author

Databases 101 : A short primer on its journey, evolution, and different flavours.

Explore topics