Enhancing RAG Through Anote

5 min readAug 8, 2024

In today’s world, having accurate and relevant information at your fingertips is crucial, especially when it comes to domains like finance, healthcare, and legal. The advent of Large Language Models (LLMs) like GPT-4 have made impressive advances in generating human-like responses, however they are prone to fail and hallucinate for more complex questions on specific topics. The reason for this is that most of these large models are trained on domain agnostic data, and the content of it may not be 100% correct. However, there are many cases where we would like to use the capabilities of such models with data that is not inherent to the model. A popular method to do this kind of knowledge injection is called Retrieval Augmented Generation (RAG). This post will go over how this method works, its current limitations and how the work at Anote is working to address them to improve its accuracy and reliability.

Overview of RAG

If you have ever used a platform like ChatGPT, you are probably familiar with how it works. The user asks the model a question, and then the model returns a response in a chatbot-like interface. For most cases, the user just inputs their question, like “where are the 2024 olympics taking place?” and the model can answer based on the query alone. Sometimes, however, there is additional context that could be relevant to the question. In these cases, the user may paste the additional context within the prompt itself. For example, “can you provide a summary of the following text” or if you want to ask questions based on a specific excerpt of text. When the relevant context is of a manageable size, it can be included in the prompt itself. However, what happens when someone wants to answer a question where the relevant context is not just an excerpt of a few hundred words but a document over 50 pages long? The entirety of the text cannot be included in the prompt because of the limitations of the context-window.

This is where Retrieval-Augmented Generation (RAG) comes in as a technique that helps language models improve their answers by looking up information from reliable sources before responding. The retrieval aspect of RAG finds the most relevant part of the document based on the question, adds or augments it to the prompt before it goes to the LLM to generate the answer. This is a very effective way to inject new information into the LLM without having to do additional training and since the information is included in the prompt itself, it significantly reduces hallucinations as well.

However, the real challenge lies in ensuring that the context retrieved by the pipeline is truly relevant and helpful. While the capabilities of the LLMs itself are rapidly improving — less attention is paid to the retrieval aspect. Even with a highly accurate and state of the art model, if it is not given a relevant section in the prompt, even the best model cannot answer questions accurately. Traditional RAG systems often stumble here, leading to less precise and sometimes irrelevant answers.

This is where Anote.ai comes in — a transformative solution that refines RAG processes to deliver sharper, more accurate insights from various types of documents.

Challenges in Traditional RAG Systems

Uniform Chunking: Many RAG systems break documents into equal-sized pieces without considering the document’s natural flow. This can result in chunks that either miss important context or include irrelevant information.
Similarity vs. Relevance: Most RAG pipelines use a metric like cosine similarity to find the most relevant chunk. However, this overlooks a lot of nuance and often misses information that is relevant, but not necessarily similar by such metrics.
Lack of Domain-Specific Insight: Generic embedding algorithms may not fully grasp the specialized language and nuances of different fields.

How is Anote Transforming RAG

Anote.ai is designed to tackle these challenges head-on with various techniques that use human based annotations and zero-shot methods to make retrieval more accurate to enhance these RAG systems.

1. Dynamic Chunking

Instead of splitting the document into uniform chunks, we can employ more advanced techniques that better capture the semantic meaning and structure of the document so that we are minimizing irrelevant information and information loss within the chunks. One way is recursive chunking, which dynamically splits documents into chunks based on contextual indicators like punctuation or headings. Taking it further, we can also do some version of element based chunking, that recognizes things like headers or tables as separate chunks and preserves that information as well.

2. Metadata Annotations

With more complex documents, not only does the context of the chunk contain valuable information, but certain metadata about it can be relevant as well. Anote is working on integrating the data labeler platform to these metadata annotations to improve RAG as a way of incorporating human feedback.

3. Query Expansion

Query Expansion, or query transformation refers to the idea of changing the question fed into the RAG pipeline to find relevant chunks based on more than the user’s original question. The idea behind this is that oftentimes, the user’s question may not explicitly contain all the information to indicate the algorithm where in the document it should look for the context, especially it is via a cosine similarity search. Query expansion adds some sort of information to the original question, such as a closed book theoretical answer from an LLM, and takes the combination of the two to then find a relevant chunk.

4. Metadata Annotations

Most RAG pipelines retrieve the top 1 or 2 chunks based on similarity and augment that to the prompt, but as we discussed earlier, it does not necessarily mean that they are the most relevant to answering the question. With a re-ranking algorithm, we might first retrieve the top 10 most similar chunks and then re-rank them based on relevance. The 8th most similar chunk might end up being the most relevant one, and we can use separate algorithms such as a cross-encoder or another ML algorithm to determine this instead of just going based on metrics like cosine similarity.

5. Fine Tuned Embeddings

Embedding algorithms are what convert text into numerical representations, and play a crucial role in RAG pipelines. Similar to how the Anote platform uses your data to finetune models, we are working on features that fine-tune embedding algorithms based on domain-specific knowledge to enhance retrieval in that specific domain. Since the retrieval is based on a similarity search of the embedding vectors, a better algorithm will better capture the meaning in a domain specific context and therefore be more effective when searching for relevant chunks.

Conclusion

Improving the performance of retrieval in document-based question-answering systems enhances the overall quality of the system. By accurately retrieving the right information, we not only provide more relevant citations but also deliver more accurate answers to users’ questions. This highlights the critical role of robust retrieval algorithms — without proper context, even the most advanced models can produce incorrect responses. Anote and the research we conduct aims to utilize human feedback to improve both the retrieval and model generation capabilities for strong robust systems that can be used for domain specific cases.