Improved RAG: More effective Semantic Search with content transformations

One of the more pragmatic ways to get going on the current AI hype, and to get some value out of it, is by leveraging semantic search. This is, in itself, a relatively simple concept: You have a bunch of documents and want to find the correct one based on a given query. The semantic part now allows you to find the correct document based on the meaning of its contents, in contrast to simply finding words or parts of words in it like we usually do with lexical search. In our last projects, we gathered some experience with search bots, and with this article, I'd love to share our insights with you.

In this article:

sg
Sebastian Gingter is architect consultant and loves to explain things at Thinktecture. He focuses on Generative AI as well as on backends with ASP.NET Core.

Noticing a problem...

While the technology behind Semantic Search is quite complex and very mathematical, it is relatively easy to use. There are already a lot of libraries and SDKs out there that provide this functionality out of the box. They claim that you simply feed it data, and they will do the rest of you with a few lines of code. You will find that claim true in your first simple experiments, and might even be shocked about how astonishingly good the results are – at first glance.

But then, when you dig deeper and increase the amount of the data, you will find that quality deteriorate quite quickly – and the results are not quite as good as hoped. Sometimes you will get a lot of documents that are very similar to your query, but none of them really answers your question. And sometimes you simply know that a very specific document has the exact answer to your question, but it will absolutely not show up in the search results and several other documents that are somewhat related but not as accurate are shown instead.

While setting up a semantic search on our internal Thinktecture knowledge base, we ran exactly into this problem: We have a lot of documents that are very similar to each other, and we have a lot of documents that are relatively similar to the query, but still the searches were quite “fuzzy” and not as accurate as we hoped for.

To understand what the underlying problem is, we first need to understand how semantic search works. And then, when we realize what the actual issue is, we can try to find a solution for it.

Semantic Search - what is it?

There are basically only four steps involved in semantic search.

Step 1: Preparation

You take your documents, and put them through a bit of tooling to prepare the data. For the most cases, this involves splitting large documents up into several smaller chunks, as too large pieces of content tend to cover too many topics. This makes it harder to determine the most relevant piece of information in large documents. Also, you might later want to use a Large Language Model (LLM) to generate an answer based on the found documents, and these models can only process a certain amount of information (they have a maximum size of their so-called context window), so this is an additional reason to limit the size of the chunks.

Step 2: Indexing, or embedding creation

The smaller chunks of content are transformed into “embeddings”. To do this, the content is passed to a specialized AI model that is trained to convert the input into an output value, which encodes the semantic meaning of the content into a computer-readable representation. Since a computer only works on numbers, the meaning is represented by several numbers, like an array of positions on several axes, or dimensions. This is called a vector. I try to imagine this as a bunch of abstract measurements. On each of the axes, an abstract concept about the input is encoded, and a value assigned to it. The abstract concepts would be something like “How wet is the content? How business is the content? How animal is the content?” and so on. Getting all these measures will position this document somewhere “embedded in the brain” of the used embedding model. This “brain” is a multidimensional space. Depending on the model you have, there are hundreds if not several thousands of different dimensions. The result of this step is a “coordinate” of that chunk of data “within the embedding model’s brain”. This array of floating point values is a mathematical vector, and in our context this vector is called an embedding.

Step 3: Storage

The embedding is stored in a database, together with the ID and also often the complete content of the document it belongs to. In general, a simple key-value store would be sufficient, but to search for a document you would have to look at all the keys for each search operation, and this is not very efficient. For this reason, there are specialized databases that are optimized for this kind of search. They are called vector databases, or vector stores. Vector databases build indices based on the embedding values, and they also implement algorithms that can search for similar vectors in the index very efficiently. Depending on the type of database, you can also store additional data (like the full chunk of the content) and also additional metadata (like the title of the document, the author, the date of creation, and so on).

This image shows a diagram of the indexing process as it is described in the last two paragraphs.

Step 4: Search

The query is passed to the same embedding model, which was used to create the embeddings. The result of this is a vector that represents the meaning of the query. It is very important to use the exact same model, as different embedding models have different meanings on each dimension, if they even have the same amount of dimensions, and the coordinate of a document in one embedding model will be totally different from the one in another model. This vector is then passed to the vector database, and the database will calculate the mathematical distance between the query vector and the stored embedding vectors. This distance value, in a mathematical sense, is low, when the vectors point in the same direction (near together, or very similar meaning) and large, when the vectors point into different directions (far away, or a big difference in the meaning). Sort of like “Where in the multidimensional AI’s brain is this document, and where is the question, and how far are they away?” These distance values are then sorted, and the first few documents which are the nearest ones to the query are returned from the vector database.

Found! Now what?

Technically, that’s it. You can now search for documents based on their meaning.

In a lot of cases, the full text of the found document and the question of the user will be passed to a Large Language Model (LLM) like GPT to formulate a full response that is then returned to the user. This is called the RAG pattern (Retrieval Augmented Generation). But as said, there is no need for that except for a better readable answer. You also simply display the list of found documents ranked by their similarity (or distance), and have the user look for the piece of information in the document.

A diagram of the RAG process, where the found document and the question is passed to a LLM which generates a natural-language answer based on the document.

A problem with our embeddings

The main problem with that approach is, like pretty much everything in the field of AI, the quality of the input data.

The AI embedding model that is used to create the vectors is trained on a very large corpus of data. The more data, the better. The more diverse the data, the better. They are trained on Wikipedia, the worlds’ history, animals, plants, seas, stars, cities, maps and so on.

And now, please be honest to yourself, how much data do you have. And how diverse is your data really?

In most businesses, your documents have to do with your business. Not with astronomy, not with animals or plants. They all have to with your business problems in your field of expertise. Probably all written in a very formal style. By experts in their field. And to not confuse a reader, most authors will try and reduce their vocabulary to your domain-specific language and be very specific.

So, naturally, when you put your documents in the huge multidimensional brain of an AI embeddings model, they will probably all be very close to each other. And when you search for a document, you will get a lot of documents that are very similar to the query, but not necessarily the one you are looking for.

To make things worse, a query is usually a question. Your documents are usually a collection of facts, statements, numbers, and so on in your field of expertise. Questions about these documents and the documents itself are, semantically, not very similar. So in general, while all documents are very near to each other, a query, by nature, is usually at least a bit further away from your bulk of documents. This means, that while you will get a lot of documents that are similar to your query, you might simply not get the specific one, that contains the answer to your question as this is still a needle in your haystack.

So this explains why our semantic search results often aren’t as good as they could be – at least initially.

Other issues...

In this article, I want to concentrate on the single issue I just outlined. There are, however, a lot of other variables that might greatly impact the quality of your semantic search results.

For one, the embedding model used can be a big factor. If the embedding model isn’t capable of creating a good semantic representation of the contents, this will also lead to inaccurate search results. This could be because the model isn’t trained enough on the language of your content, so putting a German text through a model primarily trained for English content will not grasp the semantics as good as required. Similar, the results achieved with a model trained on Wikipedia articles might be not as good as another model trained on law texts, if your use-case involves contracts.

So, for the sake of brevity, we assume that you already did a bit of research and tested which embedding model yields the best results, and committed to that specific embedding model. I am going to provide an additional article on how to do just that, but this is for another time.

Improve the embeddings

So, what can we do about our search not finding the correct tree in our forest? How can we improve the quality of our embeddings?

The main idea is to create the embeddings from content that is closer to our questions. And what is closer to a question about a specific topic than a document about this topic? Correct: Another question about this specific topic.

So we need a way to get questions for our documents to create the embeddings for these. Luckily, that is also what language models excel in. Understanding our content, and crafting questions about it that are answered in this specific piece of our document, is exactly what we need.

In this example we will use Langchain to load a lot of content in the form of markdown files from a directory (our knowledge base), then we will use a LLM to transform each document into a few questions about the specific contents and then store these transformed documents in a vector database with their embeddings. For this sample the database will be chroma, as it runs locally and stores the data in files.

A diagram showing the HyQE (hypothetical question embedding) process as described above.

Creating questions and indexing them

First, the code:

				
					# File: index.py

# Prepare the transformation chain
from langchain.chat_models.openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

transformation_llm = ChatOpenAI(model = "gpt-3.5-turbo", temperature = 0.7)
transformation_prompt = PromptTemplate(input_variables=["input"],
  template = "Create six different questions that the following document is going to answer. Each question should be about a different specific topic covered in the document. This is the document: {input}")
transformation_chain = LLMChain(llm = transformation_llm, prompt = transformation_prompt)

# Function to post-process the content
import copy

def content_postprocess(doc: Document) -> Document:
    doc.metadata["original_content"] = copy.copy(doc.page_content)
    doc.page_content = transformation_chain.run(doc.page_content)
    return doc

# Load the documents
from langchain.document_loaders import DirectoryLoader, UnstructuredMarkdownLoader

contents = DirectoryLoader(
        "../knowledgebase",
        glob = "**/*.md",
        loader_cls = UnstructuredMarkdownLoader).load()

docs = list(map(content_postprocess, contents))

# Prepare the store
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model = "text-embedding-3-large")
store = Chroma(embedding_function = embeddings, persist_directory = "./db")

# Index the documents
store.add_documents(docs)
store.persist()
				
			

The indexing is quite straight forward: First, we set up the chain that will run each document through our model to transform it into something that is more similar to the expected search query.

We then use the standard Langchain importers to load our markdown files from the directory and run each document through the transformation chain. The issue here is that this will overwrite the original content, but we will need that after retrieval, so we store a copy of it in the metadata of the document.

Then we set up the vector store, and add the documents to it. The vector store will create the embeddings for each transformed document and store them in the database.

Retrieval

Now that we have our documents indexed, we can search for them. The code for that is also quite simple:

				
					# Prepare the store
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model = "text-embedding-3-large")
store = Chroma(embedding_function = embeddings, persist_directory = "./db")

# Function to post-process the content
def retransform_content(doc):
    if (doc.metadata.get("original_content") is not None):
        doc.page_content = doc.metadata["original_content"]
    return doc


def search(query):
    "Searches and returns documents from our knowledge base as well as articles from our website. The input is a full question that asks for specific information."
    document_count = 3
    retriever = store.create().as_retriever()
    docs = retriever.get_relevant_documents(query, k = document_count)
    docs = list(map(retransform_content, docs))
    return docs

found_docs = search("What should I do if I missed the last train?")
				
			

In this specific case, the document search is a simple function, using the vector store to query it for our documents. After it returned the found documents, we restore the original content from the metadata (the page_content contains our questions and not the answer) and then return the documents.

The tool will return the top 3 documents that are most similar to the query. You can try to increase the amount of documents to search for, but in our case we want to later pass the documents into another call to an LLM to generate an answer to the query, and with larger documents that might be too huge to fit into the context size of the LLM. This is why we limit the amount of documents to 3.

A diagram showing the retrieval operations as described and implemented above.

The results

In our projects, the search results over the transformed documents were much better than the ones over the original documents. The results were a lot more specific and more accurate. We did index our original documents as well as multiple different transformations of them, and had the search also return the relevance score of each document. The results were quite interesting.

On our own internal data, we have a document that describes what to do when you did not catch your connection train and were “stranded at a train station”. In this test run, we didn’t get the distance, but a “relevance score”. This is 1 – the distance, so a larger value is better in this case.

The relevance score of the original document was 0.8102, and it was only on second place. Another document that was also about travel, but not relevant to our question, was on first place with a 0.8113.

The document that got transformed to questions was on first place with a relevance score of 0.8785. That is a huge difference. After that, another, much better document was on second place with a relevance score of 0.8195. The wrong document, which was first place untransformed, only came in third with 0.8091.

This is of course only a single example, but we tested with a lot of different queries and documents and the results were always better with the transformed documents, and the difference was always quite significant. Also, please keep in mind that the scores are specific to the combination of embedding model and vector database used. Another model or another database with a different algorithm will lead to different values and also different ranges: Another model might show documents that are very near already with a score of 0.6 and upwards, which would be terrible far apart with the model used in this example.

To be able to tell the actual relevance from a relevance score in your specific environment, you should do some tests with pieces of content that are very similar, extremely different and somewhere in the middle. The values that your setup will give you for these examples can give you a feeling on the informational value of the scores in your environment.

Conclusion

If your semantic search results are not as good as you hoped for, it might be because the contents of the documents and the questions asked to retrieve them are not as similar as they could be. In this case, you might want to try and transform your documents into questions instead, and use these transformed documents to create the embeddings for your search. This can make your search results a lot more accurate and specific for your specific use-case.

Free
Newsletter

Current articles, screencasts and interviews by our experts

Don’t miss any content on Angular, .NET Core, Blazor, Azure, and Kubernetes and sign up for our free monthly dev newsletter.

EN Newsletter Anmeldung (#7)
Related Articles
AI
sg
One of the more pragmatic ways to get going on the current AI hype, and to get some value out of it, is by leveraging semantic search. This is, in itself, a relatively simple concept: You have a bunch of documents and want to find the correct one based on a given query. The semantic part now allows you to find the correct document based on the meaning of its contents, in contrast to simply finding words or parts of words in it like we usually do with lexical search. In our last projects, we gathered some experience with search bots, and with this article, I'd love to share our insights with you.
17.05.2024
Angular
SL-rund
If you previously wanted to integrate view transitions into your Angular application, this was only possible in a very cumbersome way that needed a lot of detailed knowledge about Angular internals. Now, Angular 17 introduced a feature to integrate the View Transition API with the router. In this two-part series, we will look at how to leverage the feature for route transitions and how we could use it for single-page animations.
15.04.2024
.NET
KP-round
.NET 8 brings Native AOT to ASP.NET Core, but many frameworks and libraries rely on unbound reflection internally and thus cannot support this scenario yet. This is true for ORMs, too: EF Core and Dapper will only bring full support for Native AOT in later releases. In this post, we will implement a database access layer with Sessions using the Humble Object pattern to get a similar developer experience. We will use Npgsql as a plain ADO.NET provider targeting PostgreSQL.
15.11.2023