Hybrid Search Using PincodeDB

Hybrid Search:

Hybrid search combines dense vector search (for semantic matching) and sparse vector search (for keyword or syntactic matching) to deliver more relevant and comprehensive search results. It allows you to search not only based on the meaning of the words (semantic search) but also based on exact keyword matches (syntactic search).

Let's break down both components before we talk about hybrid search:

1. Dense Vector Search (Semantic Search)

Semantic search focuses on the meaning of words or phrases rather than the exact words themselves.
Dense vectors are used in semantic search. A dense vector is a high-dimensional representation of a word, phrase, or sentence where every component of the vector contributes information. For instance, if you're working with a 768-dimensional vector (like from BERT or GPT embeddings), the vector represents nuanced relationships between words in that high-dimensional space.
How it works: Semantic search converts text into vectors using embeddings (like word embeddings from BERT, Word2Vec, etc.). The system then compares these vectors to find similarities. This is typically done using vector databases that store these embeddings.
Example: If you search for "king," the system can retrieve related words like "queen" or "monarch" because the vectors for these words are close in the embedding space.

2. Sparse Vector Search (Syntactic Search)

Syntactic search (or keyword-based search) uses methods like TF-IDF, Bag of Words (BoW), or one-hot encoding to represent words as sparse vectors. These vectors are sparse because they contain mostly zeros (hence "sparse") and only a few non-zero values indicating the presence or frequency of specific words.
How it works: Sparse vectors represent documents or queries by indicating whether specific words from a predefined vocabulary appear in the text. It focuses on matching exact words rather than understanding the meaning behind them.
Example: In a sparse vector representing a sentence, each word in the vocabulary corresponds to a unique position. If the word "machine" appears in the sentence, its position in the vector will have a value representing its frequency, while the rest of the vector will be zeroes.

3. How Hybrid Search Works

Hybrid search is designed to take advantage of both semantic and syntactic approaches, allowing for a richer and more comprehensive search experience.

Dense Vector Search (Semantic Component): Focuses on finding documents with semantically similar meanings to the query, even if the exact words don't match.
Sparse Vector Search (Syntactic Component): Focuses on retrieving documents that contain exact matches to the query’s keywords.

Combining the Two:

Weighting the two approaches: In hybrid search, the system allows you to assign different weights to semantic search and syntactic search. For example:
- You could assign 70% weight to the dense vector (semantic) search and 30% weight to the sparse vector (syntactic) search, depending on how important meaning is compared to exact keyword matching.
- This results in a final ranking of documents that considers both the meaning behind the query and the exact presence of keywords.

Example of Hybrid Search in Action:

Let’s dive into hybrid search with a relatable example and break it down in a real-world scenario. This will help clarify the concept and the calculation behind it.

Scenario: Finding the Best Recipe for Chicken Biryani

Imagine you are running a recipe app, and you’ve stored a huge collection of recipes. Now, users want to search for "best chicken biryani recipe." Some users will look for recipes by their ingredients, while others may prefer recipes based on user reviews or their cooking techniques.

In this case, your database has two kinds of vectors:

Dense vectors (semantic search): These are generated by an embedding model that converts the recipes into vectors based on their meaning (e.g., the recipe for Chicken Biryani is close to the recipe for Mutton Biryani because they both involve similar steps and spices).
Sparse vectors (keyword search): This is based on traditional keyword matching (e.g., "chicken biryani" matches recipes where these words appear frequently, but doesn’t account for meaning).

Now, let’s say a user is searching for the "best chicken biryani recipe" and you want to perform hybrid search, combining both semantic and keyword-based retrieval.

Step-by-Step Process

Sparse Vector Search (Keyword-Based Search):
- You search the database for recipes that match "chicken biryani" using exact keyword matching, like TF-IDF (Term Frequency-Inverse Document Frequency) or Bag of Words.
- Let’s assume the search finds 5 recipes with the following rankings based on keyword relevance (in this case, higher scores mean better keyword matching):
  - Recipe A: Relevance score of 0.9
  - Recipe B: Relevance score of 0.8
  - Recipe C: Relevance score of 0.7
  - Recipe D: Relevance score of 0.5
  - Recipe E: Relevance score of 0.3
Dense Vector Search (Semantic Search):
- At the same time, a semantic search is performed based on the meaning of the query "best chicken biryani recipe."
- Your system may be using a model like Word2Vec or BERT, which embeds the recipes into vectors that consider the context and overall meaning of the recipe.
- The semantic search retrieves the following recipes, but this time based on meaning:
  - Recipe D: Semantic similarity score of 0.9
  - Recipe B: Semantic similarity score of 0.85
  - Recipe A: Semantic similarity score of 0.7
  - Recipe E: Semantic similarity score of 0.6
  - Recipe F: Semantic similarity score of 0.4
Combining the Results (Hybrid Search):
- Now that you have two lists of results (one from the sparse search and one from the dense search), you need to combine them. You do this by assigning weights to each method, based on their importance.
  - Let’s assume you give 70% weight to the semantic search and 30% weight to the keyword-based search.
- The formula to combine the rankings can be: $\text{Final Score} = (\text{Semantic Score} \times 0.7) + (\text{Keyword Score} \times 0.3)$

Breaking Down the Calculation

Let’s apply this formula to each recipe:

Recipe A:
- Semantic score = 0.7, Keyword score = 0.9
- Final score = (0.7 * 0.7) + (0.9 * 0.3) = 0.49 + 0.27 = 0.76
Recipe B:
- Semantic score = 0.85, Keyword score = 0.8
- Final score = (0.85 * 0.7) + (0.8 * 0.3) = 0.595 + 0.24 = 0.835
Recipe C:
- Semantic score = N/A (not in the dense search), Keyword score = 0.7
- Final score = (0 * 0.7) + (0.7 * 0.3) = 0 + 0.21 = 0.21
Recipe D:
- Semantic score = 0.9, Keyword score = 0.5
- Final score = (0.9 * 0.7) + (0.5 * 0.3) = 0.63 + 0.15 = 0.78
Recipe E:
- Semantic score = 0.6, Keyword score = 0.3
- Final score = (0.6 * 0.7) + (0.3 * 0.3) = 0.42 + 0.09 = 0.51
Recipe F:
- Semantic score = 0.4, Keyword score = N/A
- Final score = (0.4 * 0.7) + (0 * 0.3) = 0.28 + 0 = 0.28

Final Results

After combining both semantic and keyword search, the recipes are ranked as follows:

Recipe B: 0.835
Recipe D: 0.78
Recipe A: 0.76
Recipe E: 0.51
Recipe F: 0.28
Recipe C: 0.21

Here, Recipe B is the top result because it performed well in both semantic (meaning-based) and keyword-based searches. Recipe C, although ranked high in the keyword search, doesn’t do well overall because it wasn’t found in the semantic search.

Conclusion: Hybrid Search in Action

In this hybrid search, we’re leveraging both dense vector (semantic) search and sparse vector (keyword-based) search to provide more comprehensive and meaningful search results. By adjusting the weight between semantic and keyword importance, we can fine-tune how much emphasis to place on either type of search based on the use case.

Machine learning guider

Search This Blog