Semantic Search With Vectors

Know what is required to be ranked well. Find out more about vectors and how search engines could transform into a combination of keyword and vectors.

If you’ve been keeping track of the latest news from search, then you’ve probably heard of the concept of vector search.

You may have begun to research the subject to find out more but you need clarification. Didn’t math leave you back in your college years?

Building vector search is difficult. It doesn’t have to be.

And understanding that vector search is going to be different from the norm but a hybrid search is equally important.

What are Vectors?

When we talk about vectors in machine learning, we are referring to it this way: Vectors are numbers that symbolize something.

The thing you are looking at may be an image, a word, or almost anything else.

Naturally, the most important questions are why these vectors are valuable and how they’re created.

Let’s first look at where these vectors originate from. The short answer is machine learning.

Jay Alammar has perhaps the most insightful blog entry ever published about the definition of vectors.

As a recap, machine learning analyzes input items (let’s take, for instance, words from now on) and attempts to find the most effective formulas to predict what’s next.

For instance, you might have a model that believes the term “bee,” and it is trying to find the most efficient formulas that can accurately predict whether “bee” is viewed in similar terms in the sense of “insects” or “wasps.”

When the model has the best formula, it can change”bee,” the term “bee,” into a collection of numbers that coincide with the numbers used for “insects” or “wasps.”

Why Vectors are Powerful

Vectors are highly effective due to the following reasons Large language models such as Generative pre-trained Transformer 3 (GPT-3) or models from Google include billions of sentences and words. They can then begin to connect these dots and then become extremely intelligent.

It’s not difficult to see why people are eager to apply their knowledge to search.

Many are even suggesting the vector search will replace the traditional keyword search we’ve used and adored for a long time.

However, the reality is that vector search isn’t replacing keyword search completely. To believe that keywords won’t hold the immense value it has puts too much faith in the shiny and new.

Keyword and vector searches have strengths and weaknesses, and work best when they complement each other.

Vector Search for long Tail Questions

If you’re in the search industry, you’re probably acquainted with the long search queries.

This idea, popularized by Chris Anderson to refer to digital content, suggests that certain items (for queries) are far more well-known than all other items; however, there are many unique products that are still sought-after by people.

Also, it’s with search.

Some queries (also known as “head” questions) are all frequented, but most queries are only searched a few times perhaps once.

The number of searches will vary from one site to the following However, for a typical website around a third of the total search volume could be derived from only a few dozen queries. Nearly 50% of searches originate from questions outside the top 1,000 search queries.

Long tail queries are more likely to be lengthy, and some of them could even have the appearance of natural language questions.

My company’s research in Algolia found an average of 75% of all queries comprise two or fewer words. 95% of the questions contain four or fewer words. To get to 99percent of queries, you’ll have to use 13 words!

But they’re only sometimes that long. They could be confusing. For a website for women’s fashion, “mauve dress” could be a long tail search since people don’t inquire about the color frequently. “Wristlet” could also be an unusual request, even though the site does offer bracelets on sale.

Vector search works well for queries with a long tail. It recognizes that wristlets have a lot in common with bracelets and even surfaces them without using synonyms. It is possible to show the color of a dress as someone search for something with a mauve color.

Vector search may also work for lengthy or natural questions about language. “Something can keep me drinks cool” will result in refrigerators with a finely tuned vector search. However, when you search for keywords, it is best to ensure that the text is in the product’s description.

Also, a vector search improves the likelihood of recalling search results and the number of results discovered.

The Way Vector Search Works

Search vector can do this by taking those sets of numbers we mentioned above and then having the vector search engine ask “If I could graph these numbers groups as lines, which one would be the closest?”

One way to conceptualize this is to imagine groups with only two numbers. A group of [1,2] is likely to be more similar than to the two-number group than will have been to group [2,500[2,500].

(Of course, as vectors contain dozens of numbers, they’re “graphed” in many dimensions, which makes it simple to comprehend.)

This method of finding similarity is effective because the vectors that represent words such as “doctor” or “medicine” will be “graphed” significantly more alike than words like “doctor” as well as “rock” could be.

Downsides of Vector Search

However, there are some drawbacks for vector search.

The first will be the expense. What is all that machine learning we talked about earlier? It was expensive.

For instance, storing vectors can be more costly than storing an index of search results. The search on these vectors is more time-consuming than a search using keywords in most cases.

Hashing is now a solution to both of these issues.

We’ll be introducing new concepts in the technical realm, but this one is quite easy to comprehend the fundamentals.

Hashing involves a sequence of steps to convert a bit of data (like an object, string, or number) into numbers that take up much less space than the original data.

It is discovered that we can also employ hashing to shrink the size of vectors but still retain the characteristics that make vectors useful in their ability to identify similar concepts.

With the help of hashing by hashing, we can make searches for vectors more efficient and have less space for vectors overall.

The specifics are highly technical. However, the most important thing is to realize that it’s feasible.

The Continuity of Use Of Keyword Search

However, this doesn’t mean that keyword search isn’t necessary! It’s generally more efficient that vector.

It is also simpler to comprehend why results are ranked the way they’re.

Consider”texas, “texas,” as well as “Tejano” along with “state” as possible match words. Evidently, “Tejano” is more likely to be the winner in the match from a keyword search standpoint. It’s not easy to determine, however, which is more relevant from the perspective of a vector search.

Keyword-based search interprets “texas” in a way that is more like “Tejano” because it utilizes an approach based on text to find documents.

Suppose the words in records are identical to what’s in your query (or within a specific level of variation to be able to account for any typos). In that case, If the document contains words that are the same as what is in the query, it is considered relevant and appears in the result sets.

Also, the keyword search focus is on the quality of the results returned and making sure that the results returned are of high quality even if there are few of them.

Keyword Search As Beneficial For Head Queries

This is why keyword search is extremely effective in head searches: those most frequently searched for.

Head queries are generally shorter and are more streamlined to improve. This means that if a phrase isn’t in the right place in a document no matter what the reason, the chances are that it will be detected by analysis, or you can create the word “synonym.”

Since keyword search works best with head queries, and vector search is best for questions with a long tail, The two search engines are most effective when used in conjunction.

This is referred to as a hybrid search.

Hybrid Search is the term used to describe where a search engine uses both vector and keyword in a single request and properly ranks the records regardless of the search method that led to their creation.

Ranking Records across Search Sources

The process of comparing records from two sources is complex.

Both approaches employ inherent distinct ways of assessing records.

Vector search can return scores, but specific keyword-based engines don’t. Although the machines based on keywords give scores, there’s no guarantee that the scores are the same.

If the scores are different, you need to know that an average rating of 0.8 for the engine that generates keywords is more critical than 0.79 from the vector engine. 0.79 of the vector engine.

Another option is to run all results through scoring by one of the engines, either vector or keyword engine.

This is a benefit of having the added recall of this engine; however, it also has some drawbacks. These different results generated by the vector engine will not be evaluated as relevant using a keyword score; otherwise, they’d be included in the result set already.

It is possible to run the full results – either keyword or otherwise via the vector scoring method; however, this can be expensive and slow.

The Vector Search Function As A Fallback

It’s why some search engines do not even try to mix the two and instead always display results for keywords first, then results that are vectors second.

If your search yields no or a small number of results, you could return to the results of a vector search.

Remember that vector search is focused on improving recall or locating more results. Therefore, it might yield results that the keyword search could not.

It’s a decent interim solution but isn’t the future of hybrid search.

A simple hybrid search can rank several different search engines in the same set of results by giving a score equivalent across all sources.

There’s a lot of research on this technique today, but only a few are doing it right and releasing their engines publicly.

What does this mean for you?

The best option is to relax and keep up-to-date with the latest developments in the field.

The hybrid search engine based on keywords and vectors will be available in the coming years and accessible to people with no teams in data science.

In time, keyword searches are still helpful and will improve when vector search is incorporated into the mix later.