Makes one wonder what other use cases are lurking that would need just another small modification and haven't even been thought of yet because they used to be impossible to implement.
If I'm searching for something which has words which have a more common meaning than the context I care about - then exact matching ( of my carefully crafted search term ) performs much better than vector search.
Not every query is looking for the average result.
So how would you construct an interface specifically designed for a vector search?
What I've seen:
- Vector search without good models does not tend to perform that well. I've seen comparisons where off the shelf free models struggle to keep up with simple text search and some manually tuned queries. Many companies might use those as a starting point but end up investing in their own models. BM25 (text ranking algorithm) provides a pretty solid baseline performance for a lot of things.
- Building good models is typically left as an exercise to the reader by those who provide vector search engines. These solutions are great for comparing vectors once you have them. However, getting good vectors is a bit of a dark art. And getting those is actually the hard part of the problem. Using a vector search engine is easy, getting good vectors isn't.
- Building good models to get good vectors requires a lot of expertise and skill. And not just technical skills. For example, understanding and building good processes for evaluating your search performance is not something they teach people in universities. I know some people and companies that can do this; they are not cheap (or bored). Cutting corners here leads to predictably meh results.
- The other thing you need is lots of data. The free open stuff that everybody else trains on as well is nice as a start but generally not good enough. That's why the likes of Google other big tech companies are so casual about sharing algorithms. They are worthless without data. And they're mostly not sharing data.
- Implementing vector search can be expensive. Basically it's function of hardware, people and time. It takes ages to train models, and it requires people who understand how to do that. You can speed it up with really expensive hardware. If your people make mistakes (because they don't know what they are doing), you'll burn a lot of time and hardware cost.
- Most startups or smaller companies don't really have the level of funding needed to do a proper job. Hence a lot of startups being a bit hand wavy about doing something something AI bla bla bla vector search on some beautiful slides. When you scrutinize these companies that usually means they have a (very) junior data "scientist" fresh out of college that heard a thing or two about how these things might actually work and not a whole lot else.
I've seem some companies doing this stuff properly. Some startups even. But not a lot. Sometimes you get the right mix of people and knowledge and ideas.
A vector embedding is choosing a single meaning for my search terms - and if that single meaning is wrong - because I'm not the average person - then I'll struggle to get relevant results.
I guess you can use context to do the mapping - but the rarer the thing I'm trying to find, the less likely this will work?
Note this happens both ends - in parsing both the query and parsing and indexing the original web page.
I suppose if it misinterprets query and page you might get a hit, but then the result you want might be page 700.
There is nothing more annoying than using a search term that you know should be pretty defining and finding the engine deciding to substitute it for a much more common search term.
It's a bit like the Google equivalent of MS clippy - you appear be searching for ......