Jina AI launches open-source 8k text embedding (opens in new tab)

(jina.ai)

563 pointsartex_xh2y ago201 comments

201 comments

I'm always happy to see OSS contributions but I don't quite understand why this model is so remarkable. As the leaderboard suggests it's ranking lower than OpenAI embeddings, while 14 other contributions are even better than that. Many of which feature a comparable or lower dimensionality than 768.

The 8k context window is new, but isn't the 512 token limitation a soft limit anyway? I'm pretty sure I can stuff bigger documents into BGE for example.

Furthermore, I think that most (all?) benchmarks in the MTEB leaderboard deal with very small documents. So there is nothing here that validates how well this model does on larger documents. If anything, I'd pick a higher ranking model because I put little trust in one that only ranks 17th on small documents. Should I expect it to magically get better when the documents get larger?

Plus, you can expect that this model was designed to perform well on the datasets in MTEB while the OpenAI model probably wasn't.

Many also stated that a 8k context embeddings will not be very useful in list situations.

When would anyone use this model?

larodi2y ago

Potentially useful for paragraph embedding, where... well, paragraphs can grow a lot. Not sure how this model fares in comparison to other embedding engines (yet), but I can definitely tell you mpnet models fare much better for paragraph embeddings than the leader in HF's leaderboard (being thenlper/gte-large at time of writing).

I can guess the Davinci and similar embeddings work better for code than MPNET and it really matters what you are encoding, not only the context length. What features are actually being extracted by the emb.engine.

infecto2y ago

I have been trying to understand the hype as well. Happy to see all the work happening in this space still.

I was pretty curious about the context limit. I am not an expert in this area but I always thought the biggest problem was the length of your original text. So typically you might only encode a sentence or a selection of sentences. You could always stuff more in but they you are potentially losing the specificity, I would think that is a function of the dimensionality. This model is 768, are they saying I can stuff 8k tokens worth of text and can utilize it just as well as I have with other models on a per 1-3 sentence level?

infecto2y ago

Thinking about it some more as I read through more comments. I guess in the stated case of research papers it can make sense if your task is looking for the common themes and not specific details. If you are embedding a sentence or a paragraph you miss out on the connection between those sentences across the whole paper...or at least its harder to manage that. By encoding a large number of pages from the paper (or the entire paper) you can hopefully do a better job of capturing the theme of that paper.

This also opens up another question though, how would that compare to using a LLM to summarize that paper and then just embed on top of that summary.

1 more reply

egorfine2y ago

I fail to imagine a 8k-token-length piece of text that has just one single semantic coordinate and is appropriate for embedding and vector search.

In my experience, any text is better embedded using a sliding window of a few dozen words - this is the approximate size of semantic units in a written document in english; although this will wildly differ for different texts and topics.

simonw2y ago

What are you using those embeddings for?

I can see a sliding window working for semantic search and RAG, but not so much for clustering or finding related documents.

1 more reply

theptip2y ago

> The 8k context window is new

Hasn’t Claude had this for many months (before they bumped to 100k)?

Edit: ah, you mean new for OSS maybe?

simonw2y ago

Claude is a large language model, which is a different thing from an embedding model.

2 more replies

burcs2y ago

This is great news!

It feels like open-source is closing the gap with "Open"AI which is really exciting, and the acceleration towards parity is faster than more advancements made on the closed source models. Maybe it's wishful thinking though?

udev40962y ago

Is it tho? It's not really open source if they don't give us the information regarding training datasets

jerpint2y ago

It definitely is open source even if they don’t disclose all details behind the training

3 more replies

infecto2y ago

Wishful thinking? Embeddings to me were never the interesting or bleeding edge thing at OpenAI. Maybe the various ada models at one point reigned supreme but there have been open-source models at the top of the leaderboard for a while and from a cost/performance perspective, often even the Bert models did a really fine job.

Gasp0de2y ago

They compare it to OpenAI's ada model though, which is light-years away from ChatGPT.

simonw2y ago

Don't confuse the current Ada embedding model the old Ada GPT3 model.

It turns out OpenAI have used the name "Ada" for several very different things, purely because they went through a phase of giving everything Ada/Babbage/Curie/DaVinci names because they liked the A/B/C/D thing to indicate which of their models were largest.

infecto2y ago

Does that not conflate two different things though? Embedding model != LLM Model ?

jncraton2y ago

This is great to see. It looks like the size of the embedding vector is half the size of text-embedding-ada-002 (768 vs 1536) while providing competitive performance. This will save space in databases and make lookups somewhat faster.

For those unaware, if 512 tokens of context is sufficient for your use case, there are already many options that outperform text-embedding-ada-002 on common benchmarks:

https://huggingface.co/spaces/mteb/leaderboard

minimaxir2y ago

The 768D-sized embeddings compared to OpenAI's 1536D embeddings are actually a feature outside of index size.

In my experience, OpenAI's embeddings are overspecified and do very poorly with cosine similarity out of the box as they match syntax more than semantic meaning (which is important as that's the metric for RAG). Ideally you'd want cosine similarity in the range of [-1, 1] on a variety of data but in my experience the results are [0.6, 0.8].

TeMPOraL2y ago

Unless I'm missing something, it should be possible to map out in advance which dimensions represent syntactic aspects, and then downweigh or remove them for similarity comparisons. And that map should be a function of the model alone, i.e. fully reusable. Are there any efforts to map out the latent space of ada models like that?

karxxm2y ago

You wrote „out of the box“, did you find a way to improve this?

1 more reply

andy992y ago

What is the use case for an 8k token embedding? My (somewhat limited) experience with long context models is they aren't great for RAG. I get the impression they are optimized for something else, like writing 8k+ tokens rather than synthesizing responses.

Isn't the normal way of using embedding to find relevant text snippets for a RAG prompt? Where is it better to have coarser retrieval?

dragonwriter2y ago

> What is the use case for an 8k token embedding?

Calculating embeddings on larger documents than smaller-window embedding models.

> My (somewhat limited) experience with long context models is they aren't great for RAG.

The only reason they wouldn't be great for RAG is that they aren't great at using information in their context window, which is possible (ISTR that some models have a strong recency bias within the window, for instance) but I don't think is a general problem of long context models.

> Isn't the normal way of using embedding to find relevant text snippets for a RAG prompt?

I would say the usual use is for search and semantic similarity comparisons generally. RAG is itself an application of search, but its not the only one.

3abiton2y ago

I wonder how the perfomance fair when context size is increased. Intuitively this should be higher, but some quantized models I've tested showed noticeably worst performance.

1 more reply

teaearlgraycold2y ago

You could get a facsimile to a summary for a full article or short story. Reducing an 8k token article to a summary using a completions model would cost far more. So if you need to search through collections of contracts, scientific papers, movie scripts, etc. for recommendations/clustering then bigger input sizes can do that in one shot.

Think of it like skipping the square root step in Euclidean distance. Perfectly valid as long as you don’t want a distance so much as a way to compare distances. And doing so skips the most computationally expensive operation.

refulgentis2y ago

I think I'm missing something: like, yeah, it's vector search for bigger text chunks. But arguably vector search with bigger text chunks is _definitively_ worse -- this isn't doing summarization, just turning about 25 pages of text to 1024 floats, which you then can use cosine similarity to measure the semantic similarity to other text

I'd much rather know what paragraph to look in than what 25 pages to look in

3 more replies

kristopolous2y ago

Is this what you mean by RAG? https://www.promptingguide.ai/techniques/rag?

simonw2y ago

I have an explanation of RAG in the context of embeddings here: https://simonwillison.net/2023/Oct/23/embeddings/#answering-...

1 more reply

teaearlgraycold2y ago

Yes

egorfine2y ago

One thing that is missing in comparison: OpenAI's model is multilingual.

And not only it supports and embeds a variety of languages, it also computes the same coordinates for the same semantics in different languages. I.e. if you embed "russia is a terrorist state" and "россия - страна-террорист", both of these embeddings will have almost the same coordinates.

jimmySixDOF2y ago

I heard one of the developers on a regular Open Source DIY AI X/twitter space [1] & they are targeting two new models German/English and French/English for the next release

https://x.com/thursdai_pod

m3kw92y ago

I don’t really know what that means but it seems useful

do-me2y ago

Just quantized the models for onnx usage in e.g. transformers.js and got 4x reduced file size:

- 𝟐𝟖.𝟓 𝐌𝐁 jina-embeddings-v2-small-en (https://huggingface.co/do-me/jina-embeddings-v2-small-en)

- 𝟏𝟎𝟗 𝐌𝐁 jina-embeddings-v2-base-en (https://huggingface.co/do-me/jina-embeddings-v2-base-en)

However, I noted, that the base model is performing quite poorly on small text chunks (a few words) while the small version seems to be unaffected. Might this be some kind of side effect due to the way they deal with large contexts?

If you want to test, you can head over to SemanticFinder (https://do-me.github.io/SemanticFinder/), go to advanced settings, choose the Jina AI base model (at the very bottom) and run with "Find". You'll see that all other models perform just fine and find "food"-related chunks but the base version doesn't.

Havoc2y ago

Why quantize something that is already very small (270mb)?

pietz2y ago

Just making up stuff here, but smaller models are great for serverless compute like functions, which would also benefit from lighter computation. Don't forget, some people are dealing with hundreds of millions of documents. Accelerating this by 4x may be worth a small performance hit.

simonw2y ago

I just shipped a new llm-embed-jina plugin for my LLM tool which provides access to these new Jina models: https://github.com/simonw/llm-embed-jina

Here's how to try it out.

First, install LLM. Use pip or pipx or brew:

    brew install llm

Next install the new plugin:

    llm install llm-embed-jina

You can confirm the new models are now available to LLM by running:

    llm embed-models

You should see a list that includes "jina-embeddings-v2-small-en" and "jina-embeddings-v2-base-en"

To embed a string using the small model, run this:

    llm embed -m jina-embeddings-v2-small-en -c 'Hello world'

That will output a JSON array of 512 floating point numbers (see my explainer here for what those are: https://simonwillison.net/2023/Oct/23/embeddings/#what-are-e...)

Embeddings are only really interesting if you store them and use them for comparisons.

Here's how to use the "llm embed-multi" command to create embeddings for the 30 most recent issues in my LLM GitHub repository:

    curl 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
    | jq '[.[] | {id: .id, title: .title}]' \
    | llm embed-multi -m jina-embeddings-v2-small-en jina-llm-issues - \
    --store

This creates a collection called "jina-llm-issues" in a default SQLite database on your machine (the path to that can be found using "llm collections path").

To search for issues in that collection with titles most similar to the term "bug":

    llm similar jina-llm-issues -c 'bug'

Or for issues most similar to another existing issue by ID:

    llm similar jina-llm-issues 1922688957

Full documentation on what you can do with LLM and embeddings here: https://llm.datasette.io/en/stable/embeddings/index.html

Alternative recipe - this creates embeddings for every single README.md in the current directory and its subdirectories. Run this somewhere with a node_modules folder and you should get a whole lot of interesting stuff:

    llm embed-multi jina-readmes \
      -m jina-embeddings-v2-small-en \
      --files . '**/README.md' --store

Then search them like this:

    llm similar jina-readmes -c 'backup tools'

simonw2y ago

Wrote this up on my blog: https://simonwillison.net/2023/Oct/26/llm-embed-jina/

bosky1012y ago

The only feedback I had from your embedding post was

    wish we could create the array of floating points without openai

Great timely turnaround time, good sir. Ht

X6S1x6Okd1st2y ago

Thank you so much for all the work you've put into llm!

jillesvangurp2y ago

Thanks, this is wonderfully simple to use. Just managed to package this up using docker and was able to use it without a lot of drama. Nice how simple this is to use.

I've dabbled a bit with elasticsearch dense vectors before and this model should work great for that. Basically, I just need to feed it a lot of content and add the vectors and vector search should work great.

michalmatczuk2y ago

FYI it seems that llm install llm-embed-jina is missing yaml dependency

  File "/opt/homebrew/Cellar/llm/0.11_1/libexec/lib/python3.12/site-packages/llm/default_plugins/openai_models.py", line 17, in <module>
    import yaml

ModuleNotFoundError: No module named 'yaml'

simonw2y ago

Thanks! I wonder if the Python 3.12 upgrade broke something.

The pyyaml package is correctly listed on the formula page though: https://formulae.brew.sh/formula/llm

dazzaji2y ago

Excellent! And you were just saying how risky it is to rely long-term on OpenAI text embeddings in your post on the topic. The timing for this open source option worked out nicely.

mike_ivanov2y ago

JFYI, this is what happens on my M1 Macbook:

$ brew install llm $ llm ModuleNotFoundError: No module named 'typing_extensions'

Not sure where to report it.

simonw2y ago

Whoa, that is a weird one. Do you know what version of Python you have from Homebrew?

It looks like that package is correctly listed in the formula: https://github.com/Homebrew/homebrew-core/blob/a0048881ba9a2...

1 more reply

IanCal2y ago

Probably not this, but check with `which llm` what that's running. I had weird issues not matching the documentation but just had some other random python cli tool called llm I'd put in my home bin for and forgotten about it.

1 more reply

marinhero2y ago

How well do LLMS like this work with a non-English language? Or are these open source models limited to English?

simonw2y ago

Quite a few of the top ranked models on this leaderboard are multilingual: https://huggingface.co/spaces/mteb/leaderboard

https://huggingface.co/BAAI/bge-large-en-v1.5 FlagEmbedding for example describes itself as covering Chinese and English.

anigbrowl2y ago

Stability has a Japanese port which is getting lots of work https://twitter.com/StabilityAI_JP/status/171699857824440759...

m3at2y ago

This is not an embedding model though. Yes you can always extract some embeddings from somewhere, but for most LLMs those won't perform well for retrieval (which makes sense as it's not what the models are optimizing for)

1 more reply

ttul2y ago

That depends on whether the training data contained languages other than English.

omneity2y ago

Impressive work.

I wonder what would be the best way to use 8k embeddings. It’s a lot of information to keep in a vector, so things like “precision” of the embedding space and its ability to distinguish very similar large documents will be key.

Maybe it can be useful for coarse similarity matching, for example to detect plagiarism?

sroussey2y ago

8K is the context length. Their vector dimension size is actual much smaller, which is great for a number of use cases, though maybe not the ones you are thinking about.

omneity2y ago

Yes that’s also how I understood it. Maybe it was ambiguously expressed, but I mean “8k tokens as input is a lot of information to encode”

moralestapia2y ago

Ada is one of the (if not the) worst model offered by OpenAI, though ...

simonw2y ago

You're thinking of the old "ada" GPT-3 model - the one that was a companion to "davinci" and "babbage".

I believe "text-embedding-ada-002" is entirely unrelated to those old GPT-3 models. It's a recent embedding model (released in December 2022 - https://openai.com/blog/new-and-improved-embedding-model ) which OpenAI claim is their best current best available embedding model.

I understand your confusion: OpenAI are notoriously bad at naming things!

moralestapia2y ago

Oh, thanks for clarifying!

Edit: looking at the press release, the improvement over old Ada is ... marginal? And Ada-01 is/was a poor performing model, tbh. I guess I'll have to run some tests, but at first sight it doesn't seem that wow-ey.

1 more reply

luke-stanley2y ago

When I go to this leaderboard: https://huggingface.co/spaces/mteb/leaderboard I click on the "Classification" tab, then I see "jina-embeddings-v2-base-en" at number 12, with an average score of 73.45. But the highest scoring model there is llmrails/ember-v1 with 75.99 average score but it only supports 512 tokens, so if you need 8K tokens to be embedded, I guess they are the best. Do people need 8K of tokens for embedding? Maybe not but they might need more than 512 often enough. It could save a summary extraction step.

cztomsik2y ago

Small context window means you cannot embed the whole document, you are embedding just a part.

So, if there is some information at the bottom which is dependent on something which is at the top, your embedding could be entirely wrong.

woofwoofwoof2y ago

Just noticed that they (jina.ai) have offices both in Berlin and China. I am wondering how they will they operate with the presence of chip export restrictions and other side effects of USA / China tensions.

1 more reply

itronitron2y ago

It's weird to think there are entire companies built around providing access to a pre-computed vector space model.

nicognaw2y ago

Jina AI itself is also a great framework to expose APIs from deep neural net models and deploy them to Kubernetes clusters, which I think is very promising, but they didn't get as much hype as I predicted that they deserved.

dcastm2y ago

I wonder how much better is this, compared to taking the average ( or some other aggregation) of embeddings with a smaller context length. Has anyone done a similar comparison?

pietro72ohboy2y ago

The issue with averaging is that over large inputs, it drowns out small signal. For example, there is a chance that it completely loses a reference to something made only in a single sentence somewhere in a large document.

Kutsuya2y ago

this is super cool! I wish there was an easy to understand and follow guide on how to make your own embedding, for llama2 for example. All I can find are various guides that already assume you know everything there is to training an embedding.

I just want to make an embedding between a conversation of me and my friend and simulate talking to them. Is this a hard thing to train to begin with?

If anyone knows or could help me with this, I would be very grateful!

infecto2y ago

I will butcher this so if any experts see this please don't flame me. I think you might be conflating ideas? You could definitely fine-tune existing embedding models or train your own from scratch but the goals of embeddings models are different than a LLM conversation. Embedding models are used for things like, classifying, search, image captioning...maybe at a high level anything where you have high dimensionality that you need to condense?

What you are asking for sounds like fine tuning an existing LLM...where the data will be tokenized but the outcomes are different? There is a lot of writeups on how people have done it. You should especially follow some of the work on Huggingface. To replicate talking to your friend though, you will need a very large dataset to train off of I would think and its unclear to me if you can just fine-tune it or you would need to train a model from scratch. So a dataset with 10s of thousands of examples and then you need to train it on a GPU.

https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehe...

Kutsuya2y ago

Thank you for sending this. It's still quite puzzling to me if it's actually possible or not. Maybe what I want to train is a style? But then again, it should also remember other important things related to the friend..

1 more reply

tayo422y ago

You can't fine tune without using their library tied to their cloud? Did I misunderstand? Do you need fine tune?

1 more reply

backendEngineer2y ago

oh thank god I first read Jira...

eshack942y ago

You're not the only one... glad I misread that.

pknerd2y ago

Pardon my ignorance in advance but could it be used to "chat" with PDFs and websites? I am looking for OpenAI alternatives as I am in learning phase

lofties2y ago

No. “Chatting with PDFs” is (mostly) taking a users chat message, retrieve relevant content via e.g embedding search, then feed that into an LLM with a prompt that’s something along the lines of “given this information, can you answer this question”.

This tool helps with embedding part.

I’ve built a bunch of ”chat with your PDFs” bots, do reach out if you have any questions me at brian.jp.

pknerd2y ago

Actually I wanna use langchain. OpwnAI is not free. I wanted to test two use cases:

- chat with documents(pdf, doc etc)

- chat with website. Like, if I integrate with an ecommerce site, I can ask questions from the website. What options do I have in free for both cloud and locally?

clarkmcc2y ago

Check out my little side project for chatting with PDFs. You should be able to load most models including this one. https://github.com/clarkmcc/chitchat

pknerd2y ago

This looks cool so can it be used to feed Website/Products data in CSV/JSON format and "chat" with it?

1 more reply

canadaduane2y ago

No, this is an embedding model, not a text completion model.

seydor2y ago

using the bing tab of microsoft edge browser, you can chat with PDFs and i think they use GTP4 or equivalent

Nitrolo2y ago

Is there something like oobabooga to easily run this in a click-and-run way? Where I can load up a model, a text, and ask it questions?

simonw2y ago

See my comment here: https://news.ycombinator.com/item?id=38020655 for a CLI tool that lets you do this.

Note that embedding models are a different kind of thing from a Large Language Model, so it's not the kind of model you can ask questions.

It's a model which can take text and turn it into an array of floating point numbers, which you can then use to implement things like semantic search and related documents.

More on that here: https://simonwillison.net/2023/Oct/23/embeddings/

minimaxir2y ago

The Hugging Face page for the model has a two line load-and-encode Python code demo: https://huggingface.co/jinaai/jina-embeddings-v2-base-en

brucethemoose22y ago

iirc ooba has its own integrated vectordb called superbooga.

I bet you could hack this in.

sroussey2y ago

Does anyone know what they are using for this comparison and ranking? And where does instruct-xl stand in the mix?

sroussey2y ago

Oh duh, it’s right in the post and instructor-xl is number 9. And so many new participants now!

sroussey2y ago

The ranking are here:

https://huggingface.co/spaces/mteb/leaderboard

It’s amazing how many new and better ones there are since I last looked a few months ago. Instructor-xl was number 1, now it is number 9, and its size is more than 10x the number 2 ranked!

Things move fast!

extasia2y ago

Is this a text encoder model, BERT style?

neximo642y ago

Does it match OpenAI on number of params?

minimaxir2y ago

No one knows since OpenAI has not disclosed the number of paramerers their embeddings model uses.

nwhnwh2y ago

What does this even do?

tyingq2y ago

See this story from yesterday: https://news.ycombinator.com/item?id=37985489

3cats-in-a-coat2y ago

Great company name.

smcl2y ago

I'm gonna try to explain this because I thought the same thing, though you may enjoy it for another reason. Among Czech or other slavic software people - "jiná AI" could be like "another AI" and, to me at least, brings to mind the "yet another {thing}" naming convention (yacc = "yet another compiler compiler" for example).

e1g2y ago

Their OpenAI benchmark is GPT3 (text-embedding-ada-002), not GPT4.

simonw2y ago

"text-embedding-ada-002" isn't GPT3, it's a different kind of model. Embedding models and Large Language Models aren't the same thing.

e1g2y ago

LLMs and embedding models are certainly different, but it's a useful benchmark to calibrate expectations. OpenAI released text-embedding-ada-002 a year ago, and they describe the ada model as[1] "the original GPT-3 base model [...] capable of very simple tasks, usually the fastest model in the GPT-3 series".

It's fair to expect GPT3-level results - not GPT 3.5 and certainly not open-source tiny GPT4 as some might think when they read "rivaling OpenAI".

[1] https://platform.openai.com/docs/models/whisper

2 more replies

andrewstuart2y ago

Anyone got links to examples of text embedding?

BoorishBears2y ago

Easiest example is taking three words: Universe, University, College.

- University and Universe are similar alphabetically.

- University and College are similar in meaning.

Take embeddings for those three words and `University` will be near `College`, while `Universe` will be further away, because embeddings capture meaning:

University<-->College<-------------->Universe

With old school search you'd need to handle the special case of treating University and College as similar, but embeddings already handle it.

With embeddings you can do math to find how similar two results are, based on how close their vectors are. The closer the embeddings, the closer the meaning.

osigurdson2y ago

Another interesting point is that math can be performed on embedding vectors: emb("king") - emb("man") + emb("woman") = emb("queen").

1 more reply

RossBencina2y ago

OpenAI have a brief explainer with a bunch of example use cases here:

https://platform.openai.com/docs/guides/embeddings/what-are-...

Zuiii2y ago

Color me surprised! it looks like its actually open source (Apache 2.0) and not the usual false advertising by some two-faced company or institution. Links here:

* https://huggingface.co/jinaai/jina-embeddings-v2-base-en * https://huggingface.co/jinaai/jina-embeddings-v2-small-en

1 more reply

RossBencina2y ago

Some relevant stats from the link:

8192 token input sequence length

768 embedding dimensions

0.27GB model (with 0.07GB model also available)

Tokeniser: BertTokenizer [1], 30528 token vocab [2]

Is an 8K sequence length directly comparable to text-embedding-ada-002 if the vocabulary is much smaller? I seem to remember its tokeniser has a larger vocabulary.

[1] https://huggingface.co/jinaai/jina-embeddings-v2-base-en/blo...

[2] https://huggingface.co/jinaai/jina-embeddings-v2-base-en/blo...

LoganDark2y ago

> Is an 8K sequence length directly comparable to text-embedding-ada-002 if the vocabulary is much smaller? I seem to remember its tokeniser has a larger vocabulary.

Words that aren't in the vocabulary can still be represented by multiple tokens. Some models can input and output valid UTF-8 at the byte level (rather than needing a unique token for each codepoint). For example RWKV-World.

space_fountain2y ago

A large vocabulary means less tokens are needed to represent the same information

2 more replies

DavidSJ2y ago

A uniform distribution over 30528 tokens is just under 15 bits of information per token, whereas a vocabulary size of ~60000 would be just under 16 bits per token. In practice it's not uniform, but this shows that they're in the same ballpark.

rajin1122y ago

Thanks what size gpu would you need to fine tune or do an inference?

j / k navigate · click thread line to collapse

201 comments

pietz2y ago

The 8k context window is new, but isn't the 512 token limitation a soft limit anyway? I'm pretty sure I can stuff bigger documents into BGE for example.

Plus, you can expect that this model was designed to perform well on the datasets in MTEB while the OpenAI model probably wasn't.

Many also stated that a 8k context embeddings will not be very useful in list situations.

When would anyone use this model?

larodi2y ago

infecto2y ago

I have been trying to understand the hype as well. Happy to see all the work happening in this space still.

infecto2y ago

This also opens up another question though, how would that compare to using a LLM to summarize that paper and then just embed on top of that summary.

1 more reply

egorfine2y ago

I fail to imagine a 8k-token-length piece of text that has just one single semantic coordinate and is appropriate for embedding and vector search.

simonw2y ago

What are you using those embeddings for?

I can see a sliding window working for semantic search and RAG, but not so much for clustering or finding related documents.

1 more reply

theptip2y ago

> The 8k context window is new

Hasn’t Claude had this for many months (before they bumped to 100k)?

Edit: ah, you mean new for OSS maybe?

simonw2y ago

Claude is a large language model, which is a different thing from an embedding model.

2 more replies

burcs2y ago

This is great news!

udev40962y ago

Is it tho? It's not really open source if they don't give us the information regarding training datasets

jerpint2y ago

It definitely is open source even if they don’t disclose all details behind the training

3 more replies

infecto2y ago

Gasp0de2y ago

They compare it to OpenAI's ada model though, which is light-years away from ChatGPT.

simonw2y ago

Don't confuse the current Ada embedding model the old Ada GPT3 model.

infecto2y ago

Does that not conflate two different things though? Embedding model != LLM Model ?

jncraton2y ago

For those unaware, if 512 tokens of context is sufficient for your use case, there are already many options that outperform text-embedding-ada-002 on common benchmarks:

https://huggingface.co/spaces/mteb/leaderboard

minimaxir2y ago

The 768D-sized embeddings compared to OpenAI's 1536D embeddings are actually a feature outside of index size.

TeMPOraL2y ago

karxxm2y ago

You wrote „out of the box“, did you find a way to improve this?

1 more reply

andy992y ago

Isn't the normal way of using embedding to find relevant text snippets for a RAG prompt? Where is it better to have coarser retrieval?

dragonwriter2y ago

> What is the use case for an 8k token embedding?

Calculating embeddings on larger documents than smaller-window embedding models.

> My (somewhat limited) experience with long context models is they aren't great for RAG.

> Isn't the normal way of using embedding to find relevant text snippets for a RAG prompt?

I would say the usual use is for search and semantic similarity comparisons generally. RAG is itself an application of search, but its not the only one.

3abiton2y ago

I wonder how the perfomance fair when context size is increased. Intuitively this should be higher, but some quantized models I've tested showed noticeably worst performance.

1 more reply

teaearlgraycold2y ago

refulgentis2y ago

I'd much rather know what paragraph to look in than what 25 pages to look in

3 more replies

kristopolous2y ago

Is this what you mean by RAG? https://www.promptingguide.ai/techniques/rag?

simonw2y ago

I have an explanation of RAG in the context of embeddings here: https://simonwillison.net/2023/Oct/23/embeddings/#answering-...

1 more reply

teaearlgraycold2y ago

Yes

egorfine2y ago

One thing that is missing in comparison: OpenAI's model is multilingual.

jimmySixDOF2y ago

I heard one of the developers on a regular Open Source DIY AI X/twitter space [1] & they are targeting two new models German/English and French/English for the next release

https://x.com/thursdai_pod

m3kw92y ago

I don’t really know what that means but it seems useful

do-me2y ago

Just quantized the models for onnx usage in e.g. transformers.js and got 4x reduced file size:

- 𝟐𝟖.𝟓 𝐌𝐁 jina-embeddings-v2-small-en (https://huggingface.co/do-me/jina-embeddings-v2-small-en)

- 𝟏𝟎𝟗 𝐌𝐁 jina-embeddings-v2-base-en (https://huggingface.co/do-me/jina-embeddings-v2-base-en)

Havoc2y ago

Why quantize something that is already very small (270mb)?

pietz2y ago

simonw2y ago

I just shipped a new llm-embed-jina plugin for my LLM tool which provides access to these new Jina models: https://github.com/simonw/llm-embed-jina

Here's how to try it out.

First, install LLM. Use pip or pipx or brew:

    brew install llm

Next install the new plugin:

    llm install llm-embed-jina

You can confirm the new models are now available to LLM by running:

    llm embed-models

You should see a list that includes "jina-embeddings-v2-small-en" and "jina-embeddings-v2-base-en"

To embed a string using the small model, run this:

    llm embed -m jina-embeddings-v2-small-en -c 'Hello world'

That will output a JSON array of 512 floating point numbers (see my explainer here for what those are: https://simonwillison.net/2023/Oct/23/embeddings/#what-are-e...)

Embeddings are only really interesting if you store them and use them for comparisons.

Here's how to use the "llm embed-multi" command to create embeddings for the 30 most recent issues in my LLM GitHub repository:

    curl 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
    | jq '[.[] | {id: .id, title: .title}]' \
    | llm embed-multi -m jina-embeddings-v2-small-en jina-llm-issues - \
    --store

This creates a collection called "jina-llm-issues" in a default SQLite database on your machine (the path to that can be found using "llm collections path").

To search for issues in that collection with titles most similar to the term "bug":

    llm similar jina-llm-issues -c 'bug'

Or for issues most similar to another existing issue by ID:

    llm similar jina-llm-issues 1922688957

Full documentation on what you can do with LLM and embeddings here: https://llm.datasette.io/en/stable/embeddings/index.html

    llm embed-multi jina-readmes \
      -m jina-embeddings-v2-small-en \
      --files . '**/README.md' --store

Then search them like this:

    llm similar jina-readmes -c 'backup tools'

simonw2y ago

Wrote this up on my blog: https://simonwillison.net/2023/Oct/26/llm-embed-jina/

bosky1012y ago

The only feedback I had from your embedding post was

    wish we could create the array of floating points without openai

Great timely turnaround time, good sir. Ht

X6S1x6Okd1st2y ago

Thank you so much for all the work you've put into llm!

jillesvangurp2y ago

Thanks, this is wonderfully simple to use. Just managed to package this up using docker and was able to use it without a lot of drama. Nice how simple this is to use.

michalmatczuk2y ago

FYI it seems that llm install llm-embed-jina is missing yaml dependency

  File "/opt/homebrew/Cellar/llm/0.11_1/libexec/lib/python3.12/site-packages/llm/default_plugins/openai_models.py", line 17, in <module>
    import yaml

ModuleNotFoundError: No module named 'yaml'

simonw2y ago

Thanks! I wonder if the Python 3.12 upgrade broke something.

The pyyaml package is correctly listed on the formula page though: https://formulae.brew.sh/formula/llm

dazzaji2y ago

Excellent! And you were just saying how risky it is to rely long-term on OpenAI text embeddings in your post on the topic. The timing for this open source option worked out nicely.

mike_ivanov2y ago

JFYI, this is what happens on my M1 Macbook:

$ brew install llm $ llm ModuleNotFoundError: No module named 'typing_extensions'

Not sure where to report it.

simonw2y ago

Whoa, that is a weird one. Do you know what version of Python you have from Homebrew?

It looks like that package is correctly listed in the formula: https://github.com/Homebrew/homebrew-core/blob/a0048881ba9a2...

1 more reply

IanCal2y ago

1 more reply

marinhero2y ago

How well do LLMS like this work with a non-English language? Or are these open source models limited to English?

simonw2y ago

Quite a few of the top ranked models on this leaderboard are multilingual: https://huggingface.co/spaces/mteb/leaderboard

https://huggingface.co/BAAI/bge-large-en-v1.5 FlagEmbedding for example describes itself as covering Chinese and English.

anigbrowl2y ago

Stability has a Japanese port which is getting lots of work https://twitter.com/StabilityAI_JP/status/171699857824440759...

m3at2y ago

1 more reply

ttul2y ago

That depends on whether the training data contained languages other than English.

omneity2y ago

Impressive work.

Maybe it can be useful for coarse similarity matching, for example to detect plagiarism?

sroussey2y ago

8K is the context length. Their vector dimension size is actual much smaller, which is great for a number of use cases, though maybe not the ones you are thinking about.

omneity2y ago

Yes that’s also how I understood it. Maybe it was ambiguously expressed, but I mean “8k tokens as input is a lot of information to encode”

moralestapia2y ago

Ada is one of the (if not the) worst model offered by OpenAI, though ...

simonw2y ago

You're thinking of the old "ada" GPT-3 model - the one that was a companion to "davinci" and "babbage".

I understand your confusion: OpenAI are notoriously bad at naming things!

moralestapia2y ago

Oh, thanks for clarifying!

1 more reply

luke-stanley2y ago

cztomsik2y ago

Small context window means you cannot embed the whole document, you are embedding just a part.

So, if there is some information at the bottom which is dependent on something which is at the top, your embedding could be entirely wrong.

woofwoofwoof2y ago

1 more reply

itronitron2y ago

It's weird to think there are entire companies built around providing access to a pre-computed vector space model.

nicognaw2y ago

dcastm2y ago

I wonder how much better is this, compared to taking the average ( or some other aggregation) of embeddings with a smaller context length. Has anyone done a similar comparison?

pietro72ohboy2y ago

Kutsuya2y ago

I just want to make an embedding between a conversation of me and my friend and simulate talking to them. Is this a hard thing to train to begin with?

If anyone knows or could help me with this, I would be very grateful!

infecto2y ago

https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehe...

Kutsuya2y ago

1 more reply

tayo422y ago

You can't fine tune without using their library tied to their cloud? Did I misunderstand? Do you need fine tune?

1 more reply

backendEngineer2y ago

oh thank god I first read Jira...

eshack942y ago

You're not the only one... glad I misread that.

pknerd2y ago

Pardon my ignorance in advance but could it be used to "chat" with PDFs and websites? I am looking for OpenAI alternatives as I am in learning phase

lofties2y ago

This tool helps with embedding part.

I’ve built a bunch of ”chat with your PDFs” bots, do reach out if you have any questions me at brian.jp.

pknerd2y ago

Actually I wanna use langchain. OpwnAI is not free. I wanted to test two use cases:

- chat with documents(pdf, doc etc)

- chat with website. Like, if I integrate with an ecommerce site, I can ask questions from the website. What options do I have in free for both cloud and locally?

clarkmcc2y ago

Check out my little side project for chatting with PDFs. You should be able to load most models including this one. https://github.com/clarkmcc/chitchat

pknerd2y ago

This looks cool so can it be used to feed Website/Products data in CSV/JSON format and "chat" with it?

1 more reply

canadaduane2y ago

No, this is an embedding model, not a text completion model.

seydor2y ago

using the bing tab of microsoft edge browser, you can chat with PDFs and i think they use GTP4 or equivalent

Nitrolo2y ago

Is there something like oobabooga to easily run this in a click-and-run way? Where I can load up a model, a text, and ask it questions?

simonw2y ago

See my comment here: https://news.ycombinator.com/item?id=38020655 for a CLI tool that lets you do this.

Note that embedding models are a different kind of thing from a Large Language Model, so it's not the kind of model you can ask questions.

It's a model which can take text and turn it into an array of floating point numbers, which you can then use to implement things like semantic search and related documents.

More on that here: https://simonwillison.net/2023/Oct/23/embeddings/

minimaxir2y ago

The Hugging Face page for the model has a two line load-and-encode Python code demo: https://huggingface.co/jinaai/jina-embeddings-v2-base-en

brucethemoose22y ago

iirc ooba has its own integrated vectordb called superbooga.

I bet you could hack this in.

sroussey2y ago

Does anyone know what they are using for this comparison and ranking? And where does instruct-xl stand in the mix?

sroussey2y ago

Oh duh, it’s right in the post and instructor-xl is number 9. And so many new participants now!

sroussey2y ago

The ranking are here:

https://huggingface.co/spaces/mteb/leaderboard

It’s amazing how many new and better ones there are since I last looked a few months ago. Instructor-xl was number 1, now it is number 9, and its size is more than 10x the number 2 ranked!

Things move fast!

extasia2y ago

Is this a text encoder model, BERT style?

neximo642y ago

Does it match OpenAI on number of params?

minimaxir2y ago

No one knows since OpenAI has not disclosed the number of paramerers their embeddings model uses.

nwhnwh2y ago

What does this even do?

tyingq2y ago

See this story from yesterday: https://news.ycombinator.com/item?id=37985489

3cats-in-a-coat2y ago

Great company name.

smcl2y ago

e1g2y ago

Their OpenAI benchmark is GPT3 (text-embedding-ada-002), not GPT4.

simonw2y ago

"text-embedding-ada-002" isn't GPT3, it's a different kind of model. Embedding models and Large Language Models aren't the same thing.

e1g2y ago

It's fair to expect GPT3-level results - not GPT 3.5 and certainly not open-source tiny GPT4 as some might think when they read "rivaling OpenAI".

[1] https://platform.openai.com/docs/models/whisper

2 more replies

andrewstuart2y ago

Anyone got links to examples of text embedding?

BoorishBears2y ago

Easiest example is taking three words: Universe, University, College.

- University and Universe are similar alphabetically.

- University and College are similar in meaning.

Take embeddings for those three words and `University` will be near `College`, while `Universe` will be further away, because embeddings capture meaning:

University<-->College<-------------->Universe

With old school search you'd need to handle the special case of treating University and College as similar, but embeddings already handle it.

With embeddings you can do math to find how similar two results are, based on how close their vectors are. The closer the embeddings, the closer the meaning.

osigurdson2y ago

Another interesting point is that math can be performed on embedding vectors: emb("king") - emb("man") + emb("woman") = emb("queen").

1 more reply

RossBencina2y ago

OpenAI have a brief explainer with a bunch of example use cases here:

https://platform.openai.com/docs/guides/embeddings/what-are-...

Zuiii2y ago

Color me surprised! it looks like its actually open source (Apache 2.0) and not the usual false advertising by some two-faced company or institution. Links here:

* https://huggingface.co/jinaai/jina-embeddings-v2-base-en * https://huggingface.co/jinaai/jina-embeddings-v2-small-en

1 more reply

RossBencina2y ago

Some relevant stats from the link:

8192 token input sequence length

768 embedding dimensions

0.27GB model (with 0.07GB model also available)

Tokeniser: BertTokenizer [1], 30528 token vocab [2]

Is an 8K sequence length directly comparable to text-embedding-ada-002 if the vocabulary is much smaller? I seem to remember its tokeniser has a larger vocabulary.

[1] https://huggingface.co/jinaai/jina-embeddings-v2-base-en/blo...

[2] https://huggingface.co/jinaai/jina-embeddings-v2-base-en/blo...

LoganDark2y ago

> Is an 8K sequence length directly comparable to text-embedding-ada-002 if the vocabulary is much smaller? I seem to remember its tokeniser has a larger vocabulary.

space_fountain2y ago

A large vocabulary means less tokens are needed to represent the same information

2 more replies

DavidSJ2y ago

rajin1122y ago

Thanks what size gpu would you need to fine tune or do an inference?

j / k navigate · click thread line to collapse