Phi-3 Technical Report (opens in new tab)

(arxiv.org)

411 pointsvarunvummadi2y ago130 comments

130 comments

Everyone needs to take these benchmark numbers with a big grain of salt. According to what I've read, Phi-2 was much worse than its benchmark numbers suggested. This model follows the same training strategy. Nobody should be assuming these numbers will translate directly into a high ranking on the LMSYS leaderboard, or usefulness in everyday tasks. Let's not dethrone Llama 3 until some real world testing can be done.

That said, I don't think it's impossible for a small model to be very good. I see their "synthetic data" as essentially a way of distilling GPT-4 into smaller models. It would be exciting if a large fraction of the performance of huge models could be transferred to small ones! If true, then Chinchilla-optimal training could make sense again, as you could optimally train a ginormous model and then distill it afterward for efficient inference.

bt1a2y ago

This won't dethrone Llama 3, but it's equally impressive.

They mention this model's relative weakness in the TruthfulQA eval, since it's more lossy trying to pack 'knowledge' into a small model relative to problem-solving skills (which shine on MMLU)

Regardless - still a very useful thing to have offline and on the fly. Those scores are nothing to scoff at.

Given that these pipelines are likely harder harder to imitate than new architectures like Transformers, I assume there has been and will be an intense focus on synthetic data generation and cleansing. Llama 3 used 15T of tokens in its training corpus vs 4.8T in the "scaled-up" version of phi-3. If you made it to the end of this disjointed ramble I'm sorry

Grimblewald2y ago

Even llama3 has its issues. Ive been quite impressed so far but if the context gets a little long it freaks out, gets stuck repeating the same token or just fails to finish an answer. This is for the full f16 8B model, so it cant be put down to quantization. It also doesnt quite handle complex instructions as well as the benchmarks would imply should.

2 more replies

IvanAchlaqullah2y ago

> TruthfulQA

Wait, people still use this benchmark? I hear there's a huge flaw on it.

For examples, fine-tuning the model on 4chan make it scores better on TruthfulQA. It becomes very offensive afterwards though, for obvious reasons. See GPT-4chan [1]

[1] https://www.youtube.com/watch?v=efPrtcLdcdM

5 more replies

HarHarVeryFunny2y ago

The Chinchilla paper is about how to design/train a model to optimize use of computing power, which we can equate to cost (FLOPs cost dollars).

The question that Chinchilla tries to answer is: for a given training budget (which you can think of as dollars or FLOPs), what is the optimal trade off of model size and quantity of training data to get the most performant model? Build a large model and train with less data, or build a smaller one and train with more data?

However, another consideration is minimizing total lifetime cost of the model: training cost + inference cost. You could train a model for longer (costing more) in order to get a given level of performance from a smaller model that will be cheaper for inference, or vice versa. For any given projected model lifetime inference volume, there is going to be a different answer.

It's not that Chinchilla-optimal models stopped making sense, but rather that this sort of consideration has people willing to pump more money (tokens) into smaller models to reduce inference cost for that level of capability.

KuriousCat2y ago

Does the paper assume uniform settings through out the training phase? Or is it the bound no matter what training strategy is used given the dataset?

1 more reply

ankit2192y ago

Not trying to disparage them, but their models always give a feeling that it is overfitted on benchmarks hence they perform so well. On everyday tasks, it's much worse - chat or simple completion tasks.

Distilling can work and there are papers which suggest it does, but we still do not have a reliable mechanism which can distill knowledge from larger teacher models to smaller student models.

moffkalast2y ago

This was the case for Phi-2, it was notoriously rubbish in practical use.

spmurrayzzz2y ago

I don't think we can call it distillation, at least not in the conventional ML use of the word as you're not interacting with any of the actual model architecture, specifically — not computing the loss between the predictions of the parent model and the distilled target model.

This is an important distinction when it comes to assess model collapse risk, which is a risk I think has probably been overstated enough to this point where now its being understated.

refulgentis2y ago

Phi-2 wasn't chat/instruct tuned, so it didn't act good in chat, it was a base model. But the benchmark #s were real.

nl2y ago

I had a lot of issues trying to get Phi-2 to perform as well as the benchmarks indicated on non-chat tasks.

It felt a lot like it was overfitted to the exact type of tasks (ie, not a data leak) in the benchmarks but if you were trying something a bit off track if didn't know what to do. At the time my hypothesis was that the small model just didn't have the capacity to generalise well enough, but since then Gemma 2B has come out and seems to be ok.

So now I have no idea why, but yes: the benchmarks for Phi-2 didn't represent how it worked for me on real world tasks where you'd expect it top be ok.

irjustin2y ago

I'm pretty naive so please forgive it's a stupid question.

To me, what the parent comment is saying is that even though the benchmarks are cool, it's not super helpful to the every day person. Because if you can't chat with it very well (even for a narrow context) what utility does it have with great benchmarks?

1 more reply

oersted2y ago

Incredible, rivals Llama 3 8B with 3.8B parameters after less than a week of release.

And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.

Source: https://chat.lmsys.org/?leaderboard (select English in the dropdown)

So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones? Kinda? Wild.

(I'm sure there's a lot of nuance to it, for one these benchmarks are not so hard to game, we'll see how the dust settles, but still...)

Phi-3-mini 3.8b: 71.2

Phi-3-small 7b: 74.9

Phi-3-medium 14b: 78.2

Phi-2 2.7b: 58.8

Mistral 7b: 61.0

Gemma 7b: 62.0

Llama-3-In 8b: 68.0

Mixtral 8x7b: 69.9

GPT-3.5 1106: 75.3

(these are averages across all tasks for each model, but looking at individual scores shows a similar picture)

jxy2y ago

This inductive logic is way overblown.

> Incredible, beat Llama 3 8B with 3.8B parameters after less than a week of release.

Judging by a single benchmark? Without even trying it out with real world usage?

> And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.

Any potential caveat in such a leaderboard not withstanding, on that leaderboard alone, there is a huge gap between llama 3 8B and Mistral-Large, let alone any of the GPT-4.

By the way, for beating benchmark, "Pretraining on the Test Set Is All You Need"

oersted2y ago

It's easy to miss: select English in the dropdown. The scores are quite different in Overall and in English for LMSYS.

As I've stated in other comments, yeah... Agreed, I'm stretching it a bit. It's just that any indication of a 3.8B model being in the vicinity of GPT-4 is huge.

I'm sure that when things are properly measured by third-parties it will show a more sober picture. But still, with good fine-tunes, we'll probably get close.

It's a very significant demonstration of what could be possible soon.

1 more reply

ignoramous2y ago

> Phi-3-mini 3.8b: 71.2

Per the paper, phi3-mini (which is english-only) quantised to 4bit uses 1.8gb RAM and outputs 1212 tokens/sec (correction: 12 tokens/sec) on iOS.

A model on par with GPT-3.5 running on phones!

(weights haven't been released, though)

coder5432y ago

> (weights haven't been released, though)

Phi-1, Phi-1.5, and Phi-2 have all had their weights released, and those weights are available under the MIT License.

Hopefully Microsoft will continue that trend with Phi-3.

> outputs 1212 tokens/sec on iOS

I think you meant "12 tokens/sec", which is still nice, just a little less exciting than a kilotoken/sec.

3 more replies

karmasimida2y ago

Where did you get this from?

> So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones

No, not even close ... Even Gemini has huge UX gap comparing to GPT4/Opus, 8B I won't even attempt this argument.

alecco2y ago

At a glance, it looks like Phi-3 was trained on an English only, STEM-strong dataset. See how they are not as strong in HumanEval, Trivia, etc. But of course it's very good.

crakenzak2y ago

Can’t wait to see some Phi-3 fine tunes! Will be testing this out locally, such a small model that I can run it without quantization.

Feels incredible to be living in a time with such neck breaking innovations. What are chances we’ll have a <100B parameter GPT4/Claude Opus model in the next 5 years?

nl2y ago

> What are chances we’ll have a <100B parameter GPT4/Claude Opus model in the next 5 years?

In 5 years time we'll have adaptive compute and the idea of talking about the parameter count of a model will seem as quaint as talking about the cylinder capacity of a jet engine.

regularfry2y ago

It feels like it's going to be closer than that. People always forget that GPT4 and Opus have the advantage of behind-the-curtain tool use that you just can't see, so you don't know how much of a knowledge or reasoning leg-up they're getting from their internal tooling ecosystem. They're not really directly comparable to a raw LLM downloaded from HF.

What we need is a standardised open harness for open source LLMs to sit in that gives them both access to tools and the ability to write their own, and that's (comparatively speaking) a much easier job than training up another raw frontier LLM: it's just code, and they can write a lot of it.

Deverauxi2y ago

5 years? 5 years is a millennia these days.

We’ll have small local models beating gpt-4/Claude opus in 2024. We already have sub 100b models trading blows with former gpt-4 models, and the future is racing toward us. All these little breakthroughs are piling up.

1 more reply

bugglebeetle2y ago

We already do. It’s called LLama 3 70B Instruct.

1 more reply

stavros2y ago

Is it released?

viraptor2y ago

On par in some categories. Phi was intended for reasoning, not storing data, due to small size. I mean, it's still great, but the smaller it gets, the more facts from outside of the prompts context will not be known at all.

candiodari2y ago

I wonder if that's a positive or negative. How does it affect hallucinations?

1 more reply

blackeyeblitzar2y ago

It’s not open source, but is open weight - like distributing a precompiled executable. In particular what makes it open weights rather than just weights available is that it is licensed using an OSI approved license (MIT) rather than a restricted proprietary license.

I really wish these companies would release the training source, evaluation suites, and code used to curate/filter training data (since safety efforts can lead to biases). Ideally they would also share the training data but that may not be fully possible due to licensing.

moralestapia2y ago

>And on LMSYS English, Llama 3 8B is well above GPT-4

Source?

oersted2y ago

Right thanks for the reminder, I added it

1 more reply

zone4112y ago

> So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones?

No, we don't. LMsys is just one, very flawed benchmark.

ukuina2y ago

Why is LMsys flawed?

Many people treat LMsys as gospel because it's the only large-scale, up-to-date qualitative benchmark. All the numeric benchmarks seem to miss real-world applicability.

oersted2y ago

Agreed, but it's wild that even one benchmark shows this. Based on what we knew just a few months ago, these models should be so far from each other in every benchmark.

infecto2y ago

"But still"? Lets be realistic, all of these benchmark scores are absolute garbage. Yes, the open source community is making great strides, they are getting closer but the gap is still wide when comparing to commercially available models.

visarga2y ago

This shows the power of synthetic content - 3.3 trillion tokens! This approach can make a model even smaller and more efficient than organic text training, and it will not be able to regurgitate NYT articles because it hasn't seen any of them. This is how copyright infringement claims can be placated.

pkoiralap2y ago

They have started putting some models in huggingface: https://huggingface.co/collections/microsoft/phi-3-6626e15e9...

Patrick_Devine2y ago

And of course if you want to try it out locally, `ollama run phi3`.

minimaxir2y ago

And with a MIT license!

mythz2y ago

I'll believe it till I try it for myself, Phi-2 was the clear worst of the 20 LLMs we evaluated (was also smallest so was expected).

But it was slow for its size, generated the longest responses with the most hallucinations, as well as generating the most empty responses. It was also the model ranked with the lowest quality answers.

ein0p2y ago

Tried it: as soon as you ask something outside the head of the likely training data distribution it starts hallucinating like crazy. This isn’t surprising to me as a researcher: you need the associative memories of a larger model to cover the tail with at least something. That said, it’ll likely work well at specific narrow tasks once fine tuned. Just don’t expect it to really “beat GPT-3.5” at the general chat use case

brcmthrowaway2y ago

If I was Apple I'd be quaking in my boots. They are getting too far behind to ever catch up. Nokia in 2010 vibes.

PedroBatista2y ago

They'll just do what they have been doing for ~20 years, they wait, pick the "winner", polish the "user experience", call it Apple magic and incorporate that into their product cycles.

Some day will be the day their joke book becomes so mediocre it will not stick anymore, but I think they are safe on this one, for now..

mirekrusin2y ago

Considering that experiments cost tens to hundreds of millions of dollars a pop this may be not that bad strategy.

fauigerzigerk2y ago

True for hardware, but their record on software is far less convincing.

1 more reply

dkarras2y ago

There is enormous value in polishing the user experience, especially if no one else is doing it (or maybe capable of doing it?). It will never get old as long as they are the only ones doing it.

vessenes2y ago

I don't think MS has a special sauce here, just a willingness to publish. To the extent MS has disclosed the bulk of what they are doing with Phi, it's a combination of really nice initial idea "Use written texts + GPT-4 to generate high quality prompts where we know the answer is great because it's written down" and engineering.

To me this is advancing the state of the art as to the impact of data quality, but it doesn't look to me like the phi series have some magical special sauce otherwise. Data of quality and synthetic data creation are not magical moats that Apple can't cross.

I'll say too that I'm psyched to try Phi-3; the sweet spot for me is a model that can be a local coding assistant and still answer random q&a questions with some sophistication. I'm skeptical that 3-8b parameter models will bring the high-level of sophistication sometimes needed in this cycle; there's still a very large gap with the larger models in daily use, despite some often close benchmark scores.

Anyway, Apple-Phi-3 is in no way an impossibility.

bt1a2y ago

I tore my hair out developing a SwiftUI app that could run llama.cpp and whisper.cpp simultaneously. Was able to run a Q3_K Mistral 7B along with a smaller whisper model eventually, but grinding through XCode is a nightmare.

They're working on MLX but it only recently got swift bindings. They just don't have the DEVELOPERS DEVELOPERS DEVELOPERS coked out attitude i guess

esafak2y ago

Did they ever claim to be a powerhouse in foundation models? Did your MacBook or iPhone become obsolete or stop working? They use the models, they don't release them because they don't hoard data.

WanderPanda2y ago

The opposite is the case, with all the advancements, even by doing nothing, Apple (like everyone, including hobbyists) is moving closer to the frontier. Hopefully this trend stays alive!

oersted2y ago

If anything this is good for them. Apple's play here has always been getting their devices ready for running LLMs locally. This makes it way easier.

Deverauxi2y ago

They have something like 140 billion dollars in cash.

They’ll be fine.

IncreasePosts2y ago

How exactly does publicized research lead to them not being able to catch up? I don't think anything in this paper is patentable.

moralestapia2y ago

I don't recall Nokia being a 3 trillion dollar company. Your vibes may vary, though.

astrange2y ago

I think that when people release new interesting software products it's good for hardware companies.

thoughtegting2y ago

If I were apple, I would be developing something in total secrecy and then release something ahead of the rest of competition when people least expect it. very big ifs but siri can be updated everywhere overnight and I dont see them rushing into anything like this

golergka2y ago

If I were apple, I would just buy one of the major LLM companies. They have the cash.

1 more reply

seydor2y ago

Apple's advantage is that their devices are safeguarding people from the dangers of AI

talldayo2y ago

That's a very eloquent variation on the word "censorship"

Are you next going to tell us that the CIA's access to iCloud data protects their users from terrorism too?

fauigerzigerk2y ago

How so? And what dangers?

ec1096852y ago

Eh, I think it’s showing that this class of model is becoming commoditized given there is a new one launching every week.

m3kw92y ago

Phi-2 was useless for practical purposes except if you want to show your friends that it can write a poem, llama3 8b was slightly better but is still same category, it’s complete trash with coding vs gpt4. Llama3 400b “iS OPen SoURce!” But no you will need to pay to access because most one can not practically afford an A100 and set it up properly.

What I’m trying to say is that user experience is now as key as the model smarts and these barely touching gpt4 models cannot beat OpenAI right now as a whole package.

azinman22y ago

I just tried to give gpt4 a scrape of a menu page and asked it to reformat it to csv. It hallucinated several parts and missed others. Llama3-70b hasn’t done that. So far it’s been more reliable. You can run a quantized version on consumer hardware or pay significantly less ($10 vs $1 in, $30 vs $1 out) on hosted platforms.

abidlabs2y ago

Hugging Face Paper Page and Discussion: https://huggingface.co/papers/2404.14219

whereismyacc2y ago

they're just spamming "weights or it didn't happen"

i mean, fair

blackoil2y ago

Has anyone used these/similar with fine tune and RAG? How is the performance over a narrow domain for simple queries? Is it good enough for say an informational chat bot?

anticensor2y ago

This paper broke ArXiv's HTML generator: https://github.com/arXiv/html_feedback/issues/1090

ur-whale2y ago

That's a whole lot of Zhangs!

smartmic2y ago

Hm, roundabout 84 authors of one "scientific" paper. I wonder if this says something about (a) the quality of its content, (b) the path were academic (?) paper publishing goes to, (c) nothing at all, or (d), something entirely else.

lysecret2y ago

Just means you need a big machine and a lot of capital to make advancement. Take a look at any paper coming out of cern.

0cf8612b2e1e2y ago

You should see physics. Stuff involving the large hadron collider can be pages of authors.

It costs so little to share the credit if someone was an asset.

a_bonobo2y ago

I have been on far larger author lists :) There's probably a whole team for the training data generation and assessment, a whole team for the safety assessment (section 4), that stuff adds up.

samus2y ago

It's a tech report. Fair enough to include the whole lab.

simonw2y ago

I'm getting a bit skeptical of MMLU at this point. As far as I can tell it's a set of multiple choice questions that hasn't been updated since 2020. We have to trust the model providers not to deliberately or accidentally train on it for those scores to be useful.

minimaxir2y ago

At the least, there's multiple benchmarks noted in the paper (21!) and the results are consistent across all of them.

I'd trust Microsoft to do decontamination testing, although the paper doesn't explicitly mention it other than "The prompts and number of shots are part of a Microsoft internal tool to evaluate language models, and in particular we did no optimization to the pipeline for the phi-3 models."

Havoc2y ago

Both precious phi have been epic letdowns when I actually tried them myself so quite low confidence in this being reflective of real world. Will try it anyway though

imjonse2y ago

Phi-3 is instruct tuned though so hopefully better.

Havoc2y ago

Yeah initial testing is looking promising

homarp2y ago

the weights have been released, 4k https://huggingface.co/microsoft/Phi-3-mini-4k-instruct and 128k context https://huggingface.co/microsoft/Phi-3-mini-128k-instruct

hackerlight2y ago

Less tokens than Llama 3 (3.3T vs 15T) yet better outcome. No doubt more information dense training data. The interesting thing is the use of synthetic data which they don't talk about.

vessenes2y ago

Actually the original Phi papers did talk about their synthetic data strategy, and it's very cool -- essentially invert high quality textbook text using GPT-4 to create prompts, where the textbooks supply the answers. There may be more undisclosed, but it remains in my mind as one of the best ideas of the last twelve months -- so smart, and interesting, and apparently, it works well.

YetAnotherNick2y ago

No they don't use textbook text at all despite the paper title. They just asked GPT-4 to generate "textbook quality" content, which doesn't even exactly looks like textbook.

astrange2y ago

I feel like literal dictionaries would make good training data; wonder if any of them have done that. LLMs are good at faking so it's hard to tell by asking them.

torginus2y ago

Except everything that comes out of an LLM (like GPT4) is highly suspect (at least in my experience).

1 more reply

xarope2y ago

perhaps that's the best path forward? Text and reference books (hopefully unbiased) for answers, and web scraped data for conversational tone.

minimaxir2y ago

Yes, "chinchilla optimal" is a meme, but 15T might turn out to be too many tokens.

wrsh072y ago

My understanding from this tweet thread [1] is that chinchilla probably overspecified some of the hyperparameters to the model

tl;dr I'm looking forward to having lots of models (ideally models) trained with a wide range of parameters to narrow down "what is actually optimal"

I think there is an interesting tradeoff of data quality and data volume, though

(Eg if we train with the highest quality 10% of our data, does the model improve if we use the other 90%? What if we increase our data size by 10x?)

[1] https://twitter.com/tamaybes/status/1780639257389904013

maximsicora2y ago

insane

j / k navigate · click thread line to collapse

130 comments

modeless2y ago

bt1a2y ago

This won't dethrone Llama 3, but it's equally impressive.

They mention this model's relative weakness in the TruthfulQA eval, since it's more lossy trying to pack 'knowledge' into a small model relative to problem-solving skills (which shine on MMLU)

Regardless - still a very useful thing to have offline and on the fly. Those scores are nothing to scoff at.

Grimblewald2y ago

2 more replies

IvanAchlaqullah2y ago

> TruthfulQA

Wait, people still use this benchmark? I hear there's a huge flaw on it.

For examples, fine-tuning the model on 4chan make it scores better on TruthfulQA. It becomes very offensive afterwards though, for obvious reasons. See GPT-4chan [1]

[1] https://www.youtube.com/watch?v=efPrtcLdcdM

5 more replies

HarHarVeryFunny2y ago

The Chinchilla paper is about how to design/train a model to optimize use of computing power, which we can equate to cost (FLOPs cost dollars).

KuriousCat2y ago

Does the paper assume uniform settings through out the training phase? Or is it the bound no matter what training strategy is used given the dataset?

1 more reply

ankit2192y ago

Distilling can work and there are papers which suggest it does, but we still do not have a reliable mechanism which can distill knowledge from larger teacher models to smaller student models.

moffkalast2y ago

This was the case for Phi-2, it was notoriously rubbish in practical use.

spmurrayzzz2y ago

This is an important distinction when it comes to assess model collapse risk, which is a risk I think has probably been overstated enough to this point where now its being understated.

refulgentis2y ago

Phi-2 wasn't chat/instruct tuned, so it didn't act good in chat, it was a base model. But the benchmark #s were real.

nl2y ago

I had a lot of issues trying to get Phi-2 to perform as well as the benchmarks indicated on non-chat tasks.

So now I have no idea why, but yes: the benchmarks for Phi-2 didn't represent how it worked for me on real world tasks where you'd expect it top be ok.

irjustin2y ago

I'm pretty naive so please forgive it's a stupid question.

1 more reply

oersted2y ago

Incredible, rivals Llama 3 8B with 3.8B parameters after less than a week of release.

And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.

Source: https://chat.lmsys.org/?leaderboard (select English in the dropdown)

So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones? Kinda? Wild.

(I'm sure there's a lot of nuance to it, for one these benchmarks are not so hard to game, we'll see how the dust settles, but still...)

Phi-3-mini 3.8b: 71.2

Phi-3-small 7b: 74.9

Phi-3-medium 14b: 78.2

Phi-2 2.7b: 58.8

Mistral 7b: 61.0

Gemma 7b: 62.0

Llama-3-In 8b: 68.0

Mixtral 8x7b: 69.9

GPT-3.5 1106: 75.3

(these are averages across all tasks for each model, but looking at individual scores shows a similar picture)

jxy2y ago

This inductive logic is way overblown.

> Incredible, beat Llama 3 8B with 3.8B parameters after less than a week of release.

Judging by a single benchmark? Without even trying it out with real world usage?

> And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.

Any potential caveat in such a leaderboard not withstanding, on that leaderboard alone, there is a huge gap between llama 3 8B and Mistral-Large, let alone any of the GPT-4.

By the way, for beating benchmark, "Pretraining on the Test Set Is All You Need"

oersted2y ago

It's easy to miss: select English in the dropdown. The scores are quite different in Overall and in English for LMSYS.

As I've stated in other comments, yeah... Agreed, I'm stretching it a bit. It's just that any indication of a 3.8B model being in the vicinity of GPT-4 is huge.

I'm sure that when things are properly measured by third-parties it will show a more sober picture. But still, with good fine-tunes, we'll probably get close.

It's a very significant demonstration of what could be possible soon.

1 more reply

ignoramous2y ago

> Phi-3-mini 3.8b: 71.2

Per the paper, phi3-mini (which is english-only) quantised to 4bit uses 1.8gb RAM and outputs 1212 tokens/sec (correction: 12 tokens/sec) on iOS.

A model on par with GPT-3.5 running on phones!

(weights haven't been released, though)

coder5432y ago

> (weights haven't been released, though)

Phi-1, Phi-1.5, and Phi-2 have all had their weights released, and those weights are available under the MIT License.

Hopefully Microsoft will continue that trend with Phi-3.

> outputs 1212 tokens/sec on iOS

I think you meant "12 tokens/sec", which is still nice, just a little less exciting than a kilotoken/sec.

3 more replies

karmasimida2y ago

Where did you get this from?

> So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones

No, not even close ... Even Gemini has huge UX gap comparing to GPT4/Opus, 8B I won't even attempt this argument.

alecco2y ago

At a glance, it looks like Phi-3 was trained on an English only, STEM-strong dataset. See how they are not as strong in HumanEval, Trivia, etc. But of course it's very good.

crakenzak2y ago

Can’t wait to see some Phi-3 fine tunes! Will be testing this out locally, such a small model that I can run it without quantization.

Feels incredible to be living in a time with such neck breaking innovations. What are chances we’ll have a <100B parameter GPT4/Claude Opus model in the next 5 years?

nl2y ago

> What are chances we’ll have a <100B parameter GPT4/Claude Opus model in the next 5 years?

In 5 years time we'll have adaptive compute and the idea of talking about the parameter count of a model will seem as quaint as talking about the cylinder capacity of a jet engine.

regularfry2y ago

Deverauxi2y ago

5 years? 5 years is a millennia these days.

1 more reply

bugglebeetle2y ago

We already do. It’s called LLama 3 70B Instruct.

1 more reply

stavros2y ago

Is it released?

viraptor2y ago

candiodari2y ago

I wonder if that's a positive or negative. How does it affect hallucinations?

1 more reply

blackeyeblitzar2y ago

moralestapia2y ago

>And on LMSYS English, Llama 3 8B is well above GPT-4

Source?

oersted2y ago

Right thanks for the reminder, I added it

1 more reply

zone4112y ago

> So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones?

No, we don't. LMsys is just one, very flawed benchmark.

ukuina2y ago

Why is LMsys flawed?

Many people treat LMsys as gospel because it's the only large-scale, up-to-date qualitative benchmark. All the numeric benchmarks seem to miss real-world applicability.

oersted2y ago

Agreed, but it's wild that even one benchmark shows this. Based on what we knew just a few months ago, these models should be so far from each other in every benchmark.

infecto2y ago

visarga2y ago

pkoiralap2y ago

They have started putting some models in huggingface: https://huggingface.co/collections/microsoft/phi-3-6626e15e9...

Patrick_Devine2y ago

And of course if you want to try it out locally, `ollama run phi3`.

minimaxir2y ago

And with a MIT license!

mythz2y ago

I'll believe it till I try it for myself, Phi-2 was the clear worst of the 20 LLMs we evaluated (was also smallest so was expected).

ein0p2y ago

brcmthrowaway2y ago

If I was Apple I'd be quaking in my boots. They are getting too far behind to ever catch up. Nokia in 2010 vibes.

PedroBatista2y ago

They'll just do what they have been doing for ~20 years, they wait, pick the "winner", polish the "user experience", call it Apple magic and incorporate that into their product cycles.

Some day will be the day their joke book becomes so mediocre it will not stick anymore, but I think they are safe on this one, for now..

mirekrusin2y ago

Considering that experiments cost tens to hundreds of millions of dollars a pop this may be not that bad strategy.

fauigerzigerk2y ago

True for hardware, but their record on software is far less convincing.

1 more reply

dkarras2y ago

There is enormous value in polishing the user experience, especially if no one else is doing it (or maybe capable of doing it?). It will never get old as long as they are the only ones doing it.

vessenes2y ago

Anyway, Apple-Phi-3 is in no way an impossibility.

bt1a2y ago

They're working on MLX but it only recently got swift bindings. They just don't have the DEVELOPERS DEVELOPERS DEVELOPERS coked out attitude i guess

esafak2y ago

Did they ever claim to be a powerhouse in foundation models? Did your MacBook or iPhone become obsolete or stop working? They use the models, they don't release them because they don't hoard data.

WanderPanda2y ago

The opposite is the case, with all the advancements, even by doing nothing, Apple (like everyone, including hobbyists) is moving closer to the frontier. Hopefully this trend stays alive!

oersted2y ago

If anything this is good for them. Apple's play here has always been getting their devices ready for running LLMs locally. This makes it way easier.

Deverauxi2y ago

They have something like 140 billion dollars in cash.

They’ll be fine.

IncreasePosts2y ago

How exactly does publicized research lead to them not being able to catch up? I don't think anything in this paper is patentable.

moralestapia2y ago

I don't recall Nokia being a 3 trillion dollar company. Your vibes may vary, though.

astrange2y ago

I think that when people release new interesting software products it's good for hardware companies.

thoughtegting2y ago

golergka2y ago

If I were apple, I would just buy one of the major LLM companies. They have the cash.

1 more reply

seydor2y ago

Apple's advantage is that their devices are safeguarding people from the dangers of AI

talldayo2y ago

That's a very eloquent variation on the word "censorship"

Are you next going to tell us that the CIA's access to iCloud data protects their users from terrorism too?

fauigerzigerk2y ago

How so? And what dangers?

ec1096852y ago

Eh, I think it’s showing that this class of model is becoming commoditized given there is a new one launching every week.

m3kw92y ago

What I’m trying to say is that user experience is now as key as the model smarts and these barely touching gpt4 models cannot beat OpenAI right now as a whole package.

azinman22y ago

abidlabs2y ago

Hugging Face Paper Page and Discussion: https://huggingface.co/papers/2404.14219

whereismyacc2y ago

they're just spamming "weights or it didn't happen"

i mean, fair

blackoil2y ago

Has anyone used these/similar with fine tune and RAG? How is the performance over a narrow domain for simple queries? Is it good enough for say an informational chat bot?

anticensor2y ago

This paper broke ArXiv's HTML generator: https://github.com/arXiv/html_feedback/issues/1090

ur-whale2y ago

That's a whole lot of Zhangs!

smartmic2y ago

lysecret2y ago

Just means you need a big machine and a lot of capital to make advancement. Take a look at any paper coming out of cern.

0cf8612b2e1e2y ago

You should see physics. Stuff involving the large hadron collider can be pages of authors.

It costs so little to share the credit if someone was an asset.

a_bonobo2y ago

I have been on far larger author lists :) There's probably a whole team for the training data generation and assessment, a whole team for the safety assessment (section 4), that stuff adds up.

samus2y ago

It's a tech report. Fair enough to include the whole lab.

simonw2y ago

minimaxir2y ago

At the least, there's multiple benchmarks noted in the paper (21!) and the results are consistent across all of them.

Havoc2y ago

Both precious phi have been epic letdowns when I actually tried them myself so quite low confidence in this being reflective of real world. Will try it anyway though

imjonse2y ago

Phi-3 is instruct tuned though so hopefully better.

Havoc2y ago

Yeah initial testing is looking promising

homarp2y ago

the weights have been released, 4k https://huggingface.co/microsoft/Phi-3-mini-4k-instruct and 128k context https://huggingface.co/microsoft/Phi-3-mini-128k-instruct

hackerlight2y ago

Less tokens than Llama 3 (3.3T vs 15T) yet better outcome. No doubt more information dense training data. The interesting thing is the use of synthetic data which they don't talk about.

vessenes2y ago

YetAnotherNick2y ago

No they don't use textbook text at all despite the paper title. They just asked GPT-4 to generate "textbook quality" content, which doesn't even exactly looks like textbook.

astrange2y ago

I feel like literal dictionaries would make good training data; wonder if any of them have done that. LLMs are good at faking so it's hard to tell by asking them.

torginus2y ago

Except everything that comes out of an LLM (like GPT4) is highly suspect (at least in my experience).

1 more reply

xarope2y ago

perhaps that's the best path forward? Text and reference books (hopefully unbiased) for answers, and web scraped data for conversational tone.

minimaxir2y ago

Yes, "chinchilla optimal" is a meme, but 15T might turn out to be too many tokens.

wrsh072y ago

My understanding from this tweet thread [1] is that chinchilla probably overspecified some of the hyperparameters to the model

tl;dr I'm looking forward to having lots of models (ideally models) trained with a wide range of parameters to narrow down "what is actually optimal"

I think there is an interesting tradeoff of data quality and data volume, though

(Eg if we train with the highest quality 10% of our data, does the model improve if we use the other 90%? What if we increase our data size by 10x?)

[1] https://twitter.com/tamaybes/status/1780639257389904013

maximsicora2y ago

insane

j / k navigate · click thread line to collapse