Phi 4 available on Ollama (opens in new tab)

(ollama.com)

291 pointseadz1y ago135 comments

135 comments

Over the holidays, we published a post[1] on using high-precision few-shot examples to get `gpt-4o-mini` to perform similar to `gpt-4o`. I just re-ran that same experiment, but swapped out `gpt-4o-mini` with `phi-4`.

`phi-4` really blew me away in terms of learning from few-shots. It measured as being 97% consistent with `gpt-4o` when using high-precision few-shots! Without the few-shots, it was only 37%. That's a huge improvement!

By contrast, with few-shots it performs as well as `gpt-4o-mini` (though `gpt-4o-mini`'s baseline without few-shots was 59% – quite a bit higher than `phi-4`'s).

[1] https://bits.logic.inc/p/getting-gpt-4o-mini-to-perform-like

lmeyerov1y ago

I like the direction, but have a pretty different experience in practice. This spans legal analytics, social media analytics, code synthesis, news analysis, cyber security LLMs, etc:

1. The only ultimate absolute quality metric I saw in that blogpost afaict was expert agreement... at 90%. All of our customers would fire us at that level across all of the diff b2b domains we work in. I'm surprised 90% is considered acceptable quality in a paying business context like retail.

2. Gpt-4o-mini is great. I find we can get, for these kind of simple tasks you describe, gpt-4o-mini to achieve about 95-98% agreement with gpt-4o by iteratively manually improving prompts over increasingly large synthetic evals. Given data and a good dev, we do this basically same-day for a lot of simple tasks, which is astounding.

I do expect automatic prompt optimizers to win here long-term, and keep hopefully revisiting dspy et al. For now, they fail over standard prompt engineering. Likewise, I do believe in example learning over time for areas like personalization.... but doing semantic search recall of high-rated answers was a V1 thing we had to rethink due to too many issues.

sgk2841y ago

re: 90% – this particular case is a fairly subjective and creative task, where humans (and the LLM) are asked to follow a 22 page SOP. They've had a team of humans doing the task for 9 years, with exceptionally high variance in performance. The blended performance of the human team is meaningfully below this 90% threshold (~76%) – which speaks to the difficulty of the task.

It's, admittedly, a tough task to measure objectively though, in that it's like a code review. If a Principal Engineer pointed out 20 deficiencies in a code change and another Principal Engineer pointed out 18 of the same 20 things, but also pointed out 3 other things that the first reviewer didn't, it doesn't necessarily mean either review is wrong – they just meaningfully deviate from each other.

In this case, we chose an expert that we treat as an objective "source of truth".

re: simple tasks – We run hundreds of thousands of tasks every month with more-or-less deterministic behavior (in that, we'll reliably do it correctly a million out of a million times). We chose a particularly challenging task for the case-study though.

re: in a paying business context – FWIW, most industries are filled with humans doing tasks where the rate of perfection is far below 90%.

1 more reply

potatoman221y ago

What's your loop for prompt engineering with GPT-4o? Do you feed the meta-prompter the misclassified examples? Also does the evaluation drive the synthetic data production almost like boosting?

1 more reply

yard20101y ago

This is really nice. I loved the detailed process and I'm definitely gonna use it. One nit though: I didn't understand what the graphs mean, maybe you should add the axes names.

sgk2841y ago

Thanks! Great suggestion for improving the graphs – I just updated the post with axis labels.

1 more reply

blharr1y ago

m ∈ ℤ is the threshold for determining high or low novelty

search(T,θ,m) retrieves the first m historical tasks that are semantically similar above the θ threshold

Are both m's here the same or different numbers? I found this a bit confusing

sgk2841y ago

In our case, yes we treat them the same. Though it might be interesting to decouple them.

You could, for example, include all few-shots that meet the similarity threshold, but you’ll use more tokens for (I assume) marginal gain. Definitely worth a try though.

vincent_s1y ago

Have you also tried using the large model as FSKD model?

sgk2841y ago

We have, and it works great! We currently do this in production, though we use it to help us optimize for consistency between task executions (vs the linked post, which is about improving the capabilities of a model).

Phrased differently, when a task has many valid and correct conclusions, this technique allows the LLM to see "How did I do similar tasks before?" and it'll tend to solve new tasks by making similar decisions it made for previous similar tasks.

Two things to note:

    - You'll typically still want to have some small epsilon where you choose to run the task without few-shots. This will help prevent mistakes from propagating forward indefinitely.

    - You can have humans correct historical examples, and use their feedback to improve the large model dynamically in real-time. This is basically FSKD where the human is the "large model" and the large foundation model is the "small model".

nothrowaways1y ago

Nice blog

t0lo1y ago

Is anyone blown away by how fast we got to running something this powerful locally? I know it's easy to get burnt out on llms but this is pretty incredible.

I genuinely think we're only 2 years away from full custom local voice to voice llm assistants that grow with you like JOI in BR2049 and it's going to change how we think about being human and being social, and how we grow up.

simonw1y ago

It's incredible.

I've been experimenting with running local LLMs for nearly two years now, ever since the first LLaMA release back in March 2023.

About six months ago I had mostly lost interest in them. They were fun to play around with but the quality difference between the ones I could run on my MacBook and the ones I could access via an online API felt insurmountable.

This has completely changed in the second half of 2024. The models I can run locally had a leap in quality - they feel genuinely GPT-4 class now.

They're not as good as the best hosted models (GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet) but they're definitely good enough to be extremely useful.

This started with the Qwen 2 and 2.5 series, but I also rate Llama 3.3 70B and now Phi-4 as GPT-4 class models that run on my laptop.

I wrote more about this here: https://simonwillison.net/2024/Dec/31/llms-in-2024/#some-of-...

t0lo1y ago

I'm in complete agreement with your more recent timeline piece (the negative one), and as a younger user (22 year old student) I'm actively relocating this year to somewhere slightly more rural with a focus on physical/knowledge combined work to secure a good quality of life nearly solely because of how fast our timelines are.

A 'word calculator' this effective is the best substitute that we have for a logic calculator. And the fact that it's enough in 90% of situations is terrifying as it is transformative, as is the fact no one is awake to it.

Exponential power scaling in an unstable world feels like it only makes it exponentially more unstable though.

1 more reply

mark_l_watson1y ago

I am blown away: a year ago I bought a M2 32G Mac to run local models. It seems like what I can run locally now just one year later is 10x more useful for NLP, data wrangling, RAG, experimenting with agents, etc.

BTW, a few days ago I published a book on using Ollama. Here is a link to read it online https://leanpub.com/ollama/read

hn87261y ago

Which models do you recommend for that amount of memory?

2 more replies

cloudking1y ago

Not related to local LLMs, but JOI from BR2049 is essentially what Replika is striving for: https://replika.com/

Infact during the onboarding process they ask the user to choose which AI companion movie they related to the most: Her, BR2049 or Ex-Machina. The experience is then tailored to align closer to the movie chosen.

It's quite a terrible app from a product design perspective: filled with dark patterns (like sending the user blurred images to "unlock*) and upsells, but it's become successful amongst the masses that have adopted it, which I find fascinating. 30m+ users https://en.wikipedia.org/wiki/Replika#:~:text=Replika%20beca....

api1y ago

I’ve thought for a while that Joi in BR2049 was less dystopian than what we will probably do with AI. She doesn’t constantly prompt K to buy more credits (like a mobile game) to continue engaging with her or deepen their relationship. (“If you really love me…”) I’ve been expecting that this is how our industry would operate given the customer hostile psychologically abusive hellscape of social and mobile. Of course there’s still time.

She appears to be a local model runnable on a small device without cloud.

brookst1y ago

I expect AI to be like any other tech: some fantastic uses that advance humanity and improve the world, some terrible uses that abuse, manipulate, oppress.

I don’t see anything in the tech that indicates a singular pattern that will be “good” or “bad”.

1 more reply

yeahwhatever101y ago

How can a model "grow with you"? Do current models do this other than adding the full conversation to the context window?

SamPatt1y ago

Yes, and for image and video generation too.

Hunyuan (open source video) has been remarkable. Flux dev makes some incredible images.

The fact that it's still only going to get better from here is hard to imagine.

crorella1y ago

It’s odd that MS is releasing models they are competitors to OA. This reinforce the idea that there is no real strategic advantage in owning a model. I think the strategy is now offer cheap and performant infra to run the models.

mlepath1y ago

> This reinforce the idea that there is no real strategic advantage in owning a model

For these models probably no. But for proprietary things that are mission critical and purpose-built (think Adobe Creative Suite) the calculus is very different.

MS, Google, Amazon all win from infra for open source models. I have no idea what game Meta is playing

chefandy1y ago

> I have no idea what game Meta is playing

Based on their business moves in recent history, I’d guess most of them are playing Farmville.

1 more reply

potatoman221y ago

> I have no idea what game Meta is playing

I think they're commoditizing their complement [1]. Engaging content helps Meta, and LLMs make it easier to create that content. Their business model has never been selling API access and releasing the model enables the community to improve it for them.

[1] https://gwern.net/complement

ls6121y ago

Meta seems to be playing the “commoditize your complements” game. Which is good for the rest of us who get close to SotA open weights models.

PittleyDunkin1y ago

> It’s odd that MS is releasing models they are competitors to OA.

> I think the strategy is now offer cheap and performant infra to run the models.

Is this not what microsoft is doing? What can microsoft possibly lose by releasing a model?

lolinder1y ago

That's exactly what they're saying: it's interesting that Microsoft came to the same conclusion that Meta did, that models are generally not worth keeping locked down. It suggests that OpenAI has a very fragile business model, given that they're wholly dependent on large providers for the infra, which is apparently the valuable part of the equation.

4 more replies

m3kw91y ago

They are releasing non sota models.

buyucu1y ago

According to many press stories in the past year, the relationship between Microsoft and OpenAI has been very strained. It looks more and more like that both sides are looking for opportunity to jump ship.

This is a very clever move by Microsoft. OpenAI has no technological moat and a very unreliable partner.

easton1y ago

I think they want/need a plan b in case OpenAI falls apart like it almost did when Sam got fired.

naasking1y ago

> This reinforce the idea that there is no real strategic advantage in owning a model.

Yes, because you can't build a moat. Open source will very quickly catch up.

mythz1y ago

Was disappointed in all the Phi models before this, whose benchmark results scored way better than it worked in practice, but I've been really impressed with how good Phi-4 is at just 14B. We've run it against the top 1000 most popular StackOverflow questions and it came up 3rd beating out GPT-4 and Sonnet 3.5 in our benchmarks, only behind DeepSeek v3 and WizardLM 8x22B [1]. We're using Mixtral 8x7B to grade the quality of the answers which could explain how WizardLM (based on Mixtral 8x22B) took 2nd Place.

Unfortunately I'm only getting 6 tok/s on NVidia A4000 so it's still not great for real-time queries, but luckily now that it's MIT licensed it's available on OpenRouter [2] for a great price of $0.07/$0.14M at a fast 78 tok/s.

Because it yields better results and we're able to self-host Phi-4 for free, we've replaced Mistral NeMo with it in our default models for answering new questions [3].

[1] https://pvq.app/leaderboard

[2] https://openrouter.ai/microsoft/phi-4

[3] https://pvq.app/questions/ask

KTibow1y ago

Interesting eval but my first reaction is "using Mixtral as a judge doesn't sound like a good idea". Have you tested how different its results are from GPT-4 as a judge (on a small scale) or how stuff like style and order can affect its judgements?

Edit: they have a blog post https://pvq.app/posts/individual-voting-comparison although it could go deeper

mythz1y ago

Yeah we evaluated several models for grading ~1 year ago and concluded Mixtral was the best choice for us, as it was the best model yielding the best results that we could self-host and distribute the load of grading 1.2M+ answers over several GPU Servers.

We would have liked to pick a neutral model like Gemini which was fast, reliable and low cost, unfortunately it gave too many poor answers good grades [1]. If we had to pick a new grading model now, hopefully the much improved Gemini Flash 2.0 might yield better results.

[1] https://pvq.app/posts/individual-voting-comparison#gemini-pr...

2 more replies

lhl1y ago

I tested Phi-4 with a Japanese functional test suite and it scored much better than prior Phis (and comparable to much larger models, basically in the top tier atm). [1]

The one red-flag w/ Phi-4 is that it's IFEval score is relatively low. IFEval has specific types of constraints (forbidden words, capitalization, etc) it tests for [2] but its one area especially worth keeping an eye out for those testing Phi-4 for themselves...

[1] https://docs.google.com/spreadsheets/u/3/d/18n--cIaVt49kOh-G...

[2] https://github.com/google-research/google-research/blob/mast...

driverdan1y ago

IMO SO questions is not a good evaluation. These models were likely trained on the top 1000 most popular StackOverflow questions. You'd expect them to have similar results and perform well when compared to the original answers.

solomatov1y ago

> but luckily now that it's MIT licensed it's available on OpenRouter

Did it have a different license before? If so, why did they change it?

hbcondo7141y ago

FWIW, Phi-4 was converted to Ollama by the community last month:

https://ollama.com/vanilj/Phi-4

smallerize1y ago

And adopted unsloth's bug fixes a few days ago. https://ollama.com/vanilj/phi-4-unsloth

summarity1y ago

The template doesn't match Unsloth's recommendation: https://news.ycombinator.com/item?id=42662106

Patrick_Devine1y ago

We ended up not publishing it as a library model just because it was leaked and not the official weights.

raybb1y ago

I was going to ask if this or other Ollama models support structured output (like JSON).

Then a quick search revealed you can as of a free weeks ago

https://ollama.com/blog/structured-outputs

porker1y ago

For structured output from anywhere I'm finding https://github.com/BoundaryML/baml good. It's more accurate than gpt-04-mini will do on its own, and any of the other JSON schema approaches I've tried.

svachalek1y ago

Yeah it's not as strong as constrained beam search like OpenAI uses (at least afaik) but it works on any models that support tool calling. Just keep it simple, don't have a lot of deep nested structures or complicated rules.

Lots of other models will work nearly as well though if you just give them a clear schema to follow and ask them to output json only, then parse it yourself. Like I've been using gemma2:9b to analyze text and output a json structure and it's nearly 100% reliable despite it being a tiny model and not supporting tools or structured output officially.

andhuman1y ago

I’ve seen on the localllama subreddit that some GGUFs have bugs in them. The one recommended was by unsloth. However, I don’t know how the Ollama GGUF holds up.

compumetrika1y ago

Ollama can pull directly from HF, you just provide the URL and add to the end :Q8_0 (or whatever) to specify your desired quant. Bonus: use the short form url of `hf` instead of `huggingface` to shorten the model name a little in the ollama list table.

Edit: so for example of you want the unsloth "debugged" version of Phi4, you would run:

`$ollama pull hf.co/unsloth/phi-4-GGUF:Q8_0`

(check on the right side of the hf.co/unsloth/phi-4-GGUF page for the available quants)

jimmySixDOF1y ago

You still need to make sure the modelfile works so this method will not run out of the box on a vision GGUF or anything with special schemas. Thats why mostly a good idea to pull from ollama directly.

wruza1y ago

Is it true that non-gguf models are basically all Q4-equivalent? I'm always not sure which one to download to get the "default score".

jmorgan1y ago

Phi-4's architecture changed slightly from Phi-3.5 (it no longer uses a sliding window of 2,048 tokens [1]), causing a change in the hyperparameters (and ultimately an error at inference time for some published GGUF files on Hugging Face, since the same architecture name/identifier was re-used between the two models).

For the Phi-4 uploaded to Ollama, the hyperparameters were set to avoid the error. The error should stop occurring in the next version of Ollama [2] for imported GGUF files as well

In retrospect, a new architecture name should probably have been used entirely, instead of re-using "phi3".

[1] https://arxiv.org/html/2412.08905v1

[2] https://github.com/ollama/ollama/releases/tag/v0.5.5

magicalhippo1y ago

Here's[1] a recent submission on that.

[1]: https://news.ycombinator.com/item?id=42660335 Phi-4 Bug Fixes

gnabgib1y ago

Related Phi-4: Microsoft's Newest Small Language Model Specializing in Complex Reasoning (439 points, 24 days ago, 144 comments) https://news.ycombinator.com/item?id=42405323

Also on hugging face https://huggingface.co/microsoft/phi-4

1 more reply

summarity1y ago

Does it include the unsloth fixes yet?

mettamage1y ago

How come models can be so small now? I don't know a lot about AI, but is there an ELI5 for a software engineer that knows a bit about AI?

For context: I've made some simple neural nets with backprop. I read [1].

[1] http://neuralnetworksanddeeplearning.com/

blharr1y ago

You can find the phi-4 technical report [here](https://www.microsoft.com/en-us/research/uploads/prod/2024/1...)

The brief of it is by curating a smaller synthetic dataset of high quality from textbooks, problem sets, etc. instead of dumping a massive dataset with tons of information.

k__1y ago

"built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets"

Does this mean the model was trained without copyright infringements?

redcobra7621y ago

This is a presumptive question, as training AI models may fall under fair use.

k__1y ago

Just because some laws define fair use in some kind of way, it doesn't mean potential customers see it that way.

dartos1y ago

Does this include some of the config fixes that the sloth guys pointed out?

kuatroka1y ago

I’ve pulled and ran it. It launches fine, but when I actually ask it anything I constantly get just a blank line. Does anyone else experience this?

OJFord1y ago

I would guess on your hardware you're getting <1 token/time-you've-bothered-waiting?

kuatroka1y ago

not sure what it means. I've got macbook pro M1 Max with 64Gb. Any other model runs perfectly fine. Only Phi4 blanks on me

XCSme1y ago

Does this have an "instruct" version? Or it's already sort-of like that, as it was trained more on Q&A scenarios?

ionwake1y ago

Can this run on a macbook m1? What is the performance like? Or would I need an m3? Thanks

svachalek1y ago

Yeah as long as it has 16GB+ RAM. I've got a newer cpu and it's very fast, so I expect on an M1 it would be at least bearable.

mercer1y ago

It's good enough for me on an M1, 16Gb, and slow but good enough as a background job on my older intel mbp with 16Gb. I somehow expected it to not work on intel macs at all, so that's a freebie.

buyucu1y ago

I have unfortunately been disappointed with the llama.cpp/ollama ecosystem of late, and thinking about moving my things to vllm instead.

llama.cpp basically dropped support for multimodal visual models. ollama still does support them, but only a handful. Also ollama still does not support vulkan eventhough llama.cpp had vulkan support for a long long time now.

This has been very sad to watch. I'm more and more convinced that vllm is the way to go, not ollama.

mistercheph1y ago

Ollama maintainers seem a bit burned out

jacooper1y ago

But can you run llms that easily with vllm? do you have to fiddle with formats to get it to run?

buyucu1y ago

I'm still in early stages of exploration, but vllm seems to be compatible with most models on huggingface.

sega_sai1y ago

I've just tried to make it run something, and I just could not force to include the python code inside ``` ``` quotation marks. It always wants to put word python after three quotes, like this: ```python .. code.. ``` I wonder if that's the result of training. (I use the LLM output to then run the resulting code)

v3ss0n1y ago

Translation, Phi-4 available on llmacpp

j / k navigate · click thread line to collapse

135 comments

sgk2841y ago

By contrast, with few-shots it performs as well as `gpt-4o-mini` (though `gpt-4o-mini`'s baseline without few-shots was 59% – quite a bit higher than `phi-4`'s).

[1] https://bits.logic.inc/p/getting-gpt-4o-mini-to-perform-like

lmeyerov1y ago

I like the direction, but have a pretty different experience in practice. This spans legal analytics, social media analytics, code synthesis, news analysis, cyber security LLMs, etc:

sgk2841y ago

In this case, we chose an expert that we treat as an objective "source of truth".

re: in a paying business context – FWIW, most industries are filled with humans doing tasks where the rate of perfection is far below 90%.

1 more reply

potatoman221y ago

What's your loop for prompt engineering with GPT-4o? Do you feed the meta-prompter the misclassified examples? Also does the evaluation drive the synthetic data production almost like boosting?

1 more reply

yard20101y ago

This is really nice. I loved the detailed process and I'm definitely gonna use it. One nit though: I didn't understand what the graphs mean, maybe you should add the axes names.

sgk2841y ago

Thanks! Great suggestion for improving the graphs – I just updated the post with axis labels.

1 more reply

blharr1y ago

m ∈ ℤ is the threshold for determining high or low novelty

search(T,θ,m) retrieves the first m historical tasks that are semantically similar above the θ threshold

Are both m's here the same or different numbers? I found this a bit confusing

sgk2841y ago

In our case, yes we treat them the same. Though it might be interesting to decouple them.

You could, for example, include all few-shots that meet the similarity threshold, but you’ll use more tokens for (I assume) marginal gain. Definitely worth a try though.

vincent_s1y ago

Have you also tried using the large model as FSKD model?

sgk2841y ago

Two things to note:

    - You'll typically still want to have some small epsilon where you choose to run the task without few-shots. This will help prevent mistakes from propagating forward indefinitely.

    - You can have humans correct historical examples, and use their feedback to improve the large model dynamically in real-time. This is basically FSKD where the human is the "large model" and the large foundation model is the "small model".

nothrowaways1y ago

Nice blog

t0lo1y ago

Is anyone blown away by how fast we got to running something this powerful locally? I know it's easy to get burnt out on llms but this is pretty incredible.

simonw1y ago

It's incredible.

I've been experimenting with running local LLMs for nearly two years now, ever since the first LLaMA release back in March 2023.

This has completely changed in the second half of 2024. The models I can run locally had a leap in quality - they feel genuinely GPT-4 class now.

They're not as good as the best hosted models (GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet) but they're definitely good enough to be extremely useful.

This started with the Qwen 2 and 2.5 series, but I also rate Llama 3.3 70B and now Phi-4 as GPT-4 class models that run on my laptop.

I wrote more about this here: https://simonwillison.net/2024/Dec/31/llms-in-2024/#some-of-...

t0lo1y ago

Exponential power scaling in an unstable world feels like it only makes it exponentially more unstable though.

1 more reply

mark_l_watson1y ago

BTW, a few days ago I published a book on using Ollama. Here is a link to read it online https://leanpub.com/ollama/read

hn87261y ago

Which models do you recommend for that amount of memory?

2 more replies

cloudking1y ago

Not related to local LLMs, but JOI from BR2049 is essentially what Replika is striving for: https://replika.com/

api1y ago

She appears to be a local model runnable on a small device without cloud.

brookst1y ago

I expect AI to be like any other tech: some fantastic uses that advance humanity and improve the world, some terrible uses that abuse, manipulate, oppress.

I don’t see anything in the tech that indicates a singular pattern that will be “good” or “bad”.

1 more reply

yeahwhatever101y ago

How can a model "grow with you"? Do current models do this other than adding the full conversation to the context window?

SamPatt1y ago

Yes, and for image and video generation too.

Hunyuan (open source video) has been remarkable. Flux dev makes some incredible images.

The fact that it's still only going to get better from here is hard to imagine.

crorella1y ago

mlepath1y ago

> This reinforce the idea that there is no real strategic advantage in owning a model

For these models probably no. But for proprietary things that are mission critical and purpose-built (think Adobe Creative Suite) the calculus is very different.

MS, Google, Amazon all win from infra for open source models. I have no idea what game Meta is playing

chefandy1y ago

> I have no idea what game Meta is playing

Based on their business moves in recent history, I’d guess most of them are playing Farmville.

1 more reply

potatoman221y ago

> I have no idea what game Meta is playing

[1] https://gwern.net/complement

ls6121y ago

Meta seems to be playing the “commoditize your complements” game. Which is good for the rest of us who get close to SotA open weights models.

PittleyDunkin1y ago

> It’s odd that MS is releasing models they are competitors to OA.

> I think the strategy is now offer cheap and performant infra to run the models.

Is this not what microsoft is doing? What can microsoft possibly lose by releasing a model?

lolinder1y ago

4 more replies

m3kw91y ago

They are releasing non sota models.

buyucu1y ago

This is a very clever move by Microsoft. OpenAI has no technological moat and a very unreliable partner.

easton1y ago

I think they want/need a plan b in case OpenAI falls apart like it almost did when Sam got fired.

naasking1y ago

> This reinforce the idea that there is no real strategic advantage in owning a model.

Yes, because you can't build a moat. Open source will very quickly catch up.

mythz1y ago

Because it yields better results and we're able to self-host Phi-4 for free, we've replaced Mistral NeMo with it in our default models for answering new questions [3].

[1] https://pvq.app/leaderboard

[2] https://openrouter.ai/microsoft/phi-4

[3] https://pvq.app/questions/ask

KTibow1y ago

Edit: they have a blog post https://pvq.app/posts/individual-voting-comparison although it could go deeper

mythz1y ago

[1] https://pvq.app/posts/individual-voting-comparison#gemini-pr...

2 more replies

lhl1y ago

I tested Phi-4 with a Japanese functional test suite and it scored much better than prior Phis (and comparable to much larger models, basically in the top tier atm). [1]

[1] https://docs.google.com/spreadsheets/u/3/d/18n--cIaVt49kOh-G...

[2] https://github.com/google-research/google-research/blob/mast...

driverdan1y ago

solomatov1y ago

> but luckily now that it's MIT licensed it's available on OpenRouter

Did it have a different license before? If so, why did they change it?

hbcondo7141y ago

FWIW, Phi-4 was converted to Ollama by the community last month:

https://ollama.com/vanilj/Phi-4

smallerize1y ago

And adopted unsloth's bug fixes a few days ago. https://ollama.com/vanilj/phi-4-unsloth

summarity1y ago

The template doesn't match Unsloth's recommendation: https://news.ycombinator.com/item?id=42662106

Patrick_Devine1y ago

We ended up not publishing it as a library model just because it was leaked and not the official weights.

raybb1y ago

I was going to ask if this or other Ollama models support structured output (like JSON).

Then a quick search revealed you can as of a free weeks ago

https://ollama.com/blog/structured-outputs

porker1y ago

svachalek1y ago

andhuman1y ago

I’ve seen on the localllama subreddit that some GGUFs have bugs in them. The one recommended was by unsloth. However, I don’t know how the Ollama GGUF holds up.

compumetrika1y ago

Edit: so for example of you want the unsloth "debugged" version of Phi4, you would run:

`$ollama pull hf.co/unsloth/phi-4-GGUF:Q8_0`

(check on the right side of the hf.co/unsloth/phi-4-GGUF page for the available quants)

jimmySixDOF1y ago

wruza1y ago

Is it true that non-gguf models are basically all Q4-equivalent? I'm always not sure which one to download to get the "default score".

jmorgan1y ago

For the Phi-4 uploaded to Ollama, the hyperparameters were set to avoid the error. The error should stop occurring in the next version of Ollama [2] for imported GGUF files as well

In retrospect, a new architecture name should probably have been used entirely, instead of re-using "phi3".

[1] https://arxiv.org/html/2412.08905v1

[2] https://github.com/ollama/ollama/releases/tag/v0.5.5

magicalhippo1y ago

Here's[1] a recent submission on that.

[1]: https://news.ycombinator.com/item?id=42660335 Phi-4 Bug Fixes

gnabgib1y ago

Related Phi-4: Microsoft's Newest Small Language Model Specializing in Complex Reasoning (439 points, 24 days ago, 144 comments) https://news.ycombinator.com/item?id=42405323

Also on hugging face https://huggingface.co/microsoft/phi-4

1 more reply

summarity1y ago

Does it include the unsloth fixes yet?

mettamage1y ago

How come models can be so small now? I don't know a lot about AI, but is there an ELI5 for a software engineer that knows a bit about AI?

For context: I've made some simple neural nets with backprop. I read [1].

[1] http://neuralnetworksanddeeplearning.com/

blharr1y ago

You can find the phi-4 technical report [here](https://www.microsoft.com/en-us/research/uploads/prod/2024/1...)

The brief of it is by curating a smaller synthetic dataset of high quality from textbooks, problem sets, etc. instead of dumping a massive dataset with tons of information.

k__1y ago

"built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets"

Does this mean the model was trained without copyright infringements?

redcobra7621y ago

This is a presumptive question, as training AI models may fall under fair use.

k__1y ago

Just because some laws define fair use in some kind of way, it doesn't mean potential customers see it that way.

dartos1y ago

Does this include some of the config fixes that the sloth guys pointed out?

kuatroka1y ago

I’ve pulled and ran it. It launches fine, but when I actually ask it anything I constantly get just a blank line. Does anyone else experience this?

OJFord1y ago

I would guess on your hardware you're getting <1 token/time-you've-bothered-waiting?

kuatroka1y ago

not sure what it means. I've got macbook pro M1 Max with 64Gb. Any other model runs perfectly fine. Only Phi4 blanks on me

XCSme1y ago

Does this have an "instruct" version? Or it's already sort-of like that, as it was trained more on Q&A scenarios?

ionwake1y ago

Can this run on a macbook m1? What is the performance like? Or would I need an m3? Thanks

svachalek1y ago

Yeah as long as it has 16GB+ RAM. I've got a newer cpu and it's very fast, so I expect on an M1 it would be at least bearable.

mercer1y ago

It's good enough for me on an M1, 16Gb, and slow but good enough as a background job on my older intel mbp with 16Gb. I somehow expected it to not work on intel macs at all, so that's a freebie.

buyucu1y ago

I have unfortunately been disappointed with the llama.cpp/ollama ecosystem of late, and thinking about moving my things to vllm instead.

This has been very sad to watch. I'm more and more convinced that vllm is the way to go, not ollama.

mistercheph1y ago

Ollama maintainers seem a bit burned out

jacooper1y ago

But can you run llms that easily with vllm? do you have to fiddle with formats to get it to run?

buyucu1y ago

I'm still in early stages of exploration, but vllm seems to be compatible with most models on huggingface.

sega_sai1y ago

v3ss0n1y ago

Translation, Phi-4 available on llmacpp

j / k navigate · click thread line to collapse