`phi-4` really blew me away in terms of learning from few-shots. It measured as being 97% consistent with `gpt-4o` when using high-precision few-shots! Without the few-shots, it was only 37%. That's a huge improvement!
By contrast, with few-shots it performs as well as `gpt-4o-mini` (though `gpt-4o-mini`'s baseline without few-shots was 59% – quite a bit higher than `phi-4`'s).
[1] https://bits.logic.inc/p/getting-gpt-4o-mini-to-perform-like
1. The only ultimate absolute quality metric I saw in that blogpost afaict was expert agreement... at 90%. All of our customers would fire us at that level across all of the diff b2b domains we work in. I'm surprised 90% is considered acceptable quality in a paying business context like retail.
2. Gpt-4o-mini is great. I find we can get, for these kind of simple tasks you describe, gpt-4o-mini to achieve about 95-98% agreement with gpt-4o by iteratively manually improving prompts over increasingly large synthetic evals. Given data and a good dev, we do this basically same-day for a lot of simple tasks, which is astounding.
I do expect automatic prompt optimizers to win here long-term, and keep hopefully revisiting dspy et al. For now, they fail over standard prompt engineering. Likewise, I do believe in example learning over time for areas like personalization.... but doing semantic search recall of high-rated answers was a V1 thing we had to rethink due to too many issues.
It's, admittedly, a tough task to measure objectively though, in that it's like a code review. If a Principal Engineer pointed out 20 deficiencies in a code change and another Principal Engineer pointed out 18 of the same 20 things, but also pointed out 3 other things that the first reviewer didn't, it doesn't necessarily mean either review is wrong – they just meaningfully deviate from each other.
In this case, we chose an expert that we treat as an objective "source of truth".
re: simple tasks – We run hundreds of thousands of tasks every month with more-or-less deterministic behavior (in that, we'll reliably do it correctly a million out of a million times). We chose a particularly challenging task for the case-study though.
re: in a paying business context – FWIW, most industries are filled with humans doing tasks where the rate of perfection is far below 90%.
search(T,θ,m) retrieves the first m historical tasks that are semantically similar above the θ threshold
Are both m's here the same or different numbers? I found this a bit confusing
You could, for example, include all few-shots that meet the similarity threshold, but you’ll use more tokens for (I assume) marginal gain. Definitely worth a try though.
Phrased differently, when a task has many valid and correct conclusions, this technique allows the LLM to see "How did I do similar tasks before?" and it'll tend to solve new tasks by making similar decisions it made for previous similar tasks.
Two things to note:
- You'll typically still want to have some small epsilon where you choose to run the task without few-shots. This will help prevent mistakes from propagating forward indefinitely.
- You can have humans correct historical examples, and use their feedback to improve the large model dynamically in real-time. This is basically FSKD where the human is the "large model" and the large foundation model is the "small model".I genuinely think we're only 2 years away from full custom local voice to voice llm assistants that grow with you like JOI in BR2049 and it's going to change how we think about being human and being social, and how we grow up.
I've been experimenting with running local LLMs for nearly two years now, ever since the first LLaMA release back in March 2023.
About six months ago I had mostly lost interest in them. They were fun to play around with but the quality difference between the ones I could run on my MacBook and the ones I could access via an online API felt insurmountable.
This has completely changed in the second half of 2024. The models I can run locally had a leap in quality - they feel genuinely GPT-4 class now.
They're not as good as the best hosted models (GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet) but they're definitely good enough to be extremely useful.
This started with the Qwen 2 and 2.5 series, but I also rate Llama 3.3 70B and now Phi-4 as GPT-4 class models that run on my laptop.
I wrote more about this here: https://simonwillison.net/2024/Dec/31/llms-in-2024/#some-of-...
A 'word calculator' this effective is the best substitute that we have for a logic calculator. And the fact that it's enough in 90% of situations is terrifying as it is transformative, as is the fact no one is awake to it.
Exponential power scaling in an unstable world feels like it only makes it exponentially more unstable though.
BTW, a few days ago I published a book on using Ollama. Here is a link to read it online https://leanpub.com/ollama/read
Infact during the onboarding process they ask the user to choose which AI companion movie they related to the most: Her, BR2049 or Ex-Machina. The experience is then tailored to align closer to the movie chosen.
It's quite a terrible app from a product design perspective: filled with dark patterns (like sending the user blurred images to "unlock*) and upsells, but it's become successful amongst the masses that have adopted it, which I find fascinating. 30m+ users https://en.wikipedia.org/wiki/Replika#:~:text=Replika%20beca....
She appears to be a local model runnable on a small device without cloud.
I don’t see anything in the tech that indicates a singular pattern that will be “good” or “bad”.
Hunyuan (open source video) has been remarkable. Flux dev makes some incredible images.
The fact that it's still only going to get better from here is hard to imagine.
For these models probably no. But for proprietary things that are mission critical and purpose-built (think Adobe Creative Suite) the calculus is very different.
MS, Google, Amazon all win from infra for open source models. I have no idea what game Meta is playing
Based on their business moves in recent history, I’d guess most of them are playing Farmville.
I think they're commoditizing their complement [1]. Engaging content helps Meta, and LLMs make it easier to create that content. Their business model has never been selling API access and releasing the model enables the community to improve it for them.
> I think the strategy is now offer cheap and performant infra to run the models.
Is this not what microsoft is doing? What can microsoft possibly lose by releasing a model?
This is a very clever move by Microsoft. OpenAI has no technological moat and a very unreliable partner.
Yes, because you can't build a moat. Open source will very quickly catch up.
Unfortunately I'm only getting 6 tok/s on NVidia A4000 so it's still not great for real-time queries, but luckily now that it's MIT licensed it's available on OpenRouter [2] for a great price of $0.07/$0.14M at a fast 78 tok/s.
Because it yields better results and we're able to self-host Phi-4 for free, we've replaced Mistral NeMo with it in our default models for answering new questions [3].
[1] https://pvq.app/leaderboard
Edit: they have a blog post https://pvq.app/posts/individual-voting-comparison although it could go deeper
We would have liked to pick a neutral model like Gemini which was fast, reliable and low cost, unfortunately it gave too many poor answers good grades [1]. If we had to pick a new grading model now, hopefully the much improved Gemini Flash 2.0 might yield better results.
[1] https://pvq.app/posts/individual-voting-comparison#gemini-pr...
The one red-flag w/ Phi-4 is that it's IFEval score is relatively low. IFEval has specific types of constraints (forbidden words, capitalization, etc) it tests for [2] but its one area especially worth keeping an eye out for those testing Phi-4 for themselves...
[1] https://docs.google.com/spreadsheets/u/3/d/18n--cIaVt49kOh-G...
[2] https://github.com/google-research/google-research/blob/mast...
Did it have a different license before? If so, why did they change it?
Then a quick search revealed you can as of a free weeks ago
Lots of other models will work nearly as well though if you just give them a clear schema to follow and ask them to output json only, then parse it yourself. Like I've been using gemma2:9b to analyze text and output a json structure and it's nearly 100% reliable despite it being a tiny model and not supporting tools or structured output officially.
Edit: so for example of you want the unsloth "debugged" version of Phi4, you would run:
`$ollama pull hf.co/unsloth/phi-4-GGUF:Q8_0`
(check on the right side of the hf.co/unsloth/phi-4-GGUF page for the available quants)
For the Phi-4 uploaded to Ollama, the hyperparameters were set to avoid the error. The error should stop occurring in the next version of Ollama [2] for imported GGUF files as well
In retrospect, a new architecture name should probably have been used entirely, instead of re-using "phi3".
[1]: https://news.ycombinator.com/item?id=42660335 Phi-4 Bug Fixes
Also on hugging face https://huggingface.co/microsoft/phi-4
For context: I've made some simple neural nets with backprop. I read [1].
The brief of it is by curating a smaller synthetic dataset of high quality from textbooks, problem sets, etc. instead of dumping a massive dataset with tons of information.
Does this mean the model was trained without copyright infringements?
llama.cpp basically dropped support for multimodal visual models. ollama still does support them, but only a handful. Also ollama still does not support vulkan eventhough llama.cpp had vulkan support for a long long time now.
This has been very sad to watch. I'm more and more convinced that vllm is the way to go, not ollama.