- They compare the performance of this model to the worst 7B code llama model. The base code llama 7B python model scores 38.4% on humaneval, versus the non-python model, which only scores 33%.
- They compare their instruct tuned model to non-instruct-tuned models. Instruction tuning can add 20% or more to humaneval performance. For example, WizardLM 7B scores 55% on humaneval [1], and I've trained a 7B model that scores 62% [2].
- For another example of instruction tuning, Stablecode instruct tuned benchmarks at 26%, not the 20% they cite for the base model [3]
- Starcoder, when prompted properly, scores 40% on humaneval [4]
- They do not report their base model performance (as far as I can tell)
This is interesting work, and a good contribution, but it's important to compare similar models.[1] https://github.com/nlpxucan/WizardLM
[2] https://huggingface.co/vikp/llama_coder
[3] https://stability.ai/blog/stablecode-llm-generative-ai-codin...
[4] https://github.com/huggingface/blog/blob/main/starcoder.md
> They compare the performance of this model to the worst 7B code llama model. The base code llama 7B python model scores 38.4% on humaneval, versus the non-python model, which only scores 33%.
We are comparing multilingual models, and we are not focused on python-finetuned versions
> They compare their instruct tuned model to non-instruct-tuned models. Instruction tuning can add 20% or more to humaneval performance. For example, WizardLM 7B scores 55% on humaneval [1], and I've trained a 7B model that scores 62% [2]. > For another example of instruction tuning, Stablecode instruct tuned benchmarks at 26%, not the 20% they cite for the base model [3]
We have two separate comparisons (see https://huggingface.co/smallcloudai/Refact-1_6B-fim) for completion-based models and instruction-following-based models with different humaneval formats. But we are considering our model as a completion (FIM) one in the first place and we were using 85% non-instruction following data to make the final model. The chat functionality is really limited for such small models
> Starcoder, when prompted properly, scores 40% on humaneval
Yep, that is right. But worth mentioning, the starcoder model showed 40% while being extra finetuned exclusively on python
> They do not report their base model performance (as far as I can tell)
Our base model gets around 20-23% humaneval. But it is not the case since the model was trained using 50% non-code data (considering the model's size it was really hard to keep the model converging)
The open rail license seems to reference some sort of limitations on safety and unethical use but I can’t see where in the repo that’s spelled out precisely what the authors have in mind?
This is not really true. Llama 7B runs with Vulkan/llama.cpp on ~8GB smartphones and ~12GB laptops. That ease is going to get much better over time, as lower RAM hardware starts dropping out of the market and the Vulkan implementations get more widespread.
For users trying to run LLMs on 8GB or less machines, the AI Horde approach of distributed models seems much more practical anyway.
(Not even /s - while the developers of LLM applications may have 64GB RAM in their laptops or desktops, the less-technical early adopters of LLMs running locally are likely to be power users with lower-powered laptops, much more stringent RAM limits, and numerous line-of-business applications and browser tabs contending for that RAM. Causing those applications to be swapped onto disk will almost certainly result in a degraded overall experience that could easily be blamed on the LLM application itself.)
But for code completion in an IDE where it has to react as you type, every 100 millisecond delay in response time is noticable.
Even with a 24GB GPU, a 7B model doesn't feel snappy enough for code-completion in an IDE.
Would that be enough? shrug
i wasnt aware this was a term of art. is there a definitive blogpost or product explaining this approach?
RAM is only about 6x the speed of SSD’s for sequential access. Most people don’t actually need truly random access to all that much data rather than streaming video or loading video game assets to their GPU. So they shift spending to other components like video card, monitors, etc that actually provide significant value.
Which is how you get people with 16 GB of system RAM using graphics cards that also have 16GB of RAM.
What is the point of a new model that isn’t better than the best possible model (example: OpenAI GPT-4)?
What’s the point in having a smaller model? Who cares?
—-
This is a real, genuine question that I don’t have a clear answer to. Excuse my ignorance, plz enlighten your boi.
Here’s a few usecases that I wouldn’t want to use OpenAI/GPT for
- Advanced autocomplete for texting and private communications
- Querying sensitive document databases like emails
- Traveling in low connectivity areas
- Politically incorrect usecases (generating erotic content for example)
List kinda goes on and on
GPT4 can't even be finetuned at the moment (though I expect that to change).
- You can fine tune these models for very specific tasks, which GPT-4 might not be as good at.
- Open source models are free. You can use them as much as you want without worrying about a $xx,xxx bill at the end of the month which makes tinkering with them easier.
- Smaller models like this can run on consumer hardware, even phones, and can run offline.
- Privacy and not having to abide by a third parties terms. You don't have to deal with "As a large language model...", especially with uncensored models.
- Tools like jsonformer https://github.com/1rgs/jsonformer are not possible with OpenAIs API.
- It's also just really cool, let's be honest.
2) any model that's run on computational resources you are owning or leasing will have more privacy than an explicit cloud offering. running completely on your own local hardware will be private. this means you don't have to think twice about asking the LLM about the proprietary code or information you are working on.
3) smaller models gain the performance improvements from all the other improvements in interpreters and quantizing, allowing for even more consumer friendly offline use
4) oh yeah, offline use. could expand use cases to having LLM's baked into operating systems directly, including leading phones
5) showing what's possible, pushing towards the benchmarks of the best possible model while using less computational resources. this also makes the hosts of the best possible model realize that they could either A) be using less computational resources and increasing the bandwidth for their users B) further improve their own model because of competition. Basically if ChatGPT 4 was using similar improvements in technology across all areas of reasoning/whatever, there never would have been a rate limit on ChatGPT 4.
6) more demand for other computational resources. Nvidia is backordered till maybe Q2 2024 right now. If people realize AMD or even their ARM chips can offer same performance with the right combination of hardware and software, It alleviates pressure on other ventures that want computation power.
- You can run it behind an air-gap, where your systems are disconnected from the world.
- You can run it on the edge with low or no internet connectivity
- You do not need to worry about breaching geographic data restrictions, e.g.: medical data from Country X cannot leave Country X
I think the point is to reach a baseline of something being super lightweight yet still useful that could be production for a number of use cases.
The web interface for the LLM server is especially nice and clean compared to many of the others I've tried - and it "just works". Very interested to see how this evolves.
my understanding is that there are 2 usages of the pass@{number} syntax. the HumanEval/Codex paper interprets the {number} as number of attempts[0]. however language modelers seem to use it to denote the number of few shot example demonstrations given in the context. these are starkly different and i wish the syntax wasnt overloaded
---
[0] https://arxiv.org/pdf/2107.03374.pdf
> Kulal et al. (2019) evaluate functional correctness using the pass@k metric, where k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported.
I have a feeling that the more robust models might be the ones that don’t perform best on benchmarks right away.
I've been testing the best performing models on the huggingface leaderboard lately. Some of them are really impressive, and others are so bad that I second guess the prompt format or if the benchmarked model is actually the same one I'm testing.
See last page for restrictions
Darn! Foiled again! I was planning on breaking some federal laws, but the license says that I can't ;( \s
Open-RAIL license has the be the worst license in existence claiming to be "open".
> You shall undertake reasonable efforts to use the latest version of the Model.
Plea to folks releasing models: Please stop using this user-hostile and deranged license
E.g. a model specializing in chemistry doesn't need to include data on world's history or to be able to write poetry.
I don't know enough about fine-tuning, not sure if the process is capable of removing "unused" parts of the model (I guess not possible, similar to un-learning).
We’re using formal logic in the form of abstract rewrite systems over a causal graph to perform geometric deep learning. In theory it should be able to learn the same topological structure of data that neural networks do, but using entirely discrete operations and without the random walk inherent to stochastic gradient descent.
Current experiments are really promising, and assuming the growth curve continues as we scale up you should be able to train a GPT-4 scale LLM in a few weeks on commodity hardware (we are using a desktop with 4 4090’s currently), and be able to do both inference and continual fine tuning/online learning on device.
Abstract rewrite like a computer algebra system's (e.g. Wolfram) term rewriting equation simplication method?
Drop me a link at (my first name) @ brainchain dot AI if you'd like to chat, I'd love to hear more about what you're working on!
This isn't how Sparse MoE models work. There isn't really any complexity like that. And different models will or can pick each token.
Sparse models aren't an ensemble of models.
My sincere belief is that local models is the way of the future, with flexible base models adapted via Lora and context to specific use cases. I think open source models and techniques are inexorable at this point barring some sort of regulatory moat and will rival commercial models in all but extreme cases.
At all other times I support tech freedom. I use libre software, I use Tor, I donate to privacy and FOSS organizations constantly. I only write my software projects under an AGPL license. AI is qualitatively different. A world run amok with intelligent infinite Sybils is not good for anyone. I hope massive compute continues to be necessary, it may be the only hard chokepoint we have to keep a handle on the beast.
I agree with and appreciate the sentiment, but it feels way too late for that. These people do have and exert direct control over pretty much all of our digital devices. It's funny (or sad) that we only seem to care about this when shiny doodads like AI come around every so-often.
I can already run Llama 2 @70b on my laptop, and that’ll look like a quaint old AI artifact in 5-7 years. I think the consumer market will keep pace yet stay well below SotA, just as it always has. That still leaves plenty of room for incredible open-source stuff!
It has much better performance than all of the code models of similar size, and almost reaches the same HumanEval as Starcoder being 10x smaller in size.
With the small size, it can work with most modern GPUs requiring just 3GB Ram.
You can try self-hosting it in Refact https://github.com/smallcloudai/refact/ and get a local fast copilot alternative with decent suggestions.
Weights and model card https://huggingface.co/smallcloudai/Refact-1_6B-fim.
We would love to hear your feedback!
I see that model type "gpt_refact" in https://huggingface.co/smallcloudai/Refact-1_6B-fim/blob/mai...
how can you tell that HumanEval is not leaked to your training data in some form?
https://algora.io/org/smallcloudai/bounties
disclaimer: i'm a cofounder of algora, the platform enabling these bounties
bigscience-openrail-m
https://huggingface.co/smallcloudai/Refact-1_6B-fim/blob/mai...