Refact Code LLM: 1.6B LLM for code that reaches 32% HumanEval (opens in new tab)

(refact.ai)

181 pointskateklink2y ago100 comments

100 comments

vikp2y ago

This post is misleading, in a way that is hard to do accidentally.

  - They compare the performance of this model to the worst 7B code llama model.  The base code llama 7B python model scores 38.4% on humaneval, versus the non-python model, which only scores 33%.
  - They compare their instruct tuned model to non-instruct-tuned models.  Instruction tuning can add 20% or more to humaneval performance.  For example, WizardLM 7B scores 55% on humaneval [1], and I've trained a 7B model that scores 62% [2].
  - For another example of instruction tuning, Stablecode instruct tuned benchmarks at 26%, not the 20% they cite for the base model [3]
  - Starcoder, when prompted properly, scores 40% on humaneval [4]
  - They do not report their base model performance (as far as I can tell)

This is interesting work, and a good contribution, but it's important to compare similar models.

[1] https://github.com/nlpxucan/WizardLM

[2] https://huggingface.co/vikp/llama_coder

[3] https://stability.ai/blog/stablecode-llm-generative-ai-codin...

[4] https://github.com/huggingface/blog/blob/main/starcoder.md

JegernOUTT2y ago

Hi, thank you for your attention!

> They compare the performance of this model to the worst 7B code llama model. The base code llama 7B python model scores 38.4% on humaneval, versus the non-python model, which only scores 33%.

We are comparing multilingual models, and we are not focused on python-finetuned versions

> They compare their instruct tuned model to non-instruct-tuned models. Instruction tuning can add 20% or more to humaneval performance. For example, WizardLM 7B scores 55% on humaneval [1], and I've trained a 7B model that scores 62% [2]. > For another example of instruction tuning, Stablecode instruct tuned benchmarks at 26%, not the 20% they cite for the base model [3]

We have two separate comparisons (see https://huggingface.co/smallcloudai/Refact-1_6B-fim) for completion-based models and instruction-following-based models with different humaneval formats. But we are considering our model as a completion (FIM) one in the first place and we were using 85% non-instruction following data to make the final model. The chat functionality is really limited for such small models

> Starcoder, when prompted properly, scores 40% on humaneval

Yep, that is right. But worth mentioning, the starcoder model showed 40% while being extra finetuned exclusively on python

> They do not report their base model performance (as far as I can tell)

Our base model gets around 20-23% humaneval. But it is not the case since the model was trained using 50% non-code data (considering the model's size it was really hard to keep the model converging)

Havoc2y ago

That’s an impressive result

The open rail license seems to reference some sort of limitations on safety and unethical use but I can’t see where in the repo that’s spelled out precisely what the authors have in mind?

1 more reply

brucethemoose22y ago

One misleading thing is the notion that you need a 1-2B model to run on commodity hardware.

This is not really true. Llama 7B runs with Vulkan/llama.cpp on ~8GB smartphones and ~12GB laptops. That ease is going to get much better over time, as lower RAM hardware starts dropping out of the market and the Vulkan implementations get more widespread.

For users trying to run LLMs on 8GB or less machines, the AI Horde approach of distributed models seems much more practical anyway.

btown2y ago

Ah, but have no fear - as lower RAM hardware starts dropping out of the market, the RAM usage of Microsoft Teams will increase to compensate!

(Not even /s - while the developers of LLM applications may have 64GB RAM in their laptops or desktops, the less-technical early adopters of LLMs running locally are likely to be power users with lower-powered laptops, much more stringent RAM limits, and numerous line-of-business applications and browser tabs contending for that RAM. Causing those applications to be swapped onto disk will almost certainly result in a degraded overall experience that could easily be blamed on the LLM application itself.)

nacs2y ago

Yes, 7B is perfectly usable on low-end hardware if you're using it for instruction tuning/chat.

But for code completion in an IDE where it has to react as you type, every 100 millisecond delay in response time is noticable.

Even with a 24GB GPU, a 7B model doesn't feel snappy enough for code-completion in an IDE.

evolve79422y ago

GPU RAM quantity isn’t typically correlated to inference rate. Precision/quantization levels do affect model size, which will affect inference rate. However, I would expect a smaller model to be faster (less RAM).

brucethemoose22y ago

Llama (and many other llms, I presume) are so memory bandwidth bound that model size is a decent indicator of inference rate.

The smaller the model, the less has to be read from ram for every single token.

Batching mixes up this calculus a bit.

brucethemoose22y ago

This can be addressed with token streaming and input caching.

Would that be enough? shrug

jmorgan2y ago

This is true! Although I'm also really excited at the potential speed (both for loading the model and token generation) of a 1B model for things like code completion.

swyx2y ago

> the AI Horde approach of distributed models seems much more practical anyway.

i wasnt aware this was a term of art. is there a definitive blogpost or product explaining this approach?

ukuina2y ago

This is a reference to Kobold Horde, a distributed volunteer network of GPUs that can be inferenced upon.

brucethemoose22y ago

I didn't mean to imply splitting llama up between machines (though that is a thing with llama.cpp), but a pool of clients and servers who make requests and process them:

https://lite.koboldai.net/

A few users with half decent PCs can serve a much larger group of people, and the "lesser" hosts can host smaller models to "earn" access to larger ones.

palmer_fox2y ago

Perhaps the wrong thread to ask this question... Is it not possible to load a model on something like an NVMe M.2 drive instead of RAM? It's slower of course, but only 5-10x if I understand correctly.

kirill5pol2y ago

Yes but they’re slow enough on normal hardware for that 5-10x to be painful…

mirekrusin2y ago

Can you RAID them?

1 more reply

smcleod2y ago

Yeah but I remember thinking to myself every few years that surely next year will be the year that base model machines start at 32/64/… GB - but alas, it’s nearly the end of 2023 and your average computer still seems stuck on a measly 16GB! I don’t think average RAM size on consumer machines has increased at all in the last 8~ years or so.

Retric2y ago

It actually kind of makes sense.

RAM is only about 6x the speed of SSD’s for sequential access. Most people don’t actually need truly random access to all that much data rather than streaming video or loading video game assets to their GPU. So they shift spending to other components like video card, monitors, etc that actually provide significant value.

Which is how you get people with 16 GB of system RAM using graphics cards that also have 16GB of RAM.

naillo2y ago

7b runs on my 4gb vram machine (8gb memory). I.e. quantization helps a lot too

mholubowski2y ago

Hey, I have a genuine question:

What is the point of a new model that isn’t better than the best possible model (example: OpenAI GPT-4)?

What’s the point in having a smaller model? Who cares?

—-

This is a real, genuine question that I don’t have a clear answer to. Excuse my ignorance, plz enlighten your boi.

yunwal2y ago

GPT4 is expensive to run, even more expensive to finetune, and for all practical purposes can’t be run offline (because the model is too big to run outside of a huge data center). Evaluation latency is also an issue for many usecases, and you have to share your query with openai, so you can’t run sensitive queries. The output is also controlled/censored by OpenAI.

Here’s a few usecases that I wouldn’t want to use OpenAI/GPT for

- Advanced autocomplete for texting and private communications

- Querying sensitive document databases like emails

- Traveling in low connectivity areas

- Politically incorrect usecases (generating erotic content for example)

List kinda goes on and on

qeternity2y ago

> GPT4 is expensive to run, even more expensive to finetune

GPT4 can't even be finetuned at the moment (though I expect that to change).

MichaelBurge2y ago

It can be finetuned. Bing is a finetuned GPT-4.

2 more replies

notsylver2y ago

IMO, the main reasons are (but are definitely not limited to):

- You can fine tune these models for very specific tasks, which GPT-4 might not be as good at.

- Open source models are free. You can use them as much as you want without worrying about a $xx,xxx bill at the end of the month which makes tinkering with them easier.

- Smaller models like this can run on consumer hardware, even phones, and can run offline.

- Privacy and not having to abide by a third parties terms. You don't have to deal with "As a large language model...", especially with uncensored models.

- Tools like jsonformer https://github.com/1rgs/jsonformer are not possible with OpenAIs API.

- It's also just really cool, let's be honest.

yieldcrv2y ago

1) people can run a 1.6B model for free on consumer hardware

2) any model that's run on computational resources you are owning or leasing will have more privacy than an explicit cloud offering. running completely on your own local hardware will be private. this means you don't have to think twice about asking the LLM about the proprietary code or information you are working on.

3) smaller models gain the performance improvements from all the other improvements in interpreters and quantizing, allowing for even more consumer friendly offline use

4) oh yeah, offline use. could expand use cases to having LLM's baked into operating systems directly, including leading phones

5) showing what's possible, pushing towards the benchmarks of the best possible model while using less computational resources. this also makes the hosts of the best possible model realize that they could either A) be using less computational resources and increasing the bandwidth for their users B) further improve their own model because of competition. Basically if ChatGPT 4 was using similar improvements in technology across all areas of reasoning/whatever, there never would have been a rate limit on ChatGPT 4.

6) more demand for other computational resources. Nvidia is backordered till maybe Q2 2024 right now. If people realize AMD or even their ARM chips can offer same performance with the right combination of hardware and software, It alleviates pressure on other ventures that want computation power.

TuringNYC2y ago

The other answers are great, but to add more

- You can run it behind an air-gap, where your systems are disconnected from the world.

- You can run it on the edge with low or no internet connectivity

- You do not need to worry about breaching geographic data restrictions, e.g.: medical data from Country X cannot leave Country X

tiborsaas2y ago

Your questions sounds like why do we need Alpine linux when we have Ubuntu? Why do we have SQLite when we have Postgres?

I think the point is to reach a baseline of something being super lightweight yet still useful that could be production for a number of use cases.

SparkyMcUnicorn2y ago

You can use it 100% locally, and it doesn't cost anything.

seydor2y ago

Imagine being on Mars and running on a small PV panel and needing to code a bugfix in your oxygen supply system through the wire with Microsoft Earth(tm) or something

smcleod2y ago

Just trying out the official container image for self-hosting along side the VSCode extension - I've got to say I'm really impressed with the scaffolding especially for an early stage project.

The web interface for the LLM server is especially nice and clean compared to many of the others I've tried - and it "just works". Very interested to see how this evolves.

holoduke2y ago

Whats the difference between 1% and 99% of HumanEval? What does it tell really?

kateklinkOP2y ago

for pass@1 HumanEval tells how well the model solves a task from a set, given only one chance to solve it. It's not the perfect metric, there're other like DS-1000, MBPP (we have included them on HuggingFace model card). HumanEval is good for benchmarking with other models as it gives a fast idea how powerful the model is.

swyx2y ago

> given only one chance to solve it

my understanding is that there are 2 usages of the pass@{number} syntax. the HumanEval/Codex paper interprets the {number} as number of attempts[0]. however language modelers seem to use it to denote the number of few shot example demonstrations given in the context. these are starkly different and i wish the syntax wasnt overloaded

---

[0] https://arxiv.org/pdf/2107.03374.pdf

> Kulal et al. (2019) evaluate functional correctness using the pass@k metric, where k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported.

ldjkfkdsjnv2y ago

I dont trust any benchmarks for any LLM thats not coming from FB, Google, OpenAI, Anthropic, or Microsoft. These models are so dynamic, the simple benchmark numbers never tell the whole story of the quality of the model. Take for instance, a recent posting by anyscale, claiming their fine tuning of Llama 2 was competitive with OpenAI's model. The reality being their fined tuned model is basically worthless, and was competitive along a single metric/very narrow commoditized task. Its a great way to get clicks by posting these metrics though

breadsniffer012y ago

They could have easily benchmarked with the Spider SQL test set but they didn’t.

I have a feeling that the more robust models might be the ones that don’t perform best on benchmarks right away.

SparkyMcUnicorn2y ago

The community has fine-tuned some really good llama models (much better than llama-chat), but I get what you're saying.

I've been testing the best performing models on the huggingface leaderboard lately. Some of them are really impressive, and others are so bad that I second guess the prompt format or if the benchmarked model is actually the same one I'm testing.

breadsniffer012y ago

Which models were really bad?

SparkyMcUnicorn2y ago

I was keeping track of the good ones, and don't have many notes on the bad ones.

I do remember testing "LoKuS" last week and it was quite terrible (sometimes gave completely off-topic answers). It scored as one of the highest 13B models on the leaderboard (~65 average), but appears to be removed now.

nomel2y ago

This is the goal of humaneval, correct?

howon922y ago

Congrats on your achievement! I'm curious about your end goal. Do you aim to beat GitHub Copilot's performance and convince devs to use Refact for code completion instead of GitHub Copilot? I want to understand the motivation behind these different code-completion models that are not solely for academic research.

kateklinkOP2y ago

we want to help developers who need either on-premise or permissive code assistant, copilot has neither of this. We also wanted to lower the barriers for self-hosting, so that the model is available on most GPUs with just 3GB Ram. Plus making the code completions fast and efficient (understanding entire context, not just the previous tokens).

OlegKlimov13372y ago

You can use it in practice, that was the goal of that particular model! It's fast, runs on your own hardware if you want it to.

umutisik2y ago

The title is misleading This model is not "SOTA for the size", there are smaller models that do 10-18% better in absolute score. The text says it's SOTA "among similar models" where they probably compare with other models with permissive licensing.

mrob2y ago

"Permissive" usually refers to Free Software or Open Source licenses without copyleft requirements. OpenRAIL is a proprietary license because it imposes usage restrictions, contrary to both the Free Software and Open Source definitions.

OlegKlimov13372y ago

AFAIK There is only one model that do better, it’s phi-1 and it’s python only, and it does not support fill-in-the-middle so you can't really use it.

umutisik2y ago

Phi-1-small also scores higher with 350M parameters. It helps to be specific about what the comparison is against when claiming SOTA.

glutamate2y ago

License text: https://drive.google.com/file/d/16NqKiAkzyZ55NClubCIFup8pT2j... [PDF]

See last page for restrictions

lordofgibbons2y ago

> In any way that violates any applicable national, federal, state, local or international law or regulation;

Darn! Foiled again! I was planning on breaking some federal laws, but the license says that I can't ;( \s

Open-RAIL license has the be the worst license in existence claiming to be "open".

> You shall undertake reasonable efforts to use the latest version of the Model.

Plea to folks releasing models: Please stop using this user-hostile and deranged license

Havoc2y ago

Thanks. That looks pretty relaxed on terms

acheong082y ago

Say I want to fine tune a Golang specific model. How much $ and effort would I have to put in? Would using this as a base help in any way compared to starting from llama?

OlegKlimov13372y ago

Maybe it makes sense to start from llama-code, not llama :D I think golang specific model will not be that different from a multi-language model. But it definitely will work better after fine tuning on your code. Check out refact self hosting docker in a couple of days, finetune will be there soon. It will take you 1 GPU and almost no money )

palmer_fox2y ago

All these LLMs are pretty general if I understand correctly. Are there any efforts to create specialized models (other than for coding)? Or, what would be even better, "extract" certain areas from existing LLMs as a way to specialize them? With the goal to drastically reduce model size to be able to run on less powerful devices.

E.g. a model specializing in chemistry doesn't need to include data on world's history or to be able to write poetry.

hnhg2y ago

I am not an expert but it still has to learn human language/grammar/whathaveyou, and that is where scale seems to matter. Fine-tuning on a subset of knowledge after that is typically how the domain-specialisation is achieved, by my understanding.

charcircuit2y ago

Domain specialization is done by continuing the full training process. Fine tuning is more for changing the style of the output than adding new knowledge.

palmer_fox2y ago

What if the initial training already contains all necessary data for a particular specialization? What would be the benefit of continuing the training process?

1 more reply

palmer_fox2y ago

I was wondering about that too. Would it be possible in the future to have a more modular approach to LLMs? Have a module that is responsible for basic knowledge/language/grammar and then other more specialized modules that are added selectively.

I don't know enough about fine-tuning, not sure if the process is capable of removing "unused" parts of the model (I guess not possible, similar to un-learning).

lucubratory2y ago

There are various methods for removing unused parts of the model, like distillation. The idea is generally that you always lose performance, but hopefully you lose more size/runcost than you do performance, proportionately.

swyx2y ago

so, so many. there are RAG specific models (contextual ai), finance specific models (bloomberg gpt, brightwave), contact center models (cresta), even telco models (anthropic).

palmer_fox2y ago

Very interesting. Thanks for replying!

Manjuuu2y ago

Another model that we'll soon forget it ever existed.

_xnmw2y ago

For the sake of not giving Microsoft and a few other tech giants immense power over the world, I really do hope the cost and efficiency of LLMs improve dramatically, until we can get GPT-4-equivalent models trained on a few graphics cards and running offline on an iPhone. Really rooting for these kinds of projects until someone makes the breakthrough.

taywrobel2y ago

You may be interested in what we’re working on at Symbolica AI.

We’re using formal logic in the form of abstract rewrite systems over a causal graph to perform geometric deep learning. In theory it should be able to learn the same topological structure of data that neural networks do, but using entirely discrete operations and without the random walk inherent to stochastic gradient descent.

Current experiments are really promising, and assuming the growth curve continues as we scale up you should be able to train a GPT-4 scale LLM in a few weeks on commodity hardware (we are using a desktop with 4 4090’s currently), and be able to do both inference and continual fine tuning/online learning on device.

KRAKRISMOTT2y ago

> We’re using formal logic in the form of abstract rewrite systems over a causal graph to perform geometric deep learning. In theory it should be able to learn the same topological structure of data that neural networks do, but using entirely discrete operations and without the random walk inherent to stochastic gradient descent.

Abstract rewrite like a computer algebra system's (e.g. Wolfram) term rewriting equation simplication method?

taywrobel2y ago

Heavily influenced by Wolfram's work on metamathematics and the physics project, in so far as using a rewrite system to uncover an emergent topology; we're just using it to uncover the topology of certain data (assuming that the manifold hypothesis is correct), rather than the topology of fundamental physics as he did.

pawelduda2y ago

Sounds cool, but what are the drawbacks?

taywrobel2y ago

Biggest drawback is that since the structure is all discrete, it is inherently weak at modeling statistical distributions. For example, it'll likely never best a neural network at stock market prediction or medical data extrapolation.

However, for things that are discrete and/or causal in nature, we expect it to outperform deep learning by a wide margin. We're focused on language to start, but want to eventually target planning and controls problems as well, such as self-driving and robotics.

Another drawback is that the algorithm as it stands today is based on a subgraph isomorphism search, which is hard. Not hard as in tricky to get right like Paxos or other complex algorithms; like NP-Hard, so very difficult to scale. We have some fantastic Ph.Ds working with us who focus on optimization of subgraph isomorphism search, and category theorists working to formalize what constraints we can relax without effecting the learning mechanism of the rewrite system, so we're confident that it's achievable, but the time horizon is unknown currently.

k__2y ago

It doesn't exist at scale yet.

paulsutter2y ago

Especially interested in learning directly on geometries, please keep us updated and share results

taywrobel2y ago

Would definitely recommend Bronstein et. al's work on geometric deep learning! https://geometricdeeplearning.com

That's effectively the right hand side of the bridge that we're building between formal logic and deep learning. So far their work has been viewed mainly as descriptive, helping to understand neural networks better, but as their abstract calls out: "it gives a constructive procedure to incorporate prior physical knowledge into neural architectures and provide principled way to build future architectures yet to be invented". That's us (we hope)!

arthurcolle2y ago

I would like to subscribe to your newsletter, we'd be super interested in this at Brainchain AI.

Drop me a link at (my first name) @ brainchain dot AI if you'd like to chat, I'd love to hear more about what you're working on!

dmarchand902y ago

Really cool stuff! Do you have any recommendations of where we could learn more?

axpy9062y ago

The key in that is models. Per the GPT4 leaked details, it’s not a a single model but 16 MOE mixture of experts. There’s probably quite a lot of complexity on the backend in sourcing the right model for the right query. In short, it’s probably better to focus on single models for specific tasks in the OS community as evidenced by Code Llama. Having a system like GPT4 is still difficult to replicate. Getting it to run on a consumer hardware for specific tasks like code gen at almost GPT4 level is doable.

famouswaffles2y ago

>There’s probably quite a lot of complexity on the backend in sourcing the right model for the right query.

This isn't how Sparse MoE models work. There isn't really any complexity like that. And different models will or can pick each token.

Sparse models aren't an ensemble of models.

ttul2y ago

There are many MoE architectures and I suppose we don’t know for sure which OpenAI is using. The “selection” of the right mix of models is something that a network learns and it’s not a complex process. Certainly no more complex than training an LLM.

axpy9062y ago

When I wrote “backend” was a poor choice of a word. “Meta-model” is probably a better choice of wording.

I hope it did not detract too much from the point of focusing on subtasks and modalities for FOSS as GPT 4 was built on a $163 million budget.

Finally, good point. We’ve got no idea of what OpenAI’s MoE approach is and how it works. I went back to Metas 2022 NLLB-200 system paper and they didn’t even publish the exact details of the router (gate).

1 more reply

fnordpiglet2y ago

I think with or without algorithmic advantages hardware will improve for local model running. There’s an immense amount of capital being invested in hardware improvement and that will absolutely trickle down.

My sincere belief is that local models is the way of the future, with flexible base models adapted via Lora and context to specific use cases. I think open source models and techniques are inexorable at this point barring some sort of regulatory moat and will rival commercial models in all but extreme cases.

flangola72y ago

I don't, how do you maintain control and prevent mass harm in that case? I don't see anyway out other than similar gatekeeping we apply to ownership and use of high explosives and radiological weapon tooling.

At all other times I support tech freedom. I use libre software, I use Tor, I donate to privacy and FOSS organizations constantly. I only write my software projects under an AGPL license. AI is qualitatively different. A world run amok with intelligent infinite Sybils is not good for anyone. I hope massive compute continues to be necessary, it may be the only hard chokepoint we have to keep a handle on the beast.

smoldesu2y ago

> For the sake of not giving Microsoft and a few other tech giants immense power over the world

I agree with and appreciate the sentiment, but it feels way too late for that. These people do have and exert direct control over pretty much all of our digital devices. It's funny (or sad) that we only seem to care about this when shiny doodads like AI come around every so-often.

adrenvi2y ago

That could also help tech giants build even larger/more capable models cheaply. Ideally there would be a hard ceiling of LLM capability that even massive amounts of hardware couldn't exceed, allowing inexpensive hardware to catch up.

a_wild_dandan2y ago

I personally hope that LLMs have no such limits. The good these tools can do is immeasurable.

I can already run Llama 2 @70b on my laptop, and that’ll look like a quaint old AI artifact in 5-7 years. I think the consumer market will keep pace yet stay well below SotA, just as it always has. That still leaves plenty of room for incredible open-source stuff!

stainablesteel2y ago

to be fair, if that is achieved then the massive models that tech giants produce will probably be phenomenal

kateklinkOP2y ago

We’ve finished training a new code model Refact LLM which took us about a month. The main use-case is for blazing-fast code completion with fill-in-the-middle, additionally, the model could reply to chat prompts.

It has much better performance than all of the code models of similar size, and almost reaches the same HumanEval as Starcoder being 10x smaller in size.

With the small size, it can work with most modern GPUs requiring just 3GB Ram.

You can try self-hosting it in Refact https://github.com/smallcloudai/refact/ and get a local fast copilot alternative with decent suggestions.

Weights and model card https://huggingface.co/smallcloudai/Refact-1_6B-fim.

We would love to hear your feedback!

ALittleLight2y ago

How does it compare to Copilot? A metric I'd like to see is % of proposed completions accepted by a human user. If you had an extension that 50% of the time proposed a Copilot extension and 50% of the time proposed a Refact extension (blind to the user) then you could come up with a metric like this.

diminish2y ago

Does ctransformer (https://github.com/marella/ctransformers#supported-models) support running refact?

I see that model type "gpt_refact" in https://huggingface.co/smallcloudai/Refact-1_6B-fim/blob/mai...

drcongo2y ago

Is it possible to run it as an LSP so that it can be used in editors other than VSCode and JetBrains? (sorry if this question is completely mad, my understanding of how these things work is extremely limited)

OlegKlimov13372y ago

Yes, it's coming up in a couple of weeks.

drcongo2y ago

Great, thanks. I'll keep an eye out.

sparrow05192y ago

hi, i try to fine tune refact model using evolve code alpaca, but the loss is always bigger than 2, i try some different params but it doesn't work, can you give me some advice?

riku_iki2y ago

> almost reaches the same HumanEval

how can you tell that HumanEval is not leaked to your training data in some form?

mityamitya2y ago

Hi! We ran LSH filtering over datasets to remove all code that can be similar to HumanEval samples.

riku_iki2y ago

so, we have to trust your procedure..

1 more reply

zcesur2y ago

tangentially related: refact recently shared 4 bounties worth $9,000 to help improve their tech!

https://algora.io/org/smallcloudai/bounties

disclaimer: i'm a cofounder of algora, the platform enabling these bounties

1 more reply

iFire2y ago

LICENSE

bigscience-openrail-m

https://huggingface.co/smallcloudai/Refact-1_6B-fim/blob/mai...

1 more reply

notsahil2y ago

Model Stats - Architecture: LLAMA-like model with multi-query attention - Objectives Fill-in-the-Middle, Chat - Tokens context: 4096 - Pretraining tokens: 1.2T - Finetuning tokens: 40B - Precision: bfloat16 - GPUs 64 NVidia A5000 - Training time 28 days

1 more reply

j / k navigate · click thread line to collapse

100 comments

vikp2y ago

This post is misleading, in a way that is hard to do accidentally.

  - They compare the performance of this model to the worst 7B code llama model.  The base code llama 7B python model scores 38.4% on humaneval, versus the non-python model, which only scores 33%.
  - They compare their instruct tuned model to non-instruct-tuned models.  Instruction tuning can add 20% or more to humaneval performance.  For example, WizardLM 7B scores 55% on humaneval [1], and I've trained a 7B model that scores 62% [2].
  - For another example of instruction tuning, Stablecode instruct tuned benchmarks at 26%, not the 20% they cite for the base model [3]
  - Starcoder, when prompted properly, scores 40% on humaneval [4]
  - They do not report their base model performance (as far as I can tell)

This is interesting work, and a good contribution, but it's important to compare similar models.

[1] https://github.com/nlpxucan/WizardLM

[2] https://huggingface.co/vikp/llama_coder

[3] https://stability.ai/blog/stablecode-llm-generative-ai-codin...

[4] https://github.com/huggingface/blog/blob/main/starcoder.md

JegernOUTT2y ago

Hi, thank you for your attention!

> They compare the performance of this model to the worst 7B code llama model. The base code llama 7B python model scores 38.4% on humaneval, versus the non-python model, which only scores 33%.

We are comparing multilingual models, and we are not focused on python-finetuned versions

> Starcoder, when prompted properly, scores 40% on humaneval

Yep, that is right. But worth mentioning, the starcoder model showed 40% while being extra finetuned exclusively on python

> They do not report their base model performance (as far as I can tell)

Our base model gets around 20-23% humaneval. But it is not the case since the model was trained using 50% non-code data (considering the model's size it was really hard to keep the model converging)

Havoc2y ago

That’s an impressive result

The open rail license seems to reference some sort of limitations on safety and unethical use but I can’t see where in the repo that’s spelled out precisely what the authors have in mind?

1 more reply

brucethemoose22y ago

One misleading thing is the notion that you need a 1-2B model to run on commodity hardware.

For users trying to run LLMs on 8GB or less machines, the AI Horde approach of distributed models seems much more practical anyway.

btown2y ago

Ah, but have no fear - as lower RAM hardware starts dropping out of the market, the RAM usage of Microsoft Teams will increase to compensate!

nacs2y ago

Yes, 7B is perfectly usable on low-end hardware if you're using it for instruction tuning/chat.

But for code completion in an IDE where it has to react as you type, every 100 millisecond delay in response time is noticable.

Even with a 24GB GPU, a 7B model doesn't feel snappy enough for code-completion in an IDE.

evolve79422y ago

brucethemoose22y ago

Llama (and many other llms, I presume) are so memory bandwidth bound that model size is a decent indicator of inference rate.

The smaller the model, the less has to be read from ram for every single token.

Batching mixes up this calculus a bit.

brucethemoose22y ago

This can be addressed with token streaming and input caching.

Would that be enough? shrug

jmorgan2y ago

This is true! Although I'm also really excited at the potential speed (both for loading the model and token generation) of a 1B model for things like code completion.

swyx2y ago

> the AI Horde approach of distributed models seems much more practical anyway.

i wasnt aware this was a term of art. is there a definitive blogpost or product explaining this approach?

ukuina2y ago

This is a reference to Kobold Horde, a distributed volunteer network of GPUs that can be inferenced upon.

brucethemoose22y ago

I didn't mean to imply splitting llama up between machines (though that is a thing with llama.cpp), but a pool of clients and servers who make requests and process them:

https://lite.koboldai.net/

A few users with half decent PCs can serve a much larger group of people, and the "lesser" hosts can host smaller models to "earn" access to larger ones.

palmer_fox2y ago

kirill5pol2y ago

Yes but they’re slow enough on normal hardware for that 5-10x to be painful…

mirekrusin2y ago

Can you RAID them?

1 more reply

smcleod2y ago

Retric2y ago

It actually kind of makes sense.

Which is how you get people with 16 GB of system RAM using graphics cards that also have 16GB of RAM.

naillo2y ago

7b runs on my 4gb vram machine (8gb memory). I.e. quantization helps a lot too

mholubowski2y ago

Hey, I have a genuine question:

What is the point of a new model that isn’t better than the best possible model (example: OpenAI GPT-4)?

What’s the point in having a smaller model? Who cares?

—-

This is a real, genuine question that I don’t have a clear answer to. Excuse my ignorance, plz enlighten your boi.

yunwal2y ago

Here’s a few usecases that I wouldn’t want to use OpenAI/GPT for

- Advanced autocomplete for texting and private communications

- Querying sensitive document databases like emails

- Traveling in low connectivity areas

- Politically incorrect usecases (generating erotic content for example)

List kinda goes on and on

qeternity2y ago

> GPT4 is expensive to run, even more expensive to finetune

GPT4 can't even be finetuned at the moment (though I expect that to change).

MichaelBurge2y ago

It can be finetuned. Bing is a finetuned GPT-4.

2 more replies

notsylver2y ago

IMO, the main reasons are (but are definitely not limited to):

- You can fine tune these models for very specific tasks, which GPT-4 might not be as good at.

- Open source models are free. You can use them as much as you want without worrying about a $xx,xxx bill at the end of the month which makes tinkering with them easier.

- Smaller models like this can run on consumer hardware, even phones, and can run offline.

- Privacy and not having to abide by a third parties terms. You don't have to deal with "As a large language model...", especially with uncensored models.

- Tools like jsonformer https://github.com/1rgs/jsonformer are not possible with OpenAIs API.

- It's also just really cool, let's be honest.

yieldcrv2y ago

1) people can run a 1.6B model for free on consumer hardware

3) smaller models gain the performance improvements from all the other improvements in interpreters and quantizing, allowing for even more consumer friendly offline use

4) oh yeah, offline use. could expand use cases to having LLM's baked into operating systems directly, including leading phones

TuringNYC2y ago

The other answers are great, but to add more

- You can run it behind an air-gap, where your systems are disconnected from the world.

- You can run it on the edge with low or no internet connectivity

- You do not need to worry about breaching geographic data restrictions, e.g.: medical data from Country X cannot leave Country X

tiborsaas2y ago

Your questions sounds like why do we need Alpine linux when we have Ubuntu? Why do we have SQLite when we have Postgres?

I think the point is to reach a baseline of something being super lightweight yet still useful that could be production for a number of use cases.

SparkyMcUnicorn2y ago

You can use it 100% locally, and it doesn't cost anything.

seydor2y ago

Imagine being on Mars and running on a small PV panel and needing to code a bugfix in your oxygen supply system through the wire with Microsoft Earth(tm) or something

smcleod2y ago

Just trying out the official container image for self-hosting along side the VSCode extension - I've got to say I'm really impressed with the scaffolding especially for an early stage project.

The web interface for the LLM server is especially nice and clean compared to many of the others I've tried - and it "just works". Very interested to see how this evolves.

holoduke2y ago

Whats the difference between 1% and 99% of HumanEval? What does it tell really?

kateklinkOP2y ago

swyx2y ago

> given only one chance to solve it

---

[0] https://arxiv.org/pdf/2107.03374.pdf

ldjkfkdsjnv2y ago

breadsniffer012y ago

They could have easily benchmarked with the Spider SQL test set but they didn’t.

I have a feeling that the more robust models might be the ones that don’t perform best on benchmarks right away.

SparkyMcUnicorn2y ago

The community has fine-tuned some really good llama models (much better than llama-chat), but I get what you're saying.

breadsniffer012y ago

Which models were really bad?

SparkyMcUnicorn2y ago

I was keeping track of the good ones, and don't have many notes on the bad ones.

nomel2y ago

This is the goal of humaneval, correct?

howon922y ago

kateklinkOP2y ago

OlegKlimov13372y ago

You can use it in practice, that was the goal of that particular model! It's fast, runs on your own hardware if you want it to.

umutisik2y ago

mrob2y ago

OlegKlimov13372y ago

AFAIK There is only one model that do better, it’s phi-1 and it’s python only, and it does not support fill-in-the-middle so you can't really use it.

umutisik2y ago

Phi-1-small also scores higher with 350M parameters. It helps to be specific about what the comparison is against when claiming SOTA.

glutamate2y ago

License text: https://drive.google.com/file/d/16NqKiAkzyZ55NClubCIFup8pT2j... [PDF]

See last page for restrictions

lordofgibbons2y ago

> In any way that violates any applicable national, federal, state, local or international law or regulation;

Darn! Foiled again! I was planning on breaking some federal laws, but the license says that I can't ;( \s

Open-RAIL license has the be the worst license in existence claiming to be "open".

> You shall undertake reasonable efforts to use the latest version of the Model.

Plea to folks releasing models: Please stop using this user-hostile and deranged license

Havoc2y ago

Thanks. That looks pretty relaxed on terms

acheong082y ago

Say I want to fine tune a Golang specific model. How much $ and effort would I have to put in? Would using this as a base help in any way compared to starting from llama?

OlegKlimov13372y ago

palmer_fox2y ago

E.g. a model specializing in chemistry doesn't need to include data on world's history or to be able to write poetry.

hnhg2y ago

charcircuit2y ago

Domain specialization is done by continuing the full training process. Fine tuning is more for changing the style of the output than adding new knowledge.

palmer_fox2y ago

What if the initial training already contains all necessary data for a particular specialization? What would be the benefit of continuing the training process?

1 more reply

palmer_fox2y ago

I don't know enough about fine-tuning, not sure if the process is capable of removing "unused" parts of the model (I guess not possible, similar to un-learning).

lucubratory2y ago

swyx2y ago

so, so many. there are RAG specific models (contextual ai), finance specific models (bloomberg gpt, brightwave), contact center models (cresta), even telco models (anthropic).

palmer_fox2y ago

Very interesting. Thanks for replying!

Manjuuu2y ago

Another model that we'll soon forget it ever existed.

_xnmw2y ago

taywrobel2y ago

You may be interested in what we’re working on at Symbolica AI.

KRAKRISMOTT2y ago

Abstract rewrite like a computer algebra system's (e.g. Wolfram) term rewriting equation simplication method?

taywrobel2y ago

pawelduda2y ago

Sounds cool, but what are the drawbacks?

taywrobel2y ago

k__2y ago

It doesn't exist at scale yet.

paulsutter2y ago

Especially interested in learning directly on geometries, please keep us updated and share results

taywrobel2y ago

Would definitely recommend Bronstein et. al's work on geometric deep learning! https://geometricdeeplearning.com

arthurcolle2y ago

I would like to subscribe to your newsletter, we'd be super interested in this at Brainchain AI.

Drop me a link at (my first name) @ brainchain dot AI if you'd like to chat, I'd love to hear more about what you're working on!

dmarchand902y ago

Really cool stuff! Do you have any recommendations of where we could learn more?

axpy9062y ago

famouswaffles2y ago

>There’s probably quite a lot of complexity on the backend in sourcing the right model for the right query.

This isn't how Sparse MoE models work. There isn't really any complexity like that. And different models will or can pick each token.

Sparse models aren't an ensemble of models.

ttul2y ago

axpy9062y ago

When I wrote “backend” was a poor choice of a word. “Meta-model” is probably a better choice of wording.

I hope it did not detract too much from the point of focusing on subtasks and modalities for FOSS as GPT 4 was built on a $163 million budget.

1 more reply

fnordpiglet2y ago

flangola72y ago

smoldesu2y ago

> For the sake of not giving Microsoft and a few other tech giants immense power over the world

adrenvi2y ago

a_wild_dandan2y ago

I personally hope that LLMs have no such limits. The good these tools can do is immeasurable.

stainablesteel2y ago

to be fair, if that is achieved then the massive models that tech giants produce will probably be phenomenal

kateklinkOP2y ago

It has much better performance than all of the code models of similar size, and almost reaches the same HumanEval as Starcoder being 10x smaller in size.

With the small size, it can work with most modern GPUs requiring just 3GB Ram.

You can try self-hosting it in Refact https://github.com/smallcloudai/refact/ and get a local fast copilot alternative with decent suggestions.

Weights and model card https://huggingface.co/smallcloudai/Refact-1_6B-fim.

We would love to hear your feedback!

ALittleLight2y ago

diminish2y ago

Does ctransformer (https://github.com/marella/ctransformers#supported-models) support running refact?

I see that model type "gpt_refact" in https://huggingface.co/smallcloudai/Refact-1_6B-fim/blob/mai...

drcongo2y ago

OlegKlimov13372y ago

Yes, it's coming up in a couple of weeks.

drcongo2y ago

Great, thanks. I'll keep an eye out.

sparrow05192y ago

hi, i try to fine tune refact model using evolve code alpaca, but the loss is always bigger than 2, i try some different params but it doesn't work, can you give me some advice?

riku_iki2y ago

> almost reaches the same HumanEval

how can you tell that HumanEval is not leaked to your training data in some form?

mityamitya2y ago

Hi! We ran LSH filtering over datasets to remove all code that can be similar to HumanEval samples.

riku_iki2y ago

so, we have to trust your procedure..

1 more reply

zcesur2y ago

tangentially related: refact recently shared 4 bounties worth $9,000 to help improve their tech!

https://algora.io/org/smallcloudai/bounties

disclaimer: i'm a cofounder of algora, the platform enabling these bounties