Microgpt (opens in new tab)

(karpathy.github.io)

1936 pointstambourine_man25d ago329 comments

329 comments

Someone has modified microgpt to build a tiny GPT that generates Korean first names, and created a web page that visualizes the entire process [1].

Users can interactively explore the microgpt pipeline end to end, from tokenization until inference.

[1] English GPT lab:

https://ko-microgpt.vercel.app/

camkego24d ago

I have no affiliation with the website, but the website is pretty neat if you are learning LLM internals. It explains: Tokenization, Embedding, Attention, Loss & Gradient, Training, Inference and comparison to "Real GPT"

Pretty nifty. Even if you are not interested in the Korean language

sprobertson24d ago

This kind of thing is pretty easy to do with a much leaner model https://docs.pytorch.org/tutorials/intermediate/char_rnn_gen...

arijun23d ago

I assume the goal isn't to generate Korean names but to learn GPTs.

1 more reply

love2read24d ago

By "modified" this person of course means that they swapped out the list of X0,000 names from English to Korean names. That is seemingly the only change.

The attached website is a fully ai-generated "visualization" based on the original blog post with little added.

big-chungus424d ago

It's a good website and probably AI generated with some insane expensive model that us mere mortals are too poor to afford, thus it has a value

sahildeepreel23d ago

so impressive!

verma725d ago

I wrote a C++ translation of it: https://github.com/verma7/microgpt/blob/main/microgpt.cc

2x the number of lines of code (~400L), 10x the speed

The hard part was figuring out how to represent the Value class in C++ (ended up using shared_ptrs).

WithinReason24d ago

I made an explicit reverse pass (no autodiff), it was 8x faster in Python

hu324d ago

I made an explicit double-reverse pass (no code!), it was 80x faster in my head!

2 more replies

bear3r24d ago

tradeoff worth naming: you avoid the autodiff graph overhead (hence the speedup), but any architecture change means rewriting every gradient by hand. fine for a pedagogical project, but that's exactly why autodiff exists.

love2read24d ago

Can you share a link?

1 more reply

freakynit20d ago

24x speedup (over 10x already) and similar loss profile (for c++ version, optimized by claude): https://gist.github.com/freakynit/3982eab8413a89941bd0018e63......

verma718d ago

This is amazing! Thanks for optimizing the code using Claude!

red_hare25d ago

This is beautiful and highly readable but, still, I yearn for a detailed line-by-line explainer like the backbone.js source: https://backbonejs.org/docs/backbone.html

tomjakubowski24d ago

I believe that Backbone's annotated source is generated with Docco, another project from the creator of CoffeeScript.

https://ashkenas.com/docco/

It's really neat. I wish I published more of my code this way.

ashish0125d ago

That is really beautiful literate program. Seeing it after a long time. Here is a opus generate version of this code - https://ashish01.github.io/microgpt.html

subset25d ago

Andrej Karpathy has a walkthrough blog post here: https://karpathy.github.io/2026/02/12/microgpt/

OJFord24d ago

That is the article being discussed.

1 more reply

altcognito25d ago

ask a high end LLM to do it

subset25d ago

I had good fun transliterating it to Rust as a learning experience (https://github.com/stochastical/microgpt-rs). The trickiest part was working out how to represent the autograd graph data structure with Rust types. I'm finalising some small tweaks to make it run in the browser via WebAssmebly and then compile it up for my blog :) Andrej's code is really quite poetic, I love how much it packs into such a concise program

amelius24d ago

Storing the partial derivatives into the weights structure is quite the hack, to be honest. But everybody seems to do it like that.

hei-lima24d ago

Great work! Might do it too in some other language...

thomasmg24d ago

I got a convertion to Java. It worked (at least I think...) in the first try.

Then I want to convert this to my own programming language (which traspiles to C). I like those tiny projects very much!

1 more reply

pmarreck24d ago

Zig, here.

Anything but Python

1 more reply

jtesp23d ago

how did you do the transliteration/port?

subset21d ago

Handwritten! (aka no LLM assistance :) It wasn't transpiled or anything like that. I've been meaning to post a little about it on my blog; just been caught up with other stuff atm.

One thing that was a _little_ frustrating coming from Python, though, was the need to rely on crates for basic things like random number generation and network requests. It pulls in a lot, even if you only need a little. I understand the Rust community prefers it that way as it's easier to evolve rather than be stuck with backwards-compatability requirements. But I still missed "batteries included" Python.

1 more reply

geokon24d ago

> What’s the deal with “hallucinations”? The model generates tokens by sampling from a probability distribution. It has no concept of truth, it only knows what sequences are statistically plausible given the training data.

Extremely naiive question.. but could LLM output be tagged with some kind of confidence score? Like if I'm asking an LLM some question does it have an internal metric for how confident it is in its output? LLM outputs seem inherently rarely of the form "I'm not really sure, but maybe this XXX" - but I always felt this is baked in the model somehow

andy12_24d ago

The model could report the confidence of its output distribution, but it isn't necessarily calibrated (that is, even if it tells you that it's 70% confident, it doesn't mean that it is right 70% of the time). Famously, pre-trained base models are calibrated, but they stop being calibrated when they are post-trained to be instruction-following chatbots [1].

Edit: There is also some other work that points out that chat models might not be calibrated at the token-level, but might be calibrated at the concept-level [2]. Which means that if you sample many answers, and group them by semantic similarity, that is also calibrated. The problem is that generating many answer and grouping them is more costly.

[1] https://arxiv.org/pdf/2303.08774 Figure 8

[2] https://arxiv.org/pdf/2511.04869 Figure 1.

geokon24d ago

In absolute terms sure, but the token stream's confidence changes as it's coming out right? Consumer LLMs typically have a lot window dressing. My sense is this encourages the model to stay on-topic and it's mostly "high confidence" fluff. As it's spewing text/tokens back at you maybe when it starts hallucinating you'd expect a sudden dip in the confidence?

You could color code the output token so you can see some abrupt changes

It seems kind of obvious, so I'm guessing people have tried this

1 more reply

chongli24d ago

Having a confidence score isn't as useful as it seems unless you (the user) know a lot about the contents of the training set.

Think of traditional statistics. Suppose I said "80% of those sampled preferred apples to oranges, and my 95% confidence interval is within +/- 2% of that" but then I didn't tell you anything about how I collected the sample. Maybe I was talking to people at an apple pie festival? Who knows! Without more information on the sampling method, it's hard to make any kind of useful claim about a population.

This is why I remain so pessimistic about LLMs as a source of knowledge. Imagine you had a person who was raised from birth in a completely isolated lab environment and taught only how to read books, including the dictionary. They would know how all the words in those books relate to each other but know nothing of how that relates to the world. They could read the line "the killer drew his gun and aimed it at the victim" but what would they really know of it if they'd never seen a gun?

radarsat124d ago

I think your last point raises the following question: how would you change your answer if you know they read all about guns and death and how one causes the other? What if they'd seen pictures of guns? And pictures of victims of guns annotated as such? What if they'd seen videos of people being shot by guns?

I mean I sort of understand what you're trying to say but in fact a great deal of knowledge we get about the world we live in, we get second hand.

There are plenty of people who've never held a gun, or had a gun aimed at them, and.. granted, you could argue they probably wouldn't read that line the same way as people who have, but that doesn't mean that the average Joe who's never been around a gun can't enjoy media that features guns.

Same thing about lots of things. For instance it's not hard for me to think of animals I've never seen with my own eyes. A koala for instance. But I've seen pictures. I assume they exist. I can tell you something about their diet. Does that mean I'm no better than an LLM when it comes to koala knowledge? Probably!

1 more reply

DavidSJ24d ago

Yes, the actual LLM returns a probability distribution, which gets sampled to produce output tokens.

[Edit: but to be clear, for a pretrained model this probability means "what's my estimate of the conditional probability of this token occurring in the pretraining dataset?", not "how likely is this statement to be true?" And for a post-trained model, the probability really has no simple interpretation other than "this is the probability that I will output this token in this situation".]

mr_toad24d ago

It’s often very difficult (intractable) to come up with a probability distribution of an estimator, even when the probability distribution of the data is known.

Basically, you’d need a lot more computing power to come up with a distribution of the output of an LLM than to come up with a single answer.

podnami24d ago

What happens before the probability distribution? I’m assuming say alignment or other factors would influence it?

1 more reply

Lionga24d ago

The LLM has an internal "confidence score" but that has NOTHING to do with how correct the answer is, only with how often the same words came together in training data.

E.g. getting two r's in strawberry could very well have a very high "confidence score" while a random but rare correct fact might have a very well a very low one.

In short: LLM have no concept, or even desire to produce of truth

sharperguy24d ago

Still, it might be interesting information to have access to, as someone running the model? Normally we are reading the output trying to build an intuition for the kinds of patterns it outputs when it's hallucinating vs creating something that happens to align with reality. Adding in this could just help with that even when it isn't always correlated to reality itself.

alexwebb224d ago

Huge leap there in your conclusion. Looks like you’re hand-waving away the entire phenomenon of emergent properties.

amelius24d ago

> In short: LLM have no concept, or even desire to produce of truth

They do produce true statements most of the time, though.

1 more reply

Otterly9923d ago

There is this paper that proposed data compression as a way to judge the ability of a LLM to "understand" things correctly, training on older texts and trying to predict more recent articles:

https://ar5iv.labs.arxiv.org/html//2402.00861

podnami24d ago

I would assume this is from case to case, such as:

- How aligned has it been to “know” that something is true (eg ethical constraints)

- Statistical significance and just being able to corroborate one alternative in Its training data more strongly than another

- If it’s a web search related query, is the statement from original sources vs synthesised from say third party sources

But I’m just a layman and could be totally off here.

jorvi24d ago

> I'm not really sure, but maybe this XXX

You never see this in the response but you do in the reasoning.

danlitt24d ago

Can it generate one? Sure. But it won't mean anything, since you don't know (and nobody knows) the "true" distribution.

kuberwastaken25d ago

I'm half shocked this wasn't on HN before? Haha I built PicoGPT as a minified fork with <35 lines of JS and another in python

And it's small enough to run from a QR code :) https://kuber.studio/picogpt/

You can quite literally train a micro LLM from your phone's browser

dang24d ago

Wow I agree - surprising that it took 2 weeks to make HN's frontpage.

We do generally like HN to be a bit uncorrelated with the rest of the internet, but it feels like a miss to me that neither https://news.ycombinator.com/item?id=47000263 nor https://news.ycombinator.com/item?id=47018557 made the frontpage.

3abiton24d ago

I think he caught some flack for promoting claudebot at that time, and giving it a rave review. Some people are hardliner. His work has always been amazing nonetheless.

1 more reply

cootsnuck25d ago

It was: https://news.ycombinator.com/item?id=47000263

iberator25d ago

[flagged]

dang24d ago

Please don't be a jerk on HN, and especially not when responding to someone's work. This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.

lelandfe24d ago

https://github.com/Kuberwastaken/picogpt/blob/main/picogpt.j...

kuberwastaken24d ago

lol there is source code as a gist

growingswe25d ago

Great stuff! I wrote an interactive blogpost that walks through the code and visualizes it: https://growingswe.com/blog/microgpt

O4epegb24d ago

> By the end of training, the model produces names like "kamon", "karai", "anna", and "anton". None of them are copies from the dataset.

All 4 are in the dataset, btw

mym199024d ago

This is likely because the blog is AI generated and keys off this point from Karpathy: "As a preview, by the end of the script our model will generate (“hallucinate”!) new, plausible-sounding names.", so the LLM just repackaged that into something that is obviously wrong, which is kind of ironic.

joenot44324d ago

This is awesome! Normally I'm pretty critical of LLM-assisted-blogging, but this one's a real winner.

evntdrvn24d ago

You should totally submit that to HN as an article, if you haven't already.

dang24d ago

We've put https://news.ycombinator.com/item?id=47205208 in the second-chance pool (https://news.ycombinator.com/pool, explained at https://news.ycombinator.com/item?id=26998308), so it will get a random placement on HN's front page.

spinningslate24d ago

That’s beautifully done, thanks for posting. As helpful again to an ML novice like me as Karpathy’s original.

hei-lima24d ago

Great!

evntdrvn24d ago

really nice, thanks

la_fayette24d ago

This guy is so amazing! With his video and the code base I really have the feeling I understand gradient descent, back propagation, chain rule etc. Reading math only just confuses me, together with the code it makes it so clear! It feels like a lifetime achievement for me :-)

mentos24d ago

Curious if you could try to explain it. It’s my goal to sit down with it and attempt to understand it intuitively.

Karpathy says if you want to truly understand something then you also have to attempt to teach it to someone else ha

la_fayette24d ago

Yes, that’s true! That could be my next step… though I have to admit, writing this in a HN comment feels like a bit of a challenge.

etothet24d ago

Even if you have some basic understanding of how LLMs work, I highly recommend Karpathy’s intro to LLMs videos on YouTube.

- https://m.youtube.com/watch?v=7xTGNNLPyMI - https://m.youtube.com/watch?v=EWvNQjAaOHw

arvid-lind24d ago

thanks for the recommendations. it seems like i keep coming back to the basics of how i interact with LLMs and how they work to learn the new stuff. every time i think i understand, someone else explaining their approach usually makes me think again about how it all works.

trying my best to keep up with what and how to learn and threads like this are dense with good info. feel like I need an AI helper to schedule time for my youtube queue at this point!

grey-area24d ago

Thanks, this is very very long but very good background on how production LLMs work.

rramadass25d ago

C++ version - https://github.com/Charbel199/microgpt.cpp?tab=readme-ov-fil...

Rust version - https://github.com/mplekh/rust-microgpt

almaight23d ago

https://github.com/novvoo/gogpt

znnajdla25d ago

Super useful exercise. My gut tells me that someone will soon figure out how to build micro-LLMs for specialized tasks that have real-world value, and then training LLMs won’t just be for billion dollar companies. Imagine, for example, a hyper-focused model for a specific programming framework (e.g. Laravel, Django, NextJS) trained only on open-source repositories and documentation and carefully optimized with a specialized harness for one task only: writing code for that framework (perhaps in tandem with a commodity frontier model). Could a single programmer or a small team on a household budget afford to train a model that works better/faster than OpenAI/Anthropic/DeepSeek for specialized tasks? My gut tells me this is possible; and I have a feeling that this will become mainstream, and then custom model training becomes the new “software development”.

allovertheworld24d ago

It just doesn’t work that way, LLMs need to be generalised a lot to be useful even in specific tasks.

It really is the antithesis to the human brain, where it rewards specific knowledge

rapnie24d ago

Yesterday an interesting video was posted "Is AI Hiding Its Full Power?", interviewing professor emeritus and nobel laureate Geoffrey Hinton, with some great explanations for the non-LLM experts. Some remarkable and mindblowing observations in there. Like saying that AI's hallucinate is incorrect language, and we should use "confabulation" instead, same as people do too. And that AI agents once they are launched develop a strong survivability drive, and do not want to be switched off. Stuff like that. Recommended watch.

Here the explanation was that while LLM's thinking has similarities to how humans think, they use an opposite approach. Where humans have enormous amount of neurons, they have only few experiences to train them. And for AI that is the complete opposite, and they store incredible amounts of information in a relatively small set of neurons training on the vast experiences from the data sets of human creative work.

[0] https://www.youtube.com/watch?v=l6ZcFa8pybE

3 more replies

jeremyjh24d ago

> It just doesn’t work that way, LLMs need to be generalised a lot to be useful even in specific tasks.

This is the entire breakthrough of deep learning on which the last two decades of productive AI research is based. Massive amounts of data are needed to generalize and prevent over-fitting. GP is suggesting an entirely new research paradigm will win out - as if researchers have not yet thought of "use less data".

> It really is the antithesis to the human brain, where it rewards specific knowledge

No, its completely analogous. The human brain has vast amounts of pre-training before it starts to learn knowledge specific to any kind of career or discipline, and this fact to me intuitively suggests why GP is baked: You cannot learn general concepts such as the english language, reasoning, computing, network communication, programming, relational data from a tiny dataset consisting only of code and documentation for one open-source framework and language.

It is all built on a massive tower of other concepts that must be understood first, including ones much more basic than the examples I mentioned but that are practically invisible to us because they have always been present as far back as our first memories can reach.

1 more reply

avaer24d ago

The human brain rewards specific knowledge because it's already pre-trained by evolution to have the basics.

You'd need a lot of data to train an ocean soup to think like a human too.

It's not really the antithesis to the human brain if you think of starting with an existing brain as starting with an existing GPT.

rytill24d ago

Are you trying to imply that humans don’t need generalized knowledge, or that we’re not “rewarded” for having highly generalized knowledge?

If so, good luck walking to your kitchen this morning, knowing how to breathe, etc.

1 more reply

teleforce25d ago

This is possible but not for training but fine-tuning the existing open source models.

This can be mainstream, and then custom model fine-tuning becomes the new “software development”.

Please check out this new fine-tuning method for LLM by MIT and ETH Zurich teams that used a single NVIDIA H200 GPU [1], [2], [3].

Full fine-tuning of the entire model’s parameters were performed based on the Hugging Face TRL library.

[1] MIT's new fine-tuning method lets LLMs learn new skills without losing old ones (news):

https://venturebeat.com/orchestration/mits-new-fine-tuning-m...

[2] Self-Distillation Enables Continual Learning (paper):

https://arxiv.org/abs/2601.19897

[3] Self-Distillation Enables Continual Learning (code):

https://self-distillation.github.io/SDFT.html

jeremyjh24d ago

Fine tuning does not make a model any smaller. It can make a smaller model more effective at a specific task, but a larger model with the same architecture fine-tuned on the same dataset will always be more capable in a domain as general as programming or software design. Of course, as architecture and related tooling improves the smallest model that is "good enough" will continue to get smaller.

ManlyBread24d ago

>someone will soon figure out how to build micro-LLMs for specialized tasks that have real-world value

You've just reinvented machine learning

willio5825d ago

Hank Green in collaboration with Cal Newport just released a video where Cal makes the argument for exactly that, that for many reasons not least being cost, smaller more targeted models will become more popular for the foreseeable future. Highly recommend this long video posted today https://youtu.be/8MLbOulrLA0

otabdeveloper425d ago

We had good small language models for decades. (E.g. BERT)

The entire point of LLMs is that you don't have to spend money training them for each specific case. You can train something like Qwen once and then use it to solve whatever classification/summarization/translation problem in minutes instead of weeks.

mootothemax24d ago

> We had good small language models for decades. (E.g. BERT)

BERT isn’t a SLM, and the original was released in 2018.

The whole new era kicked off with Attention Is All You Need; we haven’t reached even a single decade of work on it.

1 more reply

znnajdla25d ago

> The entire point of LLMs is that you don't have to spend money training them for each specific case.

I don’t agree. I would say the entire point of LLMs is to be able to solve a certain class of non-deterministic problems that cannot be solved with deterministic procedural code. LLMs don’t need to be generally useful in order to be useful for specific business use cases. I as a programmer would be very happy to have a local coding agent like Claude Code that can do nothing but write code in my chosen programming language or framework, instead of using a general model like Opus, if it could be hyper-specialized and optimized for that one task, so that it is small enough to run on my MacBook. I don’t need the other general reasoning capabilities of Opus.

2 more replies

ghm219924d ago

Economics of producing goods(software code) would dictate that the world would settle to a new price per net new "unit" of code and the production pipeline(some wierd unrecognizable LLM/Human combination) to go with it. The price can go to near zero since software pipeline could be just AI and engineers would be bought in as needed(right now AI is introduced as needed and humans still build a bulk of the system). This would actually mean software engineering does not exist as u know it today, it would become a lot more like a vocation with a narrower defied training/skill needed than now. It would be more like how a plumber operates: he comes and fixes things once in a while a needed. He actually does not understand fluid dynamics and structural engineering. the building runs on auto 99% of the time.

Put it another way: Do you think people will demand masses of _new_ code just because it becomes cheap? I don't think so. It's just not clear what this would mean even 1-3 years from now for software engineering.

This round of LLM driven optimizations is really and purely about building a monopoly on _labor replacement_ (anthropic and openai's code and cowork tools) until there is clear evidence to the contrary: A Jevon's paradoxian massive demand explosion. I don't see that happening for software. If it were true — maybe it will still take a few quarters longer — SaaS companies stocks would go through the roof(i mean they are already tooling up as we speak, SAP is not gonna jus sit on its ass and wait for a garage shop to eat their lunch).

asim24d ago

This is my gut feeling also. I forked the project and got Claude to rewrite it in Go as a form of exploration. For a long time I've felt smaller useful models could exist and they could also be interconnected and routed via something else if needed but also provide streaming for real time training or evolution. The large scale stuff will be dominated by the huge companies but the "micro" side could be just as valuable.

killerstorm24d ago

You're missing the point.

Karpathy has other projects, e.g. : https://github.com/karpathy/nanochat

You can train a model with GPT-2 level of capability for $20-$100.

But, guess what, that's exactly what thousands of AI researchers have been doing for the past 5+ years. They've been training smallish models. And while these smallish models might be good for classification and whatnot, people strongly prefer big-ass frontier models for code generation.

the_arun25d ago

If we can run them on commodity hardware with cpus, nothing like it

npn25d ago

what gut? we are already doing that. there are a lot of "tiny" LLMs that are useful: M$ Phi-4, Gemma 3/3n, Qwen 7B... There are even smaller models like Gemma 270M that is fine tuned for function calls.

they are not flourish yet because of the simple reason: the frontier models are still improving. currently it is better to use frontier models than training/fine-tuning one by our own because by the time we complete the model the world is already moving forward.

heck even distillation is a waste of time and money because newer frontier models yield better outputs.

you can expect that the landscape will change drastically in the next few years when the proprietary frontier models stop having huge improvements every version upgrade.

znnajdla25d ago

I’ve tried those tiny LLMs and they don’t seem useful to me for real world tasks. They are toys for super simple autocomplete.

2 more replies

maipen24d ago

That would only produce a model that you can ask questions to.

freakynit25d ago

Is there something similar for diffusion models? By the way, this is incredibly useful for learning in depth the core of LLM's.

fulafel25d ago

This could make an interesting language shootout benchmark.

hrmtst9383725d ago

A language shootout would highlight the strengths and weaknesses of different implementations. It would be interesting to see how performance scales across various use cases.

0xbadcafebee25d ago

Since this post is about art, I'll embed here my favorite LLM art: the IOCCC 2024 prize winner in bot talk, from Adrian Cable (https://www.ioccc.org/2024/cable1/index.html), minus the stdlib headers:

  #define a(_)typedef _##t
  #define _(_)_##printf
  #define x f(i,
  #define N f(k,
  #define u _Pragma("omp parallel for")f(h,
  #define f(u,n)for(I u=0;u<(n);u++)
  #define g(u,s)x s%11%5)N s/6&33)k[u[i]]=(t){(C*)A,A+s*D/4},A+=1088*s;
  
  a(int8_)C;a(in)I;a(floa)F;a(struc){C*c;F*f;}t;enum{Z=32,W=64,E=2*W,D=Z*E,H=86*E,V='}\0'};C*P[V],X[H],Y[D],y[H];a(F
  _)[V];I*_=U" 炾ોİ䃃璱ᝓ၎瓓甧染ɐఛ瓁",U,s,p,f,R,z,$,B[D],open();F*A,*G[2],*T,w,b,c;a()Q[D];_t r,L,J,O[Z],l,a,K,v,k;Q
  m,e[4],d[3],n;I j(I e,F*o,I p,F*v,t*X){w=1e-5;x c=e^V?D:0)w+=r[i]*r[i]/D;x c)o[i]=r[i]/sqrt(w)*i[A+e*D];N $){x
  W)l[k]=w=fmax(fabs(o[i])/~-E,i?w:0);x W)y[i+k*W]=*o++/w;}u p)x $){I _=0,t=h*$+i;N W)_+=X->c[t*W+k]*y[i*W+k];v[h]=
  _*X->f[t]*l[i]+!!i*v[h];}x D-c)i[r]+=v[i];}I main(){A=mmap(0,8e9,1,2,f=open(M,f),0);x 2)~f?i[G]=malloc(3e9):exit(
  puts(M" not found"));x V)i[P]=(C*)A+4,A+=(I)*A;g(&m,V)g(&n,V)g(e,D)g(d,H)for(C*o;;s>=D?$=s=0:p<U||_()("%s",$[P]))if(!
  (*_?$=*++_:0)){if($<3&&p>=U)for(_()("\n\n> "),0<scanf("%[^\n]%*c",Y)?U=*B=1:exit(0),p=_(s)(o=X,"[INST] %s%s [/INST]",s?
  "":"<<SYS>>\n"S"\n<</SYS>>\n\n",Y);z=p-=z;U++[o+=z,B]=f)for(f=0;!f;z-=!f)for(f=V;--f&&f[P][z]|memcmp(f[P],o,z););p<U?
  $=B[p++]:fflush(0);x D)R=$*D+i,r[i]=m->c[R]*m->f[R/W];R=s++;N Z){f=k*D*D,$=W;x 3)j(k,L,D,i?G[~-i]+f+R*D:v,e[i]+k);N
  2)x D)b=sin(w=R/exp(i%E/14.)),c=1[w=cos(w),T=i+++(k?v:*G+f+R*D)],T[1]=b**T+c*w,*T=w**T-c*b;u Z){F*T=O[h],w=0;I A=h*E;x
  s){N E)i[k[L+A]=0,T]+=k[v+A]*k[i*D+*G+A+f]/11;w+=T[i]=exp(T[i]);}x s)N E)k[L+A]+=(T[i]/=k?1:w)*k[i*D+G[1]+A+f];}j(V,L
  ,D,J,e[3]+k);x 2)j(k+Z,L,H,i?K:a,d[i]+k);x H)a[i]*=K[i]/(exp(-a[i])+1);j(V,a,D,L,d[$=H/$,2]+k);}w=j($=W,r,V,k,n);x
  V)w=k[i]>w?k[$=i]:w;}}

dwroberts24d ago

I enjoyed the footnote on their entry, where they link to ChatGPT confidently asserting that it was impossible for such an LLM to exist

> You're about as close to writing this in 1800 characters of C as you are to launching a rocket to Mars with a paperclip and a match.

thatxliner25d ago

wiat what does this do?

aix125d ago

As the contest entry page explains:

> ChatIOCCC is the world’s smallest LLM (large language model) inference engine - a “generative AI chatbot” in plain-speak. ChatIOCCC runs a modern open-source model (Meta’s LLaMA 2 with 7 billion parameters) and has a good knowledge of the world, can understand and speak multiple languages, write code, and many other things. Aside from the model weights, it has no external dependencies and will run on any 64-bit platform with enough RAM.

(Model weights need to be downloaded using an enclosed shell script.)

https://www.ioccc.org/2024/cable1/index.html

1 more reply

mr_toad24d ago

Without the weights, nothing (or anything, given arbitrary weights).

ruszki24d ago

> [p for mat in state_dict.values() for row in mat for p in row]

I'm so happy without seeing Python list comprehensions nowadays.

I don't know why they couldn't go with something like this:

[state_dict.values() for mat for row for p]

or in more difficult cases

[state_dict.values() for mat to mat*2 for row for p to p/2]

I know, I know, different times, but still.

WithinReason24d ago

I would have gone for:

[for p in row in mat in state_dict.values()]

ruszki24d ago

That’s also an option. The left to right flow is better for the sake of autocomplete and comprehension: when you start to read your right to left version, you don’t know what is p, then row, then mat. With left to right, this problem doesn’t exist.

One for sure, both are superior to the garbled mess of Python’s.

Of course if the programming language would be in a right to left natural language, then these are reversed.

ThrowawayTestr25d ago

This is like those websites that implement an entire retro console in the browser.

colonCapitalDee25d ago

Beautiful work

astroanax24d ago

I feel its wrong to call it microgpt, since its smaller than nanogpt, so maybe picogpt would have been a better name? nice project tho

MattyRad25d ago

Hoenikker had been experimenting with melting and re-freezing ice-nine in the kitchen of his Cape Cod home.

Beautiful, perhaps like ice-nine is beautiful.

meta-level17d ago

In case you intend to play around with microgpt: use Python 3.14! I found it to be 3x faster compared to 3.13 or 3.12..

easygenes23d ago

Inspiring. Definitely got nerd sniped by this. Now you can train it in under a second on one CPU core with no dependencies: https://github.com/Entrpi/eemicrogpt

Detailed optimizing journey in the readme too.

lynx9723d ago

Question: Can this be modified to score a "document"? I'd basically like to pass it a name, and get a score (0..1) on how realistic themodel "thinks" the document is? This would be extremely helpful for a project of mine.

vadimf24d ago

I’m 100% sure the future consists of many models running on device. LLMs will be the mobile apps of the future (or a different architecture, but still intelligence).

ajnin24d ago

The future right now looks more like everything in remote datacenters, no autonomous capabilities and no control by the user. But I like yours better.

latchkey24d ago

I don't mind the remote datacenters, I just don't like the lack of control.

pizzafeelsright24d ago

This is the path forward, with some overhead.

1. Generic model that calls other highly specific, smaller, faster models. 2. Models loaded on demand, some black box and some open. 3. There will be a Rust model specifically for Rust (or whatever language) tasks.

In about 5-8 years we will have personalized models based upon all our previous social/medical/financial data that will respond as we would, a clone, capable of making decisions similar with direction of desired outcomes.

The big remaining blocker is that generic model that can be imprinted with specifics and rebuilt nightly. Excluding the training material but the decision making, recall, and evaluation model. I am curious if someone is working on that extracted portion that can be just a 'thinking' interface.

coldtea24d ago

If anything, memory ain't getting cheaper, disks aren't either, and as for graphics cards, forget it.

People wont be competing with even a current 2026 SOTA from their home LLM nowhere soon. Even actual SOTA LLM providers are not competing either - they're losing money on energy and costs, hopping to make it up on market capture and win the IPO races.

OtherShrezzing24d ago

I don’t think anyone needs to compete with the LLM SOTA to get the benefits of these technologies on-device.

Consumers don’t need a 100k context window oracle that knows everything about both T-Cells and the ancient Welsh Royal lineage. We need focused & small models which are specialised, and then we need a good query router.

1 more reply

sieste24d ago

The typos are interesting ("vocavulary", "inmput") - One of the godfathers of LLMs clearly does not use an LLM to improve his writing, and he doesn't even bother to use a simple spell checker.

shepherdjerred24d ago

> Write me an AI blog post

$ Sure, here's a blog post called "Microgpt"!

> "add in a few spelling/grammar mistakes so they think I wrote it"

$ Okay, made two errors for you!

meltyness24d ago

  vocabulary*

  *In the code above, we collect all unique characters across the dataset

coolThingsFirst25d ago

Incredibly fascinating. One thing is that it seems still very conceptual. What id be curious about how good of a micro llm we can train say with 12 hours of training on macbook.

smj-edison24d ago

Somewhat unrelated, but the generated names are surprisingly good! They're certainly more sane then appending -eigh to make a unique name.

WithinReason24d ago

Previously:

https://news.ycombinator.com/item?id=47000263

jonjacky24d ago

I wonder if such a small GPT exhibits plagiarism. Are some of the generated names the same as names in the input data?

borplk24d ago

Can anyone mention how you can "save the state" so it doesn't have to train from scratch on every run?

retube24d ago

Can you train this on say Wikipedia and have it generate semi-sensible responses?

krisoft24d ago

No. But there are a few layers to that.

First no is that the model as is has too few parameters for that. You could train it on the wikipedia but it wouldn’t do much of any good.

But what if you increase the number of parameters? Then you get to the second layer of “no”. The code as is is too naive to train a realistic size LLM for that task in realistic timeframes. As is it would be too slow.

But what if you increase the number of parameters and improve the performance of the code? I would argue that would by that point not be “this” but something entirely different. But even then the answer is still no. If you run that new code with increased parameters and improved efficiencly and train it on wikipedia you would still not get a model which “generate semi-sensible responses”. For the simple reason that the code as is only does the pre-training. Without the RLHF step the model would not be “responding”. It would just be completing the document. So for example if you ask it “How long is a bus?” it wouldn’t know it is supposed to answer your question. What exactly happens is kinda up to randomness. It might output a wikipedia like text about transportation, or it might output a list of questions similar to yours, or it might output broken markup garbage. Quite simply without this finishing step the base model doesn’t know that it is supposed to answer your question and it is supposed to follow your instructions. That is why this last step is called “instruction tuning” sometimes. Because it teaches the model to follow instructions.

But if you would increase the parameter count, improve the efficiency, train it on wikipedia, then do the instruction tuning (wich involves curating a database of instruction - response pairs) then yes. After that it would generate semi-sensible responses. But as you can see it would take quite a lot more work and would stretch the definition of “this”.

It is a bit like asking if my car could compete in formula-1. The answer is yes, but first we need to replace all parts of it with different parts, and also add a few new parts. To the point where you might question if it is the same car at all.

nebben6424d ago

Very useful breakdown; thank you!

OJFord24d ago

If you increase all the numbers (including, as a result, the time to train).

geon24d ago

That’s exactly what chatgpt etc are.

stuckkeys24d ago

That web interface that someone commented in your github was flawless.

jimbokun25d ago

It’s pretty staggering that a core algorithm simple enough to be expressed in 200 lines of Python can apparently be scaled up to achieve AGI.

Yes with some extra tricks and tweaks. But the core ideas are all here.

darkpicnic25d ago

LLMs won’t lead to AGI. Almost by definition, they can’t. The thought experiment I use constantly to explain this:

Train an LLM on all human knowledge up to 1905 and see if it comes up with General Relativity. It won’t.

We’ll need additional breakthroughs in AI.

foxglacier24d ago

That's an assertion, not a thought experiment. You can't logically reach the conclusion ("It won't") by thinking about it. But it doesn't sound so grand if you say "The assertion I use constantly to explain this".

1 more reply

johnmaguire25d ago

I'm not sure - with tool calling, AI can both fetch and create new context.

1 more reply

joefourier24d ago

When did AGI start meaning ASI?

LLMs are artificial general intelligence, as per the Wikipedia definition:

> generalise knowledge, transfer skills between domains, and solve novel problems without task‑specific reprogramming

Even GPT-3 could meet that bar.

1 more reply

tehjoker25d ago

Part of the issue there is that the data quantity prior to 1905 is a small drop in the bucket compared to the internet era even though the logical rigor is up to par.

2 more replies

TiredOfLife24d ago

> Train an LLM on all human knowledge up to 1905 and see if it comes up with General Relativity. It won’t.

Same thing is true for humans.

1 more reply

canjobear24d ago

It's not obvious why it wouldn't, especially if it gets to read Poincaré and Riemann.

xdennis25d ago

> Train an LLM on all human knowledge up to 1905 and see if it comes up with General Relativity. It won’t.

AGI just means human level intelligence. I couldn't come up with General Relativity. That doesn't mean I don't have general intelligence.

I don't understand why people are moving the goalposts.

3 more replies

crazy5sheep25d ago

The 1905 thought experiment actually cuts both ways. Did humans "invent" the airplane? We watched birds fly for thousands of years — that's training data. The Wright brothers didn't conjure flight from pure reasoning, they synthesized patterns from nature, prior failed attempts, and physics they'd absorbed. Show me any human invention and I'll show you the training data behind it.

Take the wheel. Even that wasn't invented from nothing — rolling logs, round stones, the shape of the sun. The "invention" was recognizing a pattern already present in the physical world and abstracting it. Still training data, just physical and sensory rather than textual.

And that's actually the most honest critique of current LLMs — not that they're architecturally incapable, but that they're missing a data modality. Humans have embodied training data. You don't just read about gravity, you've felt it your whole life. You don't just know fire is hot, you've been near one. That physical grounding gives human cognition a richness that pure text can't fully capture — yet.

Einstein is the same story. He stood on Faraday, Maxwell, Lorentz, and Riemann. General Relativity was an extraordinary synthesis — not a creation from void. If that's the bar for "real" intelligence, most humans don't clear it either. The uncomfortable truth is that human cognition and LLMs aren't categorically different. Everything you've ever "thought" comes from what you've seen, heard, and experienced. That's training data. The brain is a pattern-recognition and synthesis machine, and the attention mechanism in transformers is arguably our best computational model of how associative reasoning actually works.

So the question isn't whether LLMs can invent from nothing — nothing does that, not even us.

Are there still gaps? Sure. Data quality, training methods, physical grounding — these are real problems. But they're engineering problems, not fundamental walls. And we're already moving in that direction — robots learning from physical interaction, multimodal models connecting vision and language, reinforcement learning from real-world feedback. The brain didn't get smart because it has some magic ingredient. It got smart because it had millions of years of rich, embodied, high-stakes training data. We're just earlier in that journey with AI. The foundation is already there — AGI isn't a question of if anymore, it's a question of execution.

2 more replies

kilroy12324d ago

I strongly suspect we're like 4 more elegant algorithms away from a real AGI.

wasabi99101125d ago

1000 lines??

What is going on in this thread

jimbokun25d ago

Ok 200 lines.

Don’t know how I ended up typing 1000.

1 more reply

ViktorRay25d ago

It’s pretty sad.

The only way we know these comments are from AI bots for now is due to the obvious hallucinations.

What happens when the AI improves even more…will HN be filled with bots talking to other bots?

3 more replies

ksherlock25d ago

It's a honey pot for low quality llm slop.

anonym2925d ago

Wow, you're so right, jimbokun! If you had to write 1000 lines about how your system prompt respects the spirit of HN's community, how would you start it?

chenster24d ago

The best ML learning for dummies.

spopejoy23d ago

Sorry to RFELI5 but but ... I thought a "token" was a word? The example is of names and the output is new improvised names, implying that a character is a token? Or do all LLMs operate at character level?

Also is there some minima of training data? E.g. if you just trained on "True" "False" I assume it would be .5 Bernoulli? What is the minimum to see "interesting" results I guess.

bytesandbits24d ago

sensei karpathy has done it again

hmcamp23d ago

Cool

dhruv300625d ago

Karapthy with another gem !

1 more reply

geon24d ago

Is there a similarly simple implementation with tensorflow?

I tried building a tiny model last weekend, but it was very difficult to find any articles that weren’t broken ai slop.

joefourier24d ago

Tensorflow is largely dead, it’s been years since I’ve seen a new repo use it. Go with Jax if you want a PyTorch alternative that can have better performance for certain scenarios.

nickpsecurity24d ago

Also, TPU support. Hardware diversity.

geon24d ago

Any recommendations for Typescript?

hoppp24d ago

Was that code generated by claude?

huqedato24d ago

Looking for alternative in Julia.

mold_aid24d ago

"art" project?

shevy-java24d ago

Microslop is alive!

ViktorRay25d ago

Which license is being used for this?

dilap25d ago

MIT (https://gist.github.com/karpathy/8627fe009c40f57531cb1836010...)

ViktorRay25d ago

Thank you

kelvinjps1025d ago

Why there is multiple comments talking about 1000 c lines, bots?

the_af25d ago

Or even 1000 python lines, also wrong.

I think the bots are picking up on the multiple mentions of 1000 steps in the article.

thatxliner25d ago

btw my friend is asking if your username is a "Klara and the Sun" reference

1 more reply

tithos25d ago

What is the prime use case

antonvs25d ago

To confuse people who only think in terms of use cases.

Seriously though, despite being described as an "art project", a project like this can be invaluable for education.

bourjwahwah25d ago

Even art serves a use case of giving the artist a means of expression.

Use case does not need to be technical.

hrmtst9383724d ago

Education often hinges on breaking down complex ideas into digestible chunks, and projects like this can spark creativity and critical thinking. What may seem whimsical can lead to deeper discussions about AI's role and limitations.

keyle25d ago

it's a great learning tool and it shows it can be done concisely.

geerlingguy25d ago

Looks like to learn how a GPT operates, with a real example.

foodevl25d ago

Yeah, everyone learns differently, but for me this is a perfect way to better understand how GPTs work.

inerte25d ago

Kaparthy to tell you things you thought were hard in fact fit in a screen.

jackblemming25d ago

Case study to whenever a new copy of Programming Pearls is released.

aaronblohowiak25d ago

“Art project”

pixelatedindex25d ago

If writing is art, then I’ve been amazed at the source code written by this legend

with25d ago

"everything else is just efficiency" is a nice line but the efficiency is the hard part. the core of a search engine is also trivial, rank documents by relevance. google's moat was making it work at scale. same applies here.

lukan25d ago

Sure, but understanding the core concepts are essential to make things efficient and as far as I understand, this has mainly educational purposes ( it does not even run on a GPU).

with25d ago

yep, agreed. wasn’t knocking the project at all, it’s great for exactly that purpose

geon24d ago

I think the hard part is improving on the basic concept.

The current top of the line models are extremely overfitted and produce so much nonsense they are useless for anything but the most simple tasks.

This architecture was an interesting experiment, but is not the future.

profsummergig25d ago

If anyone knows of a way to use this code on a consumer grade laptop to train on a small corpus (in less than a week), and then demonstrate inference (hallucinations are okay), please share how.

simsla25d ago

The blog post literally explains how to do so.

hrmtst9383724d ago

It's true, the post lays out the details clearly, but a hands-on example can often make the concepts more tangible. Seeing it in action helps solidify understanding.

hrmtst9383724d ago

The post lays out the steps clearly, but implementing them often reveals unexpected challenges. It's usually more complicated in practice than it appears on paper.

1 more reply

hrmtst9383724d ago

If the implementation details are clear, replicating the setup can be worthwhile. Sometimes seeing it in action helps to better understand the nuances.

j / k navigate · click thread line to collapse

329 comments

teleforce25d ago

Someone has modified microgpt to build a tiny GPT that generates Korean first names, and created a web page that visualizes the entire process [1].

Users can interactively explore the microgpt pipeline end to end, from tokenization until inference.

[1] English GPT lab:

https://ko-microgpt.vercel.app/

camkego24d ago

Pretty nifty. Even if you are not interested in the Korean language

sprobertson24d ago

This kind of thing is pretty easy to do with a much leaner model https://docs.pytorch.org/tutorials/intermediate/char_rnn_gen...

arijun23d ago

I assume the goal isn't to generate Korean names but to learn GPTs.

1 more reply

love2read24d ago

By "modified" this person of course means that they swapped out the list of X0,000 names from English to Korean names. That is seemingly the only change.

The attached website is a fully ai-generated "visualization" based on the original blog post with little added.

big-chungus424d ago

It's a good website and probably AI generated with some insane expensive model that us mere mortals are too poor to afford, thus it has a value

sahildeepreel23d ago

so impressive!

verma725d ago

I wrote a C++ translation of it: https://github.com/verma7/microgpt/blob/main/microgpt.cc

2x the number of lines of code (~400L), 10x the speed

The hard part was figuring out how to represent the Value class in C++ (ended up using shared_ptrs).

WithinReason24d ago

I made an explicit reverse pass (no autodiff), it was 8x faster in Python

hu324d ago

I made an explicit double-reverse pass (no code!), it was 80x faster in my head!

2 more replies

bear3r24d ago

love2read24d ago

Can you share a link?

1 more reply

freakynit20d ago

24x speedup (over 10x already) and similar loss profile (for c++ version, optimized by claude): https://gist.github.com/freakynit/3982eab8413a89941bd0018e63......

verma718d ago

This is amazing! Thanks for optimizing the code using Claude!

red_hare25d ago

This is beautiful and highly readable but, still, I yearn for a detailed line-by-line explainer like the backbone.js source: https://backbonejs.org/docs/backbone.html

tomjakubowski24d ago

I believe that Backbone's annotated source is generated with Docco, another project from the creator of CoffeeScript.

https://ashkenas.com/docco/

It's really neat. I wish I published more of my code this way.

ashish0125d ago

That is really beautiful literate program. Seeing it after a long time. Here is a opus generate version of this code - https://ashish01.github.io/microgpt.html

subset25d ago

Andrej Karpathy has a walkthrough blog post here: https://karpathy.github.io/2026/02/12/microgpt/

OJFord24d ago

That is the article being discussed.

1 more reply

altcognito25d ago

ask a high end LLM to do it

subset25d ago

amelius24d ago

Storing the partial derivatives into the weights structure is quite the hack, to be honest. But everybody seems to do it like that.

hei-lima24d ago

Great work! Might do it too in some other language...

thomasmg24d ago

I got a convertion to Java. It worked (at least I think...) in the first try.

Then I want to convert this to my own programming language (which traspiles to C). I like those tiny projects very much!

1 more reply

pmarreck24d ago

Zig, here.

Anything but Python

1 more reply

jtesp23d ago

how did you do the transliteration/port?

subset21d ago

Handwritten! (aka no LLM assistance :) It wasn't transpiled or anything like that. I've been meaning to post a little about it on my blog; just been caught up with other stuff atm.

1 more reply

geokon24d ago

andy12_24d ago

[1] https://arxiv.org/pdf/2303.08774 Figure 8

[2] https://arxiv.org/pdf/2511.04869 Figure 1.

geokon24d ago

You could color code the output token so you can see some abrupt changes

It seems kind of obvious, so I'm guessing people have tried this

1 more reply

chongli24d ago

Having a confidence score isn't as useful as it seems unless you (the user) know a lot about the contents of the training set.

radarsat124d ago

I mean I sort of understand what you're trying to say but in fact a great deal of knowledge we get about the world we live in, we get second hand.

1 more reply

DavidSJ24d ago

Yes, the actual LLM returns a probability distribution, which gets sampled to produce output tokens.

mr_toad24d ago

It’s often very difficult (intractable) to come up with a probability distribution of an estimator, even when the probability distribution of the data is known.

Basically, you’d need a lot more computing power to come up with a distribution of the output of an LLM than to come up with a single answer.

podnami24d ago

What happens before the probability distribution? I’m assuming say alignment or other factors would influence it?

1 more reply

Lionga24d ago

The LLM has an internal "confidence score" but that has NOTHING to do with how correct the answer is, only with how often the same words came together in training data.

E.g. getting two r's in strawberry could very well have a very high "confidence score" while a random but rare correct fact might have a very well a very low one.

In short: LLM have no concept, or even desire to produce of truth

sharperguy24d ago

alexwebb224d ago

Huge leap there in your conclusion. Looks like you’re hand-waving away the entire phenomenon of emergent properties.

amelius24d ago

> In short: LLM have no concept, or even desire to produce of truth

They do produce true statements most of the time, though.

1 more reply

Otterly9923d ago

There is this paper that proposed data compression as a way to judge the ability of a LLM to "understand" things correctly, training on older texts and trying to predict more recent articles:

https://ar5iv.labs.arxiv.org/html//2402.00861

podnami24d ago

I would assume this is from case to case, such as:

- How aligned has it been to “know” that something is true (eg ethical constraints)

- Statistical significance and just being able to corroborate one alternative in Its training data more strongly than another

- If it’s a web search related query, is the statement from original sources vs synthesised from say third party sources

But I’m just a layman and could be totally off here.

jorvi24d ago

> I'm not really sure, but maybe this XXX

You never see this in the response but you do in the reasoning.

danlitt24d ago

Can it generate one? Sure. But it won't mean anything, since you don't know (and nobody knows) the "true" distribution.

kuberwastaken25d ago

I'm half shocked this wasn't on HN before? Haha I built PicoGPT as a minified fork with <35 lines of JS and another in python

And it's small enough to run from a QR code :) https://kuber.studio/picogpt/

You can quite literally train a micro LLM from your phone's browser

dang24d ago

Wow I agree - surprising that it took 2 weeks to make HN's frontpage.

3abiton24d ago

I think he caught some flack for promoting claudebot at that time, and giving it a rave review. Some people are hardliner. His work has always been amazing nonetheless.

1 more reply

cootsnuck25d ago

It was: https://news.ycombinator.com/item?id=47000263

iberator25d ago

[flagged]

dang24d ago

Please don't be a jerk on HN, and especially not when responding to someone's work. This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.

lelandfe24d ago

https://github.com/Kuberwastaken/picogpt/blob/main/picogpt.j...

kuberwastaken24d ago

lol there is source code as a gist

growingswe25d ago

Great stuff! I wrote an interactive blogpost that walks through the code and visualizes it: https://growingswe.com/blog/microgpt

O4epegb24d ago

> By the end of training, the model produces names like "kamon", "karai", "anna", and "anton". None of them are copies from the dataset.

All 4 are in the dataset, btw

mym199024d ago

joenot44324d ago

This is awesome! Normally I'm pretty critical of LLM-assisted-blogging, but this one's a real winner.

evntdrvn24d ago

You should totally submit that to HN as an article, if you haven't already.

dang24d ago

spinningslate24d ago

That’s beautifully done, thanks for posting. As helpful again to an ML novice like me as Karpathy’s original.

hei-lima24d ago

Great!

evntdrvn24d ago

really nice, thanks

la_fayette24d ago

mentos24d ago

Curious if you could try to explain it. It’s my goal to sit down with it and attempt to understand it intuitively.

Karpathy says if you want to truly understand something then you also have to attempt to teach it to someone else ha

la_fayette24d ago

Yes, that’s true! That could be my next step… though I have to admit, writing this in a HN comment feels like a bit of a challenge.

etothet24d ago

Even if you have some basic understanding of how LLMs work, I highly recommend Karpathy’s intro to LLMs videos on YouTube.

- https://m.youtube.com/watch?v=7xTGNNLPyMI - https://m.youtube.com/watch?v=EWvNQjAaOHw

arvid-lind24d ago

trying my best to keep up with what and how to learn and threads like this are dense with good info. feel like I need an AI helper to schedule time for my youtube queue at this point!

grey-area24d ago

Thanks, this is very very long but very good background on how production LLMs work.

rramadass25d ago

C++ version - https://github.com/Charbel199/microgpt.cpp?tab=readme-ov-fil...

Rust version - https://github.com/mplekh/rust-microgpt

almaight23d ago

https://github.com/novvoo/gogpt

znnajdla25d ago

allovertheworld24d ago

It just doesn’t work that way, LLMs need to be generalised a lot to be useful even in specific tasks.

It really is the antithesis to the human brain, where it rewards specific knowledge

rapnie24d ago

[0] https://www.youtube.com/watch?v=l6ZcFa8pybE

3 more replies

jeremyjh24d ago

> It just doesn’t work that way, LLMs need to be generalised a lot to be useful even in specific tasks.

> It really is the antithesis to the human brain, where it rewards specific knowledge

1 more reply

avaer24d ago

The human brain rewards specific knowledge because it's already pre-trained by evolution to have the basics.

You'd need a lot of data to train an ocean soup to think like a human too.

It's not really the antithesis to the human brain if you think of starting with an existing brain as starting with an existing GPT.

rytill24d ago

Are you trying to imply that humans don’t need generalized knowledge, or that we’re not “rewarded” for having highly generalized knowledge?

If so, good luck walking to your kitchen this morning, knowing how to breathe, etc.

1 more reply

teleforce25d ago

This is possible but not for training but fine-tuning the existing open source models.

This can be mainstream, and then custom model fine-tuning becomes the new “software development”.

Please check out this new fine-tuning method for LLM by MIT and ETH Zurich teams that used a single NVIDIA H200 GPU [1], [2], [3].

Full fine-tuning of the entire model’s parameters were performed based on the Hugging Face TRL library.

[1] MIT's new fine-tuning method lets LLMs learn new skills without losing old ones (news):

https://venturebeat.com/orchestration/mits-new-fine-tuning-m...

[2] Self-Distillation Enables Continual Learning (paper):

https://arxiv.org/abs/2601.19897

[3] Self-Distillation Enables Continual Learning (code):

https://self-distillation.github.io/SDFT.html

jeremyjh24d ago

ManlyBread24d ago

>someone will soon figure out how to build micro-LLMs for specialized tasks that have real-world value

You've just reinvented machine learning

willio5825d ago

otabdeveloper425d ago

We had good small language models for decades. (E.g. BERT)

mootothemax24d ago

> We had good small language models for decades. (E.g. BERT)

BERT isn’t a SLM, and the original was released in 2018.

The whole new era kicked off with Attention Is All You Need; we haven’t reached even a single decade of work on it.

1 more reply

znnajdla25d ago

> The entire point of LLMs is that you don't have to spend money training them for each specific case.

2 more replies

ghm219924d ago

asim24d ago

killerstorm24d ago

You're missing the point.

Karpathy has other projects, e.g. : https://github.com/karpathy/nanochat

You can train a model with GPT-2 level of capability for $20-$100.

the_arun25d ago

If we can run them on commodity hardware with cpus, nothing like it

npn25d ago

heck even distillation is a waste of time and money because newer frontier models yield better outputs.

you can expect that the landscape will change drastically in the next few years when the proprietary frontier models stop having huge improvements every version upgrade.

znnajdla25d ago

I’ve tried those tiny LLMs and they don’t seem useful to me for real world tasks. They are toys for super simple autocomplete.

2 more replies

maipen24d ago

That would only produce a model that you can ask questions to.

freakynit25d ago

Is there something similar for diffusion models? By the way, this is incredibly useful for learning in depth the core of LLM's.

fulafel25d ago

This could make an interesting language shootout benchmark.

hrmtst9383725d ago

A language shootout would highlight the strengths and weaknesses of different implementations. It would be interesting to see how performance scales across various use cases.

0xbadcafebee25d ago

Since this post is about art, I'll embed here my favorite LLM art: the IOCCC 2024 prize winner in bot talk, from Adrian Cable (https://www.ioccc.org/2024/cable1/index.html), minus the stdlib headers:

  #define a(_)typedef _##t
  #define _(_)_##printf
  #define x f(i,
  #define N f(k,
  #define u _Pragma("omp parallel for")f(h,
  #define f(u,n)for(I u=0;u<(n);u++)
  #define g(u,s)x s%11%5)N s/6&33)k[u[i]]=(t){(C*)A,A+s*D/4},A+=1088*s;
  
  a(int8_)C;a(in)I;a(floa)F;a(struc){C*c;F*f;}t;enum{Z=32,W=64,E=2*W,D=Z*E,H=86*E,V='}\0'};C*P[V],X[H],Y[D],y[H];a(F
  _)[V];I*_=U" 炾ોİ䃃璱ᝓ၎瓓甧染ɐఛ瓁",U,s,p,f,R,z,$,B[D],open();F*A,*G[2],*T,w,b,c;a()Q[D];_t r,L,J,O[Z],l,a,K,v,k;Q
  m,e[4],d[3],n;I j(I e,F*o,I p,F*v,t*X){w=1e-5;x c=e^V?D:0)w+=r[i]*r[i]/D;x c)o[i]=r[i]/sqrt(w)*i[A+e*D];N $){x
  W)l[k]=w=fmax(fabs(o[i])/~-E,i?w:0);x W)y[i+k*W]=*o++/w;}u p)x $){I _=0,t=h*$+i;N W)_+=X->c[t*W+k]*y[i*W+k];v[h]=
  _*X->f[t]*l[i]+!!i*v[h];}x D-c)i[r]+=v[i];}I main(){A=mmap(0,8e9,1,2,f=open(M,f),0);x 2)~f?i[G]=malloc(3e9):exit(
  puts(M" not found"));x V)i[P]=(C*)A+4,A+=(I)*A;g(&m,V)g(&n,V)g(e,D)g(d,H)for(C*o;;s>=D?$=s=0:p<U||_()("%s",$[P]))if(!
  (*_?$=*++_:0)){if($<3&&p>=U)for(_()("\n\n> "),0<scanf("%[^\n]%*c",Y)?U=*B=1:exit(0),p=_(s)(o=X,"[INST] %s%s [/INST]",s?
  "":"<<SYS>>\n"S"\n<</SYS>>\n\n",Y);z=p-=z;U++[o+=z,B]=f)for(f=0;!f;z-=!f)for(f=V;--f&&f[P][z]|memcmp(f[P],o,z););p<U?
  $=B[p++]:fflush(0);x D)R=$*D+i,r[i]=m->c[R]*m->f[R/W];R=s++;N Z){f=k*D*D,$=W;x 3)j(k,L,D,i?G[~-i]+f+R*D:v,e[i]+k);N
  2)x D)b=sin(w=R/exp(i%E/14.)),c=1[w=cos(w),T=i+++(k?v:*G+f+R*D)],T[1]=b**T+c*w,*T=w**T-c*b;u Z){F*T=O[h],w=0;I A=h*E;x
  s){N E)i[k[L+A]=0,T]+=k[v+A]*k[i*D+*G+A+f]/11;w+=T[i]=exp(T[i]);}x s)N E)k[L+A]+=(T[i]/=k?1:w)*k[i*D+G[1]+A+f];}j(V,L
  ,D,J,e[3]+k);x 2)j(k+Z,L,H,i?K:a,d[i]+k);x H)a[i]*=K[i]/(exp(-a[i])+1);j(V,a,D,L,d[$=H/$,2]+k);}w=j($=W,r,V,k,n);x
  V)w=k[i]>w?k[$=i]:w;}}

dwroberts24d ago

I enjoyed the footnote on their entry, where they link to ChatGPT confidently asserting that it was impossible for such an LLM to exist

> You're about as close to writing this in 1800 characters of C as you are to launching a rocket to Mars with a paperclip and a match.

thatxliner25d ago

wiat what does this do?

aix125d ago

As the contest entry page explains:

(Model weights need to be downloaded using an enclosed shell script.)

https://www.ioccc.org/2024/cable1/index.html

1 more reply

mr_toad24d ago

Without the weights, nothing (or anything, given arbitrary weights).

ruszki24d ago

> [p for mat in state_dict.values() for row in mat for p in row]

I'm so happy without seeing Python list comprehensions nowadays.

I don't know why they couldn't go with something like this:

[state_dict.values() for mat for row for p]

or in more difficult cases

[state_dict.values() for mat to mat*2 for row for p to p/2]

I know, I know, different times, but still.

WithinReason24d ago

I would have gone for:

[for p in row in mat in state_dict.values()]

ruszki24d ago

One for sure, both are superior to the garbled mess of Python’s.

Of course if the programming language would be in a right to left natural language, then these are reversed.

ThrowawayTestr25d ago

This is like those websites that implement an entire retro console in the browser.

colonCapitalDee25d ago

Beautiful work

astroanax24d ago

I feel its wrong to call it microgpt, since its smaller than nanogpt, so maybe picogpt would have been a better name? nice project tho

MattyRad25d ago

Hoenikker had been experimenting with melting and re-freezing ice-nine in the kitchen of his Cape Cod home.

Beautiful, perhaps like ice-nine is beautiful.

meta-level17d ago

In case you intend to play around with microgpt: use Python 3.14! I found it to be 3x faster compared to 3.13 or 3.12..

easygenes23d ago

Inspiring. Definitely got nerd sniped by this. Now you can train it in under a second on one CPU core with no dependencies: https://github.com/Entrpi/eemicrogpt

Detailed optimizing journey in the readme too.

lynx9723d ago

vadimf24d ago

I’m 100% sure the future consists of many models running on device. LLMs will be the mobile apps of the future (or a different architecture, but still intelligence).

ajnin24d ago

The future right now looks more like everything in remote datacenters, no autonomous capabilities and no control by the user. But I like yours better.

latchkey24d ago

I don't mind the remote datacenters, I just don't like the lack of control.

pizzafeelsright24d ago

This is the path forward, with some overhead.

coldtea24d ago

If anything, memory ain't getting cheaper, disks aren't either, and as for graphics cards, forget it.

OtherShrezzing24d ago

I don’t think anyone needs to compete with the LLM SOTA to get the benefits of these technologies on-device.

1 more reply

sieste24d ago

The typos are interesting ("vocavulary", "inmput") - One of the godfathers of LLMs clearly does not use an LLM to improve his writing, and he doesn't even bother to use a simple spell checker.

shepherdjerred24d ago

> Write me an AI blog post

$ Sure, here's a blog post called "Microgpt"!

> "add in a few spelling/grammar mistakes so they think I wrote it"

$ Okay, made two errors for you!

meltyness24d ago

  vocabulary*

  *In the code above, we collect all unique characters across the dataset

coolThingsFirst25d ago

Incredibly fascinating. One thing is that it seems still very conceptual. What id be curious about how good of a micro llm we can train say with 12 hours of training on macbook.

smj-edison24d ago

Somewhat unrelated, but the generated names are surprisingly good! They're certainly more sane then appending -eigh to make a unique name.

WithinReason24d ago

Previously:

https://news.ycombinator.com/item?id=47000263

jonjacky24d ago

I wonder if such a small GPT exhibits plagiarism. Are some of the generated names the same as names in the input data?

borplk24d ago

Can anyone mention how you can "save the state" so it doesn't have to train from scratch on every run?

retube24d ago

Can you train this on say Wikipedia and have it generate semi-sensible responses?

krisoft24d ago

No. But there are a few layers to that.

First no is that the model as is has too few parameters for that. You could train it on the wikipedia but it wouldn’t do much of any good.

nebben6424d ago

Very useful breakdown; thank you!

OJFord24d ago

If you increase all the numbers (including, as a result, the time to train).

geon24d ago

That’s exactly what chatgpt etc are.

stuckkeys24d ago

That web interface that someone commented in your github was flawless.

jimbokun25d ago

It’s pretty staggering that a core algorithm simple enough to be expressed in 200 lines of Python can apparently be scaled up to achieve AGI.

Yes with some extra tricks and tweaks. But the core ideas are all here.

darkpicnic25d ago

LLMs won’t lead to AGI. Almost by definition, they can’t. The thought experiment I use constantly to explain this:

Train an LLM on all human knowledge up to 1905 and see if it comes up with General Relativity. It won’t.

We’ll need additional breakthroughs in AI.

foxglacier24d ago

1 more reply

johnmaguire25d ago

I'm not sure - with tool calling, AI can both fetch and create new context.

1 more reply

joefourier24d ago

When did AGI start meaning ASI?

LLMs are artificial general intelligence, as per the Wikipedia definition:

> generalise knowledge, transfer skills between domains, and solve novel problems without task‑specific reprogramming

Even GPT-3 could meet that bar.

1 more reply

tehjoker25d ago

Part of the issue there is that the data quantity prior to 1905 is a small drop in the bucket compared to the internet era even though the logical rigor is up to par.

2 more replies

TiredOfLife24d ago

> Train an LLM on all human knowledge up to 1905 and see if it comes up with General Relativity. It won’t.

Same thing is true for humans.

1 more reply

canjobear24d ago

It's not obvious why it wouldn't, especially if it gets to read Poincaré and Riemann.

xdennis25d ago

> Train an LLM on all human knowledge up to 1905 and see if it comes up with General Relativity. It won’t.

AGI just means human level intelligence. I couldn't come up with General Relativity. That doesn't mean I don't have general intelligence.

I don't understand why people are moving the goalposts.

3 more replies

crazy5sheep25d ago

So the question isn't whether LLMs can invent from nothing — nothing does that, not even us.

2 more replies

kilroy12324d ago

I strongly suspect we're like 4 more elegant algorithms away from a real AGI.

wasabi99101125d ago

1000 lines??

What is going on in this thread

jimbokun25d ago

Ok 200 lines.

Don’t know how I ended up typing 1000.

1 more reply

ViktorRay25d ago

It’s pretty sad.

The only way we know these comments are from AI bots for now is due to the obvious hallucinations.

What happens when the AI improves even more…will HN be filled with bots talking to other bots?

3 more replies

ksherlock25d ago

It's a honey pot for low quality llm slop.

anonym2925d ago

Wow, you're so right, jimbokun! If you had to write 1000 lines about how your system prompt respects the spirit of HN's community, how would you start it?

chenster24d ago

The best ML learning for dummies.

spopejoy23d ago

Also is there some minima of training data? E.g. if you just trained on "True" "False" I assume it would be .5 Bernoulli? What is the minimum to see "interesting" results I guess.

bytesandbits24d ago

sensei karpathy has done it again

hmcamp23d ago

Cool

dhruv300625d ago

Karapthy with another gem !

1 more reply

geon24d ago

Is there a similarly simple implementation with tensorflow?

I tried building a tiny model last weekend, but it was very difficult to find any articles that weren’t broken ai slop.

joefourier24d ago

Tensorflow is largely dead, it’s been years since I’ve seen a new repo use it. Go with Jax if you want a PyTorch alternative that can have better performance for certain scenarios.

nickpsecurity24d ago

Also, TPU support. Hardware diversity.

geon24d ago

Any recommendations for Typescript?

hoppp24d ago

Was that code generated by claude?

huqedato24d ago

Looking for alternative in Julia.

mold_aid24d ago

"art" project?

shevy-java24d ago

Microslop is alive!

ViktorRay25d ago

Which license is being used for this?

dilap25d ago

MIT (https://gist.github.com/karpathy/8627fe009c40f57531cb1836010...)

ViktorRay25d ago

Thank you

kelvinjps1025d ago

Why there is multiple comments talking about 1000 c lines, bots?

the_af25d ago

Or even 1000 python lines, also wrong.

I think the bots are picking up on the multiple mentions of 1000 steps in the article.

thatxliner25d ago

btw my friend is asking if your username is a "Klara and the Sun" reference

1 more reply

tithos25d ago

What is the prime use case

antonvs25d ago

To confuse people who only think in terms of use cases.

Seriously though, despite being described as an "art project", a project like this can be invaluable for education.

bourjwahwah25d ago

Even art serves a use case of giving the artist a means of expression.

Use case does not need to be technical.

hrmtst9383724d ago

keyle25d ago

it's a great learning tool and it shows it can be done concisely.

geerlingguy25d ago

Looks like to learn how a GPT operates, with a real example.

foodevl25d ago

Yeah, everyone learns differently, but for me this is a perfect way to better understand how GPTs work.

inerte25d ago

Kaparthy to tell you things you thought were hard in fact fit in a screen.

jackblemming25d ago

Case study to whenever a new copy of Programming Pearls is released.

aaronblohowiak25d ago

“Art project”

pixelatedindex25d ago

If writing is art, then I’ve been amazed at the source code written by this legend

with25d ago

lukan25d ago

Sure, but understanding the core concepts are essential to make things efficient and as far as I understand, this has mainly educational purposes ( it does not even run on a GPU).

with25d ago

yep, agreed. wasn’t knocking the project at all, it’s great for exactly that purpose

geon24d ago

I think the hard part is improving on the basic concept.

The current top of the line models are extremely overfitted and produce so much nonsense they are useless for anything but the most simple tasks.

This architecture was an interesting experiment, but is not the future.

profsummergig25d ago

If anyone knows of a way to use this code on a consumer grade laptop to train on a small corpus (in less than a week), and then demonstrate inference (hallucinations are okay), please share how.

simsla25d ago

The blog post literally explains how to do so.

hrmtst9383724d ago

It's true, the post lays out the details clearly, but a hands-on example can often make the concepts more tangible. Seeing it in action helps solidify understanding.

hrmtst9383724d ago

The post lays out the steps clearly, but implementing them often reveals unexpected challenges. It's usually more complicated in practice than it appears on paper.

1 more reply

hrmtst9383724d ago

If the implementation details are clear, replicating the setup can be worthwhile. Sometimes seeing it in action helps to better understand the nuances.

j / k navigate · click thread line to collapse