Less is more: Recursive reasoning with tiny networks (opens in new tab)

(alexiajm.github.io)

323 pointsguybedo7mo ago71 comments

Paper: https://arxiv.org/abs/2510.04871, Code: https://github.com/SamsungSAILMontreal/TinyRecursiveModels

71 comments

I think everyone should read the post from ARC-AGI organisers about HRM carefully: https://arcprize.org/blog/hrm-analysis

With the same data augmentation / 'test time training' setting, the vanilla Transformers do pretty well, close to the "breakthrough" HRM reported. From a brief skim, this paper is using similar settings to compare itself on ARC-AGI.

I too, want to believe in smaller models with excellent reasoning performance. But first understand what ARC-AGI tests for, what the general setting is -- the one that commercial LLMs use to compare against each other -- and what the specialised setting HRM and this paper uses as evaluation.

The naming of that benchmark lends itself to hype, as we've seen in both HRM and this paper.

tsoj7mo ago

The TRM paper addresses this blog post. I don't think you need to read the HRM analysis very carefully, the TRM has the advantage of being disentangled compared to the HRM, making ablations easier. I think the real value of the arcprize HRM blog post is to highlight the importance of ablation testing.

I think ARC-AGI was supposed to be a challenge for any model. The assumption being that you'd need the reasoning abilities of large language models to solve it. It turns out that this assumption is somewhat wrong. Do you mean that HRM and TRM are specifically trained on a small dataset of ARC-AGI samples, while LLMs are not? Or which difference exactly do hint at?

shawntan7mo ago

> Do you mean that HRM and TRM are specifically trained on a small dataset of ARC-AGI samples, while LLMs are not? Or which difference exactly do hint at?

Yes, precisely this. The question is really what is ARC-AGI evaluating for?

1. If the goal is to see if models can generalise to the ARC-AGI evals, then models being evaluated on it should not be trained on the tasks. Especially IF ARC-AGI evaluations are constructed to be OOD from the ARC-AGI training data. I don't know if they are. Further, there seems to be usage of the few-shot examples in the evals to construct more training data in the HRM case. TRM may do this via the training data via other means.

2. If the goal is that even _having seen_ the training examples, and creating more training examples (after having peeked at the test set), these evaluations should still be difficult, then the ablations show that you can get pretty far without universal/recurrent Transformers.

If 1, then I think the ARC-prize organisers should have better rules laid out for the challenge. From the blog post, I do wonder how far people will push the boundary (how much can I look at the test data to 'augment' my training data?) before the organisers say "This is explicitly not allowed for this challenge."

If 2, the organisers of the challenge should have evaluated how much of a challenge it would actually have been allowing extreme 'data augmentation', and maybe realised it wasn't that much of a challenge to begin with.

I tend to agree that, given the outcome of both the HRM and this paper, is that the ARC-AGI folks do seem to allow this setting, _and_ that the task isn't as "AGI complete" as it sets out to be.

shawntan7mo ago

I should probably also add: It's long been known that Universal / Recursive Transformers are able to solve _simple_ synthetic tasks that vanilla transformers cannot.

Just check out the original UT paper, or some of it's follow ups: Neural Data Router, https://arxiv.org/abs/2110.07732; Sparse Universal Transformers (SUT), https://arxiv.org/abs/2310.07096.

There is even theoretical justification for why: https://arxiv.org/abs/2503.03961

The challenge is actually scaling them up to be useful as LLMs as well (I describe why it's a challenge in the SUT paper).

It's hard to say with the way ARC-AGI is allowed to be evaluated if this is actually what is at play. My gut tells me, given the type of data that's been allowed in the training set, that some leakage of the evaluation has happened in both HRM and TRM.

But because as a field we've given up on actually carefully ensuring training and test don't contaminate, we just decide it's fine and the effect is minimal. Especially considering LLMs, the test set example leaking into the dataset is merely a drop in the bucket (I don't believe we should be dismissing it this way, but that's a whole 'nother conversation).

With these models that are challenge-targeted, it becomes a much larger proportion of what influences the model behaviour, especially if the open evaluation sets are there for everyone to look at and simply generate more. Now we don't know if we're generalising or memorising.

3 more replies

ACCount377mo ago

ARC-AGI 1 and 2 are spatial reasoning benchmarks. ARC-AGI 3 is advanced spatial reasoning with agentic flavor.

They're adversarial benchmarks - they intentionally hit the weak point of existing LLMs. Not "AGI complete" by any means. But not useless either.

1 more reply

ACCount377mo ago

Not exactly "vanilla Transformer", but rather "a Transformer-like architecture with recurrence".

Which is still a fun idea to play around with - this approach clearly has its strengths. But it doesn't appear to be an actual "better Transformer". I don't think it deserves nearly as much hype as it gets.

shawntan7mo ago

Right. There should really be a vanilla Transformer baseline.

With recurrence: The idea has been around: https://arxiv.org/abs/1807.03819

There are reasons why it hasn't really been picked up at scale, and the method tends to do well on synthetic tasks.

magicalhippo7mo ago

Makes me think once again about the similarity to Finite Impulse Response[1] filters (traditional LLMs) and Infinite Impulse Response[2] filters (recursive models). Not that it's a very good or original analogy.

Anyway, with FIR you typically need many, many times the coefficients to get similar filter cutoff performance as a what few IIR coefficients can do.

You can convert a IIR to a FIR using for example the window design method[3], where if you use a rectangular window function you essentially unroll the recursion but stop after some finite depth.

Similarly it seems unrolling the TRM you end up with the traditional LLM architecture of many repeated attention+ff blocks, minus the global feedback part. And unlike a true IIR, the TRM does implement a finite cut-off, so in that sense is more like a traditional FIR/LLM than the structure suggest.

So, would perhaps be interesting to compare the TRM network to a similarly unrolled version.

Then again, maybe this is all mad ramblings from a sleep deprived mind.

[1]: https://en.wikipedia.org/wiki/Finite_impulse_response

[2]: https://en.wikipedia.org/wiki/Infinite_impulse_response

[3]: https://en.wikipedia.org/wiki/Finite_impulse_response#Window...

imtringued7mo ago

Deep Equilibrium Models

>We present a new approach to modeling sequential data: the deep equilibrium model (DEQ). Motivated by an observation that the hidden layers of many existing deep sequence models converge towards some fixed point, we propose the DEQ approach that directly finds these equilibrium points via root-finding. Such a method is equivalent to running an infinite depth (weight-tied) feedforward network, but has the notable advantage that we can analytically backpropagate through the equilibrium point using implicit differentiation.

https://arxiv.org/abs/1909.01377

What's fascinating about deep equilibrium models is that you only need a single layer to be equivalent to a conventional deep neural network with multiple layers. Recursion is all you need! The model automatically uses more iterations for difficult tasks and fewer iterations for easy tasks.

radarsat17mo ago

Thanks something like that was going through my mind, nice to get a good reference for it. Any insights on why this is not a more popular approach? Maybe it's too difficult for a single layer to scale.

I read a paper recently on something similar for diffusion, called Fixed Point Diffusion Models. They specialize the first and last layers but recurse the middle layer some number of times until convergence.

Considering how a Transformer is a residual model, each layer must be adding more and more precise adjustments to the selected token. It makes a lot of sense to think of this like the steps of an optimisation method.

krychu7mo ago

I implemented HRM for educational purposes and got good results for path finding. But then I started to do ablation experiments and came to the same conclusions as the ARC-AGI team (the HRM architecture itself didn’t play a big role): https://github.com/krychu/hrm

This was a bit unfortunate. I think there is something in the idea of latent space reasoning.

versteegen7mo ago

Awesome work, thanks for writing it up! Replication is absolutely critical, as is writing down and sharing learnings.

krychu7mo ago

Thanks, appreciated

guybedoOP7mo ago

Abstract:

Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies.

This biologically inspired method beats Large Language models (LLMs) on hard puzzle tasks such as Sudoku, Maze, and ARC-AGI while trained with small models (27M parameters) on small data (around 1000 examples). HRM holds great promise for solving hard problems with small networks, but it is not yet well understood and may be suboptimal.

We propose Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers.

With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters.

SeanAnderson7mo ago

"With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters."

Well, that's pretty compelling when taken in isolation. I wonder what the catch is?

esafak7mo ago

It won't be any good at factual questions, for a start; it will be reliant on an external memory. Everything would have to be reasoned from first principles, without knowledge.

My gut feeling is that this will limits its capability, because creativity and intelligence involve connecting disparate things, and to do that you need to know them first. Though philosophers have tried, you can't unravel the mysteries of the universe through reasoning alone. You need observations, facts.

What I could see it good for is a dedicated reasoning module.

js87mo ago

Basic english is about 2000 words. So a small scale LLM that would be capable of reasoning in basic english, and transforming a problem in normal english to basic english by automatically including the relevant word/phrase definitions from a dictionary, could easily beat a large LLM (by being more consistent).

I think this is where all reasoning problems of LLMs will end up. We will use LM to transform problem in informal english (human language) into a formal logical language (possibly fuzzy and modal), from that possibly into an even simpler logic, then we will solve the problem in the logical domain using traditional reasoning approaches, and convert the answer back to informal english. That way, you won't need to run a large model during the reasoning. Larger models will be only useful as a fuzzy K-V stores (attention mechanism) to help drive heuristics during reasoning search.

I suspect the biggest obstacle to AGI is philosophical, we don't really have a good grasp/formalization of human/fuzzy/modal epistemology. Even if you look at formalization of mathematics, it's mostly about proofs, but we lack understanding what is e.g. an interesting mathematical problem, or how to even express in formal logic that something is a problem, or that experiments suggest something, that one model has an advantage over the other in this respect, that there is a certain cost associated with testing a hypothesis etc. Once we figure out what we actually want in epistemology, I am sure the algorithm required will be greatly reduced.

1 more reply

Grosvenor7mo ago

That's been my expectation from the start.

We'll need a memory system, an executive function/reasoning system as well as some sort of sense integration - auditory, visual, text in the case of LLMs, symbolic probably.

A good avenue of research would be to see if you could glue opencyc to this for external "knowledge".

LLM's are fundamentally a dead end.

Github link: https://github.com/SamsungSAILMontreal/TinyRecursiveModels

2 more replies

ivape7mo ago

Should it be a larger frontier model, with this as a tool call (tool call another llm) to verify the larger one?

Why not go nuts with it and put it in the speculative decoding algorithm.

1 more reply

cubefox7mo ago

The "catch" is that TRM is a very small model and a relatively narrow architecture, which shows that the ARC-AGI benchmark doesn't actually test for AGI. (Which the ARC guys kind of admitted themselves by releasing a "-2" version and working on a "-3".)

Balinares7mo ago

Wow, so not only are the findings from https://arxiv.org/abs/2506.21734 (posted on HN a while back) confirmed, they're generalizable? Intriguing. I wonder if this will pan out in practical use cases, it'd be transformative.

Also would possibly instantly void the value of trillions of pending AI datacenter capex, which would be funny. (Though possibly not for very long.)

ACCount377mo ago

Any mention of "HRM" is incomplete without this analysis:

https://arcprize.org/blog/hrm-analysis

This here looks like a stripped down version of HRM - possibly drawing on the ablation studies from this very analysis.

Worth noting that HRMs aren't generally applicable in the same way normal transformer LLMs are. Or, at least, no one has found a way to apply them to the typical generative AI tasks yet.

I'm still reading the paper, but I expect this version to be similar - it uses the same tasks as HRMs as examples. Possibly quite good at spatial reasoning tasks (ARC-AGI and ARC-AGI-2 are both spatial reasoning benchmarks), but it would have to be integrated into a larger more generally capable architecture to go past that.

Balinares7mo ago

That's a good read also shared by another poster above, thanks! If I'm reading this right, it contextualizes, but doesn't negate the findings from that paper.

I've got a major aesthetic problem with the fact LLMs require this much training data to get where they are, namely, "not there yet"; it's brute force by any other name, and just plain kind of vulgar. Although more importantly it won't scale much further. Novel architectures will have to feature in at some point, and I'll gladly take any positive result in that direction.

ACCount377mo ago

Evolution is brute force by any other name. Nothing elegant about it. Nonetheless, here you are.

Poor sample efficiency of the current AIs is a well known issue - but you should keep in mind what kind of grisly process was required to give you the architecture that makes you as sample efficient as you are.

We don't know yet what kind of architectural quirks enable this sample efficiency in the human brain. It could be something like a non-random initialization process that confers the right inductive biases, a more efficient optimizer, recurrent background loops... or just more raw juice.

It might be that one biological neuron is worth 10000 LLM weights, and a big part of how the brain is so sample efficient is that it's hilariously overparametrized.

3 more replies

shawntan7mo ago

That analysis provided a very non-abrasive wording of their evaluation of HRM and its contributions. The comparison with a recursive / universal transformer on the same settings is telling.

"These results suggest that the performance on ARC-AGI is not an effect of the HRM architecture. While it does provide a small benefit, a replacement baseline transformer in the HRM training pipeline achieves comparable performance."

ivape7mo ago

Also would possibly instantly void the value of trillions of pending AI datacenter capex

GPU compute is not just for text inferencing. The video generation demand is something I don’t think we’ll ever saturate for quite a while, even with breakthroughs.

mirekrusin7mo ago

It doesn't matter how much compute you have, you'll always be able to saturate it one way or another with ai and having more compute will forever be an advantage.

If breakthrough in ai happens you'll get multiplied benefits, not loss.

hansvm7mo ago

That does depend on GPUs being more efficient than CPUs for those breakthroughs.

1 more reply

ivape7mo ago

The “AI is hype” can’t seem to wrap this idea around their little heads for some reason.

1 more reply

lawlessone7mo ago

>Also would possibly instantly void the value of trillions of pending AI datacenter capex

I think they would just adopt this idea and use it to continue training huge but more capable models.

baq7mo ago

Jevon’s paradox applies here IMHO. Cheaper AI/watt = more demand.

matthewfcarlson7mo ago

It would be fitting if the AI bubble was popped by AI getting too good and too efficient

briandw7mo ago

" With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI- 1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters"

That is very impressive.

Side note: Superficially reminds me of Hierarchical Temporal Memory from Jeff Hawkins "On Intelligence". Although this doesn't have the sparsity aspect, its hierarchical and temporal aspects are related.

https://en.wikipedia.org/wiki/Hierarchical_temporal_memory https://www.numenta.com

java-man7mo ago

I suspect the lack of sparsity is an Achilles' heel of the current LLM approach.

intalentive7mo ago

Overall I really like these transformer RNNs. They are basically EBMs learning an energy landscape that falls into a solution, relaxing a discrete problem into a smooth convex one. Reminds me of other iterative methods like neural cellular automata and flow matching / diffusion. This method looks promising for control problems: just tumble your way down the state space, where each step is constrained to be a valid action.

atrudeau7mo ago

Very cool to see the effectiveness of recurrence on ARC. For those interested in recurrence, here are other works that leverage a similar approach for other types of problems:

Language modeling:

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach https://arxiv.org/pdf/2502.05171

Puzzle solving:

A Simple Loss Function for Convergent Algorithm Synthesis using RNNs https://openreview.net/pdf?id=WaAJ883AqiY

End-to-end Algorithm Synthesis with Recurrent Networks: Logical Extrapolation Without Overthinking https://arxiv.org/abs/2202.05826

Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks https://proceedings.neurips.cc/paper/2021/file/3501672ebc68a...

General:

Think Again Networks and the Delta Loss https://arxiv.org/pdf/1904.11816

Universal Transformers https://arxiv.org/abs/1807.03819

Adaptive Computation Time for Recurrent Neural Networks https://arxiv.org/pdf/1603.08983

guybedoOP7mo ago

github https://github.com/SamsungSAILMontreal/TinyRecursiveModels

exe347mo ago

Any idea why the tiny network takes days to run on massive GPUs? is it the large dataset, or the recursive nature of the algorithm? i.e. would a simple question take hours to solve or require a huge amount of memory?

I don't have a huge amount of experience in the nitty gritty details and I'm wondering if I'll be able to run some interesting training on a 3090 in a few days.

porridgeraisin7mo ago

This is just from my skim of the paper, take it with a pinch of salt.

It's tiny in terms of number of weights. This is because it reuses and refines the same weights across recursion steps, instead of repeating them for each layer which is what stacked transformers are in usual LLMs.

However, the FLOPs is the exact same.

In usual LLMs you have number of transformer blocks * (per block costs), here you have number of recursion steps * number of blocks(smaller than usual,2 here) * (per block cost)

Basically, this needs compute like a 16-block LLM per training step. Because here recursions = 8, and 2 blocks. How many steps depends on dataset used mostly.

regularfry7mo ago

If this is a way to get equivalent results to a much larger network in the same FLOPs but with a fraction of the VRAM, it's transformative.

I'm particularly keen to see if you could do speech-to-text with this architecture, and replace Whisper for smaller devices.

asjir7mo ago

Nvidia's parakeet dropped recently with better performance and 0.6B params, so the rate of progress here looks good, probably next year (or mby the year after) they'll be running no probs

porridgeraisin7mo ago

I mean, I wouldn't say it's transformative or bet on it equalling usual LLM performance in general. It's kind of similar to weight reuse you see in RNNs, where the same `h` is maintained throughout. In usual LLMs each block has its own state.

These guys are choosing a middle ground - stacking few transformers, and then reusing the same 2 blocks 8 times over.

It'll be interesting to see what usecases are served well with this approach. Understanding of these architectures' response to these changes are still largely empirical so hard to say ahead of time. My intuition is that for repetitive input signals it could be good - audio processing comes to mind. But complex attention and stuff like in elevenlabs style translation is probably too much to hope for. Whisper type transcription tho, might work.

exe347mo ago

thank you! I'll need to have a read soon.

krackers7mo ago

Can someone explain to a noob how this "outer loop" differs from an LLM modified to do reasoning in latent space?

Zee27mo ago

Oh boy. Is this not essentially a neuralese CoT? They’re explicitly labelling z/z_L as a reasoning embedding that persists/mutates through the recursive process, used to refine the output embedding z_H/y. Is this not literally a neuralese CoT/reasoning chain? Yikes!

Timsky7mo ago

If it is a recursive one, can it apply the induction and solve the Towers of Hanoi beyond level six?

yorwba7mo ago

You'll first need to frame Towers of Hanoi as a supervised learning problem. I suspect the answer to your question will differ depending on what you pick as the input-output pairs to train the model on.

infogulch7mo ago

So what happens when we figure out how to 10x both scale and throughput on existing hardware by using it more efficiently? Will gigantic models still be useful?

peterlk7mo ago

Of course! We still have computers the size of mainframes that ran on vacuum tubes. They are just built with vastly more powerful hardware and are used for specialized tasks that supercomputing facilities care about.

But it has the potential to alter the economics of AI quite dramatically

regularfry7mo ago

Anything that improves scaling at the bottom end will also improve scaling at the top end, give or take.

j / k navigate · click thread line to collapse

71 comments

shawntan7mo ago

I think everyone should read the post from ARC-AGI organisers about HRM carefully: https://arcprize.org/blog/hrm-analysis

The naming of that benchmark lends itself to hype, as we've seen in both HRM and this paper.

tsoj7mo ago

shawntan7mo ago

> Do you mean that HRM and TRM are specifically trained on a small dataset of ARC-AGI samples, while LLMs are not? Or which difference exactly do hint at?

Yes, precisely this. The question is really what is ARC-AGI evaluating for?

I tend to agree that, given the outcome of both the HRM and this paper, is that the ARC-AGI folks do seem to allow this setting, _and_ that the task isn't as "AGI complete" as it sets out to be.

shawntan7mo ago

I should probably also add: It's long been known that Universal / Recursive Transformers are able to solve _simple_ synthetic tasks that vanilla transformers cannot.

Just check out the original UT paper, or some of it's follow ups: Neural Data Router, https://arxiv.org/abs/2110.07732; Sparse Universal Transformers (SUT), https://arxiv.org/abs/2310.07096.

There is even theoretical justification for why: https://arxiv.org/abs/2503.03961

The challenge is actually scaling them up to be useful as LLMs as well (I describe why it's a challenge in the SUT paper).

3 more replies

ACCount377mo ago

ARC-AGI 1 and 2 are spatial reasoning benchmarks. ARC-AGI 3 is advanced spatial reasoning with agentic flavor.

They're adversarial benchmarks - they intentionally hit the weak point of existing LLMs. Not "AGI complete" by any means. But not useless either.

1 more reply

ACCount377mo ago

Not exactly "vanilla Transformer", but rather "a Transformer-like architecture with recurrence".

shawntan7mo ago

Right. There should really be a vanilla Transformer baseline.

With recurrence: The idea has been around: https://arxiv.org/abs/1807.03819

There are reasons why it hasn't really been picked up at scale, and the method tends to do well on synthetic tasks.

magicalhippo7mo ago

Anyway, with FIR you typically need many, many times the coefficients to get similar filter cutoff performance as a what few IIR coefficients can do.

You can convert a IIR to a FIR using for example the window design method[3], where if you use a rectangular window function you essentially unroll the recursion but stop after some finite depth.

So, would perhaps be interesting to compare the TRM network to a similarly unrolled version.

Then again, maybe this is all mad ramblings from a sleep deprived mind.

[1]: https://en.wikipedia.org/wiki/Finite_impulse_response

[2]: https://en.wikipedia.org/wiki/Infinite_impulse_response

[3]: https://en.wikipedia.org/wiki/Finite_impulse_response#Window...

imtringued7mo ago

Deep Equilibrium Models

https://arxiv.org/abs/1909.01377

radarsat17mo ago

krychu7mo ago

This was a bit unfortunate. I think there is something in the idea of latent space reasoning.

versteegen7mo ago

Awesome work, thanks for writing it up! Replication is absolutely critical, as is writing down and sharing learnings.

krychu7mo ago

Thanks, appreciated

guybedoOP7mo ago

Abstract:

Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies.

We propose Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers.

With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters.

SeanAnderson7mo ago

"With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters."

Well, that's pretty compelling when taken in isolation. I wonder what the catch is?

esafak7mo ago

It won't be any good at factual questions, for a start; it will be reliant on an external memory. Everything would have to be reasoned from first principles, without knowledge.

What I could see it good for is a dedicated reasoning module.

js87mo ago

1 more reply

Grosvenor7mo ago

That's been my expectation from the start.

We'll need a memory system, an executive function/reasoning system as well as some sort of sense integration - auditory, visual, text in the case of LLMs, symbolic probably.

A good avenue of research would be to see if you could glue opencyc to this for external "knowledge".

LLM's are fundamentally a dead end.

Github link: https://github.com/SamsungSAILMontreal/TinyRecursiveModels

2 more replies

ivape7mo ago

Should it be a larger frontier model, with this as a tool call (tool call another llm) to verify the larger one?

Why not go nuts with it and put it in the speculative decoding algorithm.

1 more reply

cubefox7mo ago

Balinares7mo ago

Also would possibly instantly void the value of trillions of pending AI datacenter capex, which would be funny. (Though possibly not for very long.)

ACCount377mo ago

Any mention of "HRM" is incomplete without this analysis:

https://arcprize.org/blog/hrm-analysis

This here looks like a stripped down version of HRM - possibly drawing on the ablation studies from this very analysis.

Worth noting that HRMs aren't generally applicable in the same way normal transformer LLMs are. Or, at least, no one has found a way to apply them to the typical generative AI tasks yet.

Balinares7mo ago

That's a good read also shared by another poster above, thanks! If I'm reading this right, it contextualizes, but doesn't negate the findings from that paper.

ACCount377mo ago

Evolution is brute force by any other name. Nothing elegant about it. Nonetheless, here you are.

It might be that one biological neuron is worth 10000 LLM weights, and a big part of how the brain is so sample efficient is that it's hilariously overparametrized.

3 more replies

shawntan7mo ago

That analysis provided a very non-abrasive wording of their evaluation of HRM and its contributions. The comparison with a recursive / universal transformer on the same settings is telling.

ivape7mo ago

Also would possibly instantly void the value of trillions of pending AI datacenter capex

GPU compute is not just for text inferencing. The video generation demand is something I don’t think we’ll ever saturate for quite a while, even with breakthroughs.

mirekrusin7mo ago

It doesn't matter how much compute you have, you'll always be able to saturate it one way or another with ai and having more compute will forever be an advantage.

If breakthrough in ai happens you'll get multiplied benefits, not loss.

hansvm7mo ago

That does depend on GPUs being more efficient than CPUs for those breakthroughs.

1 more reply

ivape7mo ago

The “AI is hype” can’t seem to wrap this idea around their little heads for some reason.

1 more reply

lawlessone7mo ago

>Also would possibly instantly void the value of trillions of pending AI datacenter capex

I think they would just adopt this idea and use it to continue training huge but more capable models.

baq7mo ago

Jevon’s paradox applies here IMHO. Cheaper AI/watt = more demand.

matthewfcarlson7mo ago

It would be fitting if the AI bubble was popped by AI getting too good and too efficient

briandw7mo ago

" With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI- 1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters"

That is very impressive.

https://en.wikipedia.org/wiki/Hierarchical_temporal_memory https://www.numenta.com

java-man7mo ago

I suspect the lack of sparsity is an Achilles' heel of the current LLM approach.

intalentive7mo ago

atrudeau7mo ago

Very cool to see the effectiveness of recurrence on ARC. For those interested in recurrence, here are other works that leverage a similar approach for other types of problems:

Language modeling:

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach https://arxiv.org/pdf/2502.05171

Puzzle solving:

A Simple Loss Function for Convergent Algorithm Synthesis using RNNs https://openreview.net/pdf?id=WaAJ883AqiY

End-to-end Algorithm Synthesis with Recurrent Networks: Logical Extrapolation Without Overthinking https://arxiv.org/abs/2202.05826

Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks https://proceedings.neurips.cc/paper/2021/file/3501672ebc68a...

General:

Think Again Networks and the Delta Loss https://arxiv.org/pdf/1904.11816

Universal Transformers https://arxiv.org/abs/1807.03819

Adaptive Computation Time for Recurrent Neural Networks https://arxiv.org/pdf/1603.08983

guybedoOP7mo ago

github https://github.com/SamsungSAILMontreal/TinyRecursiveModels

exe347mo ago

I don't have a huge amount of experience in the nitty gritty details and I'm wondering if I'll be able to run some interesting training on a 3090 in a few days.

porridgeraisin7mo ago

This is just from my skim of the paper, take it with a pinch of salt.

However, the FLOPs is the exact same.

In usual LLMs you have number of transformer blocks * (per block costs), here you have number of recursion steps * number of blocks(smaller than usual,2 here) * (per block cost)

Basically, this needs compute like a 16-block LLM per training step. Because here recursions = 8, and 2 blocks. How many steps depends on dataset used mostly.

regularfry7mo ago

If this is a way to get equivalent results to a much larger network in the same FLOPs but with a fraction of the VRAM, it's transformative.

I'm particularly keen to see if you could do speech-to-text with this architecture, and replace Whisper for smaller devices.

asjir7mo ago

Nvidia's parakeet dropped recently with better performance and 0.6B params, so the rate of progress here looks good, probably next year (or mby the year after) they'll be running no probs

porridgeraisin7mo ago

These guys are choosing a middle ground - stacking few transformers, and then reusing the same 2 blocks 8 times over.

exe347mo ago

thank you! I'll need to have a read soon.

krackers7mo ago

Can someone explain to a noob how this "outer loop" differs from an LLM modified to do reasoning in latent space?

Zee27mo ago

Timsky7mo ago

If it is a recursive one, can it apply the induction and solve the Towers of Hanoi beyond level six?

yorwba7mo ago

infogulch7mo ago

So what happens when we figure out how to 10x both scale and throughput on existing hardware by using it more efficiently? Will gigantic models still be useful?

peterlk7mo ago

But it has the potential to alter the economics of AI quite dramatically

regularfry7mo ago

Anything that improves scaling at the bottom end will also improve scaling at the top end, give or take.

j / k navigate · click thread line to collapse