With the same data augmentation / 'test time training' setting, the vanilla Transformers do pretty well, close to the "breakthrough" HRM reported. From a brief skim, this paper is using similar settings to compare itself on ARC-AGI.
I too, want to believe in smaller models with excellent reasoning performance. But first understand what ARC-AGI tests for, what the general setting is -- the one that commercial LLMs use to compare against each other -- and what the specialised setting HRM and this paper uses as evaluation.
The naming of that benchmark lends itself to hype, as we've seen in both HRM and this paper.
I think ARC-AGI was supposed to be a challenge for any model. The assumption being that you'd need the reasoning abilities of large language models to solve it. It turns out that this assumption is somewhat wrong. Do you mean that HRM and TRM are specifically trained on a small dataset of ARC-AGI samples, while LLMs are not? Or which difference exactly do hint at?
Yes, precisely this. The question is really what is ARC-AGI evaluating for?
1. If the goal is to see if models can generalise to the ARC-AGI evals, then models being evaluated on it should not be trained on the tasks. Especially IF ARC-AGI evaluations are constructed to be OOD from the ARC-AGI training data. I don't know if they are. Further, there seems to be usage of the few-shot examples in the evals to construct more training data in the HRM case. TRM may do this via the training data via other means.
2. If the goal is that even _having seen_ the training examples, and creating more training examples (after having peeked at the test set), these evaluations should still be difficult, then the ablations show that you can get pretty far without universal/recurrent Transformers.
If 1, then I think the ARC-prize organisers should have better rules laid out for the challenge. From the blog post, I do wonder how far people will push the boundary (how much can I look at the test data to 'augment' my training data?) before the organisers say "This is explicitly not allowed for this challenge."
If 2, the organisers of the challenge should have evaluated how much of a challenge it would actually have been allowing extreme 'data augmentation', and maybe realised it wasn't that much of a challenge to begin with.
I tend to agree that, given the outcome of both the HRM and this paper, is that the ARC-AGI folks do seem to allow this setting, _and_ that the task isn't as "AGI complete" as it sets out to be.
Which is still a fun idea to play around with - this approach clearly has its strengths. But it doesn't appear to be an actual "better Transformer". I don't think it deserves nearly as much hype as it gets.
With recurrence: The idea has been around: https://arxiv.org/abs/1807.03819
There are reasons why it hasn't really been picked up at scale, and the method tends to do well on synthetic tasks.
Anyway, with FIR you typically need many, many times the coefficients to get similar filter cutoff performance as a what few IIR coefficients can do.
You can convert a IIR to a FIR using for example the window design method[3], where if you use a rectangular window function you essentially unroll the recursion but stop after some finite depth.
Similarly it seems unrolling the TRM you end up with the traditional LLM architecture of many repeated attention+ff blocks, minus the global feedback part. And unlike a true IIR, the TRM does implement a finite cut-off, so in that sense is more like a traditional FIR/LLM than the structure suggest.
So, would perhaps be interesting to compare the TRM network to a similarly unrolled version.
Then again, maybe this is all mad ramblings from a sleep deprived mind.
[1]: https://en.wikipedia.org/wiki/Finite_impulse_response
[2]: https://en.wikipedia.org/wiki/Infinite_impulse_response
[3]: https://en.wikipedia.org/wiki/Finite_impulse_response#Window...
>We present a new approach to modeling sequential data: the deep equilibrium model (DEQ). Motivated by an observation that the hidden layers of many existing deep sequence models converge towards some fixed point, we propose the DEQ approach that directly finds these equilibrium points via root-finding. Such a method is equivalent to running an infinite depth (weight-tied) feedforward network, but has the notable advantage that we can analytically backpropagate through the equilibrium point using implicit differentiation.
https://arxiv.org/abs/1909.01377
What's fascinating about deep equilibrium models is that you only need a single layer to be equivalent to a conventional deep neural network with multiple layers. Recursion is all you need! The model automatically uses more iterations for difficult tasks and fewer iterations for easy tasks.
I read a paper recently on something similar for diffusion, called Fixed Point Diffusion Models. They specialize the first and last layers but recurse the middle layer some number of times until convergence.
Considering how a Transformer is a residual model, each layer must be adding more and more precise adjustments to the selected token. It makes a lot of sense to think of this like the steps of an optimisation method.
This was a bit unfortunate. I think there is something in the idea of latent space reasoning.
Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies.
This biologically inspired method beats Large Language models (LLMs) on hard puzzle tasks such as Sudoku, Maze, and ARC-AGI while trained with small models (27M parameters) on small data (around 1000 examples). HRM holds great promise for solving hard problems with small networks, but it is not yet well understood and may be suboptimal.
We propose Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers.
With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters.
Well, that's pretty compelling when taken in isolation. I wonder what the catch is?
My gut feeling is that this will limits its capability, because creativity and intelligence involve connecting disparate things, and to do that you need to know them first. Though philosophers have tried, you can't unravel the mysteries of the universe through reasoning alone. You need observations, facts.
What I could see it good for is a dedicated reasoning module.
Also would possibly instantly void the value of trillions of pending AI datacenter capex, which would be funny. (Though possibly not for very long.)
https://arcprize.org/blog/hrm-analysis
This here looks like a stripped down version of HRM - possibly drawing on the ablation studies from this very analysis.
Worth noting that HRMs aren't generally applicable in the same way normal transformer LLMs are. Or, at least, no one has found a way to apply them to the typical generative AI tasks yet.
I'm still reading the paper, but I expect this version to be similar - it uses the same tasks as HRMs as examples. Possibly quite good at spatial reasoning tasks (ARC-AGI and ARC-AGI-2 are both spatial reasoning benchmarks), but it would have to be integrated into a larger more generally capable architecture to go past that.
I've got a major aesthetic problem with the fact LLMs require this much training data to get where they are, namely, "not there yet"; it's brute force by any other name, and just plain kind of vulgar. Although more importantly it won't scale much further. Novel architectures will have to feature in at some point, and I'll gladly take any positive result in that direction.
"These results suggest that the performance on ARC-AGI is not an effect of the HRM architecture. While it does provide a small benefit, a replacement baseline transformer in the HRM training pipeline achieves comparable performance."
GPU compute is not just for text inferencing. The video generation demand is something I don’t think we’ll ever saturate for quite a while, even with breakthroughs.
If breakthrough in ai happens you'll get multiplied benefits, not loss.
I think they would just adopt this idea and use it to continue training huge but more capable models.
That is very impressive.
Side note: Superficially reminds me of Hierarchical Temporal Memory from Jeff Hawkins "On Intelligence". Although this doesn't have the sparsity aspect, its hierarchical and temporal aspects are related.
https://en.wikipedia.org/wiki/Hierarchical_temporal_memory https://www.numenta.com
Language modeling:
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach https://arxiv.org/pdf/2502.05171
Puzzle solving:
A Simple Loss Function for Convergent Algorithm Synthesis using RNNs https://openreview.net/pdf?id=WaAJ883AqiY
End-to-end Algorithm Synthesis with Recurrent Networks: Logical Extrapolation Without Overthinking https://arxiv.org/abs/2202.05826
Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks https://proceedings.neurips.cc/paper/2021/file/3501672ebc68a...
General:
Think Again Networks and the Delta Loss https://arxiv.org/pdf/1904.11816
Universal Transformers https://arxiv.org/abs/1807.03819
Adaptive Computation Time for Recurrent Neural Networks https://arxiv.org/pdf/1603.08983
I don't have a huge amount of experience in the nitty gritty details and I'm wondering if I'll be able to run some interesting training on a 3090 in a few days.
It's tiny in terms of number of weights. This is because it reuses and refines the same weights across recursion steps, instead of repeating them for each layer which is what stacked transformers are in usual LLMs.
However, the FLOPs is the exact same.
In usual LLMs you have number of transformer blocks * (per block costs), here you have number of recursion steps * number of blocks(smaller than usual,2 here) * (per block cost)
Basically, this needs compute like a 16-block LLM per training step. Because here recursions = 8, and 2 blocks. How many steps depends on dataset used mostly.
I'm particularly keen to see if you could do speech-to-text with this architecture, and replace Whisper for smaller devices.
But it has the potential to alter the economics of AI quite dramatically