I don’t say this out of spite, it’s awesome he’s been able to get such opportunities and has been able to utilize them, but I think it’s important to be realistic about what these narratives imply without additional context and the answer is, sure, you can be successful with grit, brains and determination, but…the odds are very much not in your favor if you are in the wrong income bracket.
^ This is not doxxing. His LinkedIn profile states it plainly.
Sadly, same can't be said about India (infrastructure/food security lags China).
And quality of leadership.
They (barring a few exceptions) are happy to gloat in imagined past glories of vedic aeroplanes, inter-species head transplants apparent performed in Hindu golden age and loyalty based funding that produces institutions like Galgotia University.
I'm not so sure about that: https://www.populationpyramid.net/china/2026/ suggests peak high school in china was years ago.
In China since this class doesn't exist and forced busywork for class signaling is largely driven underground due to government policies driven young people just work on whatever the government policy says, which right now is tech. Hence you see smart young people getting into tech. While in the US they do Finance (if they built up the right portfolio to be allowed in the elite jobs club), speedrunning (if they didn't). Ironically the current US look a lot like imperial China, while China looks like the US of the late 1800s, but fully vertically integrated
Even if food security holds back 10% of Indians (which would still be a huge tragedy), that would still leave the other 90% for the 'onslaught'. 10% is just a made up number. But even with 50% you'd get an 'onslaught'.
So if we are seeing less than that, it's probably down to other factors.
So this is what really unsettles me. Not that China graduates more engineers every year than we have entirely employed in the US, but rather, that these individuals are not about delegating work, but actually doing it. Whereas the western credo is to get someone else to do the work (or in the words of PAtton, to get some one else to die for his country), I get the feeling that China will get robots and AI to do the work. I am reminded of the joke about Chinese factories having only 1 security guard and 1 dog. The guard is there to feed the dog.
This. One of the things that most shocked me when I moved to London was how bad English people were at hard skills, but also how easily giving orders and "projecting gravitas" came to them. Everyone wants to be a "leader", which sadly has become code for reaping benefits of other people's work.
> Abstract: Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. [...]
> Full AttnRes is straightforward but requires O(Ld) memory at scale. Block AttnRes partitions layers into N blocks, accumulates within each block via standard residuals, and applies attention only over block-level representations. With ~8 blocks, it recovers most of Full AttnRes's gains while serving as a practical drop-in replacement with marginal overhead.
I had a similar idea at the back of my head but here is a layman explanation:
Standard attention threads the previous layers output to the next layers input. By adding residual connections to each layer, the layers learn an update rule.
There is an obvious limitation here. Only the first layer gets to see the original input and all subsequent layers only get to see the previous layer output.
With attention residuals, the idea is that you have a tiny attention operator that decides between using the original input and any of the previous layer outputs.
1. Drops compute required for training by ~20%. This approach wont just help the ever escalating model sizes larger companies are pushing for, it means things like autoresearch can iterate on new model architectures faster.
2. WAY lower bandwidth requirements for inference. Means with approaches like this it should run on consumer hardware far better. It apparently requires 1/6th the memory bandwidth of a traditional approach for better results.
This is a big improvement if it can be generalized. They're claiming it's a drop in replacement, so it seems like it can as well.
This is not true. Authors claim that w.r.t. training, their method adds negigible overhead for AttnRes with no memory impact (but is way more complicated for Block AttnRes since we need to use pipelining for larger models, hence the O(Ld) & O(Nd) figures, with N ≪ L).
> WAY lower bandwidth requirements for inference.
Also not true. Paper has nothing to do with inference, apart from the benchmarks. If you're looking at the graph about "compute advantage," it's about training compute. They do some interpolation to get to the 1.25x number, basically answering the question "if non-AttnRes architecture were trained, how much compute would it take to get to the same loss as AttnRes?" (The answer being ~20% more compute.) It's an interesting claim, but there's all kinds of weird and unexpected convergence that can happen, so take it with a grain of salt.
If model A reaches performance level 100 using 100 units of compute using old methods, and you train model B using AttnRes, aiming at performance level 100, it costs you 80 units of compute.
It probably doesn't map precisely, but that's where people are diverging from the claim - it doesn't explicitly say anything about reduced inference or training time, but that's the implicit value of these sorts of things. Less compute to equivalent performance can be a huge win for platforms at scale as well as for local models.
That should be the headline right there. Giant side 60 font headline.
Some people have PhDs in burying the lede!
Is it guaranteed to have the same effect on vanishing gradients though? What if it put weight 1 on a layer that had a tiny gradient?