I am aware I could Google it all or ask an LLM, but I'm still interested in a good human explanation.
Of course with Kimi there is fear because the Chinese government can easily pressure Moonshot AI into sharing the data, and other countries have to work to stealthily siphon data off without being caught by Chinese counterintelligence. As opposed to GPT5 where the American government can easily pressure OpenAI and every other country has to stealthily siphon data off without being caught by American counterintelligence. The only way to be reasonably certain that you aren't spied on is to run your own models or rent GPU time to run models.
The bigger worry imho is whether the models are booby-trapped to give poisoned answers when they detect certain queries or when they detect that you work for a competitor or enemy of China. But that has to be reasonably stealthy to work
Not doubting they aren't spying on people, but regardless, how would you really know? Are you basing this on that no Chinese police has visited you, or how would you really know if it's "simply true" or not?
With that said, I use plenty of models coming out of China too with no fear, but I'm also using them locally, not cloud platforms.
Each model, the tooling you used and even what prompts you use for what model, impacts a lot of the quality of responses you get from the models.
They all must be doing a great favor to humanity in a good will then.
Sorry, but seriously -- Chinese government, controlled by the Chinese Communist Party (CCP), can effectively seize or shut down internet services and infrastructure at will within its borders under its national security laws.
No need to read the TOS; it's in the law.
How likely are we to NOT see the AI data center apocalypse through better algorithms?
But so far this has just lead to more induced demand. There are a lot of things we would use LLMs for if it was just cheap enough, and every increase in efficiency makes more of those use cases viable
Near certain IMO. Algorithmic improvements have outpaced hardware improvements for decades. We're already seeing the rise of small models and how simple tweaks can make small models very capable problem solvers, better even than state of the art large models. Data center scaling is nearing its peak IMO as we're hitting data limits which cap model size anyway.
If anything, the US is massively underproducing.
Consider the implications of increases in efficiency *when you hold compute constant*.
The win is far more obvious when it's "we can do more with what we have" instead of "we can do the same with less".
A lot of the optimizations are not some ground breaking new way to program, they're known techniques to any Software Engineer or Systems Engineer.
Evaluation Benchmarks Our evaluation encompasses three primary categories of benchmarks, each designed to assess distinct capabilities of the model:
• Language Understanding and Reasoning: Hellaswag [121], ARC-Challenge [14], Winogrande [83], MMLU [36], TriviaQA [47], MMLU-Redux [26], MMLU-Pro [103], GPQA-Diamond [82], BBH [94], and [105].
• Code Generation: LiveCodeBench v6 4 [44], EvalPlus [60].
• Math & Reasoning: AIME 2025, MATH 500, HMMT 2025, PolyMath-en.
• Long-context: MRCR 5 , RULER [38], Frames [52], HELMET-ICL [118], RepoQA [61], Long Code Arena [13] and LongBench v2 [6].
• Chinese Language Understanding and Reasoning: C-Eval [43], and CMMLU [55].
Honestly, it would take 24 hours just to download the 98 GB model if I wanted to try it out (assuming I had a card with 98 GB of ram).
With an RTX 3070 (7GB GRAB VRAM), 32 GB RAM and an SSD I can run such models at speeds tolerable for casual use.
> Can you explain what this means and its significance? Assume that I'm a layperson with no familiarity with LLM jargon so explain all of the technical terms, references, names. https://github.com/MoonshotAI/Kimi-Linear
Imagine your brain could only “look at” a few words at a time when you read a long letter. Today’s big language models (the AI that powers chatbots) have the same problem: the longer the letter gets, the more scratch paper they need to keep track of it all. That scratch paper is called the “KV cache,” and for a 1 000 000-word letter it can fill a small library.
Kimi Linear is a new way for the AI to read and write that throws away most of that scratch paper yet still understands the letter. It does this by replacing the usual “look at every word every time” trick (full attention) with a clever shortcut called linear attention. The shortcut is packaged into something they call Kimi Delta Attention (KDA).
What the numbers mean in plain English
51.0 on MMLU-Pro: on a 4 000-word school-test set, the shortcut scores about as well as the old, slow method.
84.3 on RULER at 128 000 words: on a much longer test it keeps the quality high while running almost four times faster.
6 × faster TPOT: when the AI is writing its reply, each new word appears up to six times sooner than with the previous best shortcut (MLA).
75 % smaller KV cache: the scratch paper is only one-quarter the usual size, so you can fit longer conversations in the same memory.
Key pieces explained Full attention: the old, accurate but slow “look back at every word” method.
KV cache: the scratch paper that stores which words were already seen.
Linear attention: a faster but traditionally weaker way of summarising what was read.
Gated DeltaNet: an improved linear attention trick that keeps the most useful bits of the summary.
Kimi Delta Attention (KDA): Moonshot’s even better version of Gated DeltaNet.
Hybrid 3:1 mix: three layers use the fast KDA shortcut, one layer still uses the old reliable full attention, giving speed without losing smarts.
48 B total, 3 B active: the model has 48 billion total parameters but only 3 billion “turn on” for any given word, saving compute.
Context length 1 M: it can keep track of about 1 000 000 words in one go—longer than most novels.
Bottom line
Kimi Linear lets an AI read very long documents or hold very long conversations with far less memory and much less waiting time, while still giving answers as good as—or better than—the big, slow models we use today.