Leanstral: Open-source agent for trustworthy coding and formal proof engineering (opens in new tab)

(mistral.ai)

783 pointsPoudlardo2mo ago191 comments

Lean 4 paper (2021): https://dl.acm.org/doi/10.1007/978-3-030-79876-5_37

191 comments

It’s great to see this pattern of people realising that agents can specify the desired behavior then write code to conform to the specs.

TDD, verification, whatever your tool; verification suites of all sorts accrue over time into a very detailed repository of documentation of how things are supposed to work that, being executable, puts zero tokens in the context when the code is correct.

It’s more powerful than reams upon reams of markdown specs. That’s because it encodes details, not intent. Your intent is helpful at the leading edge of the process, but the codified result needs shoring up to prevent regression. That’s the area software engineering has always ignored because we have gotten by on letting teams hold context in their heads and docs.

As software gets more complex we need better solutions than “go ask Jim about that, bloke’s been in the code for years”.

bluGill2mo ago

> That’s because it encodes details, not intent.

Be careful here - make sure you encode the right details. I've seen many cases where the tests are encoding the details of how it was implemented and not what it is intended to do. This means that you can't refactor anything because your tests are enforcing a design. (refactor is changing code without deleting tests, the trick is how can you make design changes without deleting tests - which means you have to test as much as possible at a point where changing that part of the design isn't possible anyway)

necovek2mo ago

While you are right that you need to be encoding the right details, I disagree on the tests enforcing a design point.

As part of the proper testing strategy, you will have tests that cover individual behavior of a small block/function (real "unit" tests), tests that cover integration points only up to the integration itself, and a small number of end-to-end or multi-component integration tests.

Only the last category should stay mostly idempotent under refactoring, depending on the type of refactor you are doing.

Integration tests will obviously be affected when you are refactoring the interfaces between components, and unit tests will be affected when you are refactoring the components themselves. Yes, you should apply the strategy that keeps it under incremental reverse TDD approach (do the refactor and keep the old interface, potentially by calling into new API from the old; then in second step replace use of old API as well, including in tests).

Tests generally define behavior and implementation in a TDD approach: it'd be weird if they do not need changing at all when you are changing the implementation.

bluGill2mo ago

Fine, but don't check in the tests that prove implementation since they will be deleted soon anyway. The only tests to check in are ones that - by failing - informed you that you broke something. We don't know which those tests are and because most tests run fast we tend to check in lots of tests that will never fail in a useful way.

1 more reply

BowBun2mo ago

I feel like the difference is minimal, if not entirely dismissable. Code in this sense is just a representation of the same information as someone would write in an .md file. The resolution changes, and that's where both detail and context are lost.

I'm not against TDD or verification-first development, but I don't think writing that as code is the end-goal. I'll concede that there's millions of lines of tests that already exist, so we should be using those as a foundation while everything else catches up.

chriswarbo2mo ago

Tests (and type-checkers, linters, formal specs, etc.) ground the model in reality: they show it that it got something wrong (without needing a human in the loop). It's empiricism, "nullius in verba"; the scientific approach, which lead to remarkable advances in a few hundred years; that over a thousand years of ungrounded philosophy couldn't achieve.

discreteevent2mo ago

The scientific approach is not only or primarily empiricism. We didn't test our way to understanding. The scientific approach starts with a theory that does it's best to explain some phenomenon. Then the theory is criticized by experts. Finally, if it seems to be a promising theory tests are constructed. The tests can help verify the theory but it is the theory that provides the explanation which is the important part. Once we have explanation then we have understanding which allows us to play around with the model to come up with new things, diagnose problems etc.

The scientific approach is theory driven, not test driven. Understanding (and the power that gives us) is the goal.

2 more replies

applfanboysbgon2mo ago

It most certainly is not. All your tests are doing is seeding the context with tokens that increase the probability of tokens related to solving the problem being selected next. One small problem: if the dataset doesn't have sufficiently well-represented answers to the specific problem, no amount of finessing the probability of token selection is going to lead to LLMs solving the problem. The scientific method is grounded in the ability to reason, not probabilistically retrieve random words that are statistically highly correlated with appearing near other words.

cowboy_henk2mo ago

This only holds if you understand what's in the tests, and the tests are realistic. The moment you let the LLM write the tests without understanding them, you may as well just let it write the code directly.

2 more replies

pydry2mo ago

This assumes that tests are realistic, which for the most part they are not.

cadamsdotcom2mo ago

Say you describe your kitchen as “I want a kitchen” - where are the knives? Where’s the stove? Answer: you abdicated control over those details, so it’s wherever the stochastic parrot decided to put them, which may or may not be where they ended up last time you pulled your LLM generate-me-a-kitchen lever. And it may not be where you want.

Don’t like the layout? Let’s reroll! Back to the generative kitchen agent for a new one! ($$$)

The big labs will gladly let you reroll until you’re happy. But software - and kitchens - should not be generated in a casino.

A finished software product - like a working kitchen - is a fractal collection of tiny details. Keeping your finished software from falling apart under its own weight means upholding as many of those details as possible.

Like a good kitchen a few differences are all that stands between software that works and software that’s hell. In software the probability that an agent will get 100% of the details right is very very small.

Details matter.

vidarh2mo ago

If it is fast enough, and cheap enough, people would very happily reroll specific subsets of decisions until happy, and then lock that down. And specify in more details the corner cases that it doesn't get just how you want it.

People metaphorically do that all the time when designing rooms, in the form of endless browsing of magazines or Tik Tok or similar to find something they like instead of starting from first principles and designing exactly what they want, because usually they don't know exactly what they want.

A lot of the time we'd be happier with a spec at the end of the process than at the beginning. A spec that ensures the current understanding of what is intentional vs. what is an accident we haven't addressed yet is nailed down would be valuable. Locking it all down at the start, on the other hand, is often impossible and/or inadvisable.

1 more reply

refulgentis2mo ago

I've seen this sentiment and am a big fan of it, but I was confused by the blog post, and based on your comment you might be able to help: how does Lean help me? FWIW, context is: code Dart/Flutter day to day.

I can think of some strawmen: for example, prove a state machine in Lean, then port the proven version to Dart? But I'm not familiar enough with Lean to know if that's like saying "prove moon made of cheese with JavaScript, then deploy to the US mainframe"

baq2mo ago

yesterday I had to tell a frontier model to translate my code to tla+ to find a tricky cache invalidation bug which nothing could find - gpt 5.4, gemini 3.1, opus 4.6 all failed. translation took maybe 5 mins, the bug was found in seconds, total time to fix from idea to commit - about 15 mins.

if you can get a model to quickly translate a relevant subset of your code to lean to find tricky bugs and map lean fixes back to your codebase space, you've got yourself a huge unlock. (spoiler alert: you basically can, today)

refulgentis2mo ago

Thanks for following up on this: I was really surprised by how much air this paeon to, idk, TDD, took out of the comments by getting off-topic.

Before you commented, I started poking at what you described for 15 minutes, then forget about it and fell asleep. Now I remembered, and I know it's viable and IIUC it's almost certainly going to make a big difference in my work practice moving forward. Cheers.

Paracompact2mo ago

I don't think he's referring to Lean specifically, but any sort of executable testing methodology. It removes the human in the loop in the confidence assurance story, or at least greatly reduces their labor. You cannot ever get such assurance just by saying, "Well this model seems really smart to me!" At best, you would wind up with AI-Jim.

(One way Lean or Rocq could help you directly, though, would be if you coded your program in it and then compiled it to C via their built-in support for it. Such is very difficult at the moment, however, and in the industry is mostly reserved for low-level, high-consequence systems.)

trenchgun2mo ago

>Such is very difficult at the moment

What do you mean? It's a nice and simple language. Way easier to get started than OCaml or Haskell for example. And LLMs write programs in Lean4 with ease as well. Only issue is that there are not as many libraries (for software, for math proofs there is plenty).

But for example I worked with Claude Code and implemented a shell + most of unix coreutils in like a couple of hours. Claude did some simple proofs as well, but that part is obvs harder. But when the program is already in Lean4, you can start moving up the verification ladder up piece by piece.

1 more reply

refulgentis2mo ago

But isn't that tantamount with "his comment is a complete non-sequitor"?

1 more reply

tonymet2mo ago

AI is the reality that TDD never before had the opportunity to live up to

nextos2mo ago

Not just TDD. Amazon, for instance, is heading towards something between TDD and lightweight formal methods.

They are embracing property-based specifications and testing à la Haskell's QuickCheck: https://kiro.dev

Then, already in formal methods territory, refinement types (e.g. Dafny, Liquid Haskell) are great and less complex than dependent types (e.g. Lean, Agda).

pydry2mo ago

The deluge of amazon bugs ive been seeing recently makes me hesitant to follow in amazon's lead.

prohobo2mo ago

What about model-driven development? Spec to code was the name of the game for UML.

1 more reply

oakpond2mo ago

It makes sense to me as long as you're not vibe coding the PBTs.

viking1232mo ago

Kiro is such garbage though

1 more reply

strujillo2mo ago

That matches what I’ve seen as well — generation is the easy part, validation is the bottleneck.

I’ve been experimenting with a small sparse-regression system that infers governing equations from raw data, and it can produce a lot of plausible candidates quickly. The hard part is filtering out the ones that look right but violate underlying constraints.

For example, it recovered the Sun’s rotation (~25.1 days vs 27 actual) from solar wind data, but most candidate equations were subtly wrong until you enforced consistency checks.

Feels like systems that treat verification as the source of truth (not just an afterthought) are the ones that will actually scale.

lsb2mo ago

The real world success they report reminds me of Simon Willison’s Red Green TDD: https://simonwillison.net/guides/agentic-engineering-pattern...

> Instead of taking a stab in the dark, Leanstral rolled up its sleeves. It successfully built test code to recreate the failing environment and diagnosed the underlying issue with definitional equality. The model correctly identified that because def creates a rigid definition requiring explicit unfolding, it was actively blocking the rw tactic from seeing the underlying structure it needed to match.

saberience2mo ago

That article is literally a definition of TDD that has been around for years and years. There's nothing novel there at all. It's literally test driven development.

jatins2mo ago

If Agent is writing the tests itself, does it offer better correctness guarantees than letting it write code and tests?

bluGill2mo ago

In my experience the agent regularly breaks some current features while adding a new one - much more often than a human would. Agents too often forget about the last feature when adding the next and so will break things. Thus I find Agent generated tests important as they stop the agent from making a lot of future mistakes.

MillionOClock2mo ago

It is definitely not foolproof but IMHO, to some extent, it is easier to describe what you expect to see than to implement it so I don't find it unreasonable to think it might provide some advantages in terms of correctness.

stingraycharles2mo ago

That definitely depends upon the situation. More often than not, properly testing a component takes me more time than writing it.

1 more reply

rvz2mo ago

Given the issues with AWS with Kiro and Github, We already have just a few high-profile examples of what happens when AI is used at scale and even when you let it generate tests which is something you should absolutely not do.

Otherwise in some cases, you get this issue [0].

[0] https://sketch.dev/blog/our-first-outage-from-llm-written-co...

vlfig2mo ago

Don't "let it" generate tests. Be intentional. Define them in a way that's slightly oblique to how the production code approaches the problem, so the seams don't match. Heck, that's why it's good to write them before even thinking about the prod side.

louiskottmann2mo ago

The linked article does not speak of tests, it speaks of a team that failed to properly review an LLM refactor then proceeds to blame the tooling.

LLMs are good at writing tests in my experience.

skanga2mo ago

TDD == Prompt Engineering, for Agentic coding tasks.

_boffin_2mo ago

Wild it’s taken people this long to realize this. Also lean tickets / tasks with all needed context to complete the task, including needed references / docs, places to look in source, acceptance criteria, other stuff.

rothific2mo ago

There have been a lot of conversations recently about how model alignment is relative and diversity of alignment is important - see the recent podcast episode between Jack Clark (co-founder of Anthropic) and Ezra Klein.

Many comments here point out that Mistral's models are not keeping up with other frontier models - this has been my personal experience as well. However, we need more diversity of model alignment techniques and companies training them - so any company taking this seriously is valuable.

nicman232mo ago

they ll get there

kimsant2mo ago

AI agents will become a comodity.

Europeans not wanting to be dependent, and they are giving for free what US investors planed to charge with 90% margin.

Amazing! What a blast. Thank you for your service (this first 100M$ burned to POC GPT1 and from here, we are so good to go)

warpspin2mo ago

The problem with the European independence story is, that it seems Mistral runs its own stuff also on US cloud act affected infrastructure. This makes them a very weird value proposition: If I accept a level of "independence" whereby I run on AWS or Azure, I could as well pay for Anthropic or GPT to have SOTA performance.

If I do not accept that level of independence but want more, I need to buy what's on OVH, Scaleway, Ionos etc. or host my own, but that usually means even smaller, worse models or a lot of investment.

Nevertheless, the "band" that Mistral occupies for economic success is very narrow. Basically just people who need independence "on paper" but not really. Because if I'm searching for actual independence, there's no way I could give them money at the moment for one of their products and it making sense, cause none of their plans are an actual independence-improvement over, let's say, Amazon Bedrock.

I really really want to support them, but it must make economic sense for my company, too, and it doesn't.

kimsant2mo ago

I don’t care about the servers, they are a comodity already.

The key is to avoid chantage, remember Oracle with DBs, people learned not to build on top of unreplaceable stuff

tin7in2mo ago

They are building their own infra - south of Paris and another one was announced in Sweden recently.

warpspin2mo ago

Then why does their list of subprocessors list Google and Microsoft "for cloud infrastructure", specifically for "Le Chat, La Plateforme, Mistral Code"? Sounds to me as if they're mainly running on Azure.

Also, they're listing CoreWeave as inference provider in "EEA" area, but CoreWeave is of course also an US company. Even if they have their data center physically in the EU, it must be considered open access for the USA due to the CLOUD act.

https://trust.mistral.ai/subprocessors

If what you say is true, they have a communications problem and they need to fix that urgently. Right now, this is why they don't get my business. Others will have made the same decision based on their own subprocessor list.

Or did you mean, they're like, right now building it and plan to move there, but it's not up yet?

1 more reply

bigfudge2mo ago

I really hope you're right. Sadly, though, I don't see any evidence of UK companies disinvesting from big US tech. There aren't good alternatives and what there is is too complex. As long as 'everyone else is still using MS', it seems like it's a brave CTO that switches to European providers. Unless that happens, the network effect of having AI+data is likely to mean US tech still has a big advantage in corp settings. But, HN - please tell me I'm wrong!

utopiah2mo ago

> There aren't good alternatives and what there is is too complex.

Sounds like a worth challenge for this community, mind giving actual examples and see what others can suggest?

coffeebeqn2mo ago

Vertical integration and breadth and depth of offerings on the cloud and customer lock-in from dominating it for 20 years

worldsayshi2mo ago

I wonder what the biggest (non-AI) moats are for US tech against the alternatives?

1 more reply

baq2mo ago

they will, but the jagged frontier is fractal and each one will have different capabilities; you'll want to mix models and to get best results consistently you'll need to.

jasonjmcghee2mo ago

Curious if anyone else had the same reaction as me

This model is specifically trained on this task and significantly[1] underperforms opus.

Opus costs about 6x more.

Which seems... totally worth it based on the task at hand.

[1]: based on the total spread of tested models

beernet2mo ago

Agreed. The idea is nice and honorable. At the same time, if AI has been proving one thing, it's that quality usually reigns over control and trust (except for some sensitive sectors and applications). Of course it's less capital-intense, so makes sense for a comparably little EU startup to focus on that niche. Likely won't spin the top line needle much, though, for the reasons stated.

isodev2mo ago

> quality usually reigns over control and trust

Most Copilot customers use Copilot because Microsoft has been able to pinky promise some level of control for their sensitive data. That's why many don't get to use Claude or Codex or Mistral directly at work and instead are forced through their lobotomised Copilot flavours.

Remember, as of yet, companies haven't been able to actually measure the value of LLMs ... so it's all in the hands of Legal to choose which models you can use based on marketing and big words.

Eridrus2mo ago

This too will be solved. You can get tye frontier models from AWS/Google/Azure without needing to send your data to anyone else already.

hermanzegerman2mo ago

EU could help them very much if they would start enforcing the Laws, so that no US Company can process European data, due to the Americans not willing to budge on Cloud Act.

That would also help to reduce our dependency on American Hyperscalers, which is much needed given how untrustworthy the US is right now. (And also hostile towards Europe as their new security strategy lays out)

bcye2mo ago

This would be unfortunately a rather nuclear option due to the continent’s insane reliance on technology that breaks its unenforced laws.

1 more reply

segmondy2mo ago

Ha, keep putting your prompts and workflows into cloud models. They are not okay with being a platform, they intend to cannibalize all businesses. Quality doesn't always reign over control and trust. Your data and original ideas are your edge and moat.

KetoManx642mo ago

The same old speech that has been used throughout history. When cars were invented people complained to everyone that Ford intended to cannbolize all horse drawn carriages. When manufacturing was invented it cannibalized the work of all the sewing and knitting companies that had women making one item at a time. When Google was invented it cannabolized libraries, and encyclopedias, etc. etc. Yet nobody wants a horse drawn carriage, nor to knit their own sweaters, nor go to the library to look things up in a physical encyclopedia.

miohtama2mo ago

Alignment tax directly eats to model quality, double digit percents.

DarkNova62mo ago

I'm never sure how much faith one can put into such benchmarks but in any case the optics seem to shift once you have pass@2 and pass@3.

Still, the more interesting comparison would be against something such as Codex.

speedgoose2mo ago

But you can run this model for free on a common battery powered laptop sitting on your laps without cooking your legs.

hobofan2mo ago

Sorry, but what are you talking about? This is a 120B-A6B model, which isn't runnable on any laptop except the most beefed up Macbooks, and then will certainly drain its battery and cook your legs.

naasking2mo ago

You can easily run a quant of this on a DGX Spark though. Seems like a small investment if it meaningful improves Lean productivity.

1 more reply

speedgoose2mo ago

Yeah my bad, it requires an expensive MacBook.

I think it would still be fine for the legs and on battery for relatively short loads: https://www.notebookcheck.net/Apple-MacBook-Pro-M5-2025-revi...

But 40 degrees and 30W of heat is a bit more than comfortable if you run the agent continuously.

nimchimpsky2mo ago

the model is open source, you can run it locally. You don't think thats significant ?

andai2mo ago

Trustworthy vibe coding. Much better than the other kind!

Not sure I really understand the comparisons though. They emphasize the cost savings relative to Haiku, but Haiku kinda sucks at this task, and Leanstral is worse? If you're optimizing for correctness, why would "yeah it sucks but it's 10 times cheaper" be relevant? Or am I misunderstanding something?

On the promising side, Opus doesn't look great at this benchmark either — maybe we can get better than Opus results by scaling this up. I guess that's the takeaway here.

flowerbreeze2mo ago

They haven't made the chart very clear, but it seems it has configurable passes and at 2 passes it's better than Haiku and Sonnet and at 16 passes starts closing in on Opus although it's not quite there, while consistently being less expensive than Sonnet.

ainch2mo ago

pass@k means that you run the model k times and give it a pass if any of the answers is correct. I guess Lean is one of the few use cases where pass@k actually makes sense, since you can automatically validate correctness.

andai2mo ago

Oh my bad. I'm not sure how that works in practice. Do you just keep running it until the tests pass? I guess with formal verification you can run it as many times as you need, right?

teekert2mo ago

I also don't understand the focus on vibe coding in the marketing. Vibe coding kind of has the image of being for non-devs, right?

I do like agents (like Claude Code), but I don't consider myself to be vibe coding when I use them. Either I'm using a language/framework I know and check every step. OR I'm learning, checking every step and asking for explanations.

I tried vibe coding, and really dislike the feeling I have when doing it. It feels like building a house, but without caring about it, and just using whatever tech. Sure I may have moisture problems later, but it's a throwaway house anyway. That's how I feel about it. Maybe I have a wrong definition.

Maybe it's good to not use "vibe coding" as a synonym for programming with agent assistance. Just to protect our profession. Like: "Ah you're vibing" (because you have Claude Code open), "No, I'm using CC to essentially type faster and prevent syntax errors and get better test coverage, maybe to get some smart solutions without deep research. But I understand and vouch for every loc here. 'We are not the same.'"

benterix2mo ago

> I tried vibe coding, and really dislike the feeling I have when doing it. It feels like building a house, but without caring about it, and just using whatever tech. Sure I may have moisture problems later, but it's a throwaway house anyway. That's how I feel about it. Maybe I have a wrong definition.

No, I feel the same. I vibe-coded a few projects and after a few weeks I just threw them away, ultimately I felt I just wasted my time and wished I coudl get it back to do something useful.

DANmode2mo ago

> It feels like building a house, but without caring about it, and just using whatever tech.

So, most homebuilders (in the US) unfortunately.

teekert2mo ago

I myself am now and expert at insulation and all the vapor-permeable and vapor-blocking membranes/foils/foams that come with it.

It came at great cost though, I hated the process of learning and the execution. I was less than happy for some years. But I feel even more uncomfortable vibe-home-improving than I do vibe-coding. The place is starting to look nice now though.

andai2mo ago

Yeah, the original meaning of Vibe Coding was "not looking at the code, just going on vibes", but a lot of people now use it to mean "AI was involved in some way".

I see a whole spectrum between those two. I typically alternate between "writing code manually and asking AI for code examples" (ChatGPT coding), and "giving AI specific instructions like, write a function blarg that does foo".

The latter I call Power Coding, in the sense of power armor, because you're still in control and mostly moving manually, but you're much stronger and faster.

I like this better than "tell agent to make a bunch of changes and come back later" because first of all it doesn't break flow (you can use a smaller model for such fine-grained changes so it goes very fast -- it's "realtime"), and second, you don't ever desync from the codebase and need to spend extra time figuring out what the AI did. Each change is sanity-checked as it comes in.

So you stay active, and the code stays slop-free.

I don't hear a lot of people doing this though? Maybe we just don't have good language for it.

teekert2mo ago

"I don't hear a lot of people doing this though? Maybe we just don't have good language for it."

Interesting thought. I guess we don't really, vibe coding is to powerful a term. But perhaps just call it LLM assisted programming? Where we used to do Stack Overflow assisted programming. LLM assisted programming is more focused, goes faster. But since you're wandering around less I guess you learn less, you're exposed to less new information, some of it was helpful in unexpected ways. Now you have to make learning a specific part of your flow, and that takes discipline/time. But is well worth it imho. Actually, for me it's the only way to enjoy it.

DrewADesign2mo ago

It’s really not hard — just explicitly ask for trustworthy outputs only in your prompt, and Bob’s your uncle.

miacycle2mo ago

Assuming that what you're dealing with is assertable. I guess what I mean to say is that in some situations is difficult to articulate what is correct and what isn't depending in some situations is difficult to articulate what is correct and what isn't depending upon the situation in which the software executes.

DrewADesign2mo ago

And Bob’s your uncle.

drdaeman2mo ago

Can someone please explain... If I don't know any Lean (and I suspect most people don't), is it of any direct value? Trying to understand if there's something it can help me with (e.g. automatically write proofs for my Go programs somehow... I'm not sure) or should I just cheer solely for more open models out there, but this one isn't for me?

TimTheTinker2mo ago

Presumably the idea is that an agent generates a Lean4 specification against which the software is measured.

But then the Lean4 specification effectively becomes the software artifact.

And we're sort of back to square 1. How do you verify a Lean4 spec is correct (and that it describes what needs to be built in the first place) without human review?

naasking2mo ago

> And we're sort of back to square 1.

Specifications are smaller than the full code, just as high level code is smaller than the functionally equivalent assembly. As we ascend the abstraction ladder the amount of reading a human needs to do decreases. I don't think this should really count as "back to square 1".

TimTheTinker2mo ago

That has always been the perceived promise of higher-abstraction software specs: automated code generation from something higher-level, thus making programmers increasingly obsolete.

  binary => hexadecimal instructions
  hexadecimal => assembly language
  assembly => portable, "high-level" languages (C, FORTRAN, COBOL, etc.)
  HLLs => 3GLs (BASIC, C++, Pascal, Java, C#, JavaScript, etc.)
  3GLs => 4GLs/DSLs/RADs and "low-code/no-code"[0]

Among the RADs is Microsoft Visual Basic, which along with WinForms and SQL was supposed to make business programmers nearly obsolete, but instead became a new onramp into programming.

In particular, I'd like to highlight UML, which was supposed to mostly obsolete programming through auto-generated code from object-oriented class diagrams.[1] The promise was that "business domain experts" could model their domain via visual UML tooling, and the codegen would handle it from there. In practice, UML-built applications became maintenance nightmares.

In every one of these examples, the artifact that people made "instead of programming" became the de-facto programming language, needing to be maintained over time, abstracted, updated, consumed behind APIs, etc. -- and programmers had to be called in to manage the mess.

It's interesting that Spec4 can be auto-generated, then used to generate code. My question is - what do you do when you have (a) consumers depending on a stable API, and (b) requests for new features? Maybe hand the job to Claude Code or a human developer with a suite of unit tests to guarantee API compatibility, but at that point we're back to an agent (LLM or human) doing the work of programming, with the Spec4 code as the programming language being updated and maintained.

[0] https://en.wikipedia.org/wiki/Fourth-generation_programming_...

[1] https://news.ycombinator.com/item?id=26934795

esperent2mo ago

I absolutely called this a couple of weeks ago, nice to be vindicated!

> I'm interested to see what it is in the age of LLMs or similar future tools. I suspect a future phase change might be towards disregarding how easy it is for humans to work with the code and instead focus on provability, testing, perhaps combined with token efficiency.

> Maybe Lean combined with Rust shrunk down to something that is very compiler friendly. Imagine if you could specify what you need in high level language and instead of getting back "vibe code", you get back proven correct code, because that's the only kind of code that will successfully compile.

https://news.ycombinator.com/item?id=47192116

AlotOfReading2mo ago

It's important to keep in mind that no proof system ensures your proof is the correct proof, only that it's a valid proof. Completely understanding what a proof proves is often nearly as difficult as understanding the program it's proving. Normally you benefit because the process of building a proof forces you to develop your understanding more fully.

specvsimpl2mo ago

Uhm, no? Even with "simple" examples like Dijkstra's shortest path, the spec is easier than the implementation. Maybe not for you, but try it out on an arbitrary 5-yr old. On the extreme end, you have results in maths, like Fermat's Last Theorem. Every teenager can understand the statement (certainly after 10 mins of explanation) but the proof is thousands of pages of super-specialized maths. It is a spectrum. For cryptography, compression, error-correction, databases, etc, the spec is often much simpler than the implementation.

AlotOfReading2mo ago

I don't know why you created a new account for this, but take the textbook example of a nontrivial formally verified system: SeL4. That implementation was 8.7k of C code, which correspondend to 15k lines of Isabelle that ultimately needed 100k+ lines of proof to satisfy. And that was with the formal model excluding lots of important properties like hardware failure that actual systems deal with.

2 more replies

patall2mo ago

Maybe a naive question: given that they see better performance with more passes but the effect hits a limit after a few passes, would performance increase if they used different models per pass, i.e leanstral, kimi, qwen and leanstral again instead of 4x leanstral?

andai2mo ago

This is called a "LLM alloy", you can even do it in agentic, where you simply swap the model on each llm invocation.

It does actually significantly boost performance. There was an article on here about it recently, I'll see if I can find it.

Edit: https://news.ycombinator.com/item?id=44630724

They found the more different the models were (the less overlap in correctly solved problems), the more it boosted the score.

patall2mo ago

That sounds quite interesting. Makes me wonder if sooner or later they will have to train multiple independent models that cover those different niches. But maybe we will see that sooner or later. Thanks for the link.

cyanydeez2mo ago

One would think that LoRAs being so successful in StableDiffusion, that more people would be focused on constructing framework based LoRas; but the economics of all this probably preclude trying to go niche in any direction and just keep building the do-all models.

1 more reply

andai2mo ago

Mixture of Mixtures of Experts ;)

JoshTriplett2mo ago

Pleasant surprise: someone saying "open source" and actually meaning Open Source. It looks like the weights are Apache-2.0 licensed.

jasonjmcghee2mo ago

Based on community definitions I've seen, this is considered "open weights". If you can't reproduce the model, it's not "open source"

xpe2mo ago

Yes “open weights” conveys the reality more clearly: merely having the parameters is very different than able to run a process that creates them. Without openness of the full process start to finish, much is hidden.*

Remember, language is what we make it. Dictionaries are useful catalogs of usage but we make the judgment calls.

* Even with the process, much is not well understood! / The ethics of releasing an open weights model at some capability level is a separate discussion.

strujillo2mo ago

Formal verification and code synthesis feel like natural companions for automated scientific discovery. I’ve been working on a small (~800‑line) Python agent that uses sparse regression to uncover governing equations directly from data; it’s managed to validate twelve physical laws, including deriving the Sun’s rotation rate from NASA plasma measurements and correcting Gemini’s plasma conservation. Having an agent like Leanstral that can reason about proofs and specifications would be a powerful complement to data‑driven model discovery — it closes the loop between experimentation and provable correctness.

flakiness2mo ago

FYI The Lean 4 paper: https://dl.acm.org/doi/10.1007/978-3-030-79876-5_37

agentultra2mo ago

Very cool but I haven’t been able to convince software developers in industry to write property based tests. I sometimes joke that we will start writing formal proofs until the tests improve. Just so that they will appreciate the difference a little more.

I can’t even convince most developers to use model checkers. Far more informal than a full proof in Lean. Still highly useful in many engineering tasks. People prefer boxes and arrows and waving their hands.

Anyway, I don’t know that I’d want to have a system vibe code a proof. These types of proofs, I suspect, aren’t going to be generated to be readable, elegant, and be well understood by people. Like programs they generate it will look plausible.

And besides, you will still need a human to review the proof and make sure it’s specifying the right things. This doesn’t solve that requirement.

Although I have thought that it would be useful to have a system that could prove trivial lemmas in the proof. That would be very neat.

rowanG0772mo ago

The point is you just need to scrutinize the theorem. Not easy either, but still significantly less work than writing the proof.

storus2mo ago

I just feel like Mistral is heading for bad financial times when they are focusing on fringe academic areas and not on building a business out of their research. Initial Mistral was largely based on LLaMA, then they added innovative MoE and since then disappeared, doing AI consulting for big EU companies instead.

igravious2mo ago

"and continues to scale linearly"

it clearly and demonstrably does not. in fact, from eyeballing their chart Qwen, Kimi, and GLM scale linearly whereas Leanstral does not. But this is not surprising because the Alibaba, Moonshot, and Zhipu have hundreds of employees each and hundreds of millions of dollars of investment each.

robertwer2mo ago

I’ve never worked with formal validation (barely remember my CS course). This release looks impressive. But I'm trying to wrap my head around the near-term practical applications for everyday software.

Right now, we see a lot of business experts in enterprises tempted to use AI to impl. business logic so they don't have to wait for (or pay) software experts. Would this kind of technology help these users any time soon?

My current theory is that the real breakthrough for these non-developers will only happen when they can actually verify the result themselves without needing an another expert in the loop. But I don't see that with formal validation anytime soon.

Do I overlook something?

Andrei_dev2mo ago

Yeah, this tracks. Developers who actually read what the AI spits out catch the obvious mistakes. The ones who just tab-complete their way through a whole project don't. And where it bites you isn't where you'd expect — logic bugs get caught fast. It's the boring security stuff. No input validation, CORS wide open, admin routes with no auth at all. Formal verification tells you whether a function matches its spec. The problem with AI-generated code goes a level below that. It's everything nobody bothered specifying — like "maybe don't hardcode your database credentials."

piyh2mo ago

Automated theorem provers running on a $5k piece of hardware is a cool version of the future

wazHFsRy2mo ago

Is anyone using this approach with lean to ship production code? Writing lean spec as human, implementation and proof by agent? And then shipping lean or exporting to C? Would be great to understand how you are actually using this.

toastal2mo ago

Naturally the Microsoft-owned language is getting the AI hype instead of the more mature options that could do this sort of work… Agda, ATS, Coq/Rocq, Dafny, Fstar, Idris, Isabelle, Why3 just to name a few.

Paracompact2mo ago

A bit uncharitable. I'm a diehard fan of Rocq, but it's nothing unusual to see the young new hotness that is Lean continue to get the spotlight. It's not a sign of Microsoft putting its thumb on the scales, and the hype for Lean has long predated LLMs.

It's certainly less mature when it comes to verified programming, but its appeal to mathematicians (rather than formal methods experts) has earned it much respect.

markusde2mo ago

You should check out the recent PR's to the Agda repo... the community is currently very divided about AI. For better or worse, the people driving the Lean project have been interested in AI for quite some time.

mrklol2mo ago

Am I missing something? Isn’t that the language most are using currently when looking at research at openai, google, deepseek etc?

Havoc2mo ago

What are these "passes" they reference here? Haven't seen that before in LLM evals

Could definitely be interesting for having another model run over the codebase when looking for improvements

rockinghigh2mo ago

It's the number of attempts at answering the question.

michaelgdwn2mo ago

The formal verification angle is what makes this interesting. Most coding agents optimize for "code that compiles and passes tests" — that's a low bar. Curious whether the proof artifacts are persisted for audit trails or thrown away after verification.

westurner2mo ago

From https://mistral.ai/news/leanstral :

  Model        Cost ($) Score
  ..
  Claude Opus     1,650 39.6
  ..
  Leanstral pass@8  145 31.0
  Leanstral pass@16 290 31.9

maelito2mo ago

I don't understand how this can impact my JS (+yaml, css, etc) code writing in a complex app.

lefrenchy2mo ago

Does Mistral come close to Opus 4.6 with any of their models?

chucky_z2mo ago

I use mistral-medium-3.1 for a lot of random daily tasks, along with the vibe cli. I'd state from my personal opinion that mistral is my preferred 'model vendor' by far at this point. They're extremely consistent between releases while each of them just feels better. I also have a strong personal preference to the output.

I actively use gemini-3.1-pro-preview, claude-4.6-opus-high, and gpt-5.3-codex as well. I prefer them all for different reasons, however I usually _start_ with mistral if it's an option.

sa-code2mo ago

Why not Large 3? It's larger and cheaper

tjwebbnorfolk2mo ago

Mistral hasn't been in the running for SOTA for quite awhile now

DarkNova62mo ago

Not at the moment, but a release of Mistral 4 seems close which likely bridges the gap.

re-thc2mo ago

Mistral Small 4 is already announced.

androiddrew2mo ago

MOE but 120B range. Man I wish it was an 80B. I have 2 GPUs with 62Gib of usable VRAM. A 4bit 80B gives me some context window, but 120B puts me into system RAM

1 more reply

elAhmo2mo ago

I don’t know a single person using Mistral models.

consumer4512mo ago

Isn't their latest speech to text model SOTA? When I tested it on jargon, it was amazing.

https://news.ycombinator.com/item?id=46886735

troyvit2mo ago

I'm using this model for my first python project, coding using opencode along with devstral and Mistral Large 3. I know it's not as capable as other, more expensive models, but working with it this way is teaching me python. More directly to your point though, the speech to text model is really good.

It's funny because I just took a break from it to read some hn and found this post.

Adrig2mo ago

I used Ministral for data cleaning.

I was surprised: even tho it was the cheapest option (against other small models from Anthropic) it performed the best in my benchmarks.

Bombthecat2mo ago

Mistral is super smart in smaller context and asking questions about it

badsectoracula2mo ago

Pretty much all of my LLM usage has been using Mistral's open source models running on my PC. I do not do full agentic coding as when i tried it with Devstral Small 2 it was a bit too slow (though if i could get 2-3 times the speed of my PC from a second computer it'd be be a different story and AFAIK that is doable if i was willing to spend $2-3k on it). However i've used Mistral's models for spelling and grammar checks[0], translations[1][2], summaries[3] and trying to figure out if common email SPAM avoidance tricks are pointless in the LLM age :-P [4]. FWIW that tool you can see in the shots is a Tcl/Tk script calling a llama.cpp-based command-line utility i threw together some time ago when experimenting with llama.cpp.

I've also used Devstral Small to make a simple raytracer[5][6] (it was made using the "classic" chat by copy/pasting code, not any agentic approach and i did fix bits of it in the process) and a quick-and-dirty "games database" in Python+Flask+Sqlite for my own use (mainly a game backlog DB :-P).

I also use it to make various small snippets, have it generate some boilerplate stuff (e.g. i have an enum in C and want to write a function that prints names for each enum value or have it match a string i read from a json file with the appropriate enum value), "translate" between languages (i had it recently convert some matrix code that i had written in Pascal into C), etc.

[0] https://i.imgur.com/f4OrNI5.png

[1] https://i.imgur.com/Zac3P4t.png

[2] https://i.imgur.com/jPYYKCd.png

[3] https://i.imgur.com/WZGfCdq.png

[4] https://i.imgur.com/ytYkyQW.png

[5] https://i.imgur.com/FevOm0o.png (screenshot)

[6] https://app.filen.io/#/d/e05ae468-6741-453c-a18d-e83dcc3de92... (C code)

[7] https://i.imgur.com/BzK8JtT.png

ainch2mo ago

That's likely because they're chasing enterprise - see deals with HSBC, ASML, AXA, BNP Paribas etc... Given swelling anti-US sentiment and their status as a French 'national champion', Mistral are probably in a strong position for now regardless of model performance, research quality or consumer uptake.

brainless2mo ago

I'm building a knowledge graph on personal data (emails, files) with Ministral 3:3b. I try with Qwen 3.5:4b as well but mostly Ministral.

Works really well. Extracts companies you have dealt with, people, topics, events, locations, financial transactions, bills, etc.

pelagicAustral2mo ago

Me neither, they're not ready for prime imo. I have a yearly sub and the product is just orders of magnitude behind Anthropic's offering. I use Code for real world stuff and I am happy with the result, Mistral is just not something I can trust right now.

Fnoord2mo ago

I use them solely.

kittikitti2mo ago

This is great, congratulations to the Mistral team! I'm looking forward to the code arena benchmark results. Thanks for sharing.

techcam2mo ago

The tricky part is that prompts can look “correct” but still behave unpredictably depending on phrasing.

blueTiger332mo ago

I read it as Lanestra, and thought of that story :D

jasonjmcghee2mo ago

Curious if pass@2 was tested for haiku and sonnet?

miacycle2mo ago

The TDD foundation! We might need one of those. :)

casey22mo ago

I get why the massive roll out is happening, but it's still possible that everything LLMs are good for can be done on the cheap. It's just another point for Google's multimodal focus.

I think in 10 years most providers will implode because they can't justify the debt for a cheap commodity product. While Google (and probably OpenAI) will have a huge moat due to users/multimodal/world models

xpe2mo ago

Public service announcement to hopefully reduce unnecessary knife fights*:

There are two compatible and important (but different) questions in play:

1. Is a program correct relative to a formal specification?

2. Is the formal specification what we mean/want?

*: Worth asking: “What that other person necessarily wrong? Or perhaps they are discussing a different aspect or framing?” AKA: “be curious and charitable” I’m not going to link to the specific threads, but they are happened / are happening. Le Sigh.

rafph2mo ago

This is a typical AI announcement. Putting FLTEval scores ahead of explanations, copying code from Rocq and basically not explaining at all what the setup does.

The average quality of an AI announcement is that of a Memecoin. Lots of graphs, meandering text and no substance.

bb-connor2mo ago

this is very exciting work

jiehong2mo ago

Congratulations on the launch!

Mistral seems to focus on a different market than the others. Their best model is meh, their best ASR model locally is either rather slow compared to Parakeet on similar languages, or not as good for others (like qwen ASR).

Side note: Lean seems quite unreadable with tons of single letter variable names. Part of it is me being unaccustomed with it, but still.

aimanbenbaha2mo ago

Mistral seems to focus on some niche LLM model tooling that are somehow very needed in certain cases. Can't forget their OCR multimodal embedding model!

ucsandman2mo ago

love the opensource push for agents, the fleet grows!

blurbleblurble2mo ago

Truly exciting

hnipps2mo ago

Here we go.

htrp2mo ago

is the haiku comparison because they've distilled from the model?

atmosx2mo ago

lol, why does the paper abstract assume I know what Lean is and it goes on to talk about lean 4 improvements?

cicko2mo ago

Why do you expect to understand an article you randomly read off the interwebs?

j / k navigate · click thread line to collapse

191 comments

cadamsdotcom2mo ago

It’s great to see this pattern of people realising that agents can specify the desired behavior then write code to conform to the specs.

As software gets more complex we need better solutions than “go ask Jim about that, bloke’s been in the code for years”.

bluGill2mo ago

> That’s because it encodes details, not intent.

necovek2mo ago

While you are right that you need to be encoding the right details, I disagree on the tests enforcing a design point.

Only the last category should stay mostly idempotent under refactoring, depending on the type of refactor you are doing.

Tests generally define behavior and implementation in a TDD approach: it'd be weird if they do not need changing at all when you are changing the implementation.

bluGill2mo ago

1 more reply

BowBun2mo ago

chriswarbo2mo ago

discreteevent2mo ago

The scientific approach is theory driven, not test driven. Understanding (and the power that gives us) is the goal.

2 more replies

applfanboysbgon2mo ago

cowboy_henk2mo ago

2 more replies

pydry2mo ago

This assumes that tests are realistic, which for the most part they are not.

cadamsdotcom2mo ago

Don’t like the layout? Let’s reroll! Back to the generative kitchen agent for a new one! ($$$)

The big labs will gladly let you reroll until you’re happy. But software - and kitchens - should not be generated in a casino.

Details matter.

vidarh2mo ago

1 more reply

refulgentis2mo ago

baq2mo ago

refulgentis2mo ago

Thanks for following up on this: I was really surprised by how much air this paeon to, idk, TDD, took out of the comments by getting off-topic.

Paracompact2mo ago

trenchgun2mo ago

>Such is very difficult at the moment

1 more reply

refulgentis2mo ago

But isn't that tantamount with "his comment is a complete non-sequitor"?

1 more reply

tonymet2mo ago

AI is the reality that TDD never before had the opportunity to live up to

nextos2mo ago

Not just TDD. Amazon, for instance, is heading towards something between TDD and lightweight formal methods.

They are embracing property-based specifications and testing à la Haskell's QuickCheck: https://kiro.dev

Then, already in formal methods territory, refinement types (e.g. Dafny, Liquid Haskell) are great and less complex than dependent types (e.g. Lean, Agda).

pydry2mo ago

The deluge of amazon bugs ive been seeing recently makes me hesitant to follow in amazon's lead.

prohobo2mo ago

What about model-driven development? Spec to code was the name of the game for UML.

1 more reply

oakpond2mo ago

It makes sense to me as long as you're not vibe coding the PBTs.

viking1232mo ago

Kiro is such garbage though

1 more reply

strujillo2mo ago

That matches what I’ve seen as well — generation is the easy part, validation is the bottleneck.

For example, it recovered the Sun’s rotation (~25.1 days vs 27 actual) from solar wind data, but most candidate equations were subtly wrong until you enforced consistency checks.

Feels like systems that treat verification as the source of truth (not just an afterthought) are the ones that will actually scale.

lsb2mo ago

The real world success they report reminds me of Simon Willison’s Red Green TDD: https://simonwillison.net/guides/agentic-engineering-pattern...

saberience2mo ago

That article is literally a definition of TDD that has been around for years and years. There's nothing novel there at all. It's literally test driven development.

jatins2mo ago

If Agent is writing the tests itself, does it offer better correctness guarantees than letting it write code and tests?

bluGill2mo ago

MillionOClock2mo ago

stingraycharles2mo ago

That definitely depends upon the situation. More often than not, properly testing a component takes me more time than writing it.

1 more reply

rvz2mo ago

Otherwise in some cases, you get this issue [0].

[0] https://sketch.dev/blog/our-first-outage-from-llm-written-co...

vlfig2mo ago

louiskottmann2mo ago

The linked article does not speak of tests, it speaks of a team that failed to properly review an LLM refactor then proceeds to blame the tooling.

LLMs are good at writing tests in my experience.

skanga2mo ago

TDD == Prompt Engineering, for Agentic coding tasks.

_boffin_2mo ago

rothific2mo ago

nicman232mo ago

they ll get there

kimsant2mo ago

AI agents will become a comodity.

Europeans not wanting to be dependent, and they are giving for free what US investors planed to charge with 90% margin.

Amazing! What a blast. Thank you for your service (this first 100M$ burned to POC GPT1 and from here, we are so good to go)

warpspin2mo ago

I really really want to support them, but it must make economic sense for my company, too, and it doesn't.

kimsant2mo ago

I don’t care about the servers, they are a comodity already.

The key is to avoid chantage, remember Oracle with DBs, people learned not to build on top of unreplaceable stuff

tin7in2mo ago

They are building their own infra - south of Paris and another one was announced in Sweden recently.

warpspin2mo ago

https://trust.mistral.ai/subprocessors

Or did you mean, they're like, right now building it and plan to move there, but it's not up yet?

1 more reply

bigfudge2mo ago

utopiah2mo ago

> There aren't good alternatives and what there is is too complex.

Sounds like a worth challenge for this community, mind giving actual examples and see what others can suggest?

coffeebeqn2mo ago

Vertical integration and breadth and depth of offerings on the cloud and customer lock-in from dominating it for 20 years

worldsayshi2mo ago

I wonder what the biggest (non-AI) moats are for US tech against the alternatives?

1 more reply

baq2mo ago

they will, but the jagged frontier is fractal and each one will have different capabilities; you'll want to mix models and to get best results consistently you'll need to.

jasonjmcghee2mo ago

Curious if anyone else had the same reaction as me

This model is specifically trained on this task and significantly[1] underperforms opus.

Opus costs about 6x more.

Which seems... totally worth it based on the task at hand.

[1]: based on the total spread of tested models

beernet2mo ago

isodev2mo ago

> quality usually reigns over control and trust

Remember, as of yet, companies haven't been able to actually measure the value of LLMs ... so it's all in the hands of Legal to choose which models you can use based on marketing and big words.

Eridrus2mo ago

This too will be solved. You can get tye frontier models from AWS/Google/Azure without needing to send your data to anyone else already.

hermanzegerman2mo ago

EU could help them very much if they would start enforcing the Laws, so that no US Company can process European data, due to the Americans not willing to budge on Cloud Act.

bcye2mo ago

This would be unfortunately a rather nuclear option due to the continent’s insane reliance on technology that breaks its unenforced laws.

1 more reply

segmondy2mo ago

KetoManx642mo ago

miohtama2mo ago

Alignment tax directly eats to model quality, double digit percents.

DarkNova62mo ago

I'm never sure how much faith one can put into such benchmarks but in any case the optics seem to shift once you have pass@2 and pass@3.

Still, the more interesting comparison would be against something such as Codex.

speedgoose2mo ago

But you can run this model for free on a common battery powered laptop sitting on your laps without cooking your legs.

hobofan2mo ago

Sorry, but what are you talking about? This is a 120B-A6B model, which isn't runnable on any laptop except the most beefed up Macbooks, and then will certainly drain its battery and cook your legs.

naasking2mo ago

You can easily run a quant of this on a DGX Spark though. Seems like a small investment if it meaningful improves Lean productivity.

1 more reply

speedgoose2mo ago

Yeah my bad, it requires an expensive MacBook.

I think it would still be fine for the legs and on battery for relatively short loads: https://www.notebookcheck.net/Apple-MacBook-Pro-M5-2025-revi...

But 40 degrees and 30W of heat is a bit more than comfortable if you run the agent continuously.

nimchimpsky2mo ago

the model is open source, you can run it locally. You don't think thats significant ?

andai2mo ago

Trustworthy vibe coding. Much better than the other kind!

On the promising side, Opus doesn't look great at this benchmark either — maybe we can get better than Opus results by scaling this up. I guess that's the takeaway here.

flowerbreeze2mo ago

ainch2mo ago

andai2mo ago

Oh my bad. I'm not sure how that works in practice. Do you just keep running it until the tests pass? I guess with formal verification you can run it as many times as you need, right?

teekert2mo ago

I also don't understand the focus on vibe coding in the marketing. Vibe coding kind of has the image of being for non-devs, right?

benterix2mo ago

No, I feel the same. I vibe-coded a few projects and after a few weeks I just threw them away, ultimately I felt I just wasted my time and wished I coudl get it back to do something useful.

DANmode2mo ago

> It feels like building a house, but without caring about it, and just using whatever tech.

So, most homebuilders (in the US) unfortunately.

teekert2mo ago

I myself am now and expert at insulation and all the vapor-permeable and vapor-blocking membranes/foils/foams that come with it.

andai2mo ago

Yeah, the original meaning of Vibe Coding was "not looking at the code, just going on vibes", but a lot of people now use it to mean "AI was involved in some way".

The latter I call Power Coding, in the sense of power armor, because you're still in control and mostly moving manually, but you're much stronger and faster.

So you stay active, and the code stays slop-free.

I don't hear a lot of people doing this though? Maybe we just don't have good language for it.

teekert2mo ago

"I don't hear a lot of people doing this though? Maybe we just don't have good language for it."

DrewADesign2mo ago

It’s really not hard — just explicitly ask for trustworthy outputs only in your prompt, and Bob’s your uncle.

miacycle2mo ago

DrewADesign2mo ago

And Bob’s your uncle.

drdaeman2mo ago

TimTheTinker2mo ago

Presumably the idea is that an agent generates a Lean4 specification against which the software is measured.

But then the Lean4 specification effectively becomes the software artifact.

And we're sort of back to square 1. How do you verify a Lean4 spec is correct (and that it describes what needs to be built in the first place) without human review?

naasking2mo ago

> And we're sort of back to square 1.

TimTheTinker2mo ago

That has always been the perceived promise of higher-abstraction software specs: automated code generation from something higher-level, thus making programmers increasingly obsolete.

  binary => hexadecimal instructions
  hexadecimal => assembly language
  assembly => portable, "high-level" languages (C, FORTRAN, COBOL, etc.)
  HLLs => 3GLs (BASIC, C++, Pascal, Java, C#, JavaScript, etc.)
  3GLs => 4GLs/DSLs/RADs and "low-code/no-code"[0]

Among the RADs is Microsoft Visual Basic, which along with WinForms and SQL was supposed to make business programmers nearly obsolete, but instead became a new onramp into programming.

[0] https://en.wikipedia.org/wiki/Fourth-generation_programming_...

[1] https://news.ycombinator.com/item?id=26934795

esperent2mo ago

I absolutely called this a couple of weeks ago, nice to be vindicated!

https://news.ycombinator.com/item?id=47192116

AlotOfReading2mo ago

specvsimpl2mo ago

AlotOfReading2mo ago

2 more replies

patall2mo ago

andai2mo ago

This is called a "LLM alloy", you can even do it in agentic, where you simply swap the model on each llm invocation.

It does actually significantly boost performance. There was an article on here about it recently, I'll see if I can find it.

Edit: https://news.ycombinator.com/item?id=44630724

They found the more different the models were (the less overlap in correctly solved problems), the more it boosted the score.

patall2mo ago

cyanydeez2mo ago

1 more reply

andai2mo ago

Mixture of Mixtures of Experts ;)

JoshTriplett2mo ago

Pleasant surprise: someone saying "open source" and actually meaning Open Source. It looks like the weights are Apache-2.0 licensed.

jasonjmcghee2mo ago

Based on community definitions I've seen, this is considered "open weights". If you can't reproduce the model, it's not "open source"

xpe2mo ago

Remember, language is what we make it. Dictionaries are useful catalogs of usage but we make the judgment calls.

* Even with the process, much is not well understood! / The ethics of releasing an open weights model at some capability level is a separate discussion.

strujillo2mo ago

flakiness2mo ago

FYI The Lean 4 paper: https://dl.acm.org/doi/10.1007/978-3-030-79876-5_37

agentultra2mo ago

And besides, you will still need a human to review the proof and make sure it’s specifying the right things. This doesn’t solve that requirement.

Although I have thought that it would be useful to have a system that could prove trivial lemmas in the proof. That would be very neat.

rowanG0772mo ago

The point is you just need to scrutinize the theorem. Not easy either, but still significantly less work than writing the proof.

storus2mo ago

igravious2mo ago

"and continues to scale linearly"

robertwer2mo ago

Do I overlook something?

Andrei_dev2mo ago

piyh2mo ago

Automated theorem provers running on a $5k piece of hardware is a cool version of the future

wazHFsRy2mo ago

toastal2mo ago

Paracompact2mo ago

It's certainly less mature when it comes to verified programming, but its appeal to mathematicians (rather than formal methods experts) has earned it much respect.

markusde2mo ago

mrklol2mo ago

Am I missing something? Isn’t that the language most are using currently when looking at research at openai, google, deepseek etc?

Havoc2mo ago

What are these "passes" they reference here? Haven't seen that before in LLM evals

Could definitely be interesting for having another model run over the codebase when looking for improvements

rockinghigh2mo ago

It's the number of attempts at answering the question.

michaelgdwn2mo ago

westurner2mo ago

From https://mistral.ai/news/leanstral :

  Model        Cost ($) Score
  ..
  Claude Opus     1,650 39.6
  ..
  Leanstral pass@8  145 31.0
  Leanstral pass@16 290 31.9

maelito2mo ago

I don't understand how this can impact my JS (+yaml, css, etc) code writing in a complex app.

lefrenchy2mo ago

Does Mistral come close to Opus 4.6 with any of their models?

chucky_z2mo ago

I actively use gemini-3.1-pro-preview, claude-4.6-opus-high, and gpt-5.3-codex as well. I prefer them all for different reasons, however I usually _start_ with mistral if it's an option.

sa-code2mo ago

Why not Large 3? It's larger and cheaper

tjwebbnorfolk2mo ago

Mistral hasn't been in the running for SOTA for quite awhile now

DarkNova62mo ago

Not at the moment, but a release of Mistral 4 seems close which likely bridges the gap.

re-thc2mo ago

Mistral Small 4 is already announced.

androiddrew2mo ago

MOE but 120B range. Man I wish it was an 80B. I have 2 GPUs with 62Gib of usable VRAM. A 4bit 80B gives me some context window, but 120B puts me into system RAM

1 more reply

elAhmo2mo ago

I don’t know a single person using Mistral models.

consumer4512mo ago

Isn't their latest speech to text model SOTA? When I tested it on jargon, it was amazing.

https://news.ycombinator.com/item?id=46886735

troyvit2mo ago

It's funny because I just took a break from it to read some hn and found this post.

Adrig2mo ago

I used Ministral for data cleaning.

I was surprised: even tho it was the cheapest option (against other small models from Anthropic) it performed the best in my benchmarks.

Bombthecat2mo ago

Mistral is super smart in smaller context and asking questions about it

badsectoracula2mo ago

[0] https://i.imgur.com/f4OrNI5.png

[1] https://i.imgur.com/Zac3P4t.png

[2] https://i.imgur.com/jPYYKCd.png

[3] https://i.imgur.com/WZGfCdq.png

[4] https://i.imgur.com/ytYkyQW.png

[5] https://i.imgur.com/FevOm0o.png (screenshot)

[6] https://app.filen.io/#/d/e05ae468-6741-453c-a18d-e83dcc3de92... (C code)

[7] https://i.imgur.com/BzK8JtT.png

ainch2mo ago

brainless2mo ago

I'm building a knowledge graph on personal data (emails, files) with Ministral 3:3b. I try with Qwen 3.5:4b as well but mostly Ministral.

Works really well. Extracts companies you have dealt with, people, topics, events, locations, financial transactions, bills, etc.

pelagicAustral2mo ago

Fnoord2mo ago

I use them solely.

kittikitti2mo ago

This is great, congratulations to the Mistral team! I'm looking forward to the code arena benchmark results. Thanks for sharing.

techcam2mo ago

The tricky part is that prompts can look “correct” but still behave unpredictably depending on phrasing.

blueTiger332mo ago

I read it as Lanestra, and thought of that story :D

jasonjmcghee2mo ago

Curious if pass@2 was tested for haiku and sonnet?

miacycle2mo ago

The TDD foundation! We might need one of those. :)

casey22mo ago

I get why the massive roll out is happening, but it's still possible that everything LLMs are good for can be done on the cheap. It's just another point for Google's multimodal focus.

xpe2mo ago

Public service announcement to hopefully reduce unnecessary knife fights*:

There are two compatible and important (but different) questions in play:

1. Is a program correct relative to a formal specification?

2. Is the formal specification what we mean/want?

rafph2mo ago

This is a typical AI announcement. Putting FLTEval scores ahead of explanations, copying code from Rocq and basically not explaining at all what the setup does.

The average quality of an AI announcement is that of a Memecoin. Lots of graphs, meandering text and no substance.

bb-connor2mo ago

this is very exciting work

jiehong2mo ago

Congratulations on the launch!

Side note: Lean seems quite unreadable with tons of single letter variable names. Part of it is me being unaccustomed with it, but still.

aimanbenbaha2mo ago

Mistral seems to focus on some niche LLM model tooling that are somehow very needed in certain cases. Can't forget their OCR multimodal embedding model!

ucsandman2mo ago

love the opensource push for agents, the fleet grows!

blurbleblurble2mo ago

Truly exciting

hnipps2mo ago

Here we go.

htrp2mo ago

is the haiku comparison because they've distilled from the model?

atmosx2mo ago

lol, why does the paper abstract assume I know what Lean is and it goes on to talk about lean 4 improvements?

cicko2mo ago

Why do you expect to understand an article you randomly read off the interwebs?

j / k navigate · click thread line to collapse