* Clear explanation of concepts that respond to questions and reformulate when things bounce
* Step-by-step verification of solutions, spotting exactly where calculations have gone
* Instantaneously generating new problem sets to reinforce concepts
LLMs are probably not going to live up to all sorts of claims their proponents make. But I don't think you can ever have tried to use an LLM in a math course and reach the conclusion that it's "demoware" for that application. At what point, over 6 months of continuous work, does it stop being a "demo"?
Minutes later
In case I've spooked anyone, they have an adult course series (Foundations I, II, and III) that's accelerated by trimming out all the material their authors believe are important only for things like school placement exams; the modal adult Math Academy person is doing I, II, and III as a leadup to their Math for Machine Learning course, which is linear algebra and multivariable calc.
I think it's one of the three most mindblowing learning resources I've ever used. One of the other three: Lingua Latina Familia Romana. In both cases, I have the uncanny certainty that I am operating at the limit of my ability to acquire and retain new information, which is a fun place to be.
Basically all of the cognitive science literature on learning that I am aware of says that the more you do directly and the less hand holding you are given, the better your acquisition and long term retention. In particular, having the LLM elaborate concepts for you is probably one of the worst things you can do when it comes to learning. Struggling through that elaboration process yourself is going to make the learning stick much more strongly, at least if all of the research is to be believed.
But for math tutoring? If you claim LLM math tutoring is demoware, you're very clearly telling on yourself.
And then I realized[0].
[0] https://ludic.mataroa.blog/blog/contra-ptaceks-terrible-arti...
For the record, I'm a systems programmer and a security person and I don't work for an AI company (you can Six Degrees of Sam Altman any startup to AI now if you want to make the claim, but if you try I'm just going to say "Sir, This Is A Wendy's".)
This piece feels like a “I tried it out how I could” piece vs “I spent time learning how others are learning math with LLMs too”
LLMs will make meaningful advances in personalized learning.
Some of the frameworks might evolve along the way.
Cooking: does the food taste better as you learn more?
Programming: are you able to build functioning software that does what you want it to do, better than you could earlier on in your path?
Fixing a broken dishwasher: does the dishwasher work again now?
The idea that learning only works if you have an expert on hand to verify that you are learning is one of those things that seems obviously true until you think harder about it.
And keep in mind, what it's getting right is trickier than just answering Calc I questions: it's taking an answer I give it, calculating the correct answer itself, selecting its answer over mine, and then spotting where I e.g. forgot to check the domain of a variable inside a log.
Yes the numeric examples often don't work. The consequences of this though are similar to a failed web search. As in it's not a big deal and when it does work it's very helpful.
Maths is one of those things with so much objectivity that even the LLM usually realizes it has failed to create a numeric example. "Here the numeric example breaks down since we cannot find a congruence of squares in this example without finding more B-smooth numbers in step 1". Ok that's a shame, i would have loved to see an end to end numeric example.
I think people get too hung up on any possibility of LLMs not being perfect while still being extremely helpful.
Meanwhile, there are math education resources like iXL that maybe cost a little money but the lessons and practice problems are fully curated by human experts (AFAICT). I'm not saying these resources are perfect either, but as a mathematician who has experimented a lot with LLMs, including in supposed tutoring modes, they make a lot of mistakes and take a lot of shortcuts that should materially decrease their effectiveness as tutors.
[1] LLM-based tutoring (edit: footnote added to clarify)
I can't think of a single instance where O4 or GPT5 got one of these problems wrong. It sees maybe 6-12 of them per day from me. I've been doing this since February.
That appears to be their whole thing, and they've been in business for longer than LLMs have been around.
If you're working on educational math problems with solutions you can validate against the solutions. If you're working with proofs you can evaluate the proofs in a proof checker. Or you can run the resulting math expressions through a calculator.
Understanding if the student has actually learned is a competency piece, in math it’s mostly show your work and/or did you have the right answer.
The continued top down attempts to boil the whole sea with LLMs is part of the current problem.
It’s getting pretty good though for focused tutoring.
For students, models setup to tutor too often are trying to boil a sea (all education) instead of a kiddie pool. The reality is that more and more seems like k-6 if not k-12 students can be supported.
If we look at the EdTech space from the bottom up, namely learner-centric, there is both a real need and opportunity.
For school age students, math largely has not changed in hundreds of years, and doesn’t change often. Either you understand it or have to put in the work.
There’s no shortage of human created written teaching resources. A teacher could create their own touring assistant based off their explanations.
Alternatively, an open source textbook could be inputted. There’s a reason why training or fine tuning on books has caused lawsuits - it can increase accuracy many fold.
Teachers are burdened with repetitive marking, there’s def a place for personalized marking tools.
We know LLMs respond differently to different input. Their superpower is being able to regenerate an input as many different many different ways, which can include personalization.
Just because one has experimented with LLMs doesn’t mean there isn’t a way to get a result from them just because we haven’t been able to understand how.
If examples of the chat logs or prompts can be provided of what did or didn’t work it helps have a conversation without the subjectivity.
Mathematics is a great lens to see that folks are trying to get non-deterministic software to behave like all the deterministic software we’ve had before, instead of finding the places where non-deterministic strengths can shine.
It’s not all or nothing, or one or the other.
LLMs getting it wrong is terrible when it matters but i also don't think it's a huge problem when it comes to acting as an additional resource to learning. Here the parent is using a lesson plan that costs money and using LLM for a little more explanation. It's similar to using web search on a topic and sometimes you get a hit, sometimes you don't.
Asking LLMs for numeric examples of complex maths sometimes fails. It's easy to spot and no great loss. When it works though it's extremely helpful to follow through.
There are probably smart ways to incorporate LLM output into an application like the one you're lauding but your comment is a little like responding "but my cake tastes good" to someone who says you shouldn't eat raw flour.
The fact it can generate human language that is very compelling for certain context, makes it seem possible of doing so for many, many more contexts.
I like the term. I have been using a similar phrase "looks good in a snippet" when referring to certain styles of programming.
Once such instance was when nodejs was becoming popular and everyone was showing how easy concurrent programming can be with a few callbacks in a snippet. However building a large code base with that would eventually turn into a nightmare.
Another example is databases which don't fsync after writes by default. They look great in benchmarks (webscale even!) then in production suddenly some of data goes missing. But at least those initial benchmark demos were impressive.
For one thing, last year's LLMs were nowhere near winning gold on collegiate math and programming competitions. That's because the "reasoning" thing hadn't kicked off yet - the first model to demonstrate that trick was o1 in ... OK that was September 12th 2024 so it just makes it to a year old now.
The key difference is that I can context switch. Once the AI has context and is doing its thing, I can move on to another task that's not working in the same area or project. I can post on HN. I can catch up on my Slack inbounds, or my email.
Having two tasks running at once nets a small but nice improvement in velocity. Having any tasks running while I'm doing other things effectively doubles my output.
The one I use creates the migrations, locally, for free and deterministically in about 30 seconds.
I think a lot of people are massively underestimating how much knowledge and skill is needed in software engineering beyond typing code into a text editor.
I think even if we ever reach actual AGI (in the far far future), we'll still want low level meatbags around to blame :-P
Edit: I mean their outputs are procedurally generated, like in https://en.m.wikipedia.org/wiki/Demoscene
This article seems to be baitware trying to push some outdated perspective. LLMs have only gotten more powerful over the last 3 years (being able to do more things), and so far not much has stopped them from becoming even more powerful (with the help of reasoning, other external methods, etc) in the future.
"daily use" is so subjective and this article will be out dated soon as we get closer to an AGI (with LLMs as the base layer and not the main driver)
(I'm not denying the possiblity. I'm proclaiming a lack of evidence.)
The only times I've personally seen LLMs engaged in repos has been handling issues, and they made an astounding mess of things that hurt far more often than it helped for anything more than automatically tagging issues. And I don't see any LLMs allowed off the leash to be making commits. Not in anything with any actual downstream users.
GitHub Copilot: 247,000 https://github.com/search?q=is%3Apr+author%3Acopilot-swe-age... - is:pr author:copilot-swe-agent[bot]
Claude: 147,000 https://github.com/search?q=is%3Apr+in%3Abody+%28%22Generate... - is:pr in:body ("Generated with Claude Code" OR "Co-Authored-By: Claude" OR "Co-authored-by: Claude")
OpenAI Codex: ~2,000,000 (over-estimate, there's no obvious author reference here so this is just title or bid containing "codex"): https://github.com/search?q=is%3Apr+%28in%3Abody+OR+in%3Atit... - is:pr (in:body OR in:title) codex
Suggestions for improvements to this methodology are welcome!
The thing that this comment misses, imo, is that LLMs are not always enabling people who previously couldn't create value to create value. In fact i think they are likely to cause some people who created value previously to create even less value!
However that's not mutually exclusive with enabling others to create more value than they did previously. Is it a net gain for society? Currently I'd bet not, by a large margin. However is it a net gain for some individual users of LLMs? I suspect yes.
LLMs are a powerful tool for the right job, and as time goes on the "right job" keeps expanding to more territory. The problem is it's a tool that takes a keen eye to analyze and train on. It's not easy to use for reliable output. It's currently a multiplier for those willing to use it on the right jobs and with the right training (reviews, suspicion, etc).
I am not a researcher, but I am a techlead and I've seen it work again and again: IDEs work. And LLMs work.
They are force multipliers though, they absolutely work best with people who already know a bit of software engineering.
I think that highly productive people who have incorporated LLMs into their workflows are enjoying a productivity multiplier.
I don’t think it’s 2x but it’s greater than 1x, if I had to guess. It’s just one of those things that’s impossible to measure beyond reasonable doubt
One of my favorite uses is that i have configured my window manager (Window Maker) that when i press Win+/ it launches xterm with a script that runs a custom C++ utility based on llama.cpp that combines a prompt that asks a quantized version of Mistral Small 3.2 to provide suggestions for grammar and spelling mistakes in text, then uses xclip to put whatever i have selected and filters the program's output through another utility that colorizes the output using some simple regex. Whenever i write any text that i care about having (more) correct grammar and spelling (e.g. documentation - i do not use it for informal text like this one or in chat) i use it to find mistakes as English is not my first language (and it tends to find a lot of them). Since the output is shown in a separate window (xterm) instead of replacing the text i can check if the correction is fine (and the act of actually typing the correction helps me remember some stuff... in theory at least :-P). The [0] shows an example of how it looks.
I also wrote a simple Tcl/Tk script that calls some of the above with more generalized queries, one of which is to translate text to English, which i'm mainly using to translate comments on Steam games[1] :-P. It is also helpful whenever i want to try out something quickly, like -e.g.- recently i thought that common email obfuscation techniques in text (like some AT example DOT com) are pointless nowadays with LLMs, so i tried that from a site i found online[2] (pretty much everything that didn't rely on JavaScript was defeated by Mistral Small).
As for programming, i used Devstral Small 1.0 once to make a simple raytracer, though i wrote about half of the code by hand since it was making a bunch of mistakes[3]. Also recently i needed to scrape some data from a page - normally i'd do it by hand, but i was feeling bored at the time so i asked Devstral to write a Python script using Beautiful Soup to do it for me and it worked just fine.
None of the above are things i'd value for billions though. But at the same time, i wouldn't have any other solution for the grammar and translation stuff (free and under my control at least).
[0] https://i.imgur.com/f4OrNI5.png
[1] https://i.imgur.com/jPYYKCd.png
In particular, "producing stuff" is not necessarily "creating value"; some stuff has _negative_ value.
Lot's of vibes and feelings but 0 measurable impact
Rumors say that Google wasn't far behind at the time, but didn't push releases. Perhaps because they were not that impressed by the applications or did not want "AI" to cannibalize their other products.
So it seems very likely that everything has been squeezed out of the decades of research and we have plateaued.
Desperate measures like Nvidia buying its own graphics cards through circular investment schemes do not inspire confidence either. Or Microsoft now doing CoPilot product placement ads in teenager YouTube channels. When Google launched, people just used it because it was good. This all fits very well with the demoware angle of the article.