I think we will probably get (non-physical) AGI when the models can solve these as well. The implications of AGI might be much bigger than the loss of knowledge worker jobs.
Remember what happened to the chimps when a smarter-than-chimpanzee species multiplied and dominated the world.
That said, GPT has no model of the world. It has no concept of how true the text it is generating is. Its going to be hard for me to think of that as AGI.
I don't think this is necessarily true. Here is an example where researchers trained a transformer to generate legal sequences of moves in the board game Othello. Then they demonstrated that the internal state of the model did, in fact, have a representation of the board.
If an LLM can solve Codeforces problems as well as a strong competitor—-in my hypothetical future LLM—-what else can it not do as well as competent humans (aside from physical tasks)?
and I'm not so sure it has no model of the world. a textual model, sure, but considering it can recognize what svgs are pictures of from the coordinates alone, that's not much of a limitation maybe.
competing with them at what, precisely?
We're still a very very long way from machines being more generally capable and efficient than biological systems, so even an oppressive AI will want to keep us around as a partner for tasks that aren't well suited to machines. Since people work better and are less destructive when they aren't angry and oppressed, the machine will almost certainly be smart enough to veil its oppression, and not squeeze too hard. Ironically, an "oppressive" AI might actually treat people better than Republican politicians.
Language models that utilise beam search can calculate integrals ('Deep learning for symbolic mathematics', Lample, Charton, 2019, https://openreview.net/forum?id=S1eZYeHFDS), but without it it doesn't work.
However, beam search makes bad language models. I got linked this paper ('Locally typical sampling' https://arxiv.org/pdf/2202.00666.pdf) when I asked some people why beam search only works for the kind of stuff above. I haven't fully digested it though.
A blank test scores 37.5
The best score 60 is 5 correct answers + 20 blank answers; or 6 correct, 4 correct random guesses, and 15 incorrect random guesses. (20% chance of correct guess)
The 5 easiest questions are relatively simple calculations, once the parsing task is achieved.
(Example: https://artofproblemsolving.com/wiki/index.php/2022_AMC_12A_... ) so the main factor in that score is how good GPT is at refusing to answer a question, or doing a bit better to overcome the guessing penalty.
> It's AMC 10 score being dramatically lower is pretty bad though...
All versions (scoring 30, 36) It scored worse than leaving the test blank.
The only explanation I can imagine for that is that it can't understand diagrams.
It's also unclear if the AMC performance is based on Englush or the computer-encoded version from this benchmark set: https://arxiv.org/pdf/2109.00110.pdf https://openai.com/research/formal-math
AMC/AIME and even to some extent USAMO/IMO problems are hard for humans because they are time-limited and closed-book. But they aren't conceptually hard -- they are solved by applying a subset of known set of theorems a few times to the input data.
The hard part of math, for humans, is ingesting data into their brains, retaining it, and searching it. Humans are bad a memorizing large databases of symbolic data, but that's trivial for a large computer system.
An AI system has a comprehensive library, and high-speech search algorithms.
Can someone who pays $20/month please post some sample AMC10/AMC12 Q&A?
The Revenge of the Call Centre