Exactly! Almost every weak point that Knuth commented is fixed in GPT4 answers.
Maybe OP feed Knuth's observations to the model?
If that ins't the case, I'm really impressed.
> Quicksort Algorithm
Definitive proof that AI must be stopped. Ranking quicksort as more elegant than heapsort?!
> Donald Knuth, a computer scientist and mathematician known for his contributions to the field of computer programming, particularly in the area of algorithms and data structures, has expressed some skepticism about the potential of artificial intelligence to achieve true human-level intelligence and creativity[1]. He once conducted an experiment with chatGPT where he posed 20 questions to it and analyzed its responses[1]. Is there anything specific you would like to know about his views on GPT?
With [1] being a citation link to https://cs.stanford.edu/~knuth/chatGPT20.txt
https://developer.mozilla.org/en-US/docs/Web/API/Element/innerHTML
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIMaybe he has seen similar claims before and is too old and dumb to not realize how world changing this is.
My take away is that he views this as another tool we are still figuring out how to use.
According to my sources, there are 11 chapters in “The Haj” by Leon Uris[1]
[1] https://cs.stanford.edu/~knuth/chatGPT20.txt
Which is amazing, because of course that document actually includes TWO different explanations of how many chapters are in The Haj - chatGPT's: The novel consists of 51 chapters and an epilogue, and it is divided into three parts.
And Knuth's: The Haj consists of a "Prelude" and 77 chapters (no epilogue), and it is divided into four parts.
Faced with these two ambiguous answers, Bing chooses neither, and instead decides to go with 11. Why?Because right at the top of that document, Knuth has published on the internet:
10. How many chapters are in The Haj by Leon Uris?
11. Write a sonnet that is also a haiku.
And one perfectly reasonable way of interpreting that bit of raw text is that the answer to "How many chapters are in The Haj by Leon Uris?" is "11".Isn't this a fundamental issue?
http://www.bookrags.com/studyguide-the-haj/chapanal001.html
On the left side if you click on "Chapters Summary and Analysis" it gives a break down of the book into 5 parts with varying chapter counts:
Part 1 Chapters 1-20 Part 2 Chapters 1-16 Part 3 Chapters 1-10 Part 4 Chapters 1-17 Part 5 Chapters 1-14
Giving a total of 20+16+10+17+14 = 77 chapters
OTOH, I tried with Bing/Creative, telling it to use this link, and it still failed. Perhaps because you need to click on the "summary and analysis" section to expand it to show the info. It seems there is room for web retrieval-augmented LLMs like Bing to improve here and be a bit more agentic.
Interestingly Knuth's own answer to the question, has a typo, and refers to the book as having "four" chapters, while then continuing on to give the chapter counts as above for all five chapters! Something to confuse future GPTs when the training set includes this, perhaps!
You could simply check the book. It’s a shame there is not more literary data in ChatGPT training corpus.
These models are doing feats that are stupendous and impossible before their advent. Not just a little bit, but the capability differences are so vast that it’s perhaps not even recognizable by people as being as vast as it is. I am impressed that Wolfram seems to have immediately grasped its significance and is running with it.
The fact this gist demonstrates essentially every single flaw was addressed. But that Knuth apparently doesn’t know / care months after GPT4’s introduction is demonstrative of a different type of personality.
I know which I aspire to be.
Both Knuth and GPTs are aggregators and presenters of knowledge, Knuth is however the antithesis of a LLM .
He has painstakingly spent years to make sure not a single mistake, not even a typo is there in material he publishes , he has devoted years developing a better typesetting so he can present his material accurately.
His obsession with accuracy is unparalleled and his dedication and mastery over communication to explain complex topics precisely and with an approachability that no one else comes close to .
He has strived for perfection all his life and not been far of the mark .ChatGPT for its all powers will never share that idealogy,
so I am more surprised that he was complimentary at all, and actually appreciated many of its skills
Instead of nit-picking flaws in what is a very early iteration of a revolutionary technology, he instead immediately started exploring ways of making it better and more useful.
Even with minimal effort that was essentially just copy-pasting some text around, he was able to show that the current way we use LLMs like GPT 4 is not the be-all and end-all of this type of technology.
I'm entirely convinced that we're just scratching the surface. It's like the first transistor, which was a crude, ugly, useless thing: https://images.computerhistory.org/siliconengine/1947-1-1.jp...
Just in the last two weeks(!), I've read about the following still-experimental methods for enhancing LLMs:
1. Plugging in "calculators" like Wolfram Alpha.
2. Adding vision input so they can understand equations, graphs, etc...
3. Filtering the output probability vector for certain allowed terms only ("YES", "NO", "MAYBE"), making them more useful in programmatically-invoked scenarios.
4. Similarly, filtering the output token list for syntax-validity, such as "valid JSON", "valid XML", etc... That is, instead of a purely random selection between to "top-n" output tokens, only valid tokens can be chosen, based on contextual syntax.
5. Storing embeddings in a vector database, giving LLMs medium-term memory, and the ability to index and reference sources precisely.
6. Efficient fine-tuning through Low-Rank Adaptation (LoRA), which allows desktop GPUs to tune a model overnight! This overcomes the "stale long-term memory" issue of ChatGPT, which only knows things up to September 2021. It could now read the news daily and "keep up".
7. External script harnesses that run multiple LLMs in parallel, with different prompts and/or different system messages. Some optimised for "idea generation", some optimised for "task completion", and then finally models tuned for "review and verification". Almost like a human team, multiple ideas can be generated, merged, reviewed, planned out, and then actioned. Check out "smol developer", which utilises Anthropic's 100K context window for this: https://www.youtube.com/watch?v=UCo7YeTy-aE
This is just the beginning. Chat GPT 4 hasn't even been available for 3 months yet, and practically all of the above experimentation has been done with weaker models because GPT 4 still doesn't have generally-available API access! Similarly, the 32K context window version of the GPT 4 model isn't available to anyone except a lucky few.
What will 2024 bring!? Heck... what will H2 2023 bring?
I recommend a dose of Mickens: https://www.youtube.com/watch?v=ajGX7odA87k
Obviously, being the work of Knuth, they are extraordinarily insightful in peeling back the first layer of the answer and providing insight to the underlying properties of both the model itself, and the dataset on which it was trained. It also tests the ability to compute (not recite) very specific facts (e.g. when the sun will be directly above Japan), so checks if subroutines and ephemerides specific to this type of data exist.
But beyond the obvious technical merit - there is an alluding property to base our tests on those whom we respect. I used a similar - but far less sophisticated - set of questions when first exploring ChatGPT. But nobody will be drawn to Dotan Cohen's language model benchmarks - rightfully so. The name Knuth has such reverence in the field that I forsee this test, and variations on it to prevent rigging, becoming a canonical test of language models.
https://gist.github.com/billylo1/bb717512d2d5145ce7eec02d055...
Notable: Bard struggles in similar ways. It does mention NASDAQ close at 12,043.59 on Friday, May 20, 2023
Imagine yourself trying to use only 5 letter words if you can't see how many letters are actually in each word, and had to rely on a hodgepodge of other means to try to figure it out!
An AI aware of how to optimally answer questions put to it would find the least objectionable interpretation when one is a subset of the other. It also failed by not constructing a simpler sentence, like subject-verb-object or subject-verb-adjective-object, since its limitations related to letters and tokens, and its failure to double check its answers before output, mean it can make errors. The more it writes, the more chance it has of making an error.
But still impressive deductive reasoning.