Show HN: Predictive text using only 13kb of JavaScript. no LLM (opens in new tab)

(adamgrant.info)

207 pointsadamkochanowicz2y ago39 comments

39 comments

Pretty cool, a js implementation of n-grams!

What is amazing to me is this: imagine that English only had 10,000 words. For each of those 10,000 words there’s 100 valid subsequent words. So there’s 1 million valid bigrams. Now if you want trigrams that takes you to 100 million, and for 4-grams it’ll be 10 billion. Just for that, you’d need 14 bytes per word and gigabytes of storage.

LLMs typically have context windows in the hundreds if not thousands. (Back in my day GPT2 had a context window of 1024 and we called that an LLM. And we liked it.) So it’s kind of amazing that a model that can fit on a flash drive can make reasonable next token prediction on the whole internet and with a context size that can fit a whole book.

10000truths2y ago

That's because a vast majority of the set of grammatically correct n-grams are not actually useful in practical language. "Colorless green ideas sleep furiously" is valid English, but it is semantically nonsense.

cchance2y ago

I mean ... technically Gemini 1.5 has 10 million context which just pushes the insanity further lol

throwaway4aday2y ago

Nice work! I built something similar years ago and I did compile the probabilities based on a corpus of text (public domain books) in an attempt to produce writing in the style of various authors. The results were actually quite similar to the output of nanoGPT[0]. It was very unoptimized and everything was kept in memory. I also knew nothing about embeddings at the time and only a little about NLP techniques that would certainly have helped. Using a graph database would have probably been better than the datastructure I came up with at the time. You should look into stuff like Datalog, Tries[1], and N-Triples[2] for more inspiration.

Your idea of splitting the probabilities based on whether you're starting the sentence or finishing it is interesting but you might be able to benefit from an approach that creates a "window" of text you can use for lookup, using an LCS[3] algorithm could do that. There's probably a lot of optimization you could do based on the probabilities of different sequences, I think this was the fundamental thing I was exploring in my project.

Seeing this has inspired me further to consider working on that project again at some point.

[0] https://github.com/karpathy/nanoGPT

[1] https://en.wikipedia.org/wiki/Trie

[2] https://en.wikipedia.org/wiki/N-Triples

[3] https://en.wikipedia.org/wiki/Longest_common_subsequence

sodimel2y ago

This is cool. I don't want a 1GB+ download and an entire LLM running on my machine (or worse, on someone else machine) to find words.

What I really want is some simple predictive text engine to make writing in English easier (because its not my first language), helping me find more complex words that I don't use because I don't know them well enough.

mromanuk2y ago

I've developed a neural network model for text correction, prediction, and autocompletion, specifically optimized for one of my iOS keyboard apps. The model is compact, at only 32MB, allowing for swift loading times. It predicts the next word based on the last four to five words, demonstrating robustness against errors, typos, and making minor grammar corrections.

The goal was to mimic the functionality of the stock iOS keyboard. This approach also opens up the possibility of training (or fine-tuning) the model with specialized vocabularies for various professional fields, such as law, medicine, and engineering, to predict, correct, and autocomplete domain-specific terms. I'm exploring whether this addresses a genuine need or is a solution in search of a problem.

Additionally, I'm considering integrating a comprehensive grammar correction feature directly into the keyboard. While there are existing apps with this capability, they rely on online APIs. My aim is to offer a self-contained, local model as an alternative.

efilife2y ago

Cool. But do you intend on sharing it? Or you just wanted to tell people that you made it?

pseudo_meta2y ago

Seeing that you use Obsidian: Have you thought about turning that into an Obsidian plugin that offers the predictions as editor suggestions?

elektor2y ago

This exists on a word level.

Completr is a plugin that provides auto-completion functionality for obsidian.

https://github.com/tth05/obsidian-completr

pseudo_meta2y ago

That does change completions from a list of words, but not completions based on predictions of the next word.

hidelooktropic2y ago

Interesting application

adzm2y ago

So this is basically a Markov chain right?

hidelooktropic2y ago

I don't think so. It's basically just traversing nested arrays and picking a random member on each level.

dartos2y ago

Yep. Sounds like a Markov chain to me.

fruktmix2y ago

Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

2 more replies

underscoring2y ago

Nice work!

> If it can do this in 13kb, it makes me wonder what it could do with more bytes.

Maybe I misunderstand, but is this not just the first baby steps of an LLM written in JS? "what it could do with more bytes" is surely "GPT2 in javascript"?

refulgentis2y ago

Author had LLM help them make a tree of words, and the algo choose which node we're at and offers children as completions. It's clever and cute but, not even close to an LLM.

cchance2y ago

I mean it's not far off from a super low quant LLM with limited params, like a 1bit quant LLM with low params XD

refulgentis2y ago

It's very far off, like "not even wrong" in the Pauli sense of the phrase.

There's a lot of abstractions one can have for this stuff, I think you're looking at that "text predictor" is one of them?

If you roll with that, then you're in a position where you're saying GPT-2 class LLMs were very close in 1960, because at the end of the day, it's just a dictionary lookup with a string key and a value of list<string> completions. That confuses instead of illuminates.

Legend24402y ago

The trouble with decision trees for language modeling is that they overfit really hard. They don't do the magical generalization that makes LLMs interesting.

zerojames2y ago

Another day, another story on HN that will have me down a rabbit hole (yesterday's was a three-hour tangent into ternary bit compression after the Microsoft paper) :D

Your project is delightful -- thank you for sharing. I have explored this realm a bit before [0] [1], but in Python. The tool I made was for personal use, but streaming every keystroke through a network connection added a lot of unnecessary latency.

I used word surprisals (entropy) to calculate the most likely candidates, and gave a boost to words from my own writing (thus, the predictive engine was "fine-tuned" on my writing). The result is a dictionary of words with their probabilities of use. This can be applied to bigrams, too. Your project has me thinking: how could that be pruned, massively, to create the smallest possible structure. Your engine feels like the answer.

My use case is technical writing: you know what you want to say, including long words you have to repeat over again, but you want a quicker way of typing.

[0]: https://jamesg.blog/2023/12/15/auto-write/

[1]: https://github.com/capjamesg/autowrite

amne2y ago

"What if we go to a place where there is no way"

Barely got 1 or 2 suggestions while typing that. Am I holding it wrong?

neom2y ago

I tried for "once in a while we all get sad" and it got exactly zero.

SamBam2y ago

What's the data corpus? I'm very surprised that the words "Be," "Of," and "And" are among the 25 most common first words of sentences.

hidelooktropic2y ago

I think the LLM that helped coauthor the tree hallucinated these suggestions. Probably better to actually train it on some text.

tagueis2y ago

What tool are you using to view dictionary.js in your screenshot?

schnuri2y ago

Cool thing!

But: The demo is not usable for me on my mobile device, the keyboard is missing a tab key.

luke-stanley2y ago

It's a small language model.

dariosalvi782y ago

An SLM?

valval2y ago

What's the point? I tested it out for around 10 seconds and noticed that it's completely useless.

awayto2y ago

For those moments when you just don't know what (5 phrases) to say?

qingcharles2y ago

I'm not sure why you're getting downvoted, the actual results were pretty subpar in my experience.

The idea is good, though.

k__2y ago

What does that mean no LLM?

Markov chains have been a thing for a long time and predictive text was used before LLMs.

hidelooktropic2y ago

It means it doesn't use a large language model. I don't think this is a claim to have been the first to do it.

j / k navigate · click thread line to collapse

39 comments

janalsncm2y ago

Pretty cool, a js implementation of n-grams!

10000truths2y ago

cchance2y ago

I mean ... technically Gemini 1.5 has 10 million context which just pushes the insanity further lol

throwaway4aday2y ago

Seeing this has inspired me further to consider working on that project again at some point.

[0] https://github.com/karpathy/nanoGPT

[1] https://en.wikipedia.org/wiki/Trie

[2] https://en.wikipedia.org/wiki/N-Triples

[3] https://en.wikipedia.org/wiki/Longest_common_subsequence

sodimel2y ago

This is cool. I don't want a 1GB+ download and an entire LLM running on my machine (or worse, on someone else machine) to find words.

mromanuk2y ago

efilife2y ago

Cool. But do you intend on sharing it? Or you just wanted to tell people that you made it?

pseudo_meta2y ago

Seeing that you use Obsidian: Have you thought about turning that into an Obsidian plugin that offers the predictions as editor suggestions?

elektor2y ago

This exists on a word level.

Completr is a plugin that provides auto-completion functionality for obsidian.

https://github.com/tth05/obsidian-completr

pseudo_meta2y ago

That does change completions from a list of words, but not completions based on predictions of the next word.

hidelooktropic2y ago

Interesting application

adzm2y ago

So this is basically a Markov chain right?

hidelooktropic2y ago

I don't think so. It's basically just traversing nested arrays and picking a random member on each level.

dartos2y ago

Yep. Sounds like a Markov chain to me.

fruktmix2y ago

Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

2 more replies

underscoring2y ago

Nice work!

> If it can do this in 13kb, it makes me wonder what it could do with more bytes.

Maybe I misunderstand, but is this not just the first baby steps of an LLM written in JS? "what it could do with more bytes" is surely "GPT2 in javascript"?

refulgentis2y ago

Author had LLM help them make a tree of words, and the algo choose which node we're at and offers children as completions. It's clever and cute but, not even close to an LLM.

cchance2y ago

I mean it's not far off from a super low quant LLM with limited params, like a 1bit quant LLM with low params XD

refulgentis2y ago

It's very far off, like "not even wrong" in the Pauli sense of the phrase.

There's a lot of abstractions one can have for this stuff, I think you're looking at that "text predictor" is one of them?

Legend24402y ago

The trouble with decision trees for language modeling is that they overfit really hard. They don't do the magical generalization that makes LLMs interesting.

zerojames2y ago

Another day, another story on HN that will have me down a rabbit hole (yesterday's was a three-hour tangent into ternary bit compression after the Microsoft paper) :D

My use case is technical writing: you know what you want to say, including long words you have to repeat over again, but you want a quicker way of typing.

[0]: https://jamesg.blog/2023/12/15/auto-write/

[1]: https://github.com/capjamesg/autowrite

amne2y ago

"What if we go to a place where there is no way"

Barely got 1 or 2 suggestions while typing that. Am I holding it wrong?

neom2y ago

I tried for "once in a while we all get sad" and it got exactly zero.

SamBam2y ago

What's the data corpus? I'm very surprised that the words "Be," "Of," and "And" are among the 25 most common first words of sentences.

hidelooktropic2y ago

I think the LLM that helped coauthor the tree hallucinated these suggestions. Probably better to actually train it on some text.

tagueis2y ago

What tool are you using to view dictionary.js in your screenshot?

schnuri2y ago

Cool thing!

But: The demo is not usable for me on my mobile device, the keyboard is missing a tab key.

luke-stanley2y ago

It's a small language model.

dariosalvi782y ago

An SLM?

valval2y ago

What's the point? I tested it out for around 10 seconds and noticed that it's completely useless.

awayto2y ago

For those moments when you just don't know what (5 phrases) to say?

qingcharles2y ago

I'm not sure why you're getting downvoted, the actual results were pretty subpar in my experience.

The idea is good, though.

k__2y ago

What does that mean no LLM?

Markov chains have been a thing for a long time and predictive text was used before LLMs.

hidelooktropic2y ago

It means it doesn't use a large language model. I don't think this is a claim to have been the first to do it.

j / k navigate · click thread line to collapse