undefined | Better HN

0 pointsgolol3y ago0 comments

it is a markov chain; At least the underlying decoder only transformer is.

0 comments

GPT-4 disagrees:

GPT-3.5, like its predecessor GPT-3, is not a Markov chain. GPT-3.5 is based on the GPT (Generative Pre-trained Transformer) architecture, which is a type of neural network known as a Transformer. Transformers use self-attention mechanisms to process and generate text, allowing them to capture long-range dependencies and context in the input data.

On the other hand, a Markov chain is a stochastic model that describes a sequence of possible events, where the probability of each event depends only on the state attained in the previous event. While Markov chains can be used for simple text generation, they lack the ability to capture the complex relationships and long-range dependencies that GPT-3.5 can handle.

gololOP3y ago

It's wrong. A decoder only transformer performs a (possibly random) operation on a state from the state space {tokens}^CtxWindow, where the distribution of the new state depends entirely on the previous state. It is a Markov Chain with a special structure: The new state is deterministically equal to the old state shifted by one, with only the last token being newly generated.

Mike_123453y ago

Then by that reasoning everything in the physical world is a Markov chain, right? That is like saying that any deterministic process in time is a Markov chain.

A tennis ball in flight is a Markov chain since the state at t is a function of the state at t-1.

You have missed the point about the Attention Mechanism in GPT. That is not a Markov chain by definition.

1 more reply

killerstorm3y ago

"Markov chain" might mean:

* a kind of stochastic model * a "naive" realization of that model which directly counts frequencies of N-dimensional vectors

This naive implementation is sometimes used for language modeling, e.g. for the purpose of compression. So people might think you mean that particular implementation rather than a theoretical model.

This sort of a description can be unhelpful.

nl3y ago

It's not. It can do in context learning, which Markov chains cannot do.

gololOP3y ago

It is a Markov Chain on the state space {Tokens}^CtxWindow.

nl3y ago

I don't think that's clear at all.

https://arxiv.org/abs/2212.10559 shows a LLM is doing gradient descent on the context window at inference time.

If it's learning relationships between concepts at runtime based on information in the context window then it seems about as useful to say it is a Markov chain as it is to say that a human is a Markov chain. Perhaps we are, but the "current state" is unmeasurably complex.

1 more reply

j / k navigate · click thread line to collapse

0 comments

Mike_123453y ago

GPT-4 disagrees:

gololOP3y ago

Mike_123453y ago

Then by that reasoning everything in the physical world is a Markov chain, right? That is like saying that any deterministic process in time is a Markov chain.

A tennis ball in flight is a Markov chain since the state at t is a function of the state at t-1.

You have missed the point about the Attention Mechanism in GPT. That is not a Markov chain by definition.

1 more reply

killerstorm3y ago

"Markov chain" might mean:

* a kind of stochastic model * a "naive" realization of that model which directly counts frequencies of N-dimensional vectors

This naive implementation is sometimes used for language modeling, e.g. for the purpose of compression. So people might think you mean that particular implementation rather than a theoretical model.

This sort of a description can be unhelpful.

nl3y ago

It's not. It can do in context learning, which Markov chains cannot do.

gololOP3y ago

It is a Markov Chain on the state space {Tokens}^CtxWindow.

nl3y ago

I don't think that's clear at all.

https://arxiv.org/abs/2212.10559 shows a LLM is doing gradient descent on the context window at inference time.

1 more reply

j / k navigate · click thread line to collapse