Making my local LLM voice assistant faster and more scalable with RAG (opens in new tab)

(johnthenerd.com)

122 pointsJohnTheNerd1y ago16 comments

16 comments

I was having a look at the model mentioned, specifcially `casperhansen/llama-3-70b-instruct-awq`.

When checking this model, I found out [1] it's based on llama-2 ?

``` Expand Llama 3 70B Instruct AWQ Parameters and Internals LLM Name Llama 3 70B Instruct AWQ Repository Open on Base Model(s) Llama 2 70B Instruct quantumaikr/llama-2-70B-instruct Model Size 70b ```

I added a question [2] on Hugging Face to learn more about this.

Anyone could explain to me what this means? Does it mean that it has been trained on the version 2 and wrongly named version 3? Or is it something that is not well intended?

[1] https://llm.extractum.io/model/casperhansen%2Fllama-3-70b-in...

[2] https://huggingface.co/casperhansen/llama-3-70b-instruct-awq...

qeternity1y ago

I don't know this site that you're citing there but it's clearly wrong.

Go look at the model config, you can clearly see it's Llama 3.

pw3781y ago

That lag between query and response ruins it for me.

ec1096851y ago

“Excellent query good sir! <said slowly enough to let the LLM catch up>…”

And more seriously, it seems like the LLM could be used to precreate lots of filler prefixes that correspond to the rag’d document that are being sent to the model.

While it wouldn’t work if you’re GPU’d bound, multiple prompts could be run in parallel with different pieces of context and then have the model chose the most appropriate response (which could be done in parallel too).

lettergram1y ago

For me, it was the cuts between each call haha

throwthrowuknow1y ago

If there are many common services for which you can precompute the embeddings then with a little record keeping and analysis you could figure out some likely questions or requests and pregenerate the responses. That way you could just use similarity search on the question or command you say and skip using the LLM. It would be interesting to try using the LLM to predict some of these based on information available ahead of time like calendar events, weather, recent prompt history, recently played media, today’s headlines, recent browser history, etc. It’d be your own recommendation algorithm.

JohnTheNerdOP1y ago

that's a great idea! I've been looking into that (I'm merely logging all prompts in a JSON file for now, so that I can analyze them later).

skipping the LLM would be tough because there are so many devices in my house, not to mention it would take away from the personality of the assistant.

however, a recommendation algorithm would actually work great since i could augment the LLM prompt with it regardless of the prompt.

Jedd1y ago

The previous story:

https://news.ycombinator.com/item?id=38985152 ( 187 comments , 2024-01-13 )

jijji1y ago

I love how The llm responds back to you in a sarcastic, patronising, condescending, uninterested tone...

smusamashah1y ago

These responses are mimicking voice and tone of the GLaDOS robot from the game Portal.

elevatedastalt1y ago

Cringe conversation. Why can't AIs just do stuff that you ask them to do without pretending to be human?

zx80801y ago

Then how is it different from Excel/Word and shell/python scripts?

exe341y ago

it's much slower, gets things wrong, and insists on things that ain't so.

WalterSear1y ago

If it isn't, we have bigger problems.

colechristensen1y ago

Because they’re trained to.

I hate the introduction to the response. That’s not even trying to be human, i don’t know something more like a deranged patronizing butler.

viraptor1y ago

Llama3 is very keen to be nice. I kind of wonder if that's due to better results on the chatbot arena (probably not, just a conspiracy theory I like). But with enough context available, you can definitely tweak the response in many ways. Give an example or two, tell it to be an emotionally detached HAL, you'll get what you want.

j / k navigate · click thread line to collapse