The thing is, you can be economical. You don't need a GPT-4 quality model for everything. Some things are just low value where a 3.5 model would do just fine.
I never use the $20 plan but I access everything via API and i spend a couple of dollars per month.
Although lately I have a home server that can do llama 3.1 8b uncensored and that actually works amazingly well.
Yeah it's good, right?? Amazingly good. The first-gen small models were a bit iffy but Llama 3.1 is so good <3
The only thing I see is that it hallucinates a lot when you ask it for knowledge. Which makes sense because 8B is just not a lot to keep detailed information around. But the ability to recite training knowledge is really a misuse of LLMs and only a peculiar side-effect. I combine it with google searches (though OpenWebUI and SearXNG) and it works amazingly well then.
Yeah, and realistically once we can get hardware powerful but cheap/energy efficient enough to run llm + TTS + ASR without any noticeable delay during a conversation then who needs cloud services for most stuff. The really big models will still be useful, but really only for specific things.
An old Ryzen CPU (2600 IIRC) and Radeon Pro VII 16GB. Got it new at a really good price.
It works ok but with a large context it can still run out of memory and also gets a lot slower. With small context it's super snappy and surprisingly good. What it is bad at are facts/knowledge but this is not something a LLM is meant to do anyway. OpenWebUI has really good search engine integration which makes it work like perplexity does. That's a better option for knowledge usecases.