Fireworks has Llama 3 for the same effective speed with much more realistic rate limits (and billing)
1. Filtering by model should be enabled by default. Mixtral-8x7b-instruct on Perplexity is almost as fast as the 7B Llama 2 on fireworks, but are quite different in sizes.
2. Pricing is a very important factor that is not included.
3. Overall service reliability should also be an important signal.
We also have pricing, long/medium/short prompt lengths (decode time can vary between providers) & parallel query benchmarking + model details (ctx window, etc)
In classification for example, you could ask Llama 8B to reason through each possibility, rank them, rate them, make counterarguments, etc. - all in the same time that GPT-4 would take to output one classification without reasoning. Which does better?
But there was something it did way better than GPT4. I asked to create 10 phrases where the last word was an animal, excluding equines, and in alphabetical order. GPT3.5 and GPT4 aren't able to follow such instructions, but the 8b model did it with maestry.