OpenAI GPT-4 vs. Groq Mistral-8x7B (opens in new tab)

(serpapi.com)

105 pointstanyongsheng2y ago133 comments

133 comments

The prompt, for those interested. I find it pretty underspecified, but maybe that's the point. For example, "Business operating hours" could be expanded a little, because "Closed - Opens at XX" is still non-processable in both cases.

  You are an expert in Web Scraping, so you are capable to find the information in HTML and label them accordingly. Please return the final result in JSON.

  Data to scrape: 
  title: Name of the business
  type: The business nature like Cafe, Coffee Shop, many others
  phone: The phone number of the business
  address: Address of the business, can be a state, country or a full address
  years_in_business: Number of years since the business started
  hours: Business operating hours
  rating: Rating of the business
  reviews: Number of reviews on the business
  price: Typical spending on the business
  description: Extra information that is not mentioned yet in any of the data
  service_options: Array of shopping options from the business, for example, in store shopping, delivery and many others. It should be in format -> option_name: true
  is_operating: Whether the business is operating
  
  HTML: 
  {html}

infecto2y ago

This should be higher up. This whole blog post is mostly worthless because the way they are extracting data is less than optimal.

Lower end models do not have the attention to complete tasks like this, GPT4Turbo will generally have the capability. But to have an optimal pipeline you should really be splitting up these tasks into individual units. You extract each attribute you want independently and then combine it back together however you want. Also asking for JSON upfront is equally suboptimal in the whole process.

I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.

Edit: I am not suggesting that an LLM is more optimal than what ever traditional parsing methods they may use, simply the way they are doing it is wrong from an LLM flow.

ilyazub2y ago

> I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.

Cool, cool. I'm super interested. Please share the process and the results.

wruza2y ago

Also, my (limited) experience with prompts tells that you want to invest more into the “You are” part. I’ll share my understanding, corrections are appreciated.

LLMs aren’t people even in a chat-roleplaying sense. They complete a “document” that can be a plot, a book, a protocol of conversation. The “AI” side in the chat isn’t an LLM itself, it’s a character (and so are you, it completes your “You: …” replies too - that’s where the driver app stops it and allows you to interfere). So everything you put in that header is very important. There are two places where you can do that: right in the chat, as in TFA, or in the “character card” (idk if GPTs have it, no GPT access for me). I found out that properly crafting a character card makes a huge difference and can resolve the whole classes of issues.

Idk what will work best in this case, but I’d start with describing which sort of a bot, how it deals with unclear or incomplete information, how amazing it is (yes, really), its soft/tech skills and problem solving abilities, what other people think of it, their experience and so on. Maybe would add few examples of interactions in a free form. Then in the task message I’d tell it more and specific details about that json.

One more note - at least for 8x7B, the “You are” in the chat is a much weaker instruction than a character card, even if the context is still empty. I low-key believe that’s because it’s a second-class prompt, i.e. the chat document starts with “This is a conversation with a helpful AI bot which yada yada” in… mind, and then in that chat that AI character gets asked to turn into something else, which poisons the setting.

Simply asking the default AI card represents 0.1% of what’s possible and doesn’t give the best results. Prompt Engineering is real.

I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.

Same. I think that no matter how good a model is, this prompt just isn’t a professional task statement and leaves too much to decide. It’s a task that you, as a regular human, would hate to receive.

mhuffman2y ago

Do you have an example of a more optimal prompt to share?

infecto2y ago

The prompt does not matter as much as the workflow which is describe above. 1) Extract one attribute at a time. 2) Don't ask for json during extraction, but on binary small attributes it might not matter as much.. 3) Combine the data later.

There are differences that can be marked on how different models perform against the same raw prompt but generally the workflow is what matters more. The raw text prompt will be dependent on what model you are using as there are those differences but I don't think its a level of "prompt engineering" like we had a year ago.

feintruled2y ago

Brave new world, where our machines are sometimes wrong but by gum they are quick about it.

RUnconcerned2y ago

I too am a big fan of having my computer hallucinate incorrect information.

darthrupert2y ago

Yesterday I asked my locally running gpt4all "What model are you running on?"

Answer: "I'm running on Toyota Corolla"

Which was perhaps the funniest thing I heard that day.

harryf2y ago

>> print(“Hello, world!”.ai_reverse()) world, Hello!

ben_w2y ago

First few versions of Swift kept changing how strings work because it's not entirely obvious what most people intend from the nth element of a string.

Used to be easy, when it was ASCII.

Reverse the bytes of utf-8 and it won't always be valid uft-8.

Reverse the code-points, and the Canadian flag gets replaced with the Ascension Island flag.

samus2y ago

Character-level operations are difficult for LLMs. Because of tokenization they don't really "perceive" strings as a list of characters. There are LLMs that ingest bytes, but they are intended to process binary data.

RUnconcerned2y ago

Finally, something more offensive than parsing HTML with regular expressions: parsing HTML with LLMs.

AlphaAndOmega02y ago

I for one am glad I can offload all the regex to LLMs. Powerful? Yes. Human readable for beginners? No.

cornedor2y ago

Why tough? To me, it seems more prone to issues (hallucinations, prompt injections etc). It is also slower and more expensive at the same time. I also think it is harder to implement properly, and you need to add way more tests in order to be confident it works.

RUnconcerned2y ago

Personally when I am parsing structured data I prefer to use parsers that won't hallucinate data but that's just me.

Also, don't parse HTML with regular expressions.

rybosome2y ago

Generally I agree with your point, but there is some value in a parser that doesn’t have to be updated when the underlying HTML changes.

Whether or not this benefit outweighs the significant problems (cost, speed, accuracy and determinism) is up to the use case. For most use cases I can think of, the speed and accuracy of an actual parser would be preferable.

However, in situations where one is parsing highly dynamic HTML (eg if each business type had slightly different output, or you are scraping a site which updates the structure frequently and breaks your hand written parser) then this could be worth the accuracy loss.

1 more reply

okamiueru2y ago

Deterministic? No.

retrac982y ago

There are so many applications for LLMs where having a perfect score is much more important than speed, because getting it wrong is so expensive, damaging, or time consuming to resolve for an organisation.

nathan_compton2y ago

If you need a perfect score, don't use LLMs. This seems obvious to me, even given the state of the art LLMs. I am a heavy user of GPT4 and I wouldn't bet $1000 bucks on it being 100% reliable for any non-trivial task.

retrac982y ago

They'll get better. Humans are far from perfect, and I have no doubt that LLMs will eventually outperform them for non-trivial tasks consistently.

nathan_compton2y ago

Maybe so, but at this stage I wouldn't be betting a business model on it.

1 more reply

Jensson2y ago

> Humans are far from perfect

Humans running multishot with mixture of experts is close to perfect. You can't compare a multishot mixture of expert AI to a single human, humans doesn't work in isolation.

littlestymaar2y ago

Machine learning models will get better for sure. We don't know if LLM are the end game though and it's not sure if this particular technique is what we'll need to reach the next level.

somewhereoutth2y ago

Or they might not get better. It could be that we are at a local optimum for that sort of thing, and major improvements will have to wait (perhaps for a very long time) for radical new technologies.

1 more reply

samus2y ago

They already have superhuman image classification performance.

2 more replies

spaniard892772y ago

I've tried to apply it to parsing HTML as this article into a pretty long pipeline. I'm using DeepInfra with Mistral 8x7B and I'm still unsure if I'm going to use for production.

The problem I'm finding is that the time I wanted to save mantaining selectors and the like is time that I'm spending writing wrapper code and dealing with the mistakes it makes. Some are OK and can deal with them, others are pretty annoying because It's difficult to deal with them in a deterministic manner.

I've also tried with GPT-4 but it's way more expensive, and despite what this guy got, it also makes mistakes.

I don't really care about inference speed, but I do care about price and correctness.

ogogmad2y ago

Might be a silly question, but if you want determinism in this, why don't you get the LLM to write the deterministic code, and use that instead? Interesting experiment, though!

In fact, what about a hybrid of what you're doing now? Initially, you use an LLM to generate examples. And then from those examples, you use that same LLM to write deterministic code?

Eisenstein2y ago

Have you tried swapping Mistral 8x7B with either command-r 34B, Qwen 1.5 70B, or miqu 70B? Those are all superior in my experience, though suited for slightly different tasks, so experimentation is needed.

samus2y ago

Parsing HTML and tagsoup is IMHO not the right application for LLMs since these are ultimately structured formats. LLM are for NLP tasks, like extracting meaning out of unstructured and ambiguous text. The computational cost of an LLM chewing through even moderately-sized document can be more efficiently spent on sophisticated parser technologies that have been around for decades, which can also to a degree deal with ambiguous and irregular grammars. LLMs should be able to help you write those.

malux852y ago

Yeah I agree - just an hour ago I was dealing with an LLM that was missing a "not" thus inverting the meaning of a rather important simulation parameter!

worldsayshi2y ago

It makes much more sense to me to have the LLM infer the correct query for extracting data on the page. Much faster and reliable and it wouldn't really be a problem to have a human in the loop every now and then.

onion2k2y ago

All the places I see AI being applicable to my work don't require a perfect score, and a threshold is actually much more useful, especially where multiple factors come together to make evaluation to a single value hard.

bberrry2y ago

If you have speed you can generate multiple answers and have another model pick the best one.

Drakim2y ago

If I ask an LLM a very complex and specific question 500 times, if it just doesn't know the facts you'll still get the wrong answer 500 times.

That's understandable. The real problem is when the AI lies/hallucinates another answer with confidence instead of saying "I don't know".

simion3142y ago

The problem is asking for facts, LLM are not a database so they know stuff but it is compressed so expect wrong facts, wrong names, dates, wrong anything.

We will need an LLM as a front end then it will generate a query to fetch the facts from the internet or a database , then maybe format the facts for your consumption.

2 more replies

m348e9122y ago

The weird problem is with LLM hallucinations is that it usually will acknowledge its mistake and correct itself if you call it out. My question is why can't LLMs included a sub-routine to check itself before answering. Simply asking itself something like "this answer may not be correct, are you sure you're right?"

7 more replies

helsinkiandrew2y ago

> If I ask an LLM a very complex and specific question 500 times, if it just doesn't know the facts you'll still get the wrong answer 500 times.

Think the commenter meant use another model/LLM which could give a different answer, then let them vote on the result. Like "old fashioned AI" did with ensemble learning.

infecto2y ago

This test is interesting from a general high level metric/test but overall the way they are extracting data using a LLM is suboptimal so I don't think the takeaway means much. You could extract this type of data using a low-end model like 8x7B with a high degree of accuracy.

samus2y ago

The better way would be to ask it to generate a program that uses CSS selectors to parse the HTML.

emporas2y ago

Mixtral works very well with json output in my personal experience. Gpt family are excellent of course, and i would bet Claude and Gemini are pretty good. Mixtral however is the smallest of the models and the most efficient.

Especially running on Groq's infrastructure it's blazing fast. Some examples i ran on Groq's API, the query was completed in 70ms. Groq has released API libraries for Python and Javascript, i wrote a simple Rust example here, of how to use the API [1].

Groq's API documents how long it takes to generate the tokens for each request. 70ms for a page of document, are well over 100 times faster than GPT, and the fastest of every other capable model. Accounting for internet's latency and some queue that might exist, then the user receives the request in a second, but how fast would this model run locally? Fast enough to generate natural language tokens, generate a synthetic voice, listen again and decode the next request the user might talk to it, all in real time.

With a technology like that, why not talk to internet services with just APIs and no web interface at all? Just functions exposed on the internet, take json as an input, validate it, and send the json back to the user? Or every other interface and button around. Why pressing buttons for every electric appliance, and not just talk to the machine using a json schema? Why should users on an internet forum, every time a comment is added, have to press the add comment button, instead of just talking and saying "post it"? Pretty annoying actually.

[1] https://github.com/pramatias/groq_test

imaurer2y ago

Groq will soon support function calling. At that point, you would want to describe your data specification and use function calling to do extraction. Tools such as Pydantic and Instructor are good starting points.

I am collecting these approaches and tools here: https://github.com/imaurer/awesome-llm-json

bambax2y ago

Interesting post, but the prompt is missing? How do the LLMs generate the keys? It's likely the mistakes could be corrected with a better prompt or a post check?

Also, Google SERP page is deterministic (always has the same structure for the same kind of queries), so it would probably be much more effective to use AI to write a parser, and then refine it and use that?

tosh2y ago

I initially thought the blog post is about scraping using screenshots and multi-modal llms.

Scraping is quite complex by now (front-end JS, deep and irregular nesting, obfuscated html, …).

crowdyriver2y ago

There's lots of comments here about how stupid is to parse html using llms.

Have you ever had to scrape multiple sites with variadic html?

samus2y ago

The example here has HTML with a somewhat fixed format. It would indeed have been better to have samples with different format and aiming for a low error rate.

If you are scraping a limited amount of sites, you could for each site ask the LLM for parsing code from some samples, review that, and move on.

malux852y ago

Sorry to be nit-picky but thats the essence of these benchmarks - Mistral putting "N/A" for not available is weird - N/A is not applicable, in every use I have ever seen, and they DONT mean the same thing. I would expect null for not available and N/A for not applicable

Impressive inference speed difference though

mewpmewp22y ago

I have always known N/A as not available.

malux852y ago

Curious, where are you from? If I Google N/A every single hit on the first page is explaining it means "Not applicable"

are you from a non-english country? Maybe its cultural?

selcuka2y ago

The first entry on Google is Wikipedia [1] for me:

> N/A (or sometimes n/a or N.A.) is a common abbreviation in tables and lists for the phrase not applicable, not available, not assessed, or no answer.

[1] https://en.wikipedia.org/wiki/N/A

1 more reply

mewpmewp22y ago

I'm from North Europe, so not a native English speaker, but still it seems like based on my experience in life it seems as the first idea is that it's Not Available.

If I was to code something and for whatever reason some data wasn't available I would use N/A.

"Not applicable" doesn't feel right to me about N/A.

For instance if there is a table of comparison and for whatever reason there is data missing for some entity, while there should be, I would use N/A. So not applicable feels wrong for me for that reason alone.

This all is coming from intuition though.

throwaway114602y ago

It means all of these.

huqedato2y ago

Can somebody explain why this Grok is more performant than Microsoft infrastructure ? LPU better than TPU/GPU ?

kkielhofner2y ago

LLM performance is about parallelism but also memory bandwidth.

Groq delivers this kind of speed by networking many, many chips together with high bandwidth interconnect. Each chip has only 230mb of SRAM[0].

From the linked reference:

"In the case of the Mixtral model, Groq had to connect 8 racks of 9 servers each with 8 chips per server. That’s a total of 576 chips to build up the inference unit and serve the Mixtral model."

That's eight racks with ~132GB of memory for the model. A single H100 has 80GB and can serve Mixtral without issue (albeit at lower performance).

If you consider the requirements for actual real-world inference serving workloads you need to serve multiple models, multiple versions of models, LoRA adapters, sentence embeddings models (for RAG), etc the economics and physical footprint alone get very challenging.

It's an interesting approach and clearly very, very fast but I'm curious to see how they do in the market:

1) This analysis uses cloud GPU costs for Nvidia pricing. Cloud providers make significant margin on their GPU instances. If you look at qty 1 retail Nvidia DGX, Lambda Hyperplane, etc and compare it to cloud GPU pricing (inference needs to run 24x7) break even on hardware vs cloud is less than seven months depending on what your costs are for hosting the hardware.

2) Nvidia has incredibly high margins.

3) CUDA.

There are some special cases where tokens per second and time to first token are incredibly important (as the article states - real time agents, etc) but overall I think actual real-world production use or deployment of Groq is a pretty challenging proposition.

[0] - https://www.semianalysis.com/p/groq-inference-tokenomics-spe...

tosh2y ago

The Mistral Mixed Expert model has way fewer parameters active during inference and Groq has special purpose hardware (and probably less concurrent demand).

kkielhofner2y ago

> probably less concurrent demand

This is a significant understatement. ChatGPT has an estimated 100m monthly active users.

Groq gets featured on HN from time to time but is otherwise almost completely unknown. According to their stats they have done something like 15m requests total since launch. ChatGPT likely does this in hours (or less).

naiv2y ago

It's a totally different approach for interference

In short:

Groq - Ai Chip Microsoft etc. - Nvidia Gpu

ttrrooppeerr2y ago

A bit off-topic but maybe not? Any words on GPT-5? Is that coming? Or is OpenAI just focusing on the Sora model?

YetAnotherNick2y ago

There's no reason for OpenAI to release the model. They have close to 100% market anyways and releasing GPT-5 likely won't increase the total market as it is a incremental leap. And it's a open secret that most other models used GPT-4 synthetic data for training to come close to it.

They would likely wait till any model performs better than GPT 4 for the same price

whiplash4512y ago

The same reasoning would have applied for GPT-3.5. In the hindsight, you can say that it was obviously a good idea to build and ship GPT4. But hindsight is 20/20.

YetAnotherNick2y ago

There are few differences. Firstly, GPT-3.5 wasn't ahead of Palm etc. from Google which was published at the same time as GPT-4.

Secondly, GPT-4 increased overall AI market. According to all the sources, interviews and leaks, GPT-5 won't be a big leap over GPT-4 as the model size and training data won't be significantly larger. I doubt GPT-5 would do that. (I could be wrong in my assumption though that GPT-5 would just be a incremental gain).

chilmers2y ago

By any chance did you used to work in leadership at Nokia or Research in Motion? :-D

YetAnotherNick2y ago

Nokia wasn't that ahead in technology and Motion wasn't that ahead in market. GPT-4 is ahead in both.

lewhoo2y ago

There is reason to release new models if said models would be capable of grabbing a significant portion of job market currently occupied by humans.

tosh2y ago

100%?

Claude 3 Opus is in the capability ballpark of GPT-4, GPT-3.5 has alternatives that are cheaper (Claude 3 Haiku) or cheaper and work offline (Qwen 1.5, Mixtral, …).

ZitchDog2y ago

100% market share.

A competitor will likely need to be 10x better than ChatGPT in order to get significant market share, not just marginally better in certain scenarios.

Kostic2y ago

Is Claude 3 Opus generating more profits and taking considerable amount of customers from OpenAI? I'm not seeing that yet. Granted, I'm in Europe (outside of EU) so I can't pay for Opus but I guess that kinda confirms my statement. GPT4 is still a good product and there are no market pressures to release GPT5.

burrish2y ago

I hear it should be dropped this summer

cornedor2y ago

According to Sam Altman in a podcast with Lex Fridman this week, there is no real indication that it will be dropped this year. They will release a new model, but it might not be GPT-5

burrish2y ago

Fair enough, I got the info from this article

https://web.archive.org/web/20240319224624/https://www.busin...

whiplash4512y ago

Which is an indication of nothing. In which world would Sam A. drop any kind of info about such a sensitive topic? If anything, this could just be deception before a massive drop.

1 more reply

DalasNoin2y ago

My understanding from the lex podcast: they will release a lot of new models this year, but they will release intermediate models first before gpt-5

dns_snek2y ago

For all the posturing and crypto hate on HN, we're entering a world where it's socially acceptable to use 1000W of computing power and 5 seconds of inference time to parse a tiny HTML fragment which would take microseconds with traditional methods - and people are cheering about it. Time for some self-reflection? That's not very green.

delegate2y ago

Crypto energy requirements go up as the currency gets more traction.

TFA shows that groq is many times faster than GPT-4. Up to 18x groq claims. Faster means less energy. So I think it's just a matter of time until these things become ridiculously power efficient (eg run on phones in sub second times)

jodleif2y ago

How does faster mean less energy? Thats only true if you’re running faster on the same hardware…

delegate2y ago

Presumably. Less time the giant chip has to draw power for computation. The point is that everyone's interested in making AI power efficient, while crypto's proof of work is a competition for more power burned hashing and throwing away the result.

wenebego2y ago

I think they are talking about the case where, hypothetically, there is a 10x increase in speed but only 2x increase in power consumption

1 more reply

drexlspivey2y ago

Bitcoin energy requirements will be cut in half in a few days..

samus2y ago

It's still a monstrosity compared to a traditional parser. You can even be fancy and use complex parsers that backtrack and can deal with mildly context-sensitive languages (as required for HTML, XML, and many programmin languages), and you'd still be more efficient.

shanehoban2y ago

This is a valid point, but we are still in the early stages of AI/LLMs, so one would expect the speed and efficiency to improve drastically (perhaps accuracy too) over the coming years.

At least AI & LLMs have large scale practical applications as opposed to crypto (IMO).

AlchemistCamp2y ago

AI is a lot older than blockchain. There were full-fledged neural networks in the 40s and the perceptron was implemented in hardware in the 50s.

IshanMi2y ago

It's also interesting to think that IBM released an 8-trillion parameter model back in the 1980s [0]. Granted it was an n-gram model so it's not exactly an apples-to-apples comparison with today's models, but still, quite crazy to think about.

[0]: https://aclanthology.org/J92-4003.pdf

1 more reply

varjag2y ago

I wouldn't call the early McCulloch & Pitts work quite "full-fledged". Also backpropagation, essential for multi level perceptrons was not a thing until 1980s.

1 more reply

ogogmad2y ago

You're partially right. It's obvious that the solution is to combine traditional programming with AI, using traditional programming wherever possible because it's greener. Assuming you want things to turn out well in every possible future scenario, your decisions only matter if AGI isn't right around the corner. So assume it isn't right around the corner. Then there's going to be some interesting combining-together of manual human intervention, traditional software, and AI. We'll need to charge more for some uses of electricity, to incentivise turning AI into traditional software wherever possible.

Crypto is nearly pure waste.

CaptainFever2y ago

> We'll need to charge more for some uses of electricity, to incentivise turning AI into traditional software wherever possible.

I don't understand this. This adds bureaucracy and I don't see why different uses need to be charged differently if they all use energy the same.

In other words, if energy costs X per unit, and an inefficient (AI) software takes 30 units and an efficient (traditional) software takes 10 units, then it is already cheaper to run the efficient software, and thus people are already incentivised to do so. There's no need to charge differently. If one day AI turns out to only need 5 units, turning more efficient, then just charge them for 5X. People will gravitate towards the new, efficient AI software naturally then.

Jensson2y ago

Websites will never be fast, will they? Even with 1000x more compute than now they will just perform everything in LLM calls and stuff are just as slow as now.

qup2y ago

It would take microseconds after a complete program was written by a human?

It no longer requires an expert human

josho2y ago

And if this use case hit any kind of scale. We’d just have an llm generate a parser and be back to microseconds.

This was just a blog to generate traffic on the site. Not to showcase some new use case for an llm.

samlinnfer2y ago

Any amount of energy spent useful work is vastly superior than whatever “POW” crypto burn does.

>For all the posturing and forest fire hate on HN, it’s now socially acceptable to run a toy steam engine to power a model car? Not very green of you.

CaptainFever2y ago

It's almost a fallacy at this point to declare something bad simply because of the existence of carbon emissions, without first comparing the benefits of what is being produced, and the alternative tradeoffs.

To be fair to GP, they did compare it to alternatives (dumb HTML parsing), but failed to consider versatile HTML parsing or other uses for Groq LLM.

samus2y ago

While you are not wrong, crypto is not what this is being compared with.

londons_explore2y ago

While energy remains cheap and human minds remain expensive, it always makes sense to use AI to reduce human effort.

If one cares about the environment, a carbon cap/tax is what you should campaign for. Then carbon-based energy sources will be curtailled, energy costs will go up, and AI like this will be encouraged to become more energy efficient or other methods used instead.

osigurdson2y ago

It is a nice idea in principle but ends up being a political tool and a tariff on goods and services of your own country. A global and corruption free carbon tax might work but that is impossible to achieve.

londons_explore2y ago

The only way it's gonna work is if a bunch of countries get together, agree a carbon cap/tax, and then tell other countries that they need to join the scheme if they want to trade goods with the group.

One way to combat corruption is to ask an international panel of experts to assess how many extra emissions came from non-official sources in each country and reduce next years cap by that amount. Then countries have an incentive to stamp out corruption.

1 more reply

infecto2y ago

Because crypto has very little real world use.

There is a lot of business value happening in the AI space and its only going to get better.

skc2y ago

One is actually useful day to day though.

rafaelero2y ago

What a ridiculous complaint. Energy efficiency won't remain static, and even if it were, it's not up to you to decide how to best leverage the available electricity.

lm284692y ago

> it's not up to you to decide

Unless you live in a dictatorship it's definitely up to us to decide... Otherwise you leave your voice to the top 0.0001% business owners and expect them to work for your good and not for their own interests

Also read about the rebound effect. Planes are twice as efficient as they were 100 years ago yet they pollute infinitely more as a whole.

There is nothing ridiculous about the comment you're replying to

infecto2y ago

Yes you are right and the future is dependent on innovation and using more electricity with a large percentage of it coming form renewable sources. I don't want to go live on the farm myself.

rafaelero2y ago

Ok, then let's start by getting away with all the wasteful animal farming.

satisfice2y ago

AND it's not even reliable.

j / k navigate · click thread line to collapse

133 comments

wruza2y ago

  You are an expert in Web Scraping, so you are capable to find the information in HTML and label them accordingly. Please return the final result in JSON.

  Data to scrape: 
  title: Name of the business
  type: The business nature like Cafe, Coffee Shop, many others
  phone: The phone number of the business
  address: Address of the business, can be a state, country or a full address
  years_in_business: Number of years since the business started
  hours: Business operating hours
  rating: Rating of the business
  reviews: Number of reviews on the business
  price: Typical spending on the business
  description: Extra information that is not mentioned yet in any of the data
  service_options: Array of shopping options from the business, for example, in store shopping, delivery and many others. It should be in format -> option_name: true
  is_operating: Whether the business is operating
  
  HTML: 
  {html}

infecto2y ago

This should be higher up. This whole blog post is mostly worthless because the way they are extracting data is less than optimal.

I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.

Edit: I am not suggesting that an LLM is more optimal than what ever traditional parsing methods they may use, simply the way they are doing it is wrong from an LLM flow.

ilyazub2y ago

> I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.

Cool, cool. I'm super interested. Please share the process and the results.

wruza2y ago

Also, my (limited) experience with prompts tells that you want to invest more into the “You are” part. I’ll share my understanding, corrections are appreciated.

Simply asking the default AI card represents 0.1% of what’s possible and doesn’t give the best results. Prompt Engineering is real.

I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.

mhuffman2y ago

Do you have an example of a more optimal prompt to share?

infecto2y ago

feintruled2y ago

Brave new world, where our machines are sometimes wrong but by gum they are quick about it.

RUnconcerned2y ago

I too am a big fan of having my computer hallucinate incorrect information.

darthrupert2y ago

Yesterday I asked my locally running gpt4all "What model are you running on?"

Answer: "I'm running on Toyota Corolla"

Which was perhaps the funniest thing I heard that day.

harryf2y ago

>> print(“Hello, world!”.ai_reverse()) world, Hello!

ben_w2y ago

First few versions of Swift kept changing how strings work because it's not entirely obvious what most people intend from the nth element of a string.

Used to be easy, when it was ASCII.

Reverse the bytes of utf-8 and it won't always be valid uft-8.

Reverse the code-points, and the Canadian flag gets replaced with the Ascension Island flag.

samus2y ago

RUnconcerned2y ago

Finally, something more offensive than parsing HTML with regular expressions: parsing HTML with LLMs.

AlphaAndOmega02y ago

I for one am glad I can offload all the regex to LLMs. Powerful? Yes. Human readable for beginners? No.

cornedor2y ago

RUnconcerned2y ago

Personally when I am parsing structured data I prefer to use parsers that won't hallucinate data but that's just me.

Also, don't parse HTML with regular expressions.

rybosome2y ago

Generally I agree with your point, but there is some value in a parser that doesn’t have to be updated when the underlying HTML changes.

1 more reply

okamiueru2y ago

Deterministic? No.

retrac982y ago

nathan_compton2y ago

retrac982y ago

They'll get better. Humans are far from perfect, and I have no doubt that LLMs will eventually outperform them for non-trivial tasks consistently.

nathan_compton2y ago

Maybe so, but at this stage I wouldn't be betting a business model on it.

1 more reply

Jensson2y ago

> Humans are far from perfect

Humans running multishot with mixture of experts is close to perfect. You can't compare a multishot mixture of expert AI to a single human, humans doesn't work in isolation.

littlestymaar2y ago

Machine learning models will get better for sure. We don't know if LLM are the end game though and it's not sure if this particular technique is what we'll need to reach the next level.

somewhereoutth2y ago

Or they might not get better. It could be that we are at a local optimum for that sort of thing, and major improvements will have to wait (perhaps for a very long time) for radical new technologies.

1 more reply

samus2y ago

They already have superhuman image classification performance.

2 more replies

spaniard892772y ago

I've tried to apply it to parsing HTML as this article into a pretty long pipeline. I'm using DeepInfra with Mistral 8x7B and I'm still unsure if I'm going to use for production.

I've also tried with GPT-4 but it's way more expensive, and despite what this guy got, it also makes mistakes.

I don't really care about inference speed, but I do care about price and correctness.

ogogmad2y ago

Might be a silly question, but if you want determinism in this, why don't you get the LLM to write the deterministic code, and use that instead? Interesting experiment, though!

In fact, what about a hybrid of what you're doing now? Initially, you use an LLM to generate examples. And then from those examples, you use that same LLM to write deterministic code?

Eisenstein2y ago

samus2y ago

malux852y ago

Yeah I agree - just an hour ago I was dealing with an LLM that was missing a "not" thus inverting the meaning of a rather important simulation parameter!

worldsayshi2y ago

onion2k2y ago

bberrry2y ago

If you have speed you can generate multiple answers and have another model pick the best one.

Drakim2y ago

If I ask an LLM a very complex and specific question 500 times, if it just doesn't know the facts you'll still get the wrong answer 500 times.

That's understandable. The real problem is when the AI lies/hallucinates another answer with confidence instead of saying "I don't know".

simion3142y ago

The problem is asking for facts, LLM are not a database so they know stuff but it is compressed so expect wrong facts, wrong names, dates, wrong anything.

We will need an LLM as a front end then it will generate a query to fetch the facts from the internet or a database , then maybe format the facts for your consumption.

2 more replies

m348e9122y ago

7 more replies

helsinkiandrew2y ago

> If I ask an LLM a very complex and specific question 500 times, if it just doesn't know the facts you'll still get the wrong answer 500 times.

Think the commenter meant use another model/LLM which could give a different answer, then let them vote on the result. Like "old fashioned AI" did with ensemble learning.

infecto2y ago

samus2y ago

The better way would be to ask it to generate a program that uses CSS selectors to parse the HTML.

emporas2y ago

[1] https://github.com/pramatias/groq_test

imaurer2y ago

I am collecting these approaches and tools here: https://github.com/imaurer/awesome-llm-json

bambax2y ago

Interesting post, but the prompt is missing? How do the LLMs generate the keys? It's likely the mistakes could be corrected with a better prompt or a post check?

tosh2y ago

I initially thought the blog post is about scraping using screenshots and multi-modal llms.

Scraping is quite complex by now (front-end JS, deep and irregular nesting, obfuscated html, …).

crowdyriver2y ago

There's lots of comments here about how stupid is to parse html using llms.

Have you ever had to scrape multiple sites with variadic html?

samus2y ago

The example here has HTML with a somewhat fixed format. It would indeed have been better to have samples with different format and aiming for a low error rate.

If you are scraping a limited amount of sites, you could for each site ask the LLM for parsing code from some samples, review that, and move on.

malux852y ago

Impressive inference speed difference though

mewpmewp22y ago

I have always known N/A as not available.

malux852y ago

Curious, where are you from? If I Google N/A every single hit on the first page is explaining it means "Not applicable"

are you from a non-english country? Maybe its cultural?

selcuka2y ago

The first entry on Google is Wikipedia [1] for me:

> N/A (or sometimes n/a or N.A.) is a common abbreviation in tables and lists for the phrase not applicable, not available, not assessed, or no answer.

[1] https://en.wikipedia.org/wiki/N/A

1 more reply

mewpmewp22y ago

I'm from North Europe, so not a native English speaker, but still it seems like based on my experience in life it seems as the first idea is that it's Not Available.

If I was to code something and for whatever reason some data wasn't available I would use N/A.

"Not applicable" doesn't feel right to me about N/A.

This all is coming from intuition though.

throwaway114602y ago

It means all of these.

huqedato2y ago

Can somebody explain why this Grok is more performant than Microsoft infrastructure ? LPU better than TPU/GPU ?

kkielhofner2y ago

LLM performance is about parallelism but also memory bandwidth.

Groq delivers this kind of speed by networking many, many chips together with high bandwidth interconnect. Each chip has only 230mb of SRAM[0].

From the linked reference:

"In the case of the Mixtral model, Groq had to connect 8 racks of 9 servers each with 8 chips per server. That’s a total of 576 chips to build up the inference unit and serve the Mixtral model."

That's eight racks with ~132GB of memory for the model. A single H100 has 80GB and can serve Mixtral without issue (albeit at lower performance).

It's an interesting approach and clearly very, very fast but I'm curious to see how they do in the market:

2) Nvidia has incredibly high margins.

3) CUDA.

[0] - https://www.semianalysis.com/p/groq-inference-tokenomics-spe...

tosh2y ago

The Mistral Mixed Expert model has way fewer parameters active during inference and Groq has special purpose hardware (and probably less concurrent demand).

kkielhofner2y ago

> probably less concurrent demand

This is a significant understatement. ChatGPT has an estimated 100m monthly active users.

naiv2y ago

It's a totally different approach for interference

In short:

Groq - Ai Chip Microsoft etc. - Nvidia Gpu

ttrrooppeerr2y ago

A bit off-topic but maybe not? Any words on GPT-5? Is that coming? Or is OpenAI just focusing on the Sora model?

YetAnotherNick2y ago

They would likely wait till any model performs better than GPT 4 for the same price

whiplash4512y ago

The same reasoning would have applied for GPT-3.5. In the hindsight, you can say that it was obviously a good idea to build and ship GPT4. But hindsight is 20/20.

YetAnotherNick2y ago

There are few differences. Firstly, GPT-3.5 wasn't ahead of Palm etc. from Google which was published at the same time as GPT-4.

chilmers2y ago

By any chance did you used to work in leadership at Nokia or Research in Motion? :-D

YetAnotherNick2y ago

Nokia wasn't that ahead in technology and Motion wasn't that ahead in market. GPT-4 is ahead in both.

lewhoo2y ago

There is reason to release new models if said models would be capable of grabbing a significant portion of job market currently occupied by humans.

tosh2y ago

100%?

Claude 3 Opus is in the capability ballpark of GPT-4, GPT-3.5 has alternatives that are cheaper (Claude 3 Haiku) or cheaper and work offline (Qwen 1.5, Mixtral, …).

ZitchDog2y ago

100% market share.

A competitor will likely need to be 10x better than ChatGPT in order to get significant market share, not just marginally better in certain scenarios.

Kostic2y ago

burrish2y ago

I hear it should be dropped this summer

cornedor2y ago

According to Sam Altman in a podcast with Lex Fridman this week, there is no real indication that it will be dropped this year. They will release a new model, but it might not be GPT-5

burrish2y ago

Fair enough, I got the info from this article

https://web.archive.org/web/20240319224624/https://www.busin...

whiplash4512y ago

Which is an indication of nothing. In which world would Sam A. drop any kind of info about such a sensitive topic? If anything, this could just be deception before a massive drop.

1 more reply

DalasNoin2y ago

My understanding from the lex podcast: they will release a lot of new models this year, but they will release intermediate models first before gpt-5

dns_snek2y ago

delegate2y ago

Crypto energy requirements go up as the currency gets more traction.

jodleif2y ago

How does faster mean less energy? Thats only true if you’re running faster on the same hardware…

delegate2y ago

wenebego2y ago

I think they are talking about the case where, hypothetically, there is a 10x increase in speed but only 2x increase in power consumption

1 more reply

drexlspivey2y ago

Bitcoin energy requirements will be cut in half in a few days..

samus2y ago

shanehoban2y ago

This is a valid point, but we are still in the early stages of AI/LLMs, so one would expect the speed and efficiency to improve drastically (perhaps accuracy too) over the coming years.

At least AI & LLMs have large scale practical applications as opposed to crypto (IMO).

AlchemistCamp2y ago

AI is a lot older than blockchain. There were full-fledged neural networks in the 40s and the perceptron was implemented in hardware in the 50s.

IshanMi2y ago

[0]: https://aclanthology.org/J92-4003.pdf

1 more reply

varjag2y ago

I wouldn't call the early McCulloch & Pitts work quite "full-fledged". Also backpropagation, essential for multi level perceptrons was not a thing until 1980s.

1 more reply

ogogmad2y ago

Crypto is nearly pure waste.

CaptainFever2y ago

> We'll need to charge more for some uses of electricity, to incentivise turning AI into traditional software wherever possible.

I don't understand this. This adds bureaucracy and I don't see why different uses need to be charged differently if they all use energy the same.

Jensson2y ago

Websites will never be fast, will they? Even with 1000x more compute than now they will just perform everything in LLM calls and stuff are just as slow as now.

qup2y ago

It would take microseconds after a complete program was written by a human?

It no longer requires an expert human

josho2y ago

And if this use case hit any kind of scale. We’d just have an llm generate a parser and be back to microseconds.

This was just a blog to generate traffic on the site. Not to showcase some new use case for an llm.

samlinnfer2y ago

Any amount of energy spent useful work is vastly superior than whatever “POW” crypto burn does.

>For all the posturing and forest fire hate on HN, it’s now socially acceptable to run a toy steam engine to power a model car? Not very green of you.

CaptainFever2y ago

To be fair to GP, they did compare it to alternatives (dumb HTML parsing), but failed to consider versatile HTML parsing or other uses for Groq LLM.

samus2y ago

While you are not wrong, crypto is not what this is being compared with.

londons_explore2y ago

While energy remains cheap and human minds remain expensive, it always makes sense to use AI to reduce human effort.

osigurdson2y ago

londons_explore2y ago

1 more reply

infecto2y ago

Because crypto has very little real world use.

There is a lot of business value happening in the AI space and its only going to get better.

skc2y ago

One is actually useful day to day though.

rafaelero2y ago

What a ridiculous complaint. Energy efficiency won't remain static, and even if it were, it's not up to you to decide how to best leverage the available electricity.

lm284692y ago

> it's not up to you to decide

Also read about the rebound effect. Planes are twice as efficient as they were 100 years ago yet they pollute infinitely more as a whole.

There is nothing ridiculous about the comment you're replying to

infecto2y ago

Yes you are right and the future is dependent on innovation and using more electricity with a large percentage of it coming form renewable sources. I don't want to go live on the farm myself.

rafaelero2y ago

Ok, then let's start by getting away with all the wasteful animal farming.

satisfice2y ago

AND it's not even reliable.

j / k navigate · click thread line to collapse