Skip to content

Top New Best Ask Show Jobs

Classic Data science pipelines built with LLMs | Better HN

Classic Data science pipelines built with LLMs (opens in new tab)

(github.com)

196 pointsgalgia1y ago86 comments

86 comments

owenthejumper1y ago

This hits home. I am helping someone analyze medical research data. When I helped before a few years ago we spent a few weeks trying to clean the data, figure out how to run the basic analysis (linear regression, etc), only to arrive at "some" results that were never repeatable because we learned as we built.

I am doing it again now. I used Claude to import the data from CSV into a database, then asked it to help me normalize it, which output a txt file with a lot of interesting facts about the data. Next step I asked to write a "fix data" script that will fix all the issues I told it about.

Finally, I said "give me univariate analysis, output the results into CSV / PNG and then write a separate script to display everything in a jupyter notebook".

Weeks of work into about 2 hours...

mritchie7121y ago

we've built a business[0] around this workflow, but in cases where the source data isn't as simple as a CSV. Think Stripe, Hubspot, Salesforce, etc. where you'd normally need to write a ton of API calls or buy something like Fivetran. The flow for Definite is:

1. Add your sources (Postgres, S3, CRM, Quickbooks, Google Sheets, etc.)

2. We deploy standard, pre-baked data models (e.g. how do you calculate ARR using Stripe data)

3. AI answers questions using the standard models and starts updating the model with SQL for anything that's not already answered.

We spin up a datalake to store all the data (similar to this one[1]) for our customers, so it's very cost effective.

0 - https://www.definite.app/

1 - https://www.definite.app/blog/cloud-iceberg-duckdb-aws

DeathArrow1y ago

>Weeks of work into about 2 hours...

Only if the output from Claude is correct. If not...

voidhorse1y ago

This. I get why people have started using LLMs for this and I think it's great in theory, but the black box nature and possibility of hallucination makes it a non starter for me. Having the LLM generate scripts which you can then validate for correctness seems more plausible.

I also worry that this approach will lead to a sort of further reification of data science. While things have already trended this way, data science is not about applying a few routine formulas to a data set. Done properly, it is far more exploratory and all about building an understanding of the unique properties and significance of a particular data set. I worry the use of these tools will greatly reduce the exploratory phase and lead to analyses that simply confirm biases or typical conclusions rather than yielding new insight.

raducu1y ago

> Only if the output from Claude is correct. If not...

Had a task at work to clear unused metrics.

Exported a whole dashboard, thought about regexes to extract metrics out of xml (bad, I know) asked chat gpt to produce the one-liners to produce the data.

Got 22 used metrics.

Next day I just gave chat gpt the whole file and asked it to spit all the used metrics.

46 used metics.

Asked Claude, Deepseek and Gemini the same question. Only Gemini messed it up by missing some, duplicating some.

Re-checked the one-liners chat-gpt produced. Turns out it/I messes up when I told it to generate a list of unique metrics from a file containing just the metric names one per line. What I wanted was a script/one-liner that would print all the metric names just once (de-duplicate) and chat-gpt ad-literam produced a script that only prints metrics that show up exactly once in the whole file.

In the end, just asking LLMs to simply extract the names from the grafana dashboard worked better, parsing out expressions, only producing unique metrics names and all that, but there was no way to know for sure, just that given that 3/4 of the LLMs produced the same output meant it was most likely corect.

I fixed the programatic approach and got thr same result, but it was a very wiered feeling asking the LLMs to just give me the result of what for me was a whole process of many steps.

owenthejumper1y ago

But I am not giving Claude a csv and saying 'clean it up'. I am asking it to write me a python script to clean it up. That way I can validate the script myself.

noja1y ago

Are you aware of this tool? https://openrefine.org

I’ve come to this same conclusion. Been able to code up something that would’ve taken me a week to do back in the day with Claude in 2 hours. I’ve given canvas csvs and seen it run analysis on them in minutes that would’ve take me day to do when I used to run R scripts and throw them into slides. This probably just the beginning too…

squigz1y ago

What happens when that 'weeks of work' is just shifted into the future, as you find out the LLM made things up and you have to figure out what went wrong?

Humans make mistakes too.

I find this "LLMs can be wrong" argument a bit tiresome, and also a bit lazy.

I feel like we have been here before. With wikipedia. With stack overflow. Or with the whole debate about c/assembler vs garbage collected languages.

williamcotton1y ago

For my ETL pipelines I have not had this issue.

arscan1y ago

“I am doing it again now” is the operative phrase here I think. I’ve found LLMs are quite good helping me build things much better and faster in this case. Maybe not so much for stuff I haven’t done before and don’t really quite know what I’m trying to accomplish or what a good solution looks like.

Can I ask you to beta test my product? I'm building something like this and I want to focus on medical data (from omics to RCTs)

Cheer21711y ago

I really don't mean this in a rude way, but if it took you a few weeks to do that on your own, you are really bad at googling for tutorials and walkthroughs. You could have watched a one hour bootcamp video and learned how to do it yourself.

What you are saying Claude helped you do is like 15 lines of python. A few weeks? 120 hours of effort?

mritchie7121y ago

the task above is not 15 lines of python with a real world dataset.

the tutorials you reference? yes, 15 lines of python when you're starting with the titanic.csv. But a real world dataset normally takes hours or days of cleaning before it's ready to run any statistical analysis on.

erikgahner1y ago

Most of these examples/walkthroughs look like they have been generated by LLMs. They might be useful for teaching purposes, but I believe they significantly underestimate the amount of domain knowledge that often goes into data extraction, data cleaning and data wrangling.

I'm not against that approach (though I am a teacher so guilty as charged).

Toy examples help teach a concept and it helps when the example is relevant to the learner's interest. However at some point, we can't design real world application examples because so much additional mess has to get thrown in there. For example, a blog for learning web development isn't really useful to many but helps outline the basics of URL parameters, GET/POST requests, database management, etc.

It is on the learner to then take those skills and use them elsewhere. Or like it would do when I was learning, ignore the blog and make your own thing but roughly following the example.

galgiaOP1y ago

+ I assumed that most people will ctrl+a -> ctrl+c -> ChatGPT -> ctrl+v

dkarl1y ago

An LLM would need a lot of integrations to send the emails, Slack messages, and meeting invites to find out all the required domain knowledge. They're basically a full-fledged employee who could take on a management role at that point.

galgiaOP1y ago

You are right! This is here to be used when your resources do not allow you to build full-blow solutions. Yes, I used LLMs to help create examples from my existing code, but they are based on things I have put in production when the client's resources were limited and wanted to move from point 0 to test out the potential of LLMs on their data.

Afaict this skips the evals and alignment side of LLMs. We find result quality is where most of our AI time goes when helping teams across industries and building our own LLM tools. Calling an API and getting bad results is the easy part, while ensuring good results is the part we get paid for.

If you look at tools like dspy, even if you disagree with their solutions, much of their effort is on helping get good vs bad results. In practice, I find different LLM use cases to have different correctness approaches, but it's not that many. I'd encourage anyone trying to teach here to always include how to get good results for every method presented, otherwise it is teaching bad & incomplete methods.

plaidfuji1y ago

This is where things are headed. All that ridiculous busywork that goes into ETL and modeling pipelines… it’s going to turn into “here’s a pile of data that’s useful for answering a question, here’s a prompt that describes how to structure it and what question I want answered, and here’s my oauth token to get it done.” So much data cleaning and prep code will be scrapped over the next few years…

benrutter1y ago

I'm definitely biased because my day job is writing ETL pipelines and supporting software, and my current side project is a data contracts library for helping the above[0]. Still I'm not sure I see this happening.

80% of the focus of an ETL pipeline is in ensuring edge cases are handled appropriately (i.e. not producing models from potentially erroneous data, dead letter queing unknown fields etc).

I think an LLM would be great for "take this json and make it a pandas dataframe", but a lot less great for interact with this billing API to produce auditable payment tables.

For areas that are reliability focused, LLMs still need a lot more improvments to be useful.

[0] https://github.com/benrutter/wimsey

timr1y ago

> I think an LLM would be great for "take this json and make it a pandas dataframe", but a lot less great for interact with this billing API to produce auditable payment tables.

Yeah, it's great....so long as you don't care that it randomly screws up the conversion 10% of the time.

My first thought, when I saw the post title, was that this is the 2025 equivalent to people using MapReduce for a 1MB dataset. LLMs certainly have good applications in data pipelines, but cleaning structured data isn't it.

For your wimsey library, using “pipe” to validate the contracts would seem to me to drastically slow down the Polars query because the UDF pushes the query out of Rust into Python. I think a cool direction would be to have a “compiler” which takes in a contract and spits out native queries for a variety of dataframe libraries (pandas/polars/pyspark). It becomes harder to define how to error with a test contract but that can be the secret sauce.

galgiaOP1y ago

I belive that LLMs will become better and better in the near future and pipelines will replace classic approaches with LLM-enriched pipelines will drastically simplify the ETL flows.

drunkpotato1y ago

This is a head-scratcher of a take. Have you actually done any in-depth work on data pipelines and analytics tooling? If so, what precisely do you see LLMs making easier?

I tried using enterprise chat gpt to write a query to load some json data into a data warehouse. I was impressed with how good a job it did, but it still required several rounds of refinement and hand-holding and the end result was almost, but not quite, correct. So I'm not coming at this from the perspective of hating LLMs a priori, but I am unimpressed with the hype and over-selling of its capabilities. In the end, it was no faster than writing the query myself, but it wasn't slower either, so I can see it being somewhat helpful in limited conditions.

Unless the technology makes another quantum leap improvement at the same time the price drops like a stone, I don't see LLMs coming anywhere close to your claim.

That said, I expect to see a huge amount of snake oil and enterprise dollars wastefully burned on executive pipe dreams of "here's a pile of data now magic me a better business!" in the next few years of LLM over-hyped nonsense. There's always a quick buck to make in duping clueless execs drooling over replacing pesky, annoying, "over-paid" tech people.

robwwilliams1y ago

Let me give you a complementary perspective. Same problems all of you have but I work in a small lab team of PhD biologist who generate huge omics data set and even larger lightsheet microscopy and MRI datasets but don’t know how to do a VLOOKUP in Excel. And who do not know the exotic acronyms: LIMS, QA, QC, or SQL. Yes, really.

What do we typically do in academic biomedical research in this situation?

The lead PI looks around the lab and finds a grad student or postdoc who knows how to turn on a computer and if very lucky also has had 6 months of experience noodling around with R or Python. This grad or postdoc is then charged with running some statistical analyses without any training whatsoever in data science. What is an outlier anyway, what do you mean by “normalize”, what is metadata exactly?

You get my drift: It is newbies in data science and programming (often 40-and 50-year-olds) leading novices (20- and 30-year-olds) to the slaughter. Might contribute to some lack of replicability ;-)

And it has been this way in the majority of academic labs since I started using CPM on an Apple 2 in 1980 at UC Davis in an electrophysiology lab in Psychology, to the first Macs I set up at Yale in a developmental neurobiology lab in 1984, and up to the point at which I set up my own lab in neurogenetics at the University of Tennessee with a pair of Mac IIs in 1989 and $150,000 in set-up funds, just enough for me to hire one very inexperience technician to help me do everything.

So in this context I hope all of you can appreciate that ANY help in bringing some real data science into mom-and-pop laboratories would be a huge huge boon.

And please god, let it be FOSS.

You have more faith in LLMs than I do. The reality is it will probably get you 70 to 80% there, then you'll spend a ton of time debugging / fixing your pipelines, only to realize it would've been simpler, faster, and more reliable to not involve an LLM in the first place.

drunkpotato1y ago

I believe that we'll learn how to incorporate LLMs to improve parts of data pipelines, particularly those that involve extracting unstructured or semistructured data into structured data, especially if it can provide a reliability score or confidence level with the extract. I'm much more skeptical of claims beyond that.

I also think there are unanswered questions about reliability, cost (dollar and energy), and AI business models; I don't think OpenAI can burn $2+ to make a dollar forever.

owenthejumper1y ago

Unless you can provide some "citation", I don't think you are right. I do this every day now and it gets me 99 % there with very little debugging.

miningape1y ago

This is completely wrong, if anything an increase in the usage of LLMs to generate small pipelines will lead to increased demand for professional pipelines to be built. Because if any small thing breaks the dashboards/features break which is immediately noticeable. I think you'll see a big increase in the number of models a data scientist can create, but making those python notebooks production ready can't be done by an LLM. That's to say as analysts create more potential use cases, there will be more demand to get those implemented.

There's so much that goes into ensuring the reliability, scalability and monitoring of production ready data pipelines. Not to mention the integration work for each use case. An LLM will give you short term wins at the cost of long term reliability - which is exactly why we already have DE teams to support DA and DS roles.

>This is completely wrong, if anything an increase in the usage of LLMs to generate small pipelines will lead to increased demand for professional pipelines to be built. Because if any small thing breaks the dashboards/features break which is immediately noticeable. I think you'll see a big increase in the number of models a data scientist can create, but making those python notebooks production ready can't be done by an LLM. That's to say as analysts create more potential use cases, there will be more demand to get those implemented.

I agree. There is a lot of data people want that isn't made because of labor costs. Not just in quantity, but difficulty. If you can only afford to hire one analyst, and the analyst's time is only spent on cleaning data and generating basic sums, then that's all you'll get. But if the analyst can save a lot of time with LLMs, they'll have time to handle more complicated statistics using those counts like forecasts or other models.

galgiaOP1y ago

I see it as a gray area - long term there will be a need for both and you will have just one tool to choose from when presented with time-budget-quality constraints.

This would require massively more compute than regular pipelines...

plaidfuji1y ago

(1) that delta will decrease quickly, and (2) corporations will gladly pay for compute over headcount to maintain fragile data pipelines

galgiaOP1y ago

If your problem is compute, you are already optimizing. This is here for all the steps before you start thinking latency-compute. Not all use cases are made equal.

mistrial91y ago

no, not so simple.. the simplicity of this idea is like a gravitational pull for your human mental model mind. Meanwhile, LLMs are like a non-reproducible cotton-candy machine. Quality will be an elusive light at the end of the tunnel, not a result, for non-trivial systems IMHO. Simple systems? sure, but economics will assign low-skill humans to the task, and other problems emerge.

What is the intoxication that assumes the engineering disciplines are now suddenly auto-automatable ?

not data pipelines, not yet at least since usually those require high degree of accuracy (depending on the company, of course). Where I see it (already) move in is data exploration, which effectively are data pipelines before data pipelines are being developed.

galgiaOP1y ago

Good point! LLMs are best when you are starting from point 0.

galgiaOP1y ago

Exactly!

fire_lake1y ago

Big song and dance to call the OpenAI rest endpoint.

hrpnk1y ago

What's missing in these examples are evals and any advice on creating a verification set that can be used to assert that the system continues to work as designed. Models and their prompting patterns change; one cannot just pretend that a 1-time automation will continue to work indefinitely when the environment is constantly changing.

refactor_master1y ago

This ETL is nice, but ours is 100k LOC, and spans multiple departments and employments, and I haven’t yet been able to make an LLM write a convincing test that wasn’t already solved by strict typing.

I’m not trying to move the goal post here, but LLMs haven’t replaced a single headcount. In fact, it’s only been helping our business so far.

javierluraschi1y ago

For those interested, you can use LLMs to process CSVs in Hal9 and also generate streamlit apps, in addition, the code is open source so if you want to help us improve our RAG or add new tools, you are more than welcomed.

- https://hal9.ai

- https://github.com/hal9ai/hal9

wodenokoto1y ago

I do not understand at all what this does or how to use it. Am I completely out of touch on LLM?

I wonder if those examples can be dumped down even further for lower age brackets. One of the "powers" of LLms

ei6251y ago

LLM for ETL is Good idea, it scales well. We need to find ideas which scales well to make the business valid.

Neelschak1y ago

I tried out Querri and loving it so far

android5211y ago

is there a typescript equivalent ?

j / k navigate · click thread line to collapse