I am doing it again now. I used Claude to import the data from CSV into a database, then asked it to help me normalize it, which output a txt file with a lot of interesting facts about the data. Next step I asked to write a "fix data" script that will fix all the issues I told it about.
Finally, I said "give me univariate analysis, output the results into CSV / PNG and then write a separate script to display everything in a jupyter notebook".
Weeks of work into about 2 hours...
1. Add your sources (Postgres, S3, CRM, Quickbooks, Google Sheets, etc.)
2. We deploy standard, pre-baked data models (e.g. how do you calculate ARR using Stripe data)
3. AI answers questions using the standard models and starts updating the model with SQL for anything that's not already answered.
We spin up a datalake to store all the data (similar to this one[1]) for our customers, so it's very cost effective.
Only if the output from Claude is correct. If not...
I also worry that this approach will lead to a sort of further reification of data science. While things have already trended this way, data science is not about applying a few routine formulas to a data set. Done properly, it is far more exploratory and all about building an understanding of the unique properties and significance of a particular data set. I worry the use of these tools will greatly reduce the exploratory phase and lead to analyses that simply confirm biases or typical conclusions rather than yielding new insight.
Had a task at work to clear unused metrics.
Exported a whole dashboard, thought about regexes to extract metrics out of xml (bad, I know) asked chat gpt to produce the one-liners to produce the data.
Got 22 used metrics.
Next day I just gave chat gpt the whole file and asked it to spit all the used metrics.
46 used metics.
Asked Claude, Deepseek and Gemini the same question. Only Gemini messed it up by missing some, duplicating some.
Re-checked the one-liners chat-gpt produced. Turns out it/I messes up when I told it to generate a list of unique metrics from a file containing just the metric names one per line. What I wanted was a script/one-liner that would print all the metric names just once (de-duplicate) and chat-gpt ad-literam produced a script that only prints metrics that show up exactly once in the whole file.
In the end, just asking LLMs to simply extract the names from the grafana dashboard worked better, parsing out expressions, only producing unique metrics names and all that, but there was no way to know for sure, just that given that 3/4 of the LLMs produced the same output meant it was most likely corect.
I fixed the programatic approach and got thr same result, but it was a very wiered feeling asking the LLMs to just give me the result of what for me was a whole process of many steps.
I find this "LLMs can be wrong" argument a bit tiresome, and also a bit lazy.
I feel like we have been here before. With wikipedia. With stack overflow. Or with the whole debate about c/assembler vs garbage collected languages.
What you are saying Claude helped you do is like 15 lines of python. A few weeks? 120 hours of effort?
the tutorials you reference? yes, 15 lines of python when you're starting with the titanic.csv. But a real world dataset normally takes hours or days of cleaning before it's ready to run any statistical analysis on.
Toy examples help teach a concept and it helps when the example is relevant to the learner's interest. However at some point, we can't design real world application examples because so much additional mess has to get thrown in there. For example, a blog for learning web development isn't really useful to many but helps outline the basics of URL parameters, GET/POST requests, database management, etc.
It is on the learner to then take those skills and use them elsewhere. Or like it would do when I was learning, ignore the blog and make your own thing but roughly following the example.
If you look at tools like dspy, even if you disagree with their solutions, much of their effort is on helping get good vs bad results. In practice, I find different LLM use cases to have different correctness approaches, but it's not that many. I'd encourage anyone trying to teach here to always include how to get good results for every method presented, otherwise it is teaching bad & incomplete methods.
80% of the focus of an ETL pipeline is in ensuring edge cases are handled appropriately (i.e. not producing models from potentially erroneous data, dead letter queing unknown fields etc).
I think an LLM would be great for "take this json and make it a pandas dataframe", but a lot less great for interact with this billing API to produce auditable payment tables.
For areas that are reliability focused, LLMs still need a lot more improvments to be useful.
Yeah, it's great....so long as you don't care that it randomly screws up the conversion 10% of the time.
My first thought, when I saw the post title, was that this is the 2025 equivalent to people using MapReduce for a 1MB dataset. LLMs certainly have good applications in data pipelines, but cleaning structured data isn't it.
I tried using enterprise chat gpt to write a query to load some json data into a data warehouse. I was impressed with how good a job it did, but it still required several rounds of refinement and hand-holding and the end result was almost, but not quite, correct. So I'm not coming at this from the perspective of hating LLMs a priori, but I am unimpressed with the hype and over-selling of its capabilities. In the end, it was no faster than writing the query myself, but it wasn't slower either, so I can see it being somewhat helpful in limited conditions.
Unless the technology makes another quantum leap improvement at the same time the price drops like a stone, I don't see LLMs coming anywhere close to your claim.
That said, I expect to see a huge amount of snake oil and enterprise dollars wastefully burned on executive pipe dreams of "here's a pile of data now magic me a better business!" in the next few years of LLM over-hyped nonsense. There's always a quick buck to make in duping clueless execs drooling over replacing pesky, annoying, "over-paid" tech people.
What do we typically do in academic biomedical research in this situation?
The lead PI looks around the lab and finds a grad student or postdoc who knows how to turn on a computer and if very lucky also has had 6 months of experience noodling around with R or Python. This grad or postdoc is then charged with running some statistical analyses without any training whatsoever in data science. What is an outlier anyway, what do you mean by “normalize”, what is metadata exactly?
You get my drift: It is newbies in data science and programming (often 40-and 50-year-olds) leading novices (20- and 30-year-olds) to the slaughter. Might contribute to some lack of replicability ;-)
And it has been this way in the majority of academic labs since I started using CPM on an Apple 2 in 1980 at UC Davis in an electrophysiology lab in Psychology, to the first Macs I set up at Yale in a developmental neurobiology lab in 1984, and up to the point at which I set up my own lab in neurogenetics at the University of Tennessee with a pair of Mac IIs in 1989 and $150,000 in set-up funds, just enough for me to hire one very inexperience technician to help me do everything.
So in this context I hope all of you can appreciate that ANY help in bringing some real data science into mom-and-pop laboratories would be a huge huge boon.
And please god, let it be FOSS.
I also think there are unanswered questions about reliability, cost (dollar and energy), and AI business models; I don't think OpenAI can burn $2+ to make a dollar forever.
There's so much that goes into ensuring the reliability, scalability and monitoring of production ready data pipelines. Not to mention the integration work for each use case. An LLM will give you short term wins at the cost of long term reliability - which is exactly why we already have DE teams to support DA and DS roles.
I agree. There is a lot of data people want that isn't made because of labor costs. Not just in quantity, but difficulty. If you can only afford to hire one analyst, and the analyst's time is only spent on cleaning data and generating basic sums, then that's all you'll get. But if the analyst can save a lot of time with LLMs, they'll have time to handle more complicated statistics using those counts like forecasts or other models.
What is the intoxication that assumes the engineering disciplines are now suddenly auto-automatable ?
I’m not trying to move the goal post here, but LLMs haven’t replaced a single headcount. In fact, it’s only been helping our business so far.