A lot of our databases at work have columns with custom types and enums, and getting the LLM (Llama2) to write SQL queries to robustly answer natural language questions about the data is tough. It requires a lot of instruction prompting, context, and question-SQL examples (few-shot learning), and it still fails in unexpected ways. It's a tough ask for people to use a tool like this if they can't trust the results all the time. It's also a bit infeasible to scale this to tens or hundreds of tables across our data warehouse.
It's great that a lot of people are trying to crack this problem, I'm curious to try this model out. I'd also love to see if other people have tried solving this problem and made any headway.
In that sense I emphasized in our Blogpost that users should think of it as a documentation oracle that always gives you the exact DuckDB SQL query snippet you are looking for, which is a tremendoues time-saver if you have an abstrat idea of the query you want to write, but you're just not sure about the syntax, expecially with DuckDB having so many functions and SQL extensions.
Here are a few exammples:
- create tmp table from test.csv
- load aws credentials from 'test' profile
- get max of all columns in rideshare table
- show query plan with runtimes for 'SELECT * FROM rideshare'
- cast hvfhs_license_num column to int
- get all columns ending with _amount from taxi table
- show summary statistics of rideshare table
- get a 10% reservoir sample of rideshare table
- get length of drivers array in taxi table
- get violation_type field from other_violations json column in taxi table
- get passenger count, trip distance and fare amount from taxi table and oder by all of them
- list all tables in current database
- get all databases starting with test_
[edit: fixed list formatting]
We at MotherDuck took an incremental approach. We launched something more akin to lane-assist. We're calling it FixIt - an in-flow visual aid to help you identify and fix errors in your SQL [0].
I think there's gobs of opportunities to improve the analytics experiences without resorting to "L5 self-driving" (e.g. full text-to-results)
[0] https://motherduck.com/blog/introducing-fixit-ai-sql-error-f...
But as you say custom types and encoded domain knowledge is extremely tough and as a result it's very tough to "transfer" the system to different databases.
1. Business users aren’t prepared to talk to their data in meaningful ways and this is an opportunity for LLMs to enhance the way users ask questions.
2. SQL modeling languages exist (although I’m not sure there are well maintained open source good ones and this is the biggest obstacle to what I’m working on now) and LLMs can extend these effectively by adding components such as dimensions, metrics, relationships, filters, etc. with less chance of hallucination
3. Deterministic SQL generation from a modeling repository is easier to troubleshoot and provide guarantees than end-to-end generation.
4. Existing SQL can be parsed to generate modeling components that can be committed to the model repository with LLM assistance
5. Much of the richness of going to data to answer questions is context, e.g., how does this dept compare to others, this month to same month last year, etc. Modeling languages are good at expressing these transformations, but business users and often analysts aren’t good at capturing all the permutations of these types of comparisons. Another area where LLMs can help apply tooling.
IMO, LLMs are more effective at using tools than generating complex outputs.
One of the biggest challenges I've personally seen in this space is business "leaders" pushing teams to ship products asap lest they loose face among their fellow CEOs for not pushing out "AI" products before everyone else.
I'm fairly optimistic about LLMs being able to truly be transformative, but it's not going to be through forcing the bread-dead UX of hoisting yet another slightly re-imagined chat interface on users.
The idea of "talking to your data" is a promising one, and anyone who has worked for a large data driven org will quickly agree that organizing and searching in-house data is not a solved problem from the UX end of things. But to truly solving these problems, even/especially with LLMs, is going to require thought and experimentation. Something few "business leaders" have patience for.
It can work to support business analysts to crank out more reporters, but I wouldn’t roll it out to all my staff.
"What was the average order size per customer for product XYZ in the West region?"
Imagine turning that one loose against the typical legacy system.
I wouldn't trust an LLM to figure out the joins or aggregate calculation, LET ALONE the definition of a customer, a product, or a region.
But it would almost certainly generate AN answer.
2. How would you suggest using this model effectively if we have custom data in our DBs? For example, we might have a column called `purpose` that's a custom defined enum (i.e. not a very well-known concept outside of our business). Currently, we've fed it in as context by defining all the possible values it can have. Do you have any other recs on how to tune our prompts so that this model is just as effective with our own custom data?
3. Similar to above, do you know you can use the same model to work effectively on tens or even hundreds of tables? I've used multiple question-SQL example pairs as context, but I've found that I need 15-20 for it to be effective for even one table, let alone tens of tables.
1. definitely training data (for me), we explored about 10 different directions before settling on the current approach. It's easy to underestimate the effect of training data on the quality of the model. Starting point was the benchmark dataset though, which we assembled manually (to avoid data pollution and also because there was simply no text2sql benchmark that covers anything else than plain old SQL select statements with a handful of aggregate functions). And training is also not a one-off thing. With large datasets it is hard to evaluate the quality of the dataset without actually training a few epochs on it and run the benchmark.
2. I left a comment about my view on where such models are effective in a previous commment: https://news.ycombinator.com/item?id=39133155
3. No way - I see a common stack emerging (take a look at companies like https://vanna.ai/, https://www.dataherald.com/, or https://www.waii.ai) that is mainly centered around foundation models like GPT-4 with strong in-context learning capabilities (that's a kind of a must to make these approaches work and comes with long inference times and higher costs). These solutions include things like embedding-based schema filtering, options for users to enrich metadata about tables and columns, including previous related queries into the context etc. around the model. I'd say it's a bit of a different problem from what we aimed at solving.
Any other ideas about where you plan to utilize LLM? E.g.
- migrate from different SQL dialects (e.g. date functions)
- make my queries nicer (e.g., rewrite them in shorter form or use better variable names)
- warn about potential bugs (e.g. SQL fanout)
What's really exciting is what you can do with DuckDB, MotherDuck, and WASM. A powerful in-browser storage and execution engine tethered to a central serverless data warehouse using hybrid mode [1] opens the doors for unprecedented experiences. Imagine the possibilities if you have metadata, data, query logic, or even LLMs in the client 0ms away from the user and on user's own hardware.
So we're doing this in our UI of course, but we also released a WASM SDK so that developers can take advantage of this new architecture in their own apps! [2]
[0]https://motherduck.com/blog/introducing-fixit-ai-sql-error-f...
[1]https://motherduck.com/docs/architecture-and-capabilities
Do you think a 3B model might also be in the future, or something small enough that can be loaded up in Transformers.js?
[1] https://huggingface.co/datasets/NumbersStation/NSText2SQL
[2] https://huggingface.co/datasets/b-mc2/sql-create-context