Bauplan – Git-for-data pipelines on object storage (opens in new tab)

(docs.bauplanlabs.com)

83 pointsbarabbababoon1y ago45 comments

45 comments

Looking to get feedback for a code-first platform for data: instead of custom frameworks, GUIs, notebooks on a chron, bauplan runs SQL / Python functions from your IDE, in the cloud, backed by your object storage. Everything is versioned and composable: time-travel, git-like branches, scriptable meta-logic.

Perhaps surprisingly, we decided to co-design the abstractions and the runtime, which allowed novel optimizations at the intersection of FaaS and data - e.g. rebuilding functions can be 15x faster than the corresponding AWS stack (https://arxiv.org/pdf/2410.17465). All capabilities are available to humans (CLI) and machines (SDK) through simple APIs.

Would love to hear the community’s thoughts on moving data engineering workflows closer to software abstractions: tables, functions, branches, CI/CD etc.

anentropic1y ago

I am very interested in this but have some questions after a quick look

It mentions "Serverless pipelines. Run fast, stateless Python functions in the cloud." on the home page... but it took me a while of clicking around looking for exactly what the deployment model is

e.g. is it the cloud provider's own "serverless functions"? or is this a platform that maybe runs on k8s and provides its own serverless compute resources?

Under examples I found https://docs.bauplanlabs.com/en/latest/examples/data_product... which shows running a cli command `serverless deploy` to deploy an AWS Lambda

for me deploying to regular Lambda func is a plus, but this example raises more questions...

https://docs.bauplanlabs.com/en/latest/commands_cheatsheet.h... doesn't show any 'serverless' or 'deploy' command... presumably the example is using an external tool i.e. the Serverless framework?

which is fine, great even - I can presumably use my existing code deployment methodology like CDK or Terraform instead

Just suggesting that the underlying details could be spelled out a bit more up front.

In the end I kind of understand it as similar to sqlmesh, but with a "BYO compute" approach? So where sqlmesh wants to run on a Data Warehouse platform that provides compute, and only really supports Iceberg via Trino, bauplan is focused solely on Iceberg and defining/providing your own compute resources?

I like it

Last question is re here https://docs.bauplanlabs.com/en/latest/tutorial/index.html

> "Need credentials? Fill out this form to get started"

Should I understand therefore that this is only usable with an account from bauplanlabs.com ?

What does that provide? There's no pricing mentioned so far - what is the model?

zenlikethat1y ago

> or is this a platform that maybe runs on k8s and provides its own serverless compute resources?

This one, although it’s a custom orchestration system, not Kubernetes. (there are some similarities but our system is really optimized for data workloads)

We manage Iceberg for easy data versioning, take care of data caching and Python modules, etc., and you just write some Python and SQL and exec it over your data catalog without having to worry about Docker and all infra stuff.

I wrote a bit on what the efficient SQL half takes care of for you here: https://www.bauplanlabs.com/blog/blending-duckdb-and-iceberg...

> In the end I kind of understand it as similar to sqlmesh, but with a "BYO compute" approach? So where sqlmesh wants to run on a Data Warehouse platform that provides compute, and only really supports Iceberg via Trino, bauplan is focused solely on Iceberg and defining/providing your own compute resources?

Philosophically, yes. In practice so far we manage the machines in separate AWS accounts _for_ the customers, in a sort of hybrid approach, but the idea is not dissimilar.

> Should I understand therefore that this is only usable with an account from bauplanlabs.com ?

Yep. We’d help you get started and use our demo team. Send jacopo.tagliabue@bauplanlabs.com an email

RE: pricing. Good question. Early startup stage bespoke at the moment. Contact your friendly neighborhood Bauplan founder to learn more :)

1 more reply

esafak1y ago

It is a service, not an open source tool, as far as I can tell. Do you intend to stay that way? What is the business model and pricing?

I am a bit concerned that you want users to swap out both their storage and workflow orchestrator. It's hard enough to convince users to drop one.

How does it compare to DuckDB or Polars for medium data?

barabbababoonOP1y ago

- Yes. it is a service and at least the runner will stay like that for the time being.

- We are not quite live yet, but the pricing model is based on compute capacity and it is divided in tiers (e.g. small=50GB for concurrent scans=$1500/month, large can get up to a TB). infinite queries, infinte jobs, infinite users. The idea is to have a very clear pricing with no sudden increases due to volume.

- You do not have to swap your storage - our runner comes to your S3 bucket and your data never ever have to be anywhere else that is not your S3.

- You do not have to swap your orchestrator either. Most of our clients are actually using it with their existing orchestrator. You call the platform's APIs, including run from your Airflow/Prefect/Temporal tasks https://www.prefect.io/blog/prefect-on-the-lakehouse-write-a...

Does it help?

zenlikethat1y ago

Yep, staying service.

RE: workflow orchestrators. You can use the Bauplan SDK to query, launch jobs and get results from within your existing platform, we don’t want to replace entirely if it’s doesn’t fit for you, just to augment.

RE: DuckDB and Polars. It literally uses DuckDB under the hood but with two huge upgrades: one, we plug into your data catalog for really efficient scanning even on massive data lake houses, before it hits the DuckDB step. Two, we do efficient data caching. Query results and intermediate scans and stuff can be reused across runs.

More details here: https://www.bauplanlabs.com/blog/blending-duckdb-and-iceberg...

As for Polars, you can use Polars itself within your Python models easily by specifying it in a pip decorator. We install all requested packages within Python modules.

pablomendes1y ago

In what kinds of workloads or usage patterns do you see the biggest performance gains vs traditional FaaS + storage stacks?