Databases should contain their own Metadata – Use SQL Everywhere (opens in new tab)

(floedb.ai)

46 pointsmatheusalmeida3mo ago33 comments

33 comments

Not clear if the author realises that all commercial SQL database engines support querying of the database's metadata using SQL. Or maybe I have misunderstood - I only skimmed the article.

da_chicken3mo ago

Yeah, this seemed like a very long way to say, "Our RDBMS has system catalogs," as if it's 1987.

But then, they're also doing JOINs with the USING clause, which seems like one of those things that everybody tries... until they hit one of the several reasons not to use them, and then they go back to the ON clause which is explicit and concrete and works great in all cases.

Personally, I'd like to hear more about the claims made about Snowflake IDs.

zabzonk3mo ago

> doing JOINs with the USING clause

I'm ashamed to say that despite using SQL from the late 1980s, and as someone that likes reading manuals and text books, I'd never come across USING. Probably a bit late for me now to use it (or not) :-(

2 more replies

tkejser3mo ago

@da_chicken: You can read more about Snowflake ID in the Wiki page linked in the article.

The short story:

They are bit like UUID in that you can generate them across a system in a distributed way without coordination. Unlike UUID they are only 64-bit.

The first bits of the snowflake ID are structured in such a way that the values end up roughly sequentially ordered on disk. That makes them great for large tables where you need to locate specific values (such a those that store query information).

whynotmaybe3mo ago

I don't think it's as easy to do the example in the article just by using information_schema.

> Which tables have a column with the name country where that column has more than two different values

But on their product page, the definition of floesql left me puzzled

> It uses intelligent caching and LLVM-based vectorized execution to deliver the query execution speed your business users expect.

> With its powerful query planner, FloeSQL executes queries with lots of joins and complicated SQL syntax without breaking your budget.

tkejser3mo ago

INFORMATION_SCHEMA is a good start, but it does not get you to full metadata flexibility. The columns you need just aren't there. It is good to have a standard for the metadata - but the standard isn't ambitious enough (a point I also make in the blog and as you observe, the sample query isn't possible on Information Schema alone)

The Floe engine is a full database on top of Iceberg and Delta storage. The system views are just the tip of the iceberg. We will be blogging more about what we are building.

1 more reply

deepsun3mo ago

Differently though, AFAIR PostgreSQL does most of the schema changes transactionally, but not MySQL.

zabzonk3mo ago

I'd be too terrified to change the schema directly via SQL on the metadata tables even if the engine allowed it, transactional or not.

galaxyLogic3mo ago

Isn't his like it is in many relational databases, you can query them about the tables in them?

matheusalmeidaOP3mo ago

The key difference is that it's not just about schema metadata (tables, indexes, views, columns, etc...). PostgreSQL is fabulous regarding this. Even native types are part of the catalog (pg_catalog).

Things are great in your DB... until they aren't. The post is about making observability a first-class citizen. Plans and query execution statistics, for example, queryable using a uniform interface (SQL) without the need to install DB extensions.

tkejser3mo ago

Thank you and yes!

By making the entire architecture of the database visible via system objects - you allow the user to form a mental model of how the database itself works. Instead of it being just a magic box that runs queries - it becomes a fully instrumented data model of itself.

Now, you could say: "The database should just work" and perhaps claim that it is design error when it doesn't. Why do I need instrumentation at this level?

To that I can say: Every database ever made makes query planning mistakes or has places where it misbehaves. That's just the way this field works - because data is fiendishly complicated - particularly at high concurrency of when there is a lot of it. The solution isn't (just) to keep improving and fixing edge cases - it is to make those edge cases easy to detect for all users.

1 more reply

tkejser3mo ago

Hi All

Original author here (no, I am not an LLM).

First, a clarifying point on INFORMATION_SCHEMA. In the post I make it clear that this interface is supported by pretty much every database since the 1980s. Most tools would not exist without them. When you write an article like this - you are trying to hit a broad audience and not everyone knows that there are standards for this.

But, our design goes further and treats all metadata as data. It's joinable, persisted and acts, in every way, like all other data. Of course, some data we cannot allow you to delete - such as that in `sys.session_log` - because it is also an audit trail.

Consider, by contrast, PostgreSQL's `pg_stat_statements`. This is an aggregated, in memory, summary of recent statements. You can get a the high level view, but you cannot get every statement run and how that particular statement deviated from statements like it. You also cannot get the query plan for a statement that ran last week.

To address the obvious question: "Isn't that very expensive to store?"

Not really. Consider a pretty aggressive analytical system (not OLTP) - you get perhaps 1000 queries/sec. The query text is normalised and so is the plan - so the actual query data (runtimes, usernames, skewness, stats about various operators) is in the order of few hundred bytes. Even on a heavily used system, we are talking some double digit GB every day for a very busy system - on cheap Object Storage. Your company web servers store orders of magnitude more data than that in their logs.

With a bit of data rotation - you can keep the aggregates sizes over time manageable.

What stats do we store about queries?

- Rows in each node (count, not the actual row data as that would be a PII problem) - Various runtimes - Metadata about who, when and where (ex: cluster location)

Again, these are tiny amounts of data in the grand schema of things. But somehow our industry accepts that our web servers store all that - but our open source databases don't (this level of detail is not controversial in the old school databases by the way).

Of course, we can go further than just measuring the query plan.

Performance Profiling of workers is a a concept you can talk about - so it is also metadata. Let us say you want to really understand what is going on inside a node in a cluster.

You can do this:

```sql SELECT stack_frame, samples FROM sys.node_trace WHERE node_id = 42 ```

Which returns a 10 second sample (via `perf`) of the process running on one of the cluster node.

(Obviously, that data is emphemeral - we are good at making things fast but we can't make tracing completely free)

Happy to answer all questions

waffletower3mo ago

"SQL everywhere" is decidedly dystopian. That being said, creating a standard for database introspection could be powerful for agents.

ewuhic3mo ago

AI shit slop

jwneil3mo ago

def not -know the author and he simply dn need to go there - read more carefully

j / k navigate · click thread line to collapse

33 comments

zabzonk3mo ago

Not clear if the author realises that all commercial SQL database engines support querying of the database's metadata using SQL. Or maybe I have misunderstood - I only skimmed the article.

da_chicken3mo ago

Yeah, this seemed like a very long way to say, "Our RDBMS has system catalogs," as if it's 1987.

Personally, I'd like to hear more about the claims made about Snowflake IDs.

zabzonk3mo ago

> doing JOINs with the USING clause

2 more replies

tkejser3mo ago

@da_chicken: You can read more about Snowflake ID in the Wiki page linked in the article.

The short story:

They are bit like UUID in that you can generate them across a system in a distributed way without coordination. Unlike UUID they are only 64-bit.

whynotmaybe3mo ago

I don't think it's as easy to do the example in the article just by using information_schema.

> Which tables have a column with the name country where that column has more than two different values

But on their product page, the definition of floesql left me puzzled

> It uses intelligent caching and LLVM-based vectorized execution to deliver the query execution speed your business users expect.

> With its powerful query planner, FloeSQL executes queries with lots of joins and complicated SQL syntax without breaking your budget.

tkejser3mo ago

The Floe engine is a full database on top of Iceberg and Delta storage. The system views are just the tip of the iceberg. We will be blogging more about what we are building.

1 more reply

deepsun3mo ago

Differently though, AFAIR PostgreSQL does most of the schema changes transactionally, but not MySQL.

zabzonk3mo ago

I'd be too terrified to change the schema directly via SQL on the metadata tables even if the engine allowed it, transactional or not.

galaxyLogic3mo ago

Isn't his like it is in many relational databases, you can query them about the tables in them?

matheusalmeidaOP3mo ago

tkejser3mo ago

Thank you and yes!

Now, you could say: "The database should just work" and perhaps claim that it is design error when it doesn't. Why do I need instrumentation at this level?

1 more reply

tkejser3mo ago

Hi All

Original author here (no, I am not an LLM).

To address the obvious question: "Isn't that very expensive to store?"

With a bit of data rotation - you can keep the aggregates sizes over time manageable.

What stats do we store about queries?

- Rows in each node (count, not the actual row data as that would be a PII problem) - Various runtimes - Metadata about who, when and where (ex: cluster location)

Of course, we can go further than just measuring the query plan.

Performance Profiling of workers is a a concept you can talk about - so it is also metadata. Let us say you want to really understand what is going on inside a node in a cluster.

You can do this:

```sql SELECT stack_frame, samples FROM sys.node_trace WHERE node_id = 42 ```

Which returns a 10 second sample (via `perf`) of the process running on one of the cluster node.

(Obviously, that data is emphemeral - we are good at making things fast but we can't make tracing completely free)

Happy to answer all questions

waffletower3mo ago

"SQL everywhere" is decidedly dystopian. That being said, creating a standard for database introspection could be powerful for agents.

ewuhic3mo ago

AI shit slop

jwneil3mo ago

def not -know the author and he simply dn need to go there - read more carefully

j / k navigate · click thread line to collapse