How to create a 1M record table with a single query (opens in new tab)

(antonz.org)

180 pointslipanski5y ago38 comments

38 comments

If you're running PostgreSQL, you can use the built-in generate_series (1) function like:

  SELECT id, random()
  FROM generate_series(0, 1000000) g (id);

There seems to be an equivalent in SQLite: https://sqlite.org/series.html

[1] https://www.postgresql.org/docs/current/functions-srf.html

dintech5y ago

Here's how to do something similar in Q (KDB)

    ([]x?x:1000000)

Which gives a table of a million rows like:

klibertp5y ago

Nothing to do with the article, SQL, or a DB, but I can't help wanting to add even just a bit when I see array languaes mentioned. So, in J it's just:

    ?~1000000

or, if you don't need elements to be unique:

    ?$~1000000

Though it's just an array, not a persistent table - I don't know much about Jd :(

1 more reply

sradman5y ago

This can generally fall under the category of “Generate Test Data in SQL” with both generic techniques like Recursive CTEs, as in the OP, and SQL dialect or tool specific options. Search is your friend. The OP also provides a table lookup technique for readable names but doesn’t address other data types such as timestamps or other functions such as MD5(). Other data distribution techniques other than random may also be needed. This post scratches the surface of testing with generated data.

jeffbee5y ago

If you're running PostgreSQL you can also just

  xxd -ps -c 16 -l $bignum < /dev/urandom | (echo '\copy random_data from stdin'; cat) | psql

Depending on how you want your random keys formatted.

em5005y ago

If you're importing from command line tools, you might as well use the specialized `shuf` tool:

  shuf -i 1-$bignum

or for random numbers with replacement

  shuf -i 1-$bignum -r -n $bignum

These give you a sample of random integers. `shuf` is part of GNU coreutils, so present in most Linux installs (but not on macOS).

2 more replies

schoetbi5y ago

you have to use it since PostgreSQL does not support limit within a with expression.

jolmg5y ago

You can use a with-expression. It's true that you can't use `limit` to limit it, but you can use a `where` condition. This is equivalent to cribwi's example:

  WITH RECURSIVE
    t AS (
      SELECT 0 id
      UNION ALL
      SELECT id + 1
        FROM t
        WHERE id < 1000000
    )
    SELECT id, random() FROM t;

It returns 1000001 rows, but so does cribwi's.

deepstack5y ago

use a sequence or uuid

etaioinshrdlu5y ago

If your DB doesn't support this technique, you might be able to use this rather disgusting technique, which builds an exponentially growing tower of rows by using nested queries with JOIN's. https://stackoverflow.com/a/61169467

dspillett5y ago

Some DBs support "literal tables" which makes this a little less nasty:

    SELECT n FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) AS d(n)

for instance:

        SELECT seq = units.n + tens.n*10 + hundreds.n*100 + thousands.n*1000
          FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) AS units(n)
    CROSS JOIN (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) AS tens(n)
    CROSS JOIN (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) AS hundreds(n)
    CROSS JOIN (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) AS thousands(n)
         WHERE units.n + tens.n*10 + hundreds.n*100 + thousands.n*1000 BETWEEN 1 AND 1337

You can make it neater still by making the sub-table with a CTE:

    WITH Digits AS (SELECT n FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) AS d(n))
            SELECT seq = units.n + tens.n*10 + hundreds.n*100 + thousands.n*1000
              FROM Digits AS units
        CROSS JOIN Digits AS tens
        CROSS JOIN Digits AS hundreds
        CROSS JOIN Digits AS thousands
             WHERE units.n + tens.n*10 + hundreds.n*100 + thousands.n*1000 BETWEEN 1 AND 1337

    WITH Digits AS (SELECT n FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) AS d(n))
    , Thousands AS (SELECT seq = units.n + tens.n*10 + hundreds.n*100 + thousands.n*1000
              FROM Digits AS units
        CROSS JOIN Digits AS tens
        CROSS JOIN Digits AS hundreds
        CROSS JOIN Digits AS thousands
        )
    SELECT seq FROM thousands ORDER BY seq
     WHERE units.n + tens.n*10 + hundreds.n*100 + thousands.n*1000 BETWEEN 1 AND 1337

though at that point you probably have the recursive option so you'd need to test to see which is more efficient. As a further hack you can combine this and the recursive CTE to extend the number of rows you can return when recursion limits would otherwise be a problem.

You can also use simple window functions to make a sequence from a system table that you know will have a few thousand rows:

    SELECT n=ROW_NUMBER() OVER (ORDER BY object_id) FROM sys.all_columns

Or if you just need a few thousand rows to hang randomly generated data from you don't even need the function.

All a bit hacky compared to some DBs that have sequence support, but useful for generating test data or numbers tables where that isn't supported.

openquery5y ago

This looks convenient (and performant). But how does it scale as queries join across tables?

If you need to create test data with complex business logic, referential integrity and constraints we've been working on declarative data generator that is build exactly for this: https://github.com/openquery-io/synth.

zikani_035y ago

If you need another repeatable way to create random data that you can export as SQL (or CSV/Excel files). You may find a tool we built and use at work useful: https://github.com/creditdatamw/zefaker

Needs a little Groovy but very convenient for generating random (or non-random) data.

sigmonsays5y ago

i generally end up writing a data generator using the language and apis in the application. I often want control over various aspects of the data and built the generator as such. Quite often I just generate queries and then run them which works quickly.

While this looks like a good way to generate simple data, practical applications are more involved.

alexf955y ago

Can someone give me a specific use case for this? As "check how a query behaves on a large table" is very vague to me.

E.g. I have a table structure but not alot of rows in it, so i go use this to get alot of rows in to check how fast queries get processed?

DasIch5y ago

You want to run a transaction in production and you're unsure about the impact because it might take a lock and block other transactions for some time. In that case you may want to spin up a separate database with fake data and test it there first.

You could create a new database and restore a production backup instead of using fake data but that might not be allowed or require some kind of approval due to rules protecting the privacy of customers or employees.

dspillett5y ago

Generating test data batches like this is a common use.

The other is for populating "numbers tables" which can be very useful in reporting contexts and/or dealing with certain gaps/islands type problems.

You can even use the techniques dynamically rather than a stored numbers table, if you need it in a DB that doesn't have one and you don't have schema permissions to create one, though this is generally less efficient.

psing5y ago

Creating realistic fake data is useful in lower environments and for load testing. Outside of SQL I like faker: https://github.com/joke2k/faker

unnouinceput5y ago

Exactly like that, yes.

muad_kyrlach5y ago

Everyone, please rush out and do this in your production databases! :)

slt20215y ago

usually there is a system table with a ton of metadata that will for sure contain few thousand rows, so generating a million rows for SQL Server is simply:

select top 1000000 ROW_NUMBER() from sys.objects a, sys.objects b

or you can use INFORMATION_SCHEMA which is more portable across different RDBMS engines

vp89895y ago

This is how I do it. You can cross join the derived table to itself if you need more rows.

Much more performant and naturally relational way of generating data than looping recursively.

kroolik5y ago

Also one of the only ways to get sequences in joins in Redshift. Unfortunately, only Redshift master nodes support 'generate_series'. If your query contains join that are spread across multiple worker nodes, Redshift will report an error saying 'generate_series' no supported.

Gotta select row number on some big enough table!

urbandw311er5y ago

> This is not a problem if your DBMS supports SQL recursion

A table of which DMBS’s support this would be useful

nalgeon5y ago

WITH RECURSIVE is supported in MariaDB 10.2+, MySQL 8.0+, PostgreSQL 8.4+ and SQLite 3.8+. Oracle 11.2+ and SQL Server 2005 support recursive queries, but without the RECURSIVE keyword.

dspillett5y ago

Before recent versions, in the Postgres implementation CTEs were a wall to predicate push-down which can make them much less efficient than in other DBMSs. Worth noting if you need to support older versions for any reason.

1 more reply

Twisell5y ago

Not sure if it's up-to date though : https://modern-sql.com/caniuse/with_recursive_(top_level)

magicalhippo5y ago

FWIW Sybase SQL Anywhere has supported it since v10[1] at least. Current version is v17[2].

[1]: http://dcx.sap.com/index.html#1001/en/dbrfen10/rf-select-sta...

[2]: http://dcx.sap.com/index.html#sqla170/en/html/8190aea36ce210...

alexf955y ago

This is the list that i could find:

Microsoft SQL Server, Firebird 2.1, PostgreSQL, SQLite, IBM Informix version, CUBRID, MariaDB and MySQL

geogra45y ago

Not oracle?

exdsq5y ago

I sort of assumed all of them did!

tandav5y ago

    spark.sparkContext.range(1_000_000, numSlices=200).take(20)

coremoff5y ago

in oracle, this can be done using heirarchical queries:

  select dbms_random.value from dual connect by level < 1000001;

edit: 1,000,001 as level starts with 1

j / k navigate · click thread line to collapse

38 comments

cribwi5y ago

If you're running PostgreSQL, you can use the built-in generate_series (1) function like:

  SELECT id, random()
  FROM generate_series(0, 1000000) g (id);

There seems to be an equivalent in SQLite: https://sqlite.org/series.html

[1] https://www.postgresql.org/docs/current/functions-srf.html

dintech5y ago

Here's how to do something similar in Q (KDB)

    ([]x?x:1000000)

Which gives a table of a million rows like:

klibertp5y ago

Nothing to do with the article, SQL, or a DB, but I can't help wanting to add even just a bit when I see array languaes mentioned. So, in J it's just:

    ?~1000000

or, if you don't need elements to be unique:

    ?$~1000000

Though it's just an array, not a persistent table - I don't know much about Jd :(

1 more reply

sradman5y ago

jeffbee5y ago

If you're running PostgreSQL you can also just

  xxd -ps -c 16 -l $bignum < /dev/urandom | (echo '\copy random_data from stdin'; cat) | psql

Depending on how you want your random keys formatted.

em5005y ago

If you're importing from command line tools, you might as well use the specialized `shuf` tool:

  shuf -i 1-$bignum

or for random numbers with replacement

  shuf -i 1-$bignum -r -n $bignum

These give you a sample of random integers. `shuf` is part of GNU coreutils, so present in most Linux installs (but not on macOS).

2 more replies

schoetbi5y ago

you have to use it since PostgreSQL does not support limit within a with expression.

jolmg5y ago

You can use a with-expression. It's true that you can't use `limit` to limit it, but you can use a `where` condition. This is equivalent to cribwi's example:

  WITH RECURSIVE
    t AS (
      SELECT 0 id
      UNION ALL
      SELECT id + 1
        FROM t
        WHERE id < 1000000
    )
    SELECT id, random() FROM t;

It returns 1000001 rows, but so does cribwi's.

deepstack5y ago

use a sequence or uuid

etaioinshrdlu5y ago

dspillett5y ago

Some DBs support "literal tables" which makes this a little less nasty:

    SELECT n FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) AS d(n)

for instance:

        SELECT seq = units.n + tens.n*10 + hundreds.n*100 + thousands.n*1000
          FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) AS units(n)
    CROSS JOIN (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) AS tens(n)
    CROSS JOIN (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) AS hundreds(n)
    CROSS JOIN (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) AS thousands(n)
         WHERE units.n + tens.n*10 + hundreds.n*100 + thousands.n*1000 BETWEEN 1 AND 1337

You can make it neater still by making the sub-table with a CTE:

    WITH Digits AS (SELECT n FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) AS d(n))
            SELECT seq = units.n + tens.n*10 + hundreds.n*100 + thousands.n*1000
              FROM Digits AS units
        CROSS JOIN Digits AS tens
        CROSS JOIN Digits AS hundreds
        CROSS JOIN Digits AS thousands
             WHERE units.n + tens.n*10 + hundreds.n*100 + thousands.n*1000 BETWEEN 1 AND 1337

    WITH Digits AS (SELECT n FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) AS d(n))
    , Thousands AS (SELECT seq = units.n + tens.n*10 + hundreds.n*100 + thousands.n*1000
              FROM Digits AS units
        CROSS JOIN Digits AS tens
        CROSS JOIN Digits AS hundreds
        CROSS JOIN Digits AS thousands
        )
    SELECT seq FROM thousands ORDER BY seq
     WHERE units.n + tens.n*10 + hundreds.n*100 + thousands.n*1000 BETWEEN 1 AND 1337

You can also use simple window functions to make a sequence from a system table that you know will have a few thousand rows:

    SELECT n=ROW_NUMBER() OVER (ORDER BY object_id) FROM sys.all_columns

Or if you just need a few thousand rows to hang randomly generated data from you don't even need the function.

All a bit hacky compared to some DBs that have sequence support, but useful for generating test data or numbers tables where that isn't supported.

openquery5y ago

This looks convenient (and performant). But how does it scale as queries join across tables?

zikani_035y ago

If you need another repeatable way to create random data that you can export as SQL (or CSV/Excel files). You may find a tool we built and use at work useful: https://github.com/creditdatamw/zefaker

Needs a little Groovy but very convenient for generating random (or non-random) data.

sigmonsays5y ago

While this looks like a good way to generate simple data, practical applications are more involved.

alexf955y ago

Can someone give me a specific use case for this? As "check how a query behaves on a large table" is very vague to me.

E.g. I have a table structure but not alot of rows in it, so i go use this to get alot of rows in to check how fast queries get processed?

DasIch5y ago

dspillett5y ago

Generating test data batches like this is a common use.

The other is for populating "numbers tables" which can be very useful in reporting contexts and/or dealing with certain gaps/islands type problems.

psing5y ago

Creating realistic fake data is useful in lower environments and for load testing. Outside of SQL I like faker: https://github.com/joke2k/faker

unnouinceput5y ago

Exactly like that, yes.

muad_kyrlach5y ago

Everyone, please rush out and do this in your production databases! :)

slt20215y ago

usually there is a system table with a ton of metadata that will for sure contain few thousand rows, so generating a million rows for SQL Server is simply:

select top 1000000 ROW_NUMBER() from sys.objects a, sys.objects b

or you can use INFORMATION_SCHEMA which is more portable across different RDBMS engines

vp89895y ago

This is how I do it. You can cross join the derived table to itself if you need more rows.

Much more performant and naturally relational way of generating data than looping recursively.

kroolik5y ago

Gotta select row number on some big enough table!

urbandw311er5y ago

> This is not a problem if your DBMS supports SQL recursion

A table of which DMBS’s support this would be useful

nalgeon5y ago

WITH RECURSIVE is supported in MariaDB 10.2+, MySQL 8.0+, PostgreSQL 8.4+ and SQLite 3.8+. Oracle 11.2+ and SQL Server 2005 support recursive queries, but without the RECURSIVE keyword.

dspillett5y ago

1 more reply

Twisell5y ago

Not sure if it's up-to date though : https://modern-sql.com/caniuse/with_recursive_(top_level)

magicalhippo5y ago

FWIW Sybase SQL Anywhere has supported it since v10[1] at least. Current version is v17[2].

[1]: http://dcx.sap.com/index.html#1001/en/dbrfen10/rf-select-sta...

[2]: http://dcx.sap.com/index.html#sqla170/en/html/8190aea36ce210...

alexf955y ago

This is the list that i could find:

Microsoft SQL Server, Firebird 2.1, PostgreSQL, SQLite, IBM Informix version, CUBRID, MariaDB and MySQL

geogra45y ago

Not oracle?

exdsq5y ago

I sort of assumed all of them did!

tandav5y ago

    spark.sparkContext.range(1_000_000, numSlices=200).take(20)

coremoff5y ago

in oracle, this can be done using heirarchical queries:

  select dbms_random.value from dual connect by level < 1000001;

edit: 1,000,001 as level starts with 1

j / k navigate · click thread line to collapse