Joining CSV Data Without SQL: An IP Geolocation Use Case (opens in new tab)

(brimdata.io)

91 pointsjameskerr2y ago39 comments

39 comments

>But before you can get there, as a prerequisite you need to enumerate all the field names and desired data types to load it into a specific structured format. Maybe there’s another way?

In sqlite, this is just:

.mode csv

.import data.csv table

>When .import is run, its treatment of the first input row depends upon whether the target table already exists. If it does not exist, the table is automatically created and the content of the first input row is used to set the name of all the columns in the table.[0]

[0] https://www.sqlite.org/cli.html#importing_files_as_csv_or_ot...

hckr12922y ago

I can't live without this anymore! However, occasionally sqlite won't quite guess the type affinity as I'd hoped for a column, then I do have to resort to enumerating all the types.

I find it slightly annoying to have to switch mode back to something reasonable again, since mode impacts query results as well as imports.

Despite doing this every few weeks, I can never remember what the commands are! The Zui might improve this workflow for me a bit. Worth a shot!

wiml2y ago

If you really want to, there's also the `join` command, which has been part of the standard unix toolset for longer than many of us have been alive.

mhuffman2y ago

Many years ago, I got sidetracked with sorting, joining, and processing (grepping, counting, etc) large text files ... until I realized I was just re-inventing a very slow database.

boneitis2y ago

As a sysadmin/SRE who occasionally has to join data, I am very intrigued. Thanks for the callout!

smartmic2y ago

> a performant join that avoids the hassle of SQL tables.

Hm, a hassle is not always the same hassle for everyone. Personally, I like SQL and I enjoy the power it can leverage on all kind of data. As others wrote, working with sqlite for local data processing is a tool I do not want to miss, besides all of the great Unix coreutils and its (mostly GNU) ecosystem.

RicoElectrico2y ago

I wasn't aware, but some developers do seem to be allergic to SQL at all cost and would go to great lengths to avoid it just because.

boredemployee2y ago

Wow yes. I just found this out recently, it's funny because Sql seems easier than most of the complicated shit a dev needs to deal with.

o11c2y ago

I think a lot of it is derived annoyance at database administration, which can't be easily automated or one-and-done. SQLite is by all measures a terrible database, but people use it because it actually bothers to solve the "lack of hassle" problem.

Pxtl2y ago

Personally my weapon-of-choice fort his kind of thing is just raw Powershell. It means all your queries will be simple linear-searches, and powershell is a warty-as-hell language, but the ergonomics aren't bad.

For example, the "all the cities in Iran" query.

  Import-FromCsv GeoLite2-City-Locations-en.csv | 
    Where-Object {$_.country_name -eq "Iran"} |
    Select-Object -ExpandProperty city_name

You could probably find modules to help with the IP-aware stuff like `cidr_match`, but the real place where it would probably fall over probably performance when joining, since you'd probably be just be doing O(n*m) convolution operations.

hckr12922y ago

nushell is also amazing for exploring data quickly like this! I can't use it as a daily driver shell, but I just call it directly from whatever other shell I'm already in and then ^D back to my prior session when I'm done exploring. Works great and lets me visualize realllly nicely.

jakjak1232y ago

I have done some similar, simpler data wrangling with xsv (https://github.com/BurntSushi/xsv) and jq. It could process my 800M rows in a couple of minutes (plus the time to read it out from the database =)

zimpenfish2y ago

A long time ago, we were trying to compare a couple of tables with a few hundred million rows in each to see whether the differences (due to a new way of processing) were allowable. Our local Oracle Boy whipped up a query, set it running, and we all sat around for hours whilst it churned - end result being we could do one comparison a day. After a while, I experimented with dumping the tables as CSV, through `sort`, and then using some Perl to compare each paired (or not!) line with some heuristics for quick rejection. That all took about 1-2 hours meaning we could get through three, maybe four, tests a day instead.

speq2y ago

qsv has more features nowadays:

https://github.com/jqnatividad/qsv

nerdponx2y ago

CLI data processing is always fun and cool. But it tends to also be limited in scope and practicality, not to mention performance if you're chaining operations between function calls and it needs to re-parse the data every time.

If you want to avoid SQL, it's really hard to beat a "data frame" data structure for tabular data processing including things like joins.

flusteredBias2y ago

I do the same with DuckDB and pretty print with tidy-viewer.

chaps2y ago

Ehhhhhhhh. It hasn't really made a super convincing argument not to use SQL. A lot of what is described isn't intinsicly a SQL problem, but instead an ETL problem. Eg, the use of complex types can be done by loading a csv into a TEXT table and then test typecasting on a sample set, followed by the full set.

And geo indexes are no joke. Using them has made 8hr long SQL queries take seconds.

1 more reply

gabinator2y ago

Wait, you guys aren't copy and pasting every CSV by hand into an Excel spreadsheet?

blitzar2y ago

I recorded a macro for that.

dima552y ago

Alternative very appropriate for some uses cases: `vnl-join` from the vnlog toolkit (https://github.com/dkogan/vnlog). Uses the `join` tool from coreutils (works well, has been around forever), and `vnlog` for nice column labelling

brennaw12y ago

I'm pretty impressed by how Zed seems to handle the CSV overhead we typically see with standard SQL. That 'gradual slope' concept and the one-shot query without a ton of preprocessing? Pretty slick. Seeing the CSV parse transition to Zed lake queries resulting in that kind of speed-up is intriguing. Before jumping on board, though, I'd be curious to see how Zed holds up with even bigger datasets. The CIDR match and join ops are a nice touch, making it feel a tad SQL-like.

zenincognito2y ago

Or, you can also use openrefine from Google.

Currently mangling a 4 GB file and working with api's that use existing data columns to provide output.. Its a great tool.

chaps2y ago

How does it work for much larger files?

zenincognito2y ago

Have found that to be the most succesful amongst a horde of other tools tried.. I have had no problems with file as big as 8-10 gigs as I can allocate more memory to the program as I see fit.

Honestly, given that I can use grel/clojure/python inside to clean up and mangle data seems to make it the swiss knife of data segmentation/cleanup.

chaps2y ago

Nice, thanks for sharing!

holoduke2y ago

I would rather create a quick sqllite file. Create some tables with dbeaver and create a csv processor to write to the sqllite file. Max 1 hour.

kelsey98765432y ago

if you need more tools and power, check out csvkit: https://csvkit.readthedocs.io/en/latest/tutorial/3_power_too...

hermitcrab2y ago

You can join 2 CSV files in a GUI ETL tool like Easy Data Transform in 3 clicks:

-drag the 2 CSV files onto the canvas

-click 'Join'

-select the 2 columns to join

thedougd2y ago

I needed to do this yesterday. Thankfully I recall a hacker news suggestion to use SQLite. No time at all to get both files imported into tables and successful join queries. I’m glad it was SQL as I needed a few basic transforms (case folding, trim, etc) and conditions.

tobilg2y ago

I‘d rather just use DuckDB and write a few SQLs, and be done.

This can be done in the terminal as well.

caifeizhi19902y ago

I also agree with this point. Importing the CSV into a relational database and using standard SQL operations may be more user-friendly for subsequent maintenance. Of course, this relational database should be lightweight and easy to use (e.g. sqlite)

shortrounddev22y ago

would be cool to see a writeup converting these to equivalent powershell queries

nathants2y ago

yes! non standard data wrangling, even if just for fun, is great way to gain a better standing of your workload and hardware.

tldr; [de]serialization is your bottleneck, after that it’s general data processing. both are wasting insane levels of cpu cycles. network and disk, when accessed linearly, are free.

i remember first looking into this when ec2 i3 came out, only more so since. lambda for burst cpu capacity when you can’t wait 30s for ec2 spot is interesting too.

https://nathants.com/posts/performant-batch-processing-with-...

qtd62142y ago

Please don’t do this for actual work you share with other people. There’s a good reason why pandas exists.

trollied2y ago

People that do proper ETL don't use python

condwanaland2y ago

I wish they did though. 80% of my day job is (admittedly non-traditional) ETL. It's done in 100% PySpark.

More composable than SQL. Have plenty of utility functions wrapped up (which are far easier than stored procedures) in libraries. The code is far easier to step through with a debugger (due to the composability).

speedgoose2y ago

Real programmers use butterflies.

https://xkcd.com/378/

Qem2y ago

Now I wonder if Raku developers chose the mascot thanks to this particular XKCD: https://en.m.wikipedia.org/wiki/File:Camelia.svg

1 more reply

j / k navigate · click thread line to collapse

39 comments

Vt71fcAqt72y ago

>But before you can get there, as a prerequisite you need to enumerate all the field names and desired data types to load it into a specific structured format. Maybe there’s another way?

In sqlite, this is just:

.mode csv

.import data.csv table

[0] https://www.sqlite.org/cli.html#importing_files_as_csv_or_ot...

hckr12922y ago

I can't live without this anymore! However, occasionally sqlite won't quite guess the type affinity as I'd hoped for a column, then I do have to resort to enumerating all the types.

I find it slightly annoying to have to switch mode back to something reasonable again, since mode impacts query results as well as imports.

Despite doing this every few weeks, I can never remember what the commands are! The Zui might improve this workflow for me a bit. Worth a shot!

wiml2y ago

If you really want to, there's also the `join` command, which has been part of the standard unix toolset for longer than many of us have been alive.

mhuffman2y ago

Many years ago, I got sidetracked with sorting, joining, and processing (grepping, counting, etc) large text files ... until I realized I was just re-inventing a very slow database.

boneitis2y ago

As a sysadmin/SRE who occasionally has to join data, I am very intrigued. Thanks for the callout!

smartmic2y ago

> a performant join that avoids the hassle of SQL tables.

RicoElectrico2y ago

I wasn't aware, but some developers do seem to be allergic to SQL at all cost and would go to great lengths to avoid it just because.

boredemployee2y ago

Wow yes. I just found this out recently, it's funny because Sql seems easier than most of the complicated shit a dev needs to deal with.

o11c2y ago

Pxtl2y ago

For example, the "all the cities in Iran" query.

  Import-FromCsv GeoLite2-City-Locations-en.csv | 
    Where-Object {$_.country_name -eq "Iran"} |
    Select-Object -ExpandProperty city_name

hckr12922y ago

jakjak1232y ago

zimpenfish2y ago

speq2y ago

qsv has more features nowadays:

https://github.com/jqnatividad/qsv

nerdponx2y ago

If you want to avoid SQL, it's really hard to beat a "data frame" data structure for tabular data processing including things like joins.

flusteredBias2y ago

I do the same with DuckDB and pretty print with tidy-viewer.

chaps2y ago

And geo indexes are no joke. Using them has made 8hr long SQL queries take seconds.

1 more reply

gabinator2y ago

Wait, you guys aren't copy and pasting every CSV by hand into an Excel spreadsheet?

blitzar2y ago

I recorded a macro for that.

dima552y ago

brennaw12y ago

zenincognito2y ago

Or, you can also use openrefine from Google.

Currently mangling a 4 GB file and working with api's that use existing data columns to provide output.. Its a great tool.

chaps2y ago

How does it work for much larger files?

zenincognito2y ago

Have found that to be the most succesful amongst a horde of other tools tried.. I have had no problems with file as big as 8-10 gigs as I can allocate more memory to the program as I see fit.

Honestly, given that I can use grel/clojure/python inside to clean up and mangle data seems to make it the swiss knife of data segmentation/cleanup.

chaps2y ago

Nice, thanks for sharing!

holoduke2y ago

I would rather create a quick sqllite file. Create some tables with dbeaver and create a csv processor to write to the sqllite file. Max 1 hour.

kelsey98765432y ago

if you need more tools and power, check out csvkit: https://csvkit.readthedocs.io/en/latest/tutorial/3_power_too...

hermitcrab2y ago

You can join 2 CSV files in a GUI ETL tool like Easy Data Transform in 3 clicks:

-drag the 2 CSV files onto the canvas

-click 'Join'

-select the 2 columns to join

thedougd2y ago

tobilg2y ago

I‘d rather just use DuckDB and write a few SQLs, and be done.

This can be done in the terminal as well.

caifeizhi19902y ago

shortrounddev22y ago

would be cool to see a writeup converting these to equivalent powershell queries

nathants2y ago

yes! non standard data wrangling, even if just for fun, is great way to gain a better standing of your workload and hardware.

tldr; [de]serialization is your bottleneck, after that it’s general data processing. both are wasting insane levels of cpu cycles. network and disk, when accessed linearly, are free.

i remember first looking into this when ec2 i3 came out, only more so since. lambda for burst cpu capacity when you can’t wait 30s for ec2 spot is interesting too.

https://nathants.com/posts/performant-batch-processing-with-...

qtd62142y ago

Please don’t do this for actual work you share with other people. There’s a good reason why pandas exists.

trollied2y ago

People that do proper ETL don't use python

condwanaland2y ago

I wish they did though. 80% of my day job is (admittedly non-traditional) ETL. It's done in 100% PySpark.

speedgoose2y ago

Real programmers use butterflies.

https://xkcd.com/378/

Qem2y ago

Now I wonder if Raku developers chose the mascot thanks to this particular XKCD: https://en.m.wikipedia.org/wiki/File:Camelia.svg

1 more reply

j / k navigate · click thread line to collapse