In sqlite, this is just:
.mode csv
.import data.csv table
>When .import is run, its treatment of the first input row depends upon whether the target table already exists. If it does not exist, the table is automatically created and the content of the first input row is used to set the name of all the columns in the table.[0]
[0] https://www.sqlite.org/cli.html#importing_files_as_csv_or_ot...
I find it slightly annoying to have to switch mode back to something reasonable again, since mode impacts query results as well as imports.
Despite doing this every few weeks, I can never remember what the commands are! The Zui might improve this workflow for me a bit. Worth a shot!
Hm, a hassle is not always the same hassle for everyone. Personally, I like SQL and I enjoy the power it can leverage on all kind of data. As others wrote, working with sqlite for local data processing is a tool I do not want to miss, besides all of the great Unix coreutils and its (mostly GNU) ecosystem.
For example, the "all the cities in Iran" query.
Import-FromCsv GeoLite2-City-Locations-en.csv |
Where-Object {$_.country_name -eq "Iran"} |
Select-Object -ExpandProperty city_name
You could probably find modules to help with the IP-aware stuff like `cidr_match`, but the real place where it would probably fall over probably performance when joining, since you'd probably be just be doing O(n*m) convolution operations.If you want to avoid SQL, it's really hard to beat a "data frame" data structure for tabular data processing including things like joins.
And geo indexes are no joke. Using them has made 8hr long SQL queries take seconds.
Currently mangling a 4 GB file and working with api's that use existing data columns to provide output.. Its a great tool.
Honestly, given that I can use grel/clojure/python inside to clean up and mangle data seems to make it the swiss knife of data segmentation/cleanup.
-drag the 2 CSV files onto the canvas
-click 'Join'
-select the 2 columns to join
This can be done in the terminal as well.
tldr; [de]serialization is your bottleneck, after that it’s general data processing. both are wasting insane levels of cpu cycles. network and disk, when accessed linearly, are free.
i remember first looking into this when ec2 i3 came out, only more so since. lambda for burst cpu capacity when you can’t wait 30s for ec2 spot is interesting too.
https://nathants.com/posts/performant-batch-processing-with-...
More composable than SQL. Have plenty of utility functions wrapped up (which are far easier than stored procedures) in libraries. The code is far easier to step through with a debugger (due to the composability).