clickhouse local --file medias.csv --query "SELECT edito, count() AS count from table group by all order by count FORMAT PrettyCompact"
┌─edito──────┬─count─┐
│ agence │ 1 │
│ agrégateur │ 10 │
│ plateforme │ 14 │
│ individu │ 30 │
│ media │ 423 │
└────────────┴───────┘
With clickhouse-local, I can do lot more as I can leverage full power of clickhouse. clickhouse local
ClickHouse local version 25.4.1.1143 (official build).
:)
There are few benefits of using clickhouse-local since ClickHouse can just do lot more than DuckDB. One such example is handling compressed files. ClickHouse can handle compressed files with formats ranging from zstd, lz4, snappy, gz, xz, bz2, zip, tar, 7zip. clickhouse local --query "SELECT count() FROM file('top-1m-2018-01-10.csv.zip :: *.csv')"
1000000
Also clickhouse-local is much more efficient in handling big csv files[0]Translating the examples from the ReadMe, having read the file with:
$medias = Get-Content .\medias.csv | ConvertFrom-Csv
Previewing the file in the terminal xan view medias.csv
$medias | Format-Table
Reading a flattened representation of the first row xan flatten -c medias.csv
$medias | Format-List
Searching for rows xan search -s outreach internationale medias.csv | xan view
$medias | Where-Object { $_.outreach -eq "internationale" } | Format-Table
Selecting some columns xan select foundation_year,name medias.csv | xan view
$medias | Select-Object -Property foundation_year, name | Format-Table
Sorting the file xan sort -s foundation_year medias.csv | xan view -s name,foundation_year
$medias | Sort-Object -Property foundation_year | Select-Object -Property name, foundation_year | Format-Table
Deduplicating the file on some column # Some medias of our corpus have the same ids on mediacloud.org
xan dedup -s mediacloud_ids medias.csv | xan count && xan count medias.csv
$medias | Select-Object -ExpandProperty mediacloud_ids -Unique | Measure-Object; $medias | Measure-Object -Property mediacloud_ids
Computing frequency tables xan frequency -s edito medias.csv | xan view
$medias | Group-Object -Property edito | Sort-Object -Property Count -Descending
It's probably orders of magnitude slower, and of course, plotting graphs and so on gets tricky. But for the simple type of analysis I typically do, it's fast enough, I don't need to learn an extra tool, and the auto-completion of column/property names is very convenient. $medias = open .\medias.csv
The above is the initial read and format into table.I'm currently on my phone so can't go through all the examples, but knowing both PS and nu, nu has the better syntax.
EDIT:
Get data and view in table:
let $medias = http get https://github.com/medialab/corpora/raw/master/polarisation/medias.csv
$medias
Get headers: $medias | columns
Get count of rows: $medias | length
Get flattened, slight more convoluted (caveat there might be a better way): $medias | each {print $in}
Search rows: $medias | where $it.outreach == 'internationale'
Select columns: $medias | select foundation_year name
Sort file: $medias | select foundation_year name | sort-by foundation_year
Dedup based on column: $medias | uniq-by mediacloud_ids
Computing frequency and histogram $medias | histogram editoIt looks like xsv and xan are in the "csvkit but faster" niche, which is nice, but now I must learn another set of commands.
And there are now many more recent utilities called csvtool, including a Perl and a Python one.
For example being able to define data types for each column and say required columns. And then run your tool as a validator and take the errors as an array that’d be amazing!
Coming from a dev who’s just over processing CSV files back into my apps.
Could definitely be done as a small little bash script
There are a lot of tools that one can use on Linux cli to work with csv. But many of them have become unmaintained. Or have terrible docs. Or have really awkward usage (looking at you, “yq”).
I know someone who uses csvtk (Golang), but haven't tried it yet. https://github.com/shenwei356/csvtk