XAN: A Modern CSV-Centric Data Manipulation Toolkit for the Terminal (opens in new tab)

(github.com)

146 pointsYomguithereal1y ago28 comments

28 comments

I prefer to use clickhouse-local for all my CSV needs as I don't need to learn a new language (or cli flags) and can just leverage SQL.

    clickhouse local --file medias.csv --query "SELECT edito, count() AS count from table group by all order by count FORMAT PrettyCompact"

   ┌─edito──────┬─count─┐
   │ agence     │     1 │
   │ agrégateur │    10 │
   │ plateforme │    14 │
   │ individu   │    30 │
   │ media      │   423 │
   └────────────┴───────┘

With clickhouse-local, I can do lot more as I can leverage full power of clickhouse.

SoftTalker1y ago

I used to use q for this sort of thing. Not sure if there are better choices now as it have been a few years.

https://harelba.github.io/q/

rixed1y ago

How does it compare with duckdb, which I usualy resort to? What I like with duckdb is that it's a single binary, no server needed, and it's been happy so far with all the CSV file I've thrown at it.

pradeepchhetri1y ago

clickhouse-local is similar to duckdb, you don't need a clickhouse-server running in order to use clickhouse-local. You just need to download the clickhouse binary and start using it.

  clickhouse local
  ClickHouse local version 25.4.1.1143 (official build).

  :)

There are few benefits of using clickhouse-local since ClickHouse can just do lot more than DuckDB. One such example is handling compressed files. ClickHouse can handle compressed files with formats ranging from zstd, lz4, snappy, gz, xz, bz2, zip, tar, 7zip.

  clickhouse local --query "SELECT count() FROM file('top-1m-2018-01-10.csv.zip :: *.csv')"
  1000000

Also clickhouse-local is much more efficient in handling big csv files[0]

[0]: https://www.vantage.sh/blog/clickhouse-local-vs-duckdb

1 more reply

sitkack1y ago

I use SQLite in a similar manner, but I'll have to check this out.

dlkmp1y ago

Can't help but thinking how handy PowerShell is out of the box for tasks like this.

Translating the examples from the ReadMe, having read the file with:

  $medias = Get-Content .\medias.csv | ConvertFrom-Csv

Previewing the file in the terminal

  xan view medias.csv
  $medias | Format-Table

Reading a flattened representation of the first row

  xan flatten -c medias.csv
  $medias | Format-List

Searching for rows

  xan search -s outreach internationale medias.csv | xan view
  $medias | Where-Object { $_.outreach -eq "internationale" } | Format-Table

Selecting some columns

  xan select foundation_year,name medias.csv | xan view
  $medias | Select-Object -Property foundation_year, name | Format-Table

Sorting the file

  xan sort -s foundation_year medias.csv | xan view -s name,foundation_year
  $medias | Sort-Object -Property foundation_year | Select-Object -Property name, foundation_year | Format-Table

Deduplicating the file on some column

  # Some medias of our corpus have the same ids on mediacloud.org
  xan dedup -s mediacloud_ids medias.csv | xan count && xan count medias.csv
  $medias | Select-Object -ExpandProperty mediacloud_ids -Unique | Measure-Object; $medias | Measure-Object -Property mediacloud_ids

Computing frequency tables

  xan frequency -s edito medias.csv | xan view
  $medias | Group-Object -Property edito | Sort-Object -Property Count -Descending

It's probably orders of magnitude slower, and of course, plotting graphs and so on gets tricky. But for the simple type of analysis I typically do, it's fast enough, I don't need to learn an extra tool, and the auto-completion of column/property names is very convenient.

account-51y ago

I find Nushell even better for these usecases:

    $medias = open .\medias.csv

The above is the initial read and format into table.

I'm currently on my phone so can't go through all the examples, but knowing both PS and nu, nu has the better syntax.

EDIT:

Get data and view in table:

    let $medias = http get https://github.com/medialab/corpora/raw/master/polarisation/medias.csv
    $medias

Get headers:

    $medias | columns

Get count of rows:

   $medias | length

Get flattened, slight more convoluted (caveat there might be a better way):

    $medias | each {print $in}

Search rows:

    $medias | where $it.outreach == 'internationale'

Select columns:

    $medias | select foundation_year name

Sort file:

    $medias | select foundation_year name | sort-by foundation_year

Dedup based on column:

    $medias | uniq-by mediacloud_ids

Computing frequency and histogram

    $medias | histogram edito

SwamyM1y ago

Yes, I find PowerShell is criminally underrated for these type of tasks. Even though it's open source and cross-platform, the stigma from it's Windows-centric days is hard to overcome.

jgord1y ago

I was going to mention BurntSushis excellent xsv, also written in rust ... but I see xan mantions its a fork + rewrite of xsv.

doug_durham1y ago

I use Pandas for most of my CSV work. It's super fast and very powerful. There's a bit of a learning curve. I can then use Python scripts to manipulate massive CSV files.

gpvos1y ago

I tend to use csvkit for more complicated transformations, and OCaml's csvtool[0] for the simpler ones. For intermediate transformations I wrote my own csved[1] script, which reads for every line of a CSV reads it into @F, applies a Perl expression to that array, then writes it out. With the -h option you can also use the %F hash to access fields by name. It's very fast.

It looks like xsv and xan are in the "csvkit but faster" niche, which is nice, but now I must learn another set of commands.

And there are now many more recent utilities called csvtool, including a Perl and a Python one.

[0] https://github.com/Chris00/ocaml-csv

[1] https://github.com/gpvos/csved

teddyh1y ago

People will do anything to avoid learning SQL.

mike-the-mikado1y ago

I recently came across miller (https://github.com/johnkerl/miller). I don't know how these tools compare.

evolve2k1y ago

Something that would be insanely useful is if your tool could be users to do validations.

For example being able to define data types for each column and say required columns. And then run your tool as a validator and take the errors as an array that’d be amazing!

Coming from a dev who’s just over processing CSV files back into my apps.

hantusk1y ago

Reading CSV into a duckdb table will give you that, along with a table for the errors and reason for error: https://duckdb.org/docs/stable/data/csv/reading_faulty_csv_f...

Could definitely be done as a small little bash script

ranger_danger1y ago

https://digital-preservation.github.io/csv-validator/

crashabr1y ago

You should take a look at the Frictionless data library https://framework.frictionlessdata.io/docs/guides/validating...

sevg1y ago

This looks really nice to use!

There are a lot of tools that one can use on Linux cli to work with csv. But many of them have become unmaintained. Or have terrible docs. Or have really awkward usage (looking at you, “yq”).

ctippett1y ago

See also: csvkit (https://csvkit.readthedocs.io)

strunz1y ago

Yep, been using csvkit for years and it's so great. There's nothing I haven't been able to do, is there some reason to use Xan instead?

aorth1y ago

I use both csvkit and xsv. The syntax for csvkit is a bit easier for most of my uses, but xsv is way faster when I have larger files.

I know someone who uses csvtk (Golang), but haven't tried it yet. https://github.com/shenwei356/csvtk

andrea761y ago

I am using Goawk with good results

reagle1y ago

This could be a nice pre-filter for VisiData…?

megadata1y ago

I appreciate tools for CSV to tame that ungodly mess. But I loathe attempts to glorify the cursed and putrid sack of shit that it really is.

j / k navigate · click thread line to collapse

28 comments

pradeepchhetri1y ago

I prefer to use clickhouse-local for all my CSV needs as I don't need to learn a new language (or cli flags) and can just leverage SQL.

    clickhouse local --file medias.csv --query "SELECT edito, count() AS count from table group by all order by count FORMAT PrettyCompact"

   ┌─edito──────┬─count─┐
   │ agence     │     1 │
   │ agrégateur │    10 │
   │ plateforme │    14 │
   │ individu   │    30 │
   │ media      │   423 │
   └────────────┴───────┘

With clickhouse-local, I can do lot more as I can leverage full power of clickhouse.

SoftTalker1y ago

I used to use q for this sort of thing. Not sure if there are better choices now as it have been a few years.

https://harelba.github.io/q/

rixed1y ago

How does it compare with duckdb, which I usualy resort to? What I like with duckdb is that it's a single binary, no server needed, and it's been happy so far with all the CSV file I've thrown at it.

pradeepchhetri1y ago

clickhouse-local is similar to duckdb, you don't need a clickhouse-server running in order to use clickhouse-local. You just need to download the clickhouse binary and start using it.

  clickhouse local
  ClickHouse local version 25.4.1.1143 (official build).

  :)

  clickhouse local --query "SELECT count() FROM file('top-1m-2018-01-10.csv.zip :: *.csv')"
  1000000

Also clickhouse-local is much more efficient in handling big csv files[0]

[0]: https://www.vantage.sh/blog/clickhouse-local-vs-duckdb

1 more reply

sitkack1y ago

I use SQLite in a similar manner, but I'll have to check this out.

dlkmp1y ago

Can't help but thinking how handy PowerShell is out of the box for tasks like this.

Translating the examples from the ReadMe, having read the file with:

  $medias = Get-Content .\medias.csv | ConvertFrom-Csv

Previewing the file in the terminal

  xan view medias.csv
  $medias | Format-Table

Reading a flattened representation of the first row

  xan flatten -c medias.csv
  $medias | Format-List

Searching for rows

  xan search -s outreach internationale medias.csv | xan view
  $medias | Where-Object { $_.outreach -eq "internationale" } | Format-Table

Selecting some columns

  xan select foundation_year,name medias.csv | xan view
  $medias | Select-Object -Property foundation_year, name | Format-Table

Sorting the file

  xan sort -s foundation_year medias.csv | xan view -s name,foundation_year
  $medias | Sort-Object -Property foundation_year | Select-Object -Property name, foundation_year | Format-Table

Deduplicating the file on some column

  # Some medias of our corpus have the same ids on mediacloud.org
  xan dedup -s mediacloud_ids medias.csv | xan count && xan count medias.csv
  $medias | Select-Object -ExpandProperty mediacloud_ids -Unique | Measure-Object; $medias | Measure-Object -Property mediacloud_ids

Computing frequency tables

  xan frequency -s edito medias.csv | xan view
  $medias | Group-Object -Property edito | Sort-Object -Property Count -Descending

account-51y ago

I find Nushell even better for these usecases:

    $medias = open .\medias.csv

The above is the initial read and format into table.

I'm currently on my phone so can't go through all the examples, but knowing both PS and nu, nu has the better syntax.

EDIT:

Get data and view in table:

    let $medias = http get https://github.com/medialab/corpora/raw/master/polarisation/medias.csv
    $medias

Get headers:

    $medias | columns

Get count of rows:

   $medias | length

Get flattened, slight more convoluted (caveat there might be a better way):

    $medias | each {print $in}

Search rows:

    $medias | where $it.outreach == 'internationale'

Select columns:

    $medias | select foundation_year name

Sort file:

    $medias | select foundation_year name | sort-by foundation_year

Dedup based on column:

    $medias | uniq-by mediacloud_ids

Computing frequency and histogram

    $medias | histogram edito

SwamyM1y ago

Yes, I find PowerShell is criminally underrated for these type of tasks. Even though it's open source and cross-platform, the stigma from it's Windows-centric days is hard to overcome.

jgord1y ago

I was going to mention BurntSushis excellent xsv, also written in rust ... but I see xan mantions its a fork + rewrite of xsv.

doug_durham1y ago

I use Pandas for most of my CSV work. It's super fast and very powerful. There's a bit of a learning curve. I can then use Python scripts to manipulate massive CSV files.

gpvos1y ago

It looks like xsv and xan are in the "csvkit but faster" niche, which is nice, but now I must learn another set of commands.

And there are now many more recent utilities called csvtool, including a Perl and a Python one.

[0] https://github.com/Chris00/ocaml-csv

[1] https://github.com/gpvos/csved

teddyh1y ago

People will do anything to avoid learning SQL.

mike-the-mikado1y ago

I recently came across miller (https://github.com/johnkerl/miller). I don't know how these tools compare.

evolve2k1y ago

Something that would be insanely useful is if your tool could be users to do validations.

For example being able to define data types for each column and say required columns. And then run your tool as a validator and take the errors as an array that’d be amazing!

Coming from a dev who’s just over processing CSV files back into my apps.

hantusk1y ago

Reading CSV into a duckdb table will give you that, along with a table for the errors and reason for error: https://duckdb.org/docs/stable/data/csv/reading_faulty_csv_f...

Could definitely be done as a small little bash script

ranger_danger1y ago

https://digital-preservation.github.io/csv-validator/

crashabr1y ago

You should take a look at the Frictionless data library https://framework.frictionlessdata.io/docs/guides/validating...

sevg1y ago

This looks really nice to use!

There are a lot of tools that one can use on Linux cli to work with csv. But many of them have become unmaintained. Or have terrible docs. Or have really awkward usage (looking at you, “yq”).

ctippett1y ago

See also: csvkit (https://csvkit.readthedocs.io)

strunz1y ago

Yep, been using csvkit for years and it's so great. There's nothing I haven't been able to do, is there some reason to use Xan instead?

aorth1y ago

I use both csvkit and xsv. The syntax for csvkit is a bit easier for most of my uses, but xsv is way faster when I have larger files.

I know someone who uses csvtk (Golang), but haven't tried it yet. https://github.com/shenwei356/csvtk

andrea761y ago

I am using Goawk with good results

reagle1y ago

This could be a nice pre-filter for VisiData…?

megadata1y ago

I appreciate tools for CSV to tame that ungodly mess. But I loathe attempts to glorify the cursed and putrid sack of shit that it really is.

j / k navigate · click thread line to collapse