Code is Apache 2 at https://github.com/clickHouse/clickpy
The downloads for Python modules are actually available in BigQuery - a row for every package download in the world. Last time i checked it's the largest BigQuery public dataset at almost 700b rows. Wanting to do some serious analytics led to a few frustrations though:
- speed for queries - BigQuery is great for complex SQL, less so for fast analytics. - cost :) especially as i wanted to offer this for free.
Knowing that ClickHouse excels at this sort of problem (as a ClickHouse employee full disclosure), I set about exporting the data to GCS and importing to ClickHouse. A weekend learning NextJS+React and some help from a designer friend (thanks Daniel!) and ClickPy was born!
For now the analytics are quite simple but I plan to enrich them over time. I've made the cluster also public and read-only so users can run the app themselves - and hopefully contribute back.
Finally, i'm planning to keep the dataset up-to-date daily (maybe more in the future). It's proven a useful test case for the ClickHouse dev team and has already found a few issues in the core db.
Contributions welcome and let me know if you find it useful for digging into the downloads of your own package.
Cheers, GingerWizard