Show HN: Git Heat Map – a tool for visualising git repo activity for each file (opens in new tab)

(github.com)

173 pointsjmforsythe3y ago39 comments

39 comments

If you think this is useful, you might also like codeatlas.dev and its Github Action (https://github.com/codeatlasHQ/codebase-visualizer-action). It currently does not support per-contributor activity, but we put a lot of effort into making the diagrams beautiful to look at and the basic approach of using treemaps for visualisation seems very similar. In fact, could be cool to collaborate on this, DM me if interested!

https://codeatlas.dev

toastal3y ago

The OP works on Git. This project only works on Microsoft GitHub.

bilekas3y ago

Are you sure ? I don't have a chance to test right now but from the brief look i had there's nothing specific for GitHub.com. as i see it uses the git log file.

Edit: based on your profile I see you're quite anti Microsoft for some reason. I think one doth protest too much in this case though.

2 more replies

ArchieMaclean3y ago

Looks pretty. I can't scroll on the gallery page on mobile, and tapping anywhere on the page (including the top example) just brings up the navigation menu. Switching to desktop mode works.

arraypad3y ago

This is similar to the Structure dashboard in Repography [1] which is a web app I built, with the added advantage that it's kept up to date automatically so you can embed it in your README.md and not have to keep updating it yourself.

[1] https://repography.com/app/0/strawberry-graphql/strawberry/s...

darknavi3y ago

This is cool. The visualization reminds me of the visualization of QDirStat/WinDirStat.

Also shout out to the (mostly useless but cool looking) git history visualizer, Gource[0].

https://gource.io/

KronisLV3y ago

> Also shout out to the (mostly useless but cool looking) git history visualizer, Gource[0].

I'd say that Gource is actually pretty useful for figuring out where most of the effort has been concentrated recently! For example, when added to a new project, I might run it against the repo to see which packages in the project have been changed the most in the past month, what people are working on and so on.

basic_banana3y ago

WinDirStat is pretty cool! thanks for sharing.

pushedx3y ago

This will show the 30 "hottest" files in a repo.

    git log --pretty=format: --name-only | sort | uniq -c | sort -rg | head -n 30

thecodrr3y ago

I wonder if this can be done faster using the same approach as git-filter-repo or bfg. 68 minutes for something like this is a long time.

Otherwise looks really cool.

ByronBates3y ago

Computing diffs is what takes large amounts of time as the object database is used intensively along with limited efficiency of object caches.

I couldn't resist and threw `gitoxide` at it, and it turned out to be more than 2x as fast (even though it uses way more CPU to do that, there is definitely room for improvement).

The PR which adds the `db-gen` program: https://github.com/jmforsythe/Git-Heat-Map/pull/6

herrvogel-3y ago

I sometimes think of similar tool that would give you this information on a more fine grained level. Like functions, scopes or lines.

jemmyw3y ago

It didn't work for me:

  File "git-database.py", line 263, in <module>
    main()
  File "git-database.py", line 247, in main
    last_commit = handle_commit(cur, lines)
  File "git-database.py", line 178, in handle_commit
    handle_match(cur, matches[i], commit_lines[2+i], fields)
  File "git-database.py", line 197, in handle_match
    p, n = secondary_line.split("|")

  ValueError: not enough values to unpack (expected 2, got 1)

ByronBates3y ago

You could try the alternative version of the database-generator: https://github.com/jmforsythe/Git-Heat-Map/pull/6 - it shouldn't crash.

abhishekbasu3y ago

I wonder if this can somehow be used as a tool for quick visual inspection of architecture compliance. Maybe the size of the box could be selected before display to denote a custom metric. For example, if the size of the box for each file was proportional to the (number_of_lines_edited) * (current_date - file_creation_date), then modularity could demand that the size of the boxes remain small. (pardon the musings of a non software engineer on a Saturday night)

jmforsytheOP3y ago

The size of the boxes is fairly customisable on the back end, the webpage just needs to be sent the right JSON data, and the only real criteria is that the size of each directory is the sum of the sizes of its children.

welder3y ago

WakaTime just released the same feature 4 days ago using IDE activity instead of commits. [0] Scroll to the bottom to see the bubble chart screenshot.

[0] https://wakatime.com/blog/58-chatgpt-prototyped-our-new-feat...

minton3y ago

I’d never heard of WakaTime before. Their Team Leader board looks like a dystopian nightmare.

phaer3y ago

Looks useful on big repositories!

https://github.com/nixos/nixpkgs/ would be a great benchmark for a tool like this :) One of the larger repos on github, close to half a million commits by a large set of contributors to thousands of files.

jeffreygoesto3y ago

If you are an org with a large codebase and want to get many more insights, there is Seerene, it's commercial but takes the idea of repo analysis to the extremes (not affiliated, just a happy user)

https://www.seerene.com/

nickcox3y ago

There's also Adam Tornhill's [0] Codescene. [1]

[0] https://www.amazon.com/Books-Adam-Tornhill/s?rh=n%3A283155%2... [1] https://codescene.com/

ozim3y ago

Great that is one of the tools I had somewhere on the back of my head.

Nice to see which parts are changing the most in a project to maybe see if it should be improved or at least to direct efforts of improving quality to these spots.

hamasho3y ago

I love this, thanks!

It would be nice if the heatmap showed addition, deletion, and possibly modification separately. I can't think of suitable visualization methods, but if I can see them individually, it gives more insights.

pizza3y ago

This is really cool. Some other things might want to be able to visualize are this + a recency filter, or this + treesitter support, in order to be able to query eg which classes or functions have dominated

nativecoinc3y ago

I want to do some whitespace refactoring on the files in our project. File activity is useful for that since then I can change files that no one is likely to be touching.

nonethewiser3y ago

How much faster are your benchmarks using WAL mode? https://www.sqlite.org/wal.html

avinassh3y ago

I experimented earlier to push the limits of SQLite inserts and wrote a blog post[0] about it. We can apply some of the learnings here.

I reviewed the OP's code and did some benchmarks; SQLite is not the bottleneck here. The code first generates the commit info from the git log, prints that to stdin [1] and the python script reads from it one by one in a loop [2]. Each of the commit info is written to SQLite. So, with or without WAL, the time is almost the same.

To confirm my hypothesis, I ran the project without insert calls. On my machine, for cpython, it took 160 seconds and without sqlite inserts 159 ish.

I believe the git log will be fast anyway, so other ways to make it faster would be to read a bunch of commits at once and then do batch inserts. We can also make it run in parallel since each commit info is independent, and we don't need to care about ordering while inserting.

[0] - https://avi.im/blag/2021/fast-sqlite-inserts/

[1] - https://github.com/jmforsythe/Git-Heat-Map/blob/bd9bc22/git-...

[2] - https://github.com/jmforsythe/Git-Heat-Map/blob/bd9bc22/git-...

jmforsytheOP3y ago

Unfortunately the commit info is not independent at the moment, and that is mostly due to wanting to track renaming/deletion of files.

nonethewiser3y ago

Nice work, thank you for the analysis

llagerlof3y ago

Add installation procedures.

jmforsytheOP3y ago

What exactly do you mean? The README describes how to run the website, and how to mentions the only dependency.

aren555553y ago

After cloning, I had to change the perms of the .sh files, install flask via pip and hope for the best. I wasn't sure what Python version to use, I used some 3.x. It would be really nice if this was fully containerized and I could launch in docker/podman.

Also I wasn't sure how to get the colors :shrug:

1 more reply

breck3y ago

Don't have time to install but I would pay $10 in NEAR coin if you can email or post the results of my repos to me (breck7@gmail.com):

https://github.com/breck7/jtree and https://github.com/breck7/pldb

Update: Got the email. Thank you! NEAR sent!

1 more reply

j / k navigate · click thread line to collapse

39 comments

Weidenwalker3y ago

https://codeatlas.dev

toastal3y ago

The OP works on Git. This project only works on Microsoft GitHub.

bilekas3y ago

Are you sure ? I don't have a chance to test right now but from the brief look i had there's nothing specific for GitHub.com. as i see it uses the git log file.

Edit: based on your profile I see you're quite anti Microsoft for some reason. I think one doth protest too much in this case though.

2 more replies

ArchieMaclean3y ago

Looks pretty. I can't scroll on the gallery page on mobile, and tapping anywhere on the page (including the top example) just brings up the navigation menu. Switching to desktop mode works.

arraypad3y ago

[1] https://repography.com/app/0/strawberry-graphql/strawberry/s...

darknavi3y ago

This is cool. The visualization reminds me of the visualization of QDirStat/WinDirStat.

Also shout out to the (mostly useless but cool looking) git history visualizer, Gource[0].

https://gource.io/

KronisLV3y ago

> Also shout out to the (mostly useless but cool looking) git history visualizer, Gource[0].

basic_banana3y ago

WinDirStat is pretty cool! thanks for sharing.

pushedx3y ago

This will show the 30 "hottest" files in a repo.

    git log --pretty=format: --name-only | sort | uniq -c | sort -rg | head -n 30

thecodrr3y ago

I wonder if this can be done faster using the same approach as git-filter-repo or bfg. 68 minutes for something like this is a long time.

Otherwise looks really cool.

ByronBates3y ago

Computing diffs is what takes large amounts of time as the object database is used intensively along with limited efficiency of object caches.

I couldn't resist and threw `gitoxide` at it, and it turned out to be more than 2x as fast (even though it uses way more CPU to do that, there is definitely room for improvement).

The PR which adds the `db-gen` program: https://github.com/jmforsythe/Git-Heat-Map/pull/6

herrvogel-3y ago

I sometimes think of similar tool that would give you this information on a more fine grained level. Like functions, scopes or lines.

jemmyw3y ago

It didn't work for me:

  File "git-database.py", line 263, in <module>
    main()
  File "git-database.py", line 247, in main
    last_commit = handle_commit(cur, lines)
  File "git-database.py", line 178, in handle_commit
    handle_match(cur, matches[i], commit_lines[2+i], fields)
  File "git-database.py", line 197, in handle_match
    p, n = secondary_line.split("|")

  ValueError: not enough values to unpack (expected 2, got 1)

ByronBates3y ago

You could try the alternative version of the database-generator: https://github.com/jmforsythe/Git-Heat-Map/pull/6 - it shouldn't crash.

abhishekbasu3y ago

jmforsytheOP3y ago

welder3y ago

WakaTime just released the same feature 4 days ago using IDE activity instead of commits. [0] Scroll to the bottom to see the bubble chart screenshot.

[0] https://wakatime.com/blog/58-chatgpt-prototyped-our-new-feat...

minton3y ago

I’d never heard of WakaTime before. Their Team Leader board looks like a dystopian nightmare.

phaer3y ago

Looks useful on big repositories!

jeffreygoesto3y ago

If you are an org with a large codebase and want to get many more insights, there is Seerene, it's commercial but takes the idea of repo analysis to the extremes (not affiliated, just a happy user)

https://www.seerene.com/

nickcox3y ago

There's also Adam Tornhill's [0] Codescene. [1]

[0] https://www.amazon.com/Books-Adam-Tornhill/s?rh=n%3A283155%2... [1] https://codescene.com/

ozim3y ago

Great that is one of the tools I had somewhere on the back of my head.

Nice to see which parts are changing the most in a project to maybe see if it should be improved or at least to direct efforts of improving quality to these spots.

hamasho3y ago

I love this, thanks!

pizza3y ago

nativecoinc3y ago

I want to do some whitespace refactoring on the files in our project. File activity is useful for that since then I can change files that no one is likely to be touching.

nonethewiser3y ago

How much faster are your benchmarks using WAL mode? https://www.sqlite.org/wal.html

avinassh3y ago

I experimented earlier to push the limits of SQLite inserts and wrote a blog post[0] about it. We can apply some of the learnings here.

To confirm my hypothesis, I ran the project without insert calls. On my machine, for cpython, it took 160 seconds and without sqlite inserts 159 ish.

[0] - https://avi.im/blag/2021/fast-sqlite-inserts/

[1] - https://github.com/jmforsythe/Git-Heat-Map/blob/bd9bc22/git-...

[2] - https://github.com/jmforsythe/Git-Heat-Map/blob/bd9bc22/git-...

jmforsytheOP3y ago

Unfortunately the commit info is not independent at the moment, and that is mostly due to wanting to track renaming/deletion of files.

nonethewiser3y ago

Nice work, thank you for the analysis

llagerlof3y ago

Add installation procedures.

jmforsytheOP3y ago

What exactly do you mean? The README describes how to run the website, and how to mentions the only dependency.

aren555553y ago

Also I wasn't sure how to get the colors :shrug:

1 more reply

breck3y ago

Don't have time to install but I would pay $10 in NEAR coin if you can email or post the results of my repos to me (breck7@gmail.com):

https://github.com/breck7/jtree and https://github.com/breck7/pldb

Update: Got the email. Thank you! NEAR sent!

1 more reply

j / k navigate · click thread line to collapse