Minimally Invasive (and More Accurate) Analytics: GoAccess and Athena/SQL (opens in new tab)

(brandur.org)

85 pointsChris9115y ago33 comments

33 comments

I have always used GoAccess on my blog (https://2byt.es) which gets very modest traffic because I don't post much and don't advertise outside of my few twitter followers. Privacy has always been a core principle of mine.

I've found that over time, crawlers drown out the numbers of actual visitors but I find GoAccess hard to use to get any meaningful data from when interesting things do happen.

Can anyone suggest a way I can do something similar to this without relying on a service I don't host (and without having to write parsers into a SQL or similar DB by hand)?

tstack5y ago

lnav[0] can take care of parsing the log files and displaying them in a TUI. There's also a SQLite interface for doing queries. However, you'll need to build the filters/queries yourself, there aren't any built-in ones at the moment.

[0] - https://lnav.org

alias_neo5y ago

Perfect, thank you, this is right up my street!

I host a static site for my blog, using Hugo so "no server" etc is exactly what I need, and writing the filters/queries myself leaves me in control of getting what I need out of them.

joshyi5y ago

I'm waiting as well for the ability to use filters directly from goaccess. Hope they get to it soon! https://github.com/allinurl/goaccess/issues/117

marvinblum5y ago

https://pirsch.io/ I'm one of the founders :) Check my post above (or below, who knows on hn?) for more details, or ask me.

[Edit] Sorry, I did read "which I don't host" instead of the other way around. You can check out the open-source core library, that might work for you if you put in some work.

mrwnmonm5y ago

Wow, Go support, thanks.

conradfr5y ago

There is the --ignore-crawlers argument, on my modest projects it seems efficient but I've not look at it too precisely.

alias_neo5y ago

I've always had this flag in my run script but find I still have huge amounts of crawler traffic. I might need to look at that again.

e12e5y ago

For a long time I've thought about scavenging the robot identifiers from Matomo, née Piwik - so one might leverage the hive mind in updating robot identifiers, and use it to strip plain access logs for easier use with tools like goaccess..

herunan5y ago

I use Cloudflare Web Analytics. Since I use Cf, I thought: why not utilize their analytics? Anonymous, no cookie, no fingerprinting and no localStorage.

Edit: also, no JS if you have Pro (20usd/mo).

mrwnmonm5y ago

I read that DNS analytics is not accurate too.

1 more reply

megous5y ago

Hmm, it ocured to me that you can probably get a nice list of robot user-agents by querying all UAs that accessed the robots.txt file. I don't think normal browsers touch that file.

Also a thing to do on the cheap, if you want more usable logs is to do JSON logging[1] (one object per line). This is trivial to import into PostgreSQL and also trivial to query via tools like jq, as is.

[1] Example: https://stackoverflow.com/questions/25049667/how-to-generate...

heipei5y ago

Logging JSON directly from nginx is what I currently do, and then the log output is ingested straight into ElasticSearch. One neat thing you can do is also log return headers from an upstream HTTP server, such as a username for example or any application-specific piece of data. That way you can interleave your HTTP access logs with application data and have everything available for querying in one index.

marvinblum5y ago

I've found the same issue. A lot of traffic will get blocked if you use a simple JavaScript integration. The solution is (obviously) to track from the backend and provide a simple dashboard for it. I've started building a library [0] written in Go, which I could integrate into my website and until the end of last year, it became a product (in beta right now) called Pirsch [1]. We offer a JS integration to onboard customers more easily, but one of the main reasons we build it is, that you can use it from your backend through our API [2]. We plan to add more SDKs and plugins (Wordpress, ...) to make the integration easier, but it should be fairly simple already.

I would love to hear feedback, as we plan to fully release it soon :)

[0] https://github.com/pirsch-analytics/pirsch

[1] https://pirsch.io/

[2] https://docs.pirsch.io/get-started/backend-integration/

[Edit]

I forgot to mention my website, which I initially created Pirsch for. The article I wrote about the issue and my solution is here: https://marvinblum.de/blog/server-side-tracking-without-cook...

fatsdomino0015y ago

Is it possible to integrate pirsch into a heroku deployment?

marvinblum5y ago

I haven't worked with heroku yet, but if you can make an API request, yes. You can read about how that works here: https://docs.pirsch.io/api-sdks/api/

1 more reply

twotwotwo5y ago

I want to second that plug for Athena for ad-hoc analysis. (If you're hosting your own stuff and at the scale where it'd be useful, there's Presto/Hive, which Athena is based on, and/or at Trino, the Presto fork maintained by some of its initial developers.)

It was useful for me when tweaking spam/bot detection rules a while ago; if I could roughly describe a rule in a query, I could back-test it on old traffic and follow up on questionable-looking results (e.g. what other requests did this IP make around the time of the suspicious ones?). We also used Athena on a project looking into performance, and on network flow logs. The lack of recurring charges for an always-on cluster makes it great for occasional use like that.

You can use what the docs call "partition projection" to efficiently limit the date range of logs to look at (https://docs.aws.amazon.com/athena/latest/ug/partition-proje...), so it was free-ish to experiment with a query on the last couple days of data before looking further back.

More generally, Athena/Presto/Hive support various data sources and formats (including applying regexps to text). Compressed plain-text formats like ALB logs can already be surprisingly cheap to store/scan. If you're producing/exporting data, it's worth looking into how these tools "like" to receive it--you may be able to use a more compact columnar format (Parquet or ORC) or take advantage of partitioning/bucketing (https://docs.aws.amazon.com/athena/latest/ug/partitions.html, https://trino.io/blog/2019/05/29/improved-hive-bucketing.htm...) for more efficient querying later.

As the blog post notes, usability was...imperfect, especially during initial setup. Error messages sometimes point at one of the first few tokens of the SQL, nowhere near the mistake, and there are lots of knobs to tweak, some controlled by 'magical.dotted.names.in.strings'. CLIs were sometimes easier than the GUI. But you can get a lot out of it once you've got it working!

bitsondatadev5y ago

One quick detail on the Trino description is that not only are some of the initial developers but all of the creators and the majority of contributors (https://github.com/prestodb/presto/graphs/contributors?from=...) and still have contributed the majority of the code in both Presto (https://github.com/prestodb/presto/graphs/contributors)/Trin... (https://github.com/trinodb/trino/graphs/contributors).

To really jump into this, take a look at https://trino.io/blog/2020/12/27/announcing-trino.html.

A few more stats and info:

Trino commits: 22,383 Presto commits: 18,582

Trino slack members: 3,603 Presto slack members: 1,575

Trino supports iceberg: https://trino.io/docs/current/connector/iceberg.html HDP3 Support: https://github.com/trinodb/trino/issues/1218

Trino has addressed a critical security vulnerability that still exists in Presto: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-1508...

Give our repo a star if you have a sec: https://github.com/trinodb/trino/blob/master/.github/star.pn...

twotwotwo5y ago

Yeah, when I was looking into this stuff it was sad to see the blog post about why Trino had to switch names. Unsurprisingly given the history, the Trino blog was the only place I could find written-up details about things like how to name files so they work with the new version of bucketing. The new features look really neat and hope the project has (continues to have) success!

3 more replies

blakesterz5y ago

Interesting, there's quite a big number of people running ad blockers!

"Both Google Analytics and Goatcounter agreed that I got ~13k unique visitors across the couple days where it spiked. GoAccess and my own custom Athena queries agreed that it was more like ~33k unique visitors, giving me a rough ratio of 2.5x more visitors than reported by analytics, and meaning that about 60% of my readers are using an adblocker."

joshyi5y ago

We use goaccess against a pretty busy centralized log server and has worked really well for years. We don't have to worry about JS and that's always a plus. I personally like how it follows the unix philosophy.

cube22225y ago

With OctoSQL[0], as I wanted to see how people are using it, I literally just set up an http endpoint which received a JSON request on each CLI invocation (you can see the data sent in the code, it's open source) and appended it to an on-disk JSON file.

Then I used... OctoSQL to analyze it!

Nit: The project may seem dead for a few months, but I'm just in midst of a rewrite (on a branch) which gets rid of wrong decisions and makes it easier to embed in existing applications.

[0]:https://github.com/cube2222/octosql

ruuda5y ago

I do a similar thing for my site, but instead of renting a database cluster in the cloud, I wrote a small Python script that converts nginx log files into a SQLite database. https://github.com/ruuda/sqlog

tobilg5y ago

For a readymade stack similar to what’s described in the article to self-host in your own AWS account have a look at https://ownstats.cloud

j / k navigate · click thread line to collapse

33 comments

alias_neo5y ago

I've found that over time, crawlers drown out the numbers of actual visitors but I find GoAccess hard to use to get any meaningful data from when interesting things do happen.

Can anyone suggest a way I can do something similar to this without relying on a service I don't host (and without having to write parsers into a SQL or similar DB by hand)?

tstack5y ago

[0] - https://lnav.org

alias_neo5y ago

Perfect, thank you, this is right up my street!

I host a static site for my blog, using Hugo so "no server" etc is exactly what I need, and writing the filters/queries myself leaves me in control of getting what I need out of them.

joshyi5y ago

I'm waiting as well for the ability to use filters directly from goaccess. Hope they get to it soon! https://github.com/allinurl/goaccess/issues/117

marvinblum5y ago

https://pirsch.io/ I'm one of the founders :) Check my post above (or below, who knows on hn?) for more details, or ask me.

[Edit] Sorry, I did read "which I don't host" instead of the other way around. You can check out the open-source core library, that might work for you if you put in some work.

mrwnmonm5y ago

Wow, Go support, thanks.

conradfr5y ago

There is the --ignore-crawlers argument, on my modest projects it seems efficient but I've not look at it too precisely.

alias_neo5y ago

I've always had this flag in my run script but find I still have huge amounts of crawler traffic. I might need to look at that again.

e12e5y ago

herunan5y ago

I use Cloudflare Web Analytics. Since I use Cf, I thought: why not utilize their analytics? Anonymous, no cookie, no fingerprinting and no localStorage.

Edit: also, no JS if you have Pro (20usd/mo).

mrwnmonm5y ago

I read that DNS analytics is not accurate too.

1 more reply

megous5y ago

Hmm, it ocured to me that you can probably get a nice list of robot user-agents by querying all UAs that accessed the robots.txt file. I don't think normal browsers touch that file.

[1] Example: https://stackoverflow.com/questions/25049667/how-to-generate...

heipei5y ago

marvinblum5y ago

I would love to hear feedback, as we plan to fully release it soon :)

[0] https://github.com/pirsch-analytics/pirsch

[1] https://pirsch.io/

[2] https://docs.pirsch.io/get-started/backend-integration/

[Edit]

I forgot to mention my website, which I initially created Pirsch for. The article I wrote about the issue and my solution is here: https://marvinblum.de/blog/server-side-tracking-without-cook...

fatsdomino0015y ago

Is it possible to integrate pirsch into a heroku deployment?

marvinblum5y ago

I haven't worked with heroku yet, but if you can make an API request, yes. You can read about how that works here: https://docs.pirsch.io/api-sdks/api/

1 more reply

twotwotwo5y ago

bitsondatadev5y ago

To really jump into this, take a look at https://trino.io/blog/2020/12/27/announcing-trino.html.

A few more stats and info:

Trino commits: 22,383 Presto commits: 18,582

Trino slack members: 3,603 Presto slack members: 1,575

Trino supports iceberg: https://trino.io/docs/current/connector/iceberg.html HDP3 Support: https://github.com/trinodb/trino/issues/1218

Trino has addressed a critical security vulnerability that still exists in Presto: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-1508...

Give our repo a star if you have a sec: https://github.com/trinodb/trino/blob/master/.github/star.pn...

twotwotwo5y ago

3 more replies

blakesterz5y ago

Interesting, there's quite a big number of people running ad blockers!

joshyi5y ago

cube22225y ago

Then I used... OctoSQL to analyze it!

Nit: The project may seem dead for a few months, but I'm just in midst of a rewrite (on a branch) which gets rid of wrong decisions and makes it easier to embed in existing applications.

[0]:https://github.com/cube2222/octosql

ruuda5y ago

tobilg5y ago

For a readymade stack similar to what’s described in the article to self-host in your own AWS account have a look at https://ownstats.cloud

j / k navigate · click thread line to collapse