> Let’s say I use Puppeteer i can scrap this page a million time with completely wrong headers like « Netscape 8.1 ». GA purify this type of malicious attempts , it will probably look my IP Adress and figure out that it’s actually coming from only one IP and « Netscape » is too rare to be considered as an actual browser so it would probably ignore it.
I do a lot of work with GA, and have seen this misperception brought up a few times. When it comes to data processing, GA is not intelligent. If you haven't told it to do something explicitly, it isn't doing it. And if you tell it to do something, it'll only do that for all new data and will make no attempts to do it to historical data.
- GA is relatively robust against web scraping due to the fact that most scrapers don't render the page. So the GA-related code on the page is never executed and a hit is never made to Google's servers. If the scraper is using a headless browser, such as Puppeteer, and renders the page, then it will in fact send that hit to GA.
- If you've checked the "Exclude bots" view setting[1], it will apply the IAB Spiders and Bots list to traffic[2]. This is a deterministic list of user-agent based filters to apply[3], and anyone is capable of paying for it. Google just gives it to GA users for free via the Exclude bots filter.
- The Exclude Bots setting does nothing else than that. Scrapers like Puppeteer by default report their user agent as the version of Chromium they're using. These will show up just like any legitimate user to your site that also is browsing with that specific version of Chromium.
- GA has pretty robust filtering options[4]. But you have to manually create them. And they don't apply retroactively. You can filter IPs here, and only here. While you can apply reporting filters after the fact on a lot of fields, IP addresses aren't available as one of those fields. This makes it really frustrating to retroactively get rid of junk traffic, whether internal or automated/scrapping. You can approximate it by getting creative with fields that make a good proxy. The only exception to this would be GA360/Google Marketing Cloud customers, since they can access their clickstream data via BigQuery as part of their subscription.
- GA's interface will give you really smart looking notifications now like "Filter internal traffic. Hits from your corporate network are showing up in property example.com". It's not doing anything super neat like dynamically cross-referencing your IP address as you're in the admin area against the collected data in your GA property. It's literally just triggering that warning based off the fact that you haven't applied any IP-based filters applied yet.
There are quite a few other completely unintuitive aspects of GA that are rooted in the fact that their data processing model is incredibly straightforward, and there are very few exceptions to it and virtually no edge cases taken into account. Which leads to a lot of instances where people's expectations on behavior decouple from actual behavior. But a good rule of thumb is that, if a particular functionality or metric seems even remotely like it'd require extra computation or complexity to implement in a way to make it match what you're thinking. Then it's highly likely it doesn't work the way you think.
[1] https://support.google.com/analytics/answer/1010249?hl=en
[2] https://www.iab.com/guidelines/iab-abc-international-spiders...
[3] https://www.iab.com/wp-content/uploads/2015/11/IAB_SpidersBo...
[4] https://support.google.com/analytics/topic/1032939?hl=en&ref...