- "Advanced URL Filtering" seems to have a feature where web content is either can be evaluated "inline" or "web payload data is also submitted to Advanced URL Filtering in the cloud" [1].
- If a URL is considered 2 spooky to load on the user's endpoint, it can instead be loaded via "Remote Browser Isolation" in a remote-desktop-like session, on demand, for that single page only [2].
I think either (or both) could explain the signals you're detecting.
[1]: https://docs.paloaltonetworks.com/advanced-url-filtering/adm....
[2]: https://docs.paloaltonetworks.com/advanced-url-filtering/adm...
When someone makes an HTTP request, the firewall takes the host and path from the request and looks them up first in a local cache on the data plane, then in the cloud. (As you can imagine, bypassing the entire feature is therefore trivial for malware. You just open a connection to an arbitrary IP address and put, say, google.com in the host header. As far as the firewall can tell, you are in fact talking to google.com.)
When the URL isn't already known to the cloud, or hasn't been visited more recently than its TTL, it goes into a queue to be refreshed by the crawler, which will make its way there shortly thereafter to classify the page.
Palo Alto has other URL scanners, but none that would reliably visit the page after the user. URLs carved out of SMTP traffic, for example, would mostly be visited before the real user, not after.
> What we found were user agents purporting to be from a range of devices including mobile devices, all only ever loading a single page without any existing state like cookies.
> The behavior itself is also strange, how did it load these pages which were often behind an authwall without ever logging in or having auth cookies?
I don't think they mean to say that pages behind authentication were successfully loaded without authenticating. If cookies are required to load the page, you aren't loading it without them. So I read this as "The sessions weren't authenticated, so where on earth did they even find these URLs?"
The answer is that there's a real, authenticated user behind a firewall, and every unknown URL this user visits is getting queued up for the crawler to classify later, query string and all. So the crawler's behavior looks like the user's, but offset by a few seconds and without any state. Presumably the auth wall is doing its job and rejecting these requests.
I remember setting up a Confluence server which was only used by me, but had public access (still password protected).
When checking the logs, I noticed an external IP trying to access pages which I had accessed previously, but they got redirected to the log-in page. The paths were very specific, some which I had bookmarked, so it was clear that there was an extension logging my browsing and some server or person then tried to access my pages.
https://www.paloaltonetworks.com/company/press/2023/palo-alt...
> Dec. 28, 2023 Palo Alto Networks .. announced that it has completed the acquisition of Talon Cyber Security, a pioneer of enterprise browser technology ... Talon's Enterprise Browser will provide additional layers of protection against phishing attacks, web-based attacks and malicious browser extensions. Talon also offers extensive controls to help ensure that sensitive data does not escape the confines of the browser.
Set hyper-granular policies ... boundaries across all users, devices, apps, networks, locations, & assets
Log any and all browser behavior, review screenshots of critical actions, & trace incidents down to the click
Critical security tools embedded into the browser: like native browser isolation, automatic phishing protection, & web filteringWell that definitely tracks.
My machine wanted me to accept a client certificate from palo alto networks.
I did not and kept refusing.
I think they had some sort of intrusive mitm proxy that filtered everything everyone was doing/browsing.
I suspect most of the other non-dev machines in the building had the ca installed by IT.
In this case these appeared to be all MitM'ed pages from a security device since the key wasn't in the url and it contained userids for a specific user.
> * That somehow had the page content from a user
> * Would render and execute all scripts on that page as if it was that user
Oh and this can also happen when a mobile user is jumping off their home wifi network to a internationally roaming data card. Why they would do that? Because data is cheaper this way, or they are actually tourists. So please do not block users just because they are doing this teleportation dance.
Thankfully that could be resolved, but it wasn't a great way to start a vacation.
Some other code running in the browser window (probably a browser extension, but possibly another script tag in the page, inserted by an intermediate firewall/proxy) is doing this. It could be corporate spyware (i.e. forced on users by the IT department), or an extension that only tends to be used by large institutions (because it relates to some expensive enterprise product). Alternatively, it could be a much more popular browser extension, but it only executes this capture when it determines that the user is within a target list of large institutions.
I'm making the same guess as the author about the execution process: that the code is shipping a huge amount of page content to a cloud server, e.g. the full DOM, and then rendering that DOM in this older Chrome version. It's not fetching the same page from the origin server, which is how it's able to do this without auth cookies.
As part of rendering, the page's script tags all get executed again, which is why Upollo is seeing this. (Note that I don't know if this re-execution of script tags is deliberate. There's a good chance that it's an unintended side-effect of loading the DOM into Chrome, but it doesn't seem to break anything so nobody's bothered to disable it.)
It's only sampling a small percentage of executions, which is why it's not continually happening for every interaction by these users.
It's waiting ten seconds so that the page's network interactions are likely to have finished by then. Waiting longer would increase the odds of the user navigating to another page before the code has had a chance to run.
The article doesn't say if there are particular kinds of pages being grabbed, but looking for commonality between them would help.
The main thing that stumps me – assuming I've understood it correctly – is why the second render is happening across such a diverse set of cloud networks.
The diversity of cloud networks looks to be due to these being deployed by individual institutions (eg. universities, corporations etc.) rather than only run from Palo Alto Network's data centers.
We also saw slightly different configurations with different browser versions, but with the same pattern of behaviour.
https://support.apple.com/guide/iphone/get-extensions-iphab0...
Funnily enough I got motivated to try to make my crawler show up the same way in my own server logs by just raw scan breadth, IE by hitting so many servers I'd see my own crawler in the logs without any kind of targeting. As a kind of "planetary level experiment" source of curiosity.
Had to tweak masscan settings till my crappy router could keep up with the routing load. Ended up with something like 500 addresses / sec, which pales in comparison to the best hardware used for this which when combined with masscan, scans the ipv4 space in 6 minutes. Managed to scan 1% of the IPV4 space while I slept before I started to get seriously throttled and got a quite angry email from my ISP. Just told them "Oh thanks for noticing, I now fixed the offending device" (pressed Control+C) and never ran the scan again lol.
Ran the scan with masscan with no blacklist. Don't recommend, at least not doing it more than once unless you get a good blacklist to follow
> This is an Internet-scale port scanner. It can scan the entire Internet in under 5 minutes, transmitting 10 million packets per second, from a single machine.
Absolutely insane
Just speculatively, if someone was managing the setup of a room full of NSA analysts browsing for OSINT, how would they cover their tracks? What would that traffic look like?
[0] https://docs.umbrella.com/deployment-umbrella/docs/install-c...
Edited to add link to docs.
It is scary that for people in a corporate environment this could be rendering banking, messaging or any other pages contents.
Some employers might not actually do that, but that decision is usually neither static nor will a change in it have to be reported to you under most policies.
They have a feature that effectively "tests" what the user is about to load in a virtual environment, and sees if that content behaves abnormally. I forgot what they called it. It sounds like this could be it.
Maybe related somehow to that?
I don't know where the "security" bit comes from, but this is, to me, obviously web scrapping
I've recently been wondering how Omnivore, unlike e.g. Pocket, is able to store paywalled content (for which I have a subscription) on iOS when saving it via the Omnivore app target in the share sheet, but not when directly pasting the target URL in the webapp or iOS app.
Turns out that sharing to an iOS app actually enables [1] the app to run JavaScript in the Safari web context of the displayed page, including cookies and everything!
If I'm skimming the client and server source code correctly, it does just that: It seems to serialize and upload the HTML of the page [2] and then invokes Puppeteer on the server [3]. Puppeteer is a scriptable/headless Chrome – that would fit the bill of "an outdated Chrome running in a data center"!
Omnivore can also be self-hosted since both client and server are open-source; that would explain you seeing multiple data center IPs.
[1] https://developer.apple.com/library/archive/documentation/Ge...
[2] https://github.com/omnivore-app/omnivore/blob/main/apple/Sou...
[3] https://github.com/omnivore-app/omnivore/blob/57aca545388904...
> But wait, these are different devices, they have none of the same cookies. If this were a VPN it would be the same device.
Could be interesting, but I cant read this shit with flashing images.
/* Text Higlight Color /**/
:root {
--highlight-color: null;
}
::selection {
background: var(--highlight-color);
color:#FFFFFF;
}
::-moz-selection { /* Code for Firefox */
color: #FFFFFF;
background: var(--highlight-color);
}
</style>
<!-- Text Highlight -->
<script>
const colors = ["#F76808", "#30A46C", "#0091FF", "#6E56CF", "#E5484D"];
window.addEventListener("mousedown", (e) => {
const color = colors.shift();
document.documentElement.style.setProperty("--highlight-color", color);
colors.push(color);
});
</script>Sad, because it sounded interesting, but no way I could focus enough to actually comprehend it.
> strange devices show up for some of our customers' users
> how did it load these pages which were often behind an authwall without ever logging in or having auth cookies?
Either
- The customer has screwed up user auth big time and some X knows that.... lets go with no
- OP's data is wrong or they are reading it wrong
- They are explaining it badly.
Either way it feels like malware on a client machine, but doesn't necessarily mean that the page contents are being read by the malware.
I guess if you had some javascript which only loaded if the chrome version was not the latest you could confirm -- the attempt to load the URL would not occur on GoodChrome, but it would on the "security" device. Therefore if the page contents was being shipped to BadDevice completely it would be loaded, but if it was just re-loading the URLs called by GoodChrome the URL wouldn't be called.