Reverse Engineering TikTok's VM Obfuscation (opens in new tab)

(nullpt.rs)

683 pointshazebooth3y ago122 comments

122 comments

This is really awesome work.

I spent a lot of time in the early 2000s coming up with nasty obfuscation techniques to protect certain IP that inherently needed to be run client-side in casino games. Up to and including inserting bytecode that was custom crafted to intentionally crash off-the-shelf decompilers that had to run the code to disassemble it (and forcing them to phone home in the process where possible!)

My view on obfuscation is that since it's never a valid security practice, it's only admissible for hiding machinery from the general public. For instance, if you have IP you want to protect from average script kiddies. Any serious IP can be replicated by someone with deep pockets anyway. Most other uses of code obfuscation are nefarious, and obfuscated code should always be assumed to be malicious until proven otherwise. I'm not a reputable large company, but no reputable large company should be going to these lengths to hide their process from the user, because doing so serves no valid security purpose.

dbrueck3y ago

Agreed - obfuscation is useful for keeping honest people honest. If someone is sufficiently motivated, they will circumvent it, but for the vast majority of people it's just not worth the effort so they'll move to something else.

For example, in our application we have some optionally downloadable content that includes some code for an interpreted language. That code lives on disk in an obfuscated form because we are not yet ready to make the API public (it's on our "someday" roadmap), we don't want to clean up the code for public viewing, and above all because there are different licensing requirements around each content pack.

We looked at various "real" security options and they all have holes, and they all add a ton of complexity. We then also looked at the likely intersection between "people who would pay for this" and "people who could crack this", and there's not much there. In the end, obfuscation is cheap (especially in terms of implementation and maintenance) and steers our real customers away violating the license, and we don't waste resources on dishonest people.

If I'm being charitable, the obfuscation in the article has an out of whack cost/benefit ratio. If I'm being cynical, the obfuscation they are doing strays well into the realm of nefarious. :)

thrashh3y ago

People knock on obfuscation but everything in life is based on trust. Locks being breakable, the fruit stand in front of a shop being unprotected, fences being scalable. Everything is a cost/benefit

rowanG0773y ago

It's the curse of ideological purity you see in a lot of the tevh sevtor. Most of these types are of the sort that either something is unbreakable or it's useless.

noduerme3y ago

Just as a fun aside, I was perusing some of my most ancient HN threads and came across this obfuscated monster which I totally forgot about. Pastebin links inside are still active ;)

https://news.ycombinator.com/item?id=3432800

jstanley3y ago

Wait, why is a casino protecting it's so-called "intellectual property" legitimate and above-board, but TikTok doing the same is not?

margalabargala3y ago

I don't think OP was defending their own earlier work or otherwise exempting it from their assertion that all obfuscated code should be considered malicious.

jstanley3y ago

Having reread it, I think you might be right.

> it's only admissible for hiding machinery from the general public.

I had originally read this to imply that somehow it's OK for a casino to hide its machinery from the general public, but it's not OK for TikTok to hide its machinery from the general public, but maybe "machinery" here is intended much more narrowly, and OP thinks it applies neither to casinos nor TikTok.

1 more reply

rnd03y ago

That's how I read it too. I had the feeling that the experience convinced the OP that it's not valid except in some circumstances.

noduerme3y ago

Parent / casino founder here. The casino specialized in original, exotic games. The obfuscated portions of the front-end were game modules (including art assets) that were loaded after login. We had several games that we were filing for patents on. We were also in talks with a much larger online casino about licensing individual games and/or the software as a whole to them. The purpose of the obfuscation was to make it harder for competitors to decompile and get at raw assets or read the math by which the game mechanics worked. For instance, we had a 3D slot machine based on a Rubik's Cube that paid out based on the odds of being able to solve one side in N steps from any given randomly scrambled position. That algorithm had to exist client-side to calculate the odds visible to the user in realtime, along with server-side for confirmation against someone trying to cheat in the client.

I felt it was important to make it as hard as possible for someone to reverse engineer the unique mechanisms. Ultimately, it was probably a waste of time. This is why I think in most cases the uses of obfuscation are at best limited, but they can put a costly stumbling block for competitors if you want to encourage them to license your software rather than copy it. Where I think they tilt toward the nefarious is when they're designed to extract hidden data from end users. As a distinction, what went over the wire between the client game modules and the casino back-end were completely human-readable game states in all cases (besides the user's unique ID and session hash, which were named as such). There were no bullets of obfuscated fingerprints flying around. Any user was free to read what came and went from the API, and even to mess with it by adjusting parameters if they wanted to see what the server would accept or reject.

neodymiumphish3y ago

I think the distinction in what's obfuscated is important. Casino apps are trying to hide their code that detects cheating, number generation, etc, while TikTok is trying to hide its data collection. Obfuscation itself isn't necessarily bad.

noduerme3y ago

Cheating detection was essentially all conducted on the back-end in my casino, but I do think there's a use case for obfuscating some front-end monitoring, e.g. for bot-like inputs. We didn't explicitly ban poker bots, but we didn't make the API guide public, either. The cheating we were most concerned with was poker collusion, which could be detected by combing the log files for certain patterns of play correlated between users or IP addresses.

Random numbers are never generated in the client. Ours were generated on dedicated server separate from anything else - in a different country, for legal reasons - whose sole purpose was to generate random numbers on demand.

im3w1l3y ago

> Number generation

Number generation is extremely important and it's also regulated. You don't put such a thing in the client obfuscated or not.

kevin_thibedeau3y ago

Because they're doing it on hardware that they control.

maria23y ago

White box crypto is kind of like obfuscation, but tries to make it impossible to extract the information.

awestroke3y ago

No, encryption is very different from obfuscation, even if the former is often used in the latter

xurukefi3y ago

You missed the point. maria2 is talking about whitebox crypto. The "whitebox" part means that the decryption process happens on your machine incuding the secrets, which are present in some obfuscated scrambled form in memory. Getting the secret key is a matter of debugging and understanding the obfuscation scheme. A prime example of this is DRM like Widevine (L3) in the chrome browser.

1 more reply

krackers3y ago

There's also indistinguishability obfuscation which I recall recently had a breakthrough in terms of practical construction

bobleeswagger3y ago

> since it's never a valid security practice

Why not? It's just another tool in the security game.

I want to be with you on thinking that all obfuscation is malicious, I know that individuals have every right to obfuscation and privacy as a matter of the 1st and 4th amendments in the US, but I'm not sure I can always say that obfuscation by a corporation is evil, without a more compelling argument. I'm as anti-establishment as they come, too.

mtnygard3y ago

I read the GP a bit differently... I didn't read it as saying obfuscation is evil, just that it is ineffective. More like "obfuscation can't prevent reversing, therefore it's not a valid security practice since all it does is slow down the casual observer but does not stop the determined adversary." The statement that most use of obfuscation is nefarious is a corollary... since obfuscation doesn't protect IP it is mostly used to hide malicious activity.

noduerme3y ago

This, exactly. Thank you for putting it so succinctly.

ViViDboarder3y ago

I think l the reason is that it means that they don’t trust or don’t want their users to know what they are doing on your machine. To me, that is already a malicious premise. Even if they aren’t trying to exfiltrate my data or anything.

1 more reply

codedokode3y ago

It is interesting, that while technologies like canvas, WebGL or WebRTC were intented for other purposes, their main usage became fingerprinting. For example, WebGL provides valuable information about GPU model and its drivers.

This shows how browser developers race to provide new features ignoring privacy impact.

I don't understand why features that allow fingerprinting (reading back canvas pixels or GPU buffers) are not hidden behind a permission.

jsnell3y ago

It is absurd to claim that the main use of WebRTC is fingerprinting. Especially during the pandemic the world pretty much ran on WebRTC. Real-time media is clearly a pretty core functionality for the web to be a serious application platform, it wasn't just some kind of a trojan horse for tracking.

Now, it is true that a lot of older web APIs do expose too much fingerprinting surface. But the design sensibilities having changed a lot over time, it's just not the case that you can make statements about what browser developers do now based on what designs from a decade or two ago look like. These days privacy is a top issue when it comes to any new browser APIs.

But let's take your question at face value: why aren't thesespecific things behind a permission dialog? Because the permissions would be totally unactionable to a normal user. "This page wants to send you notifications" or "this page wants to use the microphone" is understandable. "This page wants to read pixels from a canvas" isn't. If you go the permission route, the options are to either a) teach users that they need to click through nonsensical permission dialogs, with all the obvious downsides; b) make the notifications so scare or the permissions so inaccessible that the features might as well not exist. And the latter would be bad! Because the legit use cases for e.g. reading from a canvas do exist; they're just pretty rare.

The Privacy Sandbox approach to this is to track and limit how much entropy a site is extracting via these kinds of side channels. So if you legit need to read canvas pixels, you'll have to give up on other features that could leak fingerprinting data. (I personally don't really believe in that approach will work, but it is at least principled. What I'd like to see instead is limiting the use of these APIs to situations where the site has a stable identifier for the user anyway. But that requires getting away from implementing auth with cookies as opaque blobs of data with unknown semantics, and moving to some kind of proper session support where the browsers understands the semantics of signed-in session, and it's made clear to users when they're signing in somewhere and where they're signed in right now. And then you can make a lot better tradeoffs with limiting the fingerprinting surface in the non-signed in cases.)

trifurcate3y ago

> "This page wants to send you notifications" or "this page wants to use the microphone" is understandable. "This page wants to read pixels from a canvas" isn't.

Yes, it is. Tor Browser already does this: https://www.bleepstatic.com/content/posts/2017/10/30/CanvasF...

That specific wording may be a touch too verbose for the average end user, but it's not impossible nor is it strange. Just include a note about how this is 99% likely a fingerprinting measure; option b) isn't so bad in this case. Of course, due to the nature of how fingerprinting works, the absolute breadth of features that would be gated behind something like this would be offputting.

I am also wary of what you suggested with gating this kind of fingerprinting to when the website has positively identified the user anyway; in a way, this seems to me even more valuable than fingerprint data without an associated "strong" identity.

ballenf3y ago

Giving users the permissions would simply be a training exercise in "I have to say 'yes' or TikTok breaks". Like how Android worked a few years ago with the other permissions.

2 more replies

tveyben3y ago

The user ‘Joe average’ does not use Tor, does not even know it exists - Tor is used by a completely different segment (of people with ‘above average’ IT skills…)

0xy3y ago

Of course it's main use is fingerprinting. Do you think WebRTC is instantiated for genuine reasons the majority of the time? That's real absurdity.

WebRTC is instantiated most often by ad networks and anti-fraud services.

Same thing with Chrome's fundamentally insecure AudioContext tracking scheme (yes, it's a tracking scheme), which is used by trackers 99% of the time. It provides audio latency information which is highly unique (why?).

Given Chrome's stated mission of secure APIs and their actions of implementing leaky APIs with zeal, I have reason enough to question their motives.

After all, AudioContext is abused heavily on Google's ad networks. Google knows this.

ciarlill3y ago

> It provides audio latency information which is highly unique (why?).

As someone who has worked with WebAudio extensively, and have opened and read many issues in the bug tracker and read many of the proposals... this is just not as nefarious as you are making it seem. I don't disagree that this _can_ be abused by ad tracking networks but I do disagree with the premise that it was somehow an oversight of the spec or implementation which led to this (or even worse, intentional). Providing consistent audio behavior across a wide variety of platforms (Linux, OSX, Windows, Android) along with multiple versions of all those platforms and the myriad hardware in the actual devices is actually just pretty hard. The boring answer here is that to provide low latency audio to support things like games, a lot of decisions have to made about what buffer sizes are appropriate for the underlying hardware and this is what ultimately exposes some information about audio latency on the system. Some of those decisions are limited by the audio APIs of the OS. Some are limited by the capabilities of the hardware. Some are workaround for obscure bugs in either layer. The point is that, as with most software, compromises are made to support an API that people actually need or want to use to make stuff. I also don't think audio latency information is really "highly unique". There are only a handful of buffer sizes which are reasonable based on the desired sample rate and are mostly limited by the OS, meaning at best you can probably identify a persons OS via the AudioContext. Furthermore, I have seen API "improvements" and requests rejected outright due to possibly exposing fingerprinting information. Things that would be really useful to applications which are building audio-centric software won't be implemented because the team takes this issue seriously.

1 more reply

arein33y ago

Wow, that's really shitty from googles part.

Datagenerator3y ago

One alternative Librewolf needs some more promotion, has safer security by default

psychphysic3y ago

Do you mean more websites use webRTC for legitimate purposes than for fingerprinting? Or more instances of it being activated is legitimate or more traffic is legitimate (probs true given bandwidth needed for audio video).

But I suspect by the other two metrics it's correct to say most uses are to fingerprint.

ghayes3y ago

Take a look at Firefox’s Fingerprinting Prevention feature. This includes a permission for canvas, as well as:

- Your timezone is reported to be UTC

- Not all fonts installed on your computer are available to webpages

- The browser window prefers to be set to a specific size

- Your browser reports a specific, common version number and operating system

- Your keyboard layout and language is disguised

- Your webcam and microphone capabilities are disguised

- The Media Statistics Web API reports misleading information

- Any Site-Specific Zoom settings are not applied

- The WebSpeech, Gamepad, Sensors, and Performance Web APIs are disabled

https://support.mozilla.org/en-US/kb/firefox-protection-agai...

fxtentacle3y ago

It's because the developer of the browser needs fingerprinting for their ads.

I don't think Chrome accidentally exposed data that Google wanted.

IshKebab3y ago

Please don't spread obviously untrue conspiracy theories.

The main reason is that it's really hard to avoid fingerprinting (while providing rich features like WebGL and WebRTC anyway).

A secondary reason is that web browsers started off from a position of leaking fingerprint data all over the place so there's not much incentive to care about it for new features.

You might be interested in this effort to reduce fingerprinting: https://developer.chrome.com/en/docs/privacy-sandbox/privacy...

(The real conspiracy is that Google added logins to Chrome specifically so that they don't have to rely on fingerprinting. They have a huge incentive to stop fingerprinting because it leaves them as the only entity that can track users.)

danielheath3y ago

I thought the developer of the browser is the only ad provider that _doesn't_ need it (since they have other, better ways to get that intel which their competitors do not).

asdfghjkjhg3y ago

they (google) did try.

that's the profile icon you see on your google-chrome UI.

but only fools use that feature.

1 more reply

supriyo-biswas3y ago

The fly in the ointment with this theory is why Apple (or even Mozilla) would expose the same kind of information. Apple has only recently started experimenting with ads, and their ads are limited to the apps that they control.

The more benign explanation would be to allow developers to work around device-specific or browser-specific bugs.

(I'm aware Apple changes the GPU Model to "Apple GPU", however they do expose a ton of other properties that make it possible to fingerprint a device.)

jakear3y ago

Apple devices are in fact fairly difficult to fingerprint. In my experiments [1] all instances of the same hardware model (on iOS, iPadOS, and macOS) give the same fingerprint, so the best a tracker can get is "uses iPhone 14". Better than nothing, but not terribly unique.

[1] fingrprintr.pages.dev

2 more replies

camyule3y ago

Firefox do have a mechanism to limit the amount of data being leaked for fingerprinting, but it’s disabled by default: https://support.mozilla.org/en-US/kb/firefox-protection-agai...

1 more reply

threatofrain3y ago

Continuing the push the browser to be a general app platform is the only way it can survive against native experience, which is already eating into the enthusiasm for the web. It seems like the trend for consumer companies is to maybe launch first on the web for velocity but eventually migrate to native experiences.

I wonder to what degree we can enable hardware performance without leaking user data.

RobotToaster3y ago

Isn't Mozilla's main source of income from Google?

madeofpalk3y ago

> This shows how browser developers race to provide new features ignoring privacy impact.

I think it showed how many years ago browser vendors were naive with understanding how this tech could be misused.

These days I think browser vendors are very much aware of it and will frequently block features or proposals that they feel compromise on privacy and/or could be used as a tracking vector, especially Firefox and Safari. Sort this list https://mozilla.github.io/standards-positions/ by Mozilla Position to see the reason they reject/refuse to implement standards and proposals.

neop1x3y ago

For those who are unaware of how big of a problem fingerprinting is, there is an EFF website [1]. EU cookie policy is nothing compared to this. There are libraries like fingerprintjs [2] which can generate a pretty stable visitor ID.

If you change or alter some browser APIs in order to make your browser less unique, some payment processors webs may stop working. And webs proxied through CloudFlare will constantly display "Checking if the site connection is secure" page, sometimes in an infinite loop where even solving their captchas won't help.

[1] https://coveryourtracks.eff.org/

[2] https://fingerprint.com/

PetahNZ3y ago

Come on, it's not their main usage... An intentional side effect maybe, but their main usage is clear.

0xy3y ago

If something is used 99% of the time for tracking and 1% of the time for genuine useful reasons, it's safe to say it's a tracking mechanism.

Intent is irrelevant, the APIs are fundamentally insecure. Google directly benefits from this financially.

ivoras3y ago

Of course it's not that simple.

In most parts of the world, if a person is in a public space, anyone can take a photo of that person, including shop owners. This photo could be considered as a type of "fingerprint" for that person. The only important difference is that in some countries, you are not allowed make money off of such photos.

The Internet is a lot like a big public space, and possibly worse - while you are using certain services (web pages or apps), it might be argued that you are actually "on premises" for that service provider.

The best we can do now is more and more education about what can go wrong with such data collection.

ajsnigrutin3y ago

Yes, but taking photos is expensive, fingerprinting online is cheap. Also, there's a difference between taking a photo of the eiffel tower and taking a photo of a bunch of other tourists there (legal), or intentionally targeting and photographing an individual and creating a database of those photos (illegal in most countries).

TobyTheDog1233y ago

TikTok changes this algorithm about once every three months. I've reverse-engineered it about two times, and have since given up and decided to run a headless browser to do it for me. I'd love to see some tool developed to automate solving this so I can sign requests in a more limited context (ala Cloudflare Workers / C@E)

nullpt_rs3y ago

Author of the post here, if you have an older version of the script you're able to post or send over I'd love to take a look at it and see what changes they make and potentially automate the extraction.

TobyTheDog1233y ago

Hey I'd love to:

1.0.0.200: https://hastebin.com/tudivadufa.apache Unknown version: https://hastebin.com/jasuxineti.js

Some of these might have some console.logs (or curse words), but as a whole should be representative

moneywoes3y ago

Are you able to scrape with a headless browser?

TobyTheDog1233y ago

Yeah, I can get basic user information pretty reliably just from the initial page load.

I had a secondary use case of allowing users to sign-in in order to import the (verified/creator) users they follow, but quickly realized Apple wouldn't allow that data to be used (after the whole OG app ordeal), so I never had a real reason to follow up and crack it again.

thih93y ago

I've seen some of these techniques elsewhere; e.g. javascript-obfuscator supports replacing variable names with hex values [1] or transforming call structure into something more complex [2]. Bytecode generation is new to me; is there an existing JS obfuscation tool, preferably open source, that supports it?

[1]: https://github.com/javascript-obfuscator/javascript-obfuscat...

[2]: https://github.com/javascript-obfuscator/javascript-obfuscat...

czx4f4bd3y ago

Based on my previous research into this, the magic keywords to find this kind of thing on Google are "virtualization obfuscation" or "VM obfuscation".

rusty-jsyc is the main open source implementation I've found, though it hasn't been touched in a few years: https://jwillbold.com/posts/obfuscation/2019-06-16-the-secre... (GitHub: https://github.com/jwillbold/rusty-jsyc)

I think there are other implementations, but they're proprietary so I didn't look into them very much. There are lots of posts out there about reversing virtualization obfuscation, but not many about implementing it. Seems like most people who put the effort into implementing it tend to prefer selling it commercially (which I suppose makes sense).

hoosieree3y ago

It's only for C, but Tigress[1] supports a ton of obfuscation types. Virtualization and JIT are very effective, especially when used together with control flow transforms like Split and Flatten.

Renaming variables or encoding them is fairly trivial to reverse.

[1] https://tigress.wtf/transformations.html

xchkr13373y ago

Compiling JS to bytecode is not that uncommon, there's a few anti-bot services that rely on it for obfuscation (like recaptcha or f5 shapesecurity) but so far I haven't seen any open source projects for obfuscating this way

0x0083y ago

If I recall correctly: electron can compile JavaScript to “ByteNode” which is some form of byte code intended to be run in the V8 engine.

derefr3y ago

FYI, most CAPTCHA and anti-DDoS services (e.g. Cloudflare) do something very similar, sending the user an obfuscated program implemented on top of an obfuscated JS VM, that they effectively have to execute as-is, in a real browser, to get back the correct results the gateway is looking for. This is done to prevent simple scraping scripts (the ScraPy type) from being able to be used to scrape the site. If you want to do scraping, you have to spend the extra overhead of doing it by driving a real browser to do it. (And not even a headless one; they have tricks to detect that, too.)

antiviral3y ago

This is excellent work.

It also shows how Tiktok may be in violation of several US/EU privacy laws. I really wonder now who this data is shared with. Perhaps someone should bring this article to the FTC’s attention for further review.

mdaniel3y ago

Well, I guess they can stand in line:

https://news.ycombinator.com/item?id=34112874 (TikTok banned on government devices)

https://news.ycombinator.com/item?id=34121201 (TikTok admits to spying on U.S. users)

KirillPanov3y ago

Awesome, really awesome work. However:

> If that is something you are interested in, keep an eye out for the second part of this series :)

Your site is missing an RSS/Atom feed, so I can't do that. ::sad face::

CallMeMarc3y ago

We're sharing the same fate apparently! Just added a PR to their repository to add some feeds, hope it gets merged soon.

https://github.com/nullpt-rs/blog/pull/1

nullpt_rs3y ago

Thanks for the PR! You should now be able to access the feeds :)

https://www.nullpt.rs/feed.atom https://www.nullpt.rs/feed.rss https://www.nullpt.rs/feed.json

wiml3y ago

Given that the beginning of the "weird string" has a magic number and a version field, I wonder if the point of this is not so much obfuscation as transpilation? The magic number corresponds to ASCII "HNOJ" "@?RC", or perhaps "JONH" "CR?@", which doesn't turn anything up on Google but it seems odd to include that redundant header if your main goal is minification or obfuscation.

amelius3y ago

Can someone explain what VM they are talking about, and where that VM is running on, and what is running in it?

dbrueck3y ago

It's a custom VM running inside their app, though calling it a VM might be a bit of a stretch because it doesn't appear to be a general purpose computing mechanism but more of higher level command processor.

It sounds like the forthcoming part 2 article will go into more depth.

hendrik1273y ago

This seems to refer to a language virtual machine rather than an actual VM. Like the Java Virtual Machine (JVM)

Aperocky3y ago

Isn't the same concept also used in Youtube? I believe a python mock of the equivalent VM exist in youtube-dl.

linux26473y ago

IIRC not exactly. YouTube provides some arbitrary JavaScript that must be evaluated as a form of a challenge. It changes with every page request, but it’s just a set of math operations. It’s easier to evaluate the JS than to statically analyze it

mdaniel3y ago

I recall that discussion recently, and thus just happen to have it handy:

a very, very specialized "regex" based JS evaluator that presumably did just enough to make the YT one run: https://github.com/ytdl-org/youtube-dl/blob/2021.12.17/youtu...

and its callsite: https://github.com/ytdl-org/youtube-dl/blob/2021.12.17/youtu...

So the short version is that I would not classify that as a VM, and I don't even believe it's obfuscated. Perhaps there are other extractors that do what you're describing, I didn't go looking

Alifatisk3y ago

I never knew that Tiktok was shipped with its own virtual machine!

But that explains the obvious subdomain vm.tiktok.com

llacb473y ago

Don’t think that’s what vm means there. The m is likely “maliva”, which is tiktok’s overseas (US/europe) CDN.

born-jre3y ago

Something hit me when reading this, you know how zknark is touted as tech which in future allow to create app that can work on user private data while preserving user's privacy, could it be used as (opposite) an obfuscation technique to, u encrypt users data inside and zk oracle in user side and send to server. You could reverse engineer what are the inputs to the oracle, but not further what exactly it sends to the server?

renonce3y ago

zkSNARK allows you to make a proof for a statement that some boolean expression is satisfiable, without leaking any information about how the expression can be satisfied. That helps prove something but not work on any data. The technique you described sounds more like homomorphic encryption, which currently is lots of magnitudes slower than native hardware and lacks practical use.

born-jre3y ago

What about sth like this https://github.com/zkonduit/ezkl ?

mhasbini3y ago

Deobfuscated script without the vm part: https://gist.github.com/mhasbini/f9269d230ed8eb6dfdbb1bd1be9...

lazyeye3y ago

There needs to be a publicly funded charity that pays people to work fulltime de-obsfucating all the major apps. This should be a well-resourced ongoing operation.

mdaniel3y ago

I believe reversing for interoperability purposes is protected (at least here in the US), and I'd guess all reversing is "protected" if one doesn't share the resulting code (as with TFA), but I would bet that a crowdsourced setup like you're describing would run afoul of patent and copyright laws and ultimately the legal system is "he who has the most lawyers wins"

I have often wondered what the legal area is for sharing a Ghidra database that merely labels existing code, but I haven't looked into how much of the original binary gets packaged up with such a database

derefr3y ago

That HTTP request is kind of hideous. All those extra parameters that have nothing to do with what the response will end up being, and which change often. Seems like a great way to toss out all your API-response edge-cache-ability.

kevincox3y ago

With HTTPS you need to own the edge cache yourself and most will have options to ignore the headers and URL parameters that you want. That way they can log the tracking data and serve the cached data as if they were never there.

derefr3y ago

This is mostly true — though keep in mind that corporate forward-proxy caches still work under strict TLS, by installing root CA certs through GPOs on corporate machines, that re-sign all connections.

More importantly, if you're talking to a browser, the browser's own cache is in play. It's not an edge cache, per se, but it's just as important as one, and acts very similar to one.

thecleaner3y ago

Can I conclude that TikTok implemented a custom VM in Javascript ? Any idea what its used for and how many instructions it can process and are there other comparable implementations ?

Exuma3y ago

This article is 2 hours old and his Twitter is already changed?

mdaniel3y ago

Someone reported that he just had a typo in the twitter handle, IIRC an extra "r" at the end; FWIW, navigating up one level also has a link to the twitter handle and works just fine: https://twitter.com/nullpt_rs

deepzn3y ago

Looks like they are more active on Mastodon.

apienx3y ago

Solid case! Thanks for taking the time to write it up.

Those who care and have to use TikTok can probably add their own virtualization layer (and tolerate the hit in cost/performance).

chinathrow3y ago

No one has to use social media.

QuantumGood3y ago

Wouldn't an example be a job that requires it? Are you attempting a meta comment, and really mean something like "anyone can quit a job that requires social media usage"?

jesuspiece3y ago

They're trying to be unique and cool by denouncing the use of social media

1 more reply

frozencell3y ago

The hunt begins.

draw_down3y ago

> void 0 (a fancy obfuscated way of saying undefined)

Kind of. But it was possible at one point, maybe still is, to rebind `undefined` to some other value, causing trouble. `void` is an operator, a language keyword; it’s guaranteed to give you the true undefined value. (In other words, the value whose type is `undefined`.)

If you’re coding against an environment as adversarial as these people clearly believe they are, you’d go with `void` as well.

kerneloops3y ago

Another reason to use `void 0` is that "void 0" takes only 6 characters while "undefined" takes 9, saving some bandwidth. It is common practice for JavaScript minifiers to use this substitution.

marginalia_nu3y ago

Given it will be gzip-compressed in transport, does this really save a meaningful amount of bandwidth?

draw_down3y ago

It’s really more that there is no reason not to do it. Void is marginally safer as well as shorter, so any minifier/transpile step etc will make this substitution.

Kukumber3y ago

Nice use of low altitude satellites to track individuals and sniff telecoms all over the world

This decompiled object class also spy on the grid network, that's quite interesting and very clever

I never knew we could also lobby governments to push for some office and cloud software full of spyware, even France had to ban them! [1]

This TikTok app is very dangerous!

Of course /s

[1] - https://news.ycombinator.com/item?id=33686599

lazyeye3y ago

Yes it is.

What is the reason why China blocks all foreign social media apps within its own borders?

trasz33y ago

It doesn’t. It just requires them to follow the law, like other countries do. The problem comes from the fact that American companies are used to buying their way around the laws, and in this case they can’t.

lazyeye3y ago

Yes they do. You are simply lying.

j / k navigate · click thread line to collapse

122 comments

noduerme3y ago

This is really awesome work.

dbrueck3y ago

If I'm being charitable, the obfuscation in the article has an out of whack cost/benefit ratio. If I'm being cynical, the obfuscation they are doing strays well into the realm of nefarious. :)

thrashh3y ago

People knock on obfuscation but everything in life is based on trust. Locks being breakable, the fruit stand in front of a shop being unprotected, fences being scalable. Everything is a cost/benefit

rowanG0773y ago

It's the curse of ideological purity you see in a lot of the tevh sevtor. Most of these types are of the sort that either something is unbreakable or it's useless.

noduerme3y ago

Just as a fun aside, I was perusing some of my most ancient HN threads and came across this obfuscated monster which I totally forgot about. Pastebin links inside are still active ;)

https://news.ycombinator.com/item?id=3432800

jstanley3y ago

Wait, why is a casino protecting it's so-called "intellectual property" legitimate and above-board, but TikTok doing the same is not?

margalabargala3y ago

I don't think OP was defending their own earlier work or otherwise exempting it from their assertion that all obfuscated code should be considered malicious.

jstanley3y ago

Having reread it, I think you might be right.

> it's only admissible for hiding machinery from the general public.

1 more reply

rnd03y ago

That's how I read it too. I had the feeling that the experience convinced the OP that it's not valid except in some circumstances.

noduerme3y ago

neodymiumphish3y ago

noduerme3y ago

im3w1l3y ago

> Number generation

Number generation is extremely important and it's also regulated. You don't put such a thing in the client obfuscated or not.

kevin_thibedeau3y ago

Because they're doing it on hardware that they control.

maria23y ago

White box crypto is kind of like obfuscation, but tries to make it impossible to extract the information.

awestroke3y ago

No, encryption is very different from obfuscation, even if the former is often used in the latter

xurukefi3y ago

1 more reply

krackers3y ago

There's also indistinguishability obfuscation which I recall recently had a breakthrough in terms of practical construction

bobleeswagger3y ago

> since it's never a valid security practice

Why not? It's just another tool in the security game.

mtnygard3y ago

noduerme3y ago

This, exactly. Thank you for putting it so succinctly.

ViViDboarder3y ago

1 more reply

codedokode3y ago

This shows how browser developers race to provide new features ignoring privacy impact.

I don't understand why features that allow fingerprinting (reading back canvas pixels or GPU buffers) are not hidden behind a permission.

jsnell3y ago

trifurcate3y ago

> "This page wants to send you notifications" or "this page wants to use the microphone" is understandable. "This page wants to read pixels from a canvas" isn't.

Yes, it is. Tor Browser already does this: https://www.bleepstatic.com/content/posts/2017/10/30/CanvasF...

ballenf3y ago

Giving users the permissions would simply be a training exercise in "I have to say 'yes' or TikTok breaks". Like how Android worked a few years ago with the other permissions.

2 more replies

tveyben3y ago

The user ‘Joe average’ does not use Tor, does not even know it exists - Tor is used by a completely different segment (of people with ‘above average’ IT skills…)

0xy3y ago

Of course it's main use is fingerprinting. Do you think WebRTC is instantiated for genuine reasons the majority of the time? That's real absurdity.

WebRTC is instantiated most often by ad networks and anti-fraud services.

Given Chrome's stated mission of secure APIs and their actions of implementing leaky APIs with zeal, I have reason enough to question their motives.

After all, AudioContext is abused heavily on Google's ad networks. Google knows this.

ciarlill3y ago

> It provides audio latency information which is highly unique (why?).

1 more reply

arein33y ago

Wow, that's really shitty from googles part.

Datagenerator3y ago

One alternative Librewolf needs some more promotion, has safer security by default

psychphysic3y ago

But I suspect by the other two metrics it's correct to say most uses are to fingerprint.

ghayes3y ago

Take a look at Firefox’s Fingerprinting Prevention feature. This includes a permission for canvas, as well as:

- Your timezone is reported to be UTC

- Not all fonts installed on your computer are available to webpages

- The browser window prefers to be set to a specific size

- Your browser reports a specific, common version number and operating system

- Your keyboard layout and language is disguised

- Your webcam and microphone capabilities are disguised

- The Media Statistics Web API reports misleading information

- Any Site-Specific Zoom settings are not applied

- The WebSpeech, Gamepad, Sensors, and Performance Web APIs are disabled

https://support.mozilla.org/en-US/kb/firefox-protection-agai...

fxtentacle3y ago

It's because the developer of the browser needs fingerprinting for their ads.

I don't think Chrome accidentally exposed data that Google wanted.

IshKebab3y ago

Please don't spread obviously untrue conspiracy theories.

The main reason is that it's really hard to avoid fingerprinting (while providing rich features like WebGL and WebRTC anyway).

A secondary reason is that web browsers started off from a position of leaking fingerprint data all over the place so there's not much incentive to care about it for new features.

You might be interested in this effort to reduce fingerprinting: https://developer.chrome.com/en/docs/privacy-sandbox/privacy...

danielheath3y ago

I thought the developer of the browser is the only ad provider that _doesn't_ need it (since they have other, better ways to get that intel which their competitors do not).

asdfghjkjhg3y ago

they (google) did try.

that's the profile icon you see on your google-chrome UI.

but only fools use that feature.

1 more reply

supriyo-biswas3y ago

The more benign explanation would be to allow developers to work around device-specific or browser-specific bugs.

(I'm aware Apple changes the GPU Model to "Apple GPU", however they do expose a ton of other properties that make it possible to fingerprint a device.)

jakear3y ago

[1] fingrprintr.pages.dev

2 more replies

camyule3y ago

Firefox do have a mechanism to limit the amount of data being leaked for fingerprinting, but it’s disabled by default: https://support.mozilla.org/en-US/kb/firefox-protection-agai...

1 more reply

threatofrain3y ago

I wonder to what degree we can enable hardware performance without leaking user data.

RobotToaster3y ago

Isn't Mozilla's main source of income from Google?

madeofpalk3y ago

> This shows how browser developers race to provide new features ignoring privacy impact.

I think it showed how many years ago browser vendors were naive with understanding how this tech could be misused.

neop1x3y ago

[1] https://coveryourtracks.eff.org/

[2] https://fingerprint.com/

PetahNZ3y ago

Come on, it's not their main usage... An intentional side effect maybe, but their main usage is clear.

0xy3y ago

If something is used 99% of the time for tracking and 1% of the time for genuine useful reasons, it's safe to say it's a tracking mechanism.

Intent is irrelevant, the APIs are fundamentally insecure. Google directly benefits from this financially.

ivoras3y ago

Of course it's not that simple.

The best we can do now is more and more education about what can go wrong with such data collection.

ajsnigrutin3y ago

TobyTheDog1233y ago

nullpt_rs3y ago

TobyTheDog1233y ago

Hey I'd love to:

1.0.0.200: https://hastebin.com/tudivadufa.apache Unknown version: https://hastebin.com/jasuxineti.js

Some of these might have some console.logs (or curse words), but as a whole should be representative

moneywoes3y ago

Are you able to scrape with a headless browser?

TobyTheDog1233y ago

Yeah, I can get basic user information pretty reliably just from the initial page load.

thih93y ago

[1]: https://github.com/javascript-obfuscator/javascript-obfuscat...

[2]: https://github.com/javascript-obfuscator/javascript-obfuscat...

czx4f4bd3y ago

Based on my previous research into this, the magic keywords to find this kind of thing on Google are "virtualization obfuscation" or "VM obfuscation".

hoosieree3y ago

It's only for C, but Tigress[1] supports a ton of obfuscation types. Virtualization and JIT are very effective, especially when used together with control flow transforms like Split and Flatten.

Renaming variables or encoding them is fairly trivial to reverse.

[1] https://tigress.wtf/transformations.html

xchkr13373y ago

0x0083y ago

If I recall correctly: electron can compile JavaScript to “ByteNode” which is some form of byte code intended to be run in the V8 engine.

derefr3y ago

antiviral3y ago

This is excellent work.

mdaniel3y ago

Well, I guess they can stand in line:

https://news.ycombinator.com/item?id=34112874 (TikTok banned on government devices)

https://news.ycombinator.com/item?id=34121201 (TikTok admits to spying on U.S. users)

KirillPanov3y ago

Awesome, really awesome work. However:

> If that is something you are interested in, keep an eye out for the second part of this series :)

Your site is missing an RSS/Atom feed, so I can't do that. ::sad face::

CallMeMarc3y ago

We're sharing the same fate apparently! Just added a PR to their repository to add some feeds, hope it gets merged soon.

https://github.com/nullpt-rs/blog/pull/1

nullpt_rs3y ago

Thanks for the PR! You should now be able to access the feeds :)

https://www.nullpt.rs/feed.atom https://www.nullpt.rs/feed.rss https://www.nullpt.rs/feed.json

wiml3y ago

amelius3y ago

Can someone explain what VM they are talking about, and where that VM is running on, and what is running in it?

dbrueck3y ago

It sounds like the forthcoming part 2 article will go into more depth.

hendrik1273y ago

This seems to refer to a language virtual machine rather than an actual VM. Like the Java Virtual Machine (JVM)

Aperocky3y ago

Isn't the same concept also used in Youtube? I believe a python mock of the equivalent VM exist in youtube-dl.

linux26473y ago

mdaniel3y ago

I recall that discussion recently, and thus just happen to have it handy:

a very, very specialized "regex" based JS evaluator that presumably did just enough to make the YT one run: https://github.com/ytdl-org/youtube-dl/blob/2021.12.17/youtu...

and its callsite: https://github.com/ytdl-org/youtube-dl/blob/2021.12.17/youtu...

So the short version is that I would not classify that as a VM, and I don't even believe it's obfuscated. Perhaps there are other extractors that do what you're describing, I didn't go looking

Alifatisk3y ago

I never knew that Tiktok was shipped with its own virtual machine!

But that explains the obvious subdomain vm.tiktok.com

llacb473y ago

Don’t think that’s what vm means there. The m is likely “maliva”, which is tiktok’s overseas (US/europe) CDN.

born-jre3y ago

renonce3y ago

born-jre3y ago

What about sth like this https://github.com/zkonduit/ezkl ?

mhasbini3y ago

Deobfuscated script without the vm part: https://gist.github.com/mhasbini/f9269d230ed8eb6dfdbb1bd1be9...

lazyeye3y ago

There needs to be a publicly funded charity that pays people to work fulltime de-obsfucating all the major apps. This should be a well-resourced ongoing operation.

mdaniel3y ago

derefr3y ago

kevincox3y ago

derefr3y ago

More importantly, if you're talking to a browser, the browser's own cache is in play. It's not an edge cache, per se, but it's just as important as one, and acts very similar to one.

thecleaner3y ago

Can I conclude that TikTok implemented a custom VM in Javascript ? Any idea what its used for and how many instructions it can process and are there other comparable implementations ?

Exuma3y ago

This article is 2 hours old and his Twitter is already changed?

mdaniel3y ago

deepzn3y ago

Looks like they are more active on Mastodon.

apienx3y ago

Solid case! Thanks for taking the time to write it up.

Those who care and have to use TikTok can probably add their own virtualization layer (and tolerate the hit in cost/performance).

chinathrow3y ago

No one has to use social media.

QuantumGood3y ago

Wouldn't an example be a job that requires it? Are you attempting a meta comment, and really mean something like "anyone can quit a job that requires social media usage"?

jesuspiece3y ago

They're trying to be unique and cool by denouncing the use of social media

1 more reply

frozencell3y ago

The hunt begins.

draw_down3y ago

> void 0 (a fancy obfuscated way of saying undefined)

If you’re coding against an environment as adversarial as these people clearly believe they are, you’d go with `void` as well.

kerneloops3y ago

Another reason to use `void 0` is that "void 0" takes only 6 characters while "undefined" takes 9, saving some bandwidth. It is common practice for JavaScript minifiers to use this substitution.

marginalia_nu3y ago

Given it will be gzip-compressed in transport, does this really save a meaningful amount of bandwidth?

draw_down3y ago

It’s really more that there is no reason not to do it. Void is marginally safer as well as shorter, so any minifier/transpile step etc will make this substitution.

Kukumber3y ago

Nice use of low altitude satellites to track individuals and sniff telecoms all over the world

This decompiled object class also spy on the grid network, that's quite interesting and very clever

I never knew we could also lobby governments to push for some office and cloud software full of spyware, even France had to ban them! [1]

This TikTok app is very dangerous!

Of course /s

[1] - https://news.ycombinator.com/item?id=33686599

lazyeye3y ago

Yes it is.

What is the reason why China blocks all foreign social media apps within its own borders?

trasz33y ago

lazyeye3y ago

Yes they do. You are simply lying.

j / k navigate · click thread line to collapse