My question is how do you find all those algorithms? Wikipedia never really feel like a good introductory nor discovery place. In particular, some problems have been perfectly studied by scholars, but you don't find them because you have not found the keywords that will direct google in your search. And I am not a researcher, so I don't keep a tab on a domain, I'm a jack of all trades, I work on a wide set of things.
The general setting for this type of algorithm is that you have a large number of states and you want to find the state that maximizes a particular score. In the TSP, the states are paths and the score is the (negative of the) path length. The idea is to define a Markov Chain whose stationary distribution is proportional to the score. This means that if you jump from state to state according to the probabilities defined by the MC, eventually the probability of ending up at a given state is proportional to that states score.
In the case of TSP the states can be represented as permutations of the cities and the neighbors of a state are given by swapping the positions of two cities. The probability of moving to a neighbor is high if the length of the neighbor is shorter than the current path.
I know a place.
From the Github repo Coding Interview University[0] there is a link to a Jupyter notebook[1] on various ways to solve the traveling salesman problem - it is a very good, detailed, resource.
[0] https://github.com/jwasham/coding-interview-university
[1] https://nbviewer.jupyter.org/url/norvig.com/ipython/TSP.ipyn...
Knowing the right keywords and roughly how they fit together is a huge component of domain expertise. Building general background knowledge through experience and readings goes a long way for this reason.
At first it's novel then st some point it appears in a book specific to polynomial time approximation schems in books dedicated to the cause. Look at the references in wikipedia. They often point to a larger body of work than what the wiki can possibly hope to cover.
When reading a paper I would just scan the abstract, conclusion, introduction, related work section (possibly in that order) to see if it’s relevant.
For finding papers I like semantic scholar and Google scholar.
https://github.com/nraynaud/webgcode/blob/gh-pages/webapp/cn...
I guess I decided that some order was better than random.
I'll try to answer any questions about the post (technical or otherwise).
Edit: added more details about me.
identify(anonymous_id, new_canonical_id)
you merge the entire set of anonymous ids associated with find(anonymous_id)
with the set of anonymous ids associated with new_canonical_id?A classic example is pre- and post- signup behavior for a single user. When a user first lands on a page, they will be anonymous and lack a canonical identity. They may come from specific referrers (search, ad, social media, direct), land on a specific page, or engage with certain parts of the site. All of these actions are tracked and stored using an anonymous id. After the user creates an account and is assigned a canonical id (via the `identify` API call), we still want to associate all the previously tracked data with the canonical identity. This allows our users to perform analyses using events and data points from before and after identification.
> merge the entire set of anonymous ids
In the previous example the "set of anonymous ids" is just a single ID. There are use cases were a user may already have a canonical identity but we want to change/update that canonical id. In this case, we are merging all the data associated with both canonical identities (set of anonymous id's associated with the canonical user and the set of ids associated with the new canonical identity) and creating a single combined user with a cohesive view of all actions on our customer's site/app etc.
> most hashes ... are essentially one way functions, ideally preserving no information about the input at all
You are correct, an individual hash does not preserve any information about the node, nor does it approximate rank in the union-find graph. We only get the effect in aggregate when many unions are performed and we repeatedly hash and select the lower value as the new root.
The hash for a given node in the graph is deterministic and fixed. It is effectively a random value. But the hash for a tree of nodes (aka the hash of the tree's root node) decreases with each union operation. This is because we select the root with the lower hash to become the root of the combined tree every union operation. As a result, the node with the lowest hash is the root of each tree. With more union operations, larger trees will tend to have lower and lower hashes - effectively approximating a rank.
Put another (more handwave-y) way, we essentially roll the dice and get a fixed random number (hash) for each node in the tree. Since the tree is assigned the lowest random number contained within, larger trees will probably have lower values than smaller trees (more dices rolls to get a lower number).
1. an appropriate privacy and deanonymization level to each data item in a structure (corporate privacy, trade secret, account ID)
2. an appropriate privacy level to a tuple of data items (i.e., PII; name, ID, birthdate)
3. And a security context for each class of end-users (support, engineering team, marketing) to use what amount of and degree of deanonymized data items.
Be careful in tracking users, it may fall foul of GDPR and GDPR compliance. If your company is using analytics like this, it will be almost impossible to remove personally identifiable tracking information from website or mobile app platform.
This means it will become a nightmare to remove the user tracking information from every system and system logs. This is one of the reasons most solutions in USA are ill-suited for Europe. Hopefully USA can really start respecting privacy and build systems which keeps privacy on top.
Given most admired companies are built based on invasion of privacy (facebook, google, amazon, netflix), I doubt there is a will to get rid of privacy invading technologies from core. I see some efforts by Apple, but than the larger app eco-system still rely on trading privacy for some free apps or utilities will be hard to go.
This is already the case! Interestingly enough, our identity tracking mechanisms actually make it easier to delete users and purge _all_ of their data in the same way it makes unifying all their data for analysis easier in the first place.
When a GDPR request comes in through our API, we search for the matching user in our database. If we find a matching user, we'll look in our Identity system ('s union-find data structure) to find all the other user_ids which were associated with that canonical identity. Then we'll go through and delete all the data for every constituent user. This is essentially a reverse look up (find the set of user_ids for a canonical user_id/identity) and is easy to execute.