Applying Textbook Data Structures for Real Life Wins (opens in new tab)

(heap.io)

157 pointsdrob5y ago32 comments

32 comments

A few years back I had a traveling salesperson problem, I googled a bit, and somehow didn't find anything interesting, so I just created a random loop, and randomly mutated it until it got better. A few months ago I discovered a neat little heuristic: the best loop never self-intersect. And a few days ago, I found that there is a definite algorithm that gives a result whose worst case is bounded WRT to the optimal.

My question is how do you find all those algorithms? Wikipedia never really feel like a good introductory nor discovery place. In particular, some problems have been perfectly studied by scholars, but you don't find them because you have not found the keywords that will direct google in your search. And I am not a researcher, so I don't keep a tab on a domain, I'm a jack of all trades, I work on a wide set of things.

lowdanie5y ago

Your method is similar to a popular probabilistic technique called simulated annealing [1].

The general setting for this type of algorithm is that you have a large number of states and you want to find the state that maximizes a particular score. In the TSP, the states are paths and the score is the (negative of the) path length. The idea is to define a Markov Chain whose stationary distribution is proportional to the score. This means that if you jump from state to state according to the probabilities defined by the MC, eventually the probability of ending up at a given state is proportional to that states score.

In the case of TSP the states can be represented as permutations of the cities and the neighbors of a state are given by swapping the positions of two cities. The probability of moving to a neighbor is high if the length of the neighbor is shorter than the current path.

[1] https://en.wikipedia.org/wiki/Simulated_annealing

whycombagator5y ago

> My question is how do you find all those algorithms? Wikipedia never really feel like a good introductory nor discovery place.

I know a place.

From the Github repo Coding Interview University[0] there is a link to a Jupyter notebook[1] on various ways to solve the traveling salesman problem - it is a very good, detailed, resource.

[0] https://github.com/jwasham/coding-interview-university

[1] https://nbviewer.jupyter.org/url/norvig.com/ipython/TSP.ipyn...

bpicolo5y ago

> you have not found the keywords

Knowing the right keywords and roughly how they fit together is a huge component of domain expertise. Building general background knowledge through experience and readings goes a long way for this reason.

itissid5y ago

I think that most of the times they are particular to a domain to solve particular problem. For example Some researcher would have a grant for developing an optical inspection of a circuit board using a camera moving on the 2d euclidean space efficiently is a TSP. You can develop a Polynomial Time approximation scehne(PTAS) to solve it.

At first it's novel then st some point it appears in a book specific to polynomial time approximation schems in books dedicated to the cause. Look at the references in wikipedia. They often point to a larger body of work than what the wiki can possibly hope to cover.

vivekseth5y ago

I would try to look for academic papers on the topic. Even if you can’t find an exact match, papers typically have a section that describes other related work. You can also scan through the citations to see if anything might be a match. If there’s some relevant paper, you’ll probably find it after searching through a few papers.

When reading a paper I would just scan the abstract, conclusion, introduction, related work section (possibly in that order) to see if it’s relevant.

For finding papers I like semantic scholar and Google scholar.

dlkf5y ago

OOC: how did you create your initial random loop? This was a Kaggle problem about seven years ago. The competitive solutions found a good initial guess by breaking up the domain into a grid. They would solve TSP in each grid cell, and then stitch these solutions together to get a reasonable initial global solution. If you created your initial loop greedily (as I did) you got a garbage solution, and also wasted a lot of time that you could have spent doing random swaps or simulated annealing.

nraynaud5y ago

sorry, I misremembered, I used a Z-order path in the end:

https://github.com/nraynaud/webgcode/blob/gh-pages/webapp/cn...

I guess I decided that some order was better than random.

bunsenmcdubbs5y ago

Hey, author here! I've been working on infrastructure and database-y things at Heap for the last couple years, ranging from improving Postgres performance and availability to building out services (like this one!) and refactoring, encapsulating, and optimizing core systems.

I'll try to answer any questions about the post (technical or otherwise).

Edit: added more details about me.

georgewfraser5y ago

Do I understand correctly that when someone calls

  identify(anonymous_id, new_canonical_id)

you merge the entire set of anonymous ids associated with

  find(anonymous_id)

with the set of anonymous ids associated with new_canonical_id?

bunsenmcdubbs5y ago

Yes! You are correct. This ensures that we are able to maintain a cohesive view of the entire "user" across all their devices, browsers etc (so long as the Heap customer has a method for identifying their end user).

A classic example is pre- and post- signup behavior for a single user. When a user first lands on a page, they will be anonymous and lack a canonical identity. They may come from specific referrers (search, ad, social media, direct), land on a specific page, or engage with certain parts of the site. All of these actions are tracked and stored using an anonymous id. After the user creates an account and is assigned a canonical id (via the `identify` API call), we still want to associate all the previously tracked data with the canonical identity. This allows our users to perform analyses using events and data points from before and after identification.

> merge the entire set of anonymous ids

In the previous example the "set of anonymous ids" is just a single ID. There are use cases were a user may already have a canonical identity but we want to change/update that canonical id. In this case, we are merging all the data associated with both canonical identities (set of anonymous id's associated with the canonical user and the set of ids associated with the new canonical identity) and creating a single combined user with a cohesive view of all actions on our customer's site/app etc.

1 more reply

lootsauce5y ago

You mention the performance issues of the queries in postgres, I'm curious if a graph database would perform significantly batter?

bunsenmcdubbs5y ago

That is an interesting idea. Off the top of my head, I'm not sure - I personally do not have much experience with graph databases (mostly focused on Postgres). We are doing some more infrastructure-y work in this area and changing the underlying persistence layer isn't out of the question...

Apocryphon5y ago

People are so jaded about whiteboarding algorithms/data structures that it's great to see bonafide examples of their use in production outside of Skiena's war stories. Great content.

whoisburbansky5y ago

Somehow "4 hours to 15 minutes" sounds way more impressive than 15x speed up, at least to me. From the article, it's unclear how the hashing mechanism you use actually approximates rank at all; most hashes I'm familiar with are essentially one way functions, ideally preserving no information about the input at all. How are you hashing ID's so that unions result in smaller hashes?

bunsenmcdubbs5y ago

And great question! The hashing heuristic is a bit of a cute trick and I definitely glossed over it a bit in the post. I'll try to elaborate a bit more here...

> most hashes ... are essentially one way functions, ideally preserving no information about the input at all

You are correct, an individual hash does not preserve any information about the node, nor does it approximate rank in the union-find graph. We only get the effect in aggregate when many unions are performed and we repeatedly hash and select the lower value as the new root.

The hash for a given node in the graph is deterministic and fixed. It is effectively a random value. But the hash for a tree of nodes (aka the hash of the tree's root node) decreases with each union operation. This is because we select the root with the lower hash to become the root of the combined tree every union operation. As a result, the node with the lowest hash is the root of each tree. With more union operations, larger trees will tend to have lower and lower hashes - effectively approximating a rank.

Put another (more handwave-y) way, we essentially roll the dice and get a fixed random number (hash) for each node in the tree. Since the tree is assigned the lowest random number contained within, larger trees will probably have lower values than smaller trees (more dices rolls to get a lower number).

whoisburbansky5y ago

Ahhh I see now. I was so caught up trying to think of ways to make the hash function give you this property, I didn't realize that it didn't matter and that it emerged as a result of you picking which hash to assign to a tree. Makes sense now, thanks.

KMag5y ago

I think a simplified explanation is that, assuming your hash function is ideal (can be assumed to be random, except for being deterministic), you end up building a treap, minus the treap's restriction on the in-order traversal of keys and restriction on the number of children. The height of a treap is O(log N), so your solution is expected to be within a constant factor of an optimal solution.

[0] https://en.wikipedia.org/wiki/Treap

eutectic5y ago

I think there's a paper somewhere claiming that completely random linking is just as good as union-by-rank (in expectation).

2 more replies

commandlinefan5y ago

I feel like, 25 years post-undergrad and over 10 years post grad school, I should go back and re-read some data structures classics: I spend all of my time running database queries and debugging web service calls and only rarely dipping into real algorithmic optimization. I wonder how much performance optimization I'm leaving on the table because my data structures are so rusty.

Dowwie5y ago

Some of the more advanced postgres SQL blog posts over the years have come from the Heap team. I'm curious what their good enough, optimized postgres solution was that they replaced with this project. :)

egberts15y ago

Biggest and most challenging trend is to properly devise a data design schema for:

1. an appropriate privacy and deanonymization level to each data item in a structure (corporate privacy, trade secret, account ID)

2. an appropriate privacy level to a tuple of data items (i.e., PII; name, ID, birthdate)

3. And a security context for each class of end-users (support, engineering team, marketing) to use what amount of and degree of deanonymized data items.

Asooka5y ago

I don't see the point of hashing the key instead of picking the lexicographically smaller key. It will have the same effect and comparison is O(n), same as hashing.

bunsenmcdubbs5y ago

Great question! You are correct, these methods would be equivalent w.r.t. tree "shape". The problem emerges downstream. In our shared database, we reply on the fact that user IDs are uniformly distributed. If we directly compared values, we start skewing the data towards the lower end of the range. By comparing hashes, we get the same heuristic effect without disrupting the uniform distribution.

dragonsh5y ago

It's a good way to capture user tracking data and be precise in knowing exactly who is doing what on site or mobile app. Hopefully this data stucture can be extended to completely remove every bit of tracking information collected about a user, so that it can comply with GDPR and respect privacy.

Be careful in tracking users, it may fall foul of GDPR and GDPR compliance. If your company is using analytics like this, it will be almost impossible to remove personally identifiable tracking information from website or mobile app platform.

This means it will become a nightmare to remove the user tracking information from every system and system logs. This is one of the reasons most solutions in USA are ill-suited for Europe. Hopefully USA can really start respecting privacy and build systems which keeps privacy on top.

Given most admired companies are built based on invasion of privacy (facebook, google, amazon, netflix), I doubt there is a will to get rid of privacy invading technologies from core. I see some efforts by Apple, but than the larger app eco-system still rely on trading privacy for some free apps or utilities will be hard to go.

bunsenmcdubbs5y ago

> Hopefully this data stucture can be extended to completely remove every bit of tracking information collected about a user, so that it can comply with GDPR and respect privacy

This is already the case! Interestingly enough, our identity tracking mechanisms actually make it easier to delete users and purge _all_ of their data in the same way it makes unifying all their data for analysis easier in the first place.

When a GDPR request comes in through our API, we search for the matching user in our database. If we find a matching user, we'll look in our Identity system ('s union-find data structure) to find all the other user_ids which were associated with that canonical identity. Then we'll go through and delete all the data for every constituent user. This is essentially a reverse look up (find the set of user_ids for a canonical user_id/identity) and is easy to execute.

j / k navigate · click thread line to collapse

32 comments

nraynaud5y ago

lowdanie5y ago

Your method is similar to a popular probabilistic technique called simulated annealing [1].

[1] https://en.wikipedia.org/wiki/Simulated_annealing

whycombagator5y ago

> My question is how do you find all those algorithms? Wikipedia never really feel like a good introductory nor discovery place.

I know a place.

From the Github repo Coding Interview University[0] there is a link to a Jupyter notebook[1] on various ways to solve the traveling salesman problem - it is a very good, detailed, resource.

[0] https://github.com/jwasham/coding-interview-university

[1] https://nbviewer.jupyter.org/url/norvig.com/ipython/TSP.ipyn...

bpicolo5y ago

> you have not found the keywords

itissid5y ago

vivekseth5y ago

When reading a paper I would just scan the abstract, conclusion, introduction, related work section (possibly in that order) to see if it’s relevant.

For finding papers I like semantic scholar and Google scholar.

dlkf5y ago

nraynaud5y ago

sorry, I misremembered, I used a Z-order path in the end:

https://github.com/nraynaud/webgcode/blob/gh-pages/webapp/cn...

I guess I decided that some order was better than random.

bunsenmcdubbs5y ago

I'll try to answer any questions about the post (technical or otherwise).

Edit: added more details about me.

georgewfraser5y ago

Do I understand correctly that when someone calls

  identify(anonymous_id, new_canonical_id)

you merge the entire set of anonymous ids associated with

  find(anonymous_id)

with the set of anonymous ids associated with new_canonical_id?

bunsenmcdubbs5y ago

> merge the entire set of anonymous ids

1 more reply

lootsauce5y ago

You mention the performance issues of the queries in postgres, I'm curious if a graph database would perform significantly batter?

bunsenmcdubbs5y ago

Apocryphon5y ago

People are so jaded about whiteboarding algorithms/data structures that it's great to see bonafide examples of their use in production outside of Skiena's war stories. Great content.

whoisburbansky5y ago

bunsenmcdubbs5y ago

And great question! The hashing heuristic is a bit of a cute trick and I definitely glossed over it a bit in the post. I'll try to elaborate a bit more here...

> most hashes ... are essentially one way functions, ideally preserving no information about the input at all

whoisburbansky5y ago

KMag5y ago

[0] https://en.wikipedia.org/wiki/Treap

eutectic5y ago

I think there's a paper somewhere claiming that completely random linking is just as good as union-by-rank (in expectation).

2 more replies

commandlinefan5y ago

Dowwie5y ago

egberts15y ago

Biggest and most challenging trend is to properly devise a data design schema for:

1. an appropriate privacy and deanonymization level to each data item in a structure (corporate privacy, trade secret, account ID)

2. an appropriate privacy level to a tuple of data items (i.e., PII; name, ID, birthdate)

3. And a security context for each class of end-users (support, engineering team, marketing) to use what amount of and degree of deanonymized data items.

Asooka5y ago

I don't see the point of hashing the key instead of picking the lexicographically smaller key. It will have the same effect and comparison is O(n), same as hashing.

bunsenmcdubbs5y ago

dragonsh5y ago

bunsenmcdubbs5y ago

> Hopefully this data stucture can be extended to completely remove every bit of tracking information collected about a user, so that it can comply with GDPR and respect privacy

j / k navigate · click thread line to collapse