Website data leaks pose greater risks than most people realize (opens in new tab)

(seas.harvard.edu)

194 pointstonicb6y ago31 comments

31 comments

Most companies still don’t know what anonymization means and confuse anonymized with pseudonymized or masked data.

Part of the problem is that there are still no good criteria available to define anonymity. Concepts like differential privacy are a step in the right direction but they still provide room for error, and in many cases they are either too restrictive (transformed data is not useful anymore) or too lax (transformed data is useful but can be easily re-identified).

ravenstine6y ago

It's not that most of them don't know what anonymization is or are confused about it.

Society is a tapestry of bullshit and low-level swindling is generally tolerated or quickly forgotten about. Thus, there's nothing to prod the unprincipled in charge to do the right thing. As long as something seems to be good(anonymized, in this cage), and problems can be hidden behind the corporate veil long enough, the unwritten rule is to half-ass security solutions because, well, security is boring and there's other things to devote company time and resources to(that will advance upper management).

Security measures, especially those that protect the users, don't make money. At best, they're insurance against the fallout that might occur when it's revealed that your company has been silently screwing people over. Like most human beings, businesses often put off serious consideration of the future in order to enjoy quick and immediate gain.

I wouldn't put it past most companies to screw up an approach like differential privacy. Not enough people actually care that much.

dfxm126y ago

Security measures, especially those that protect the users, don't make money.

This is why the government has to make regulations with teeth in this space (of course, the government could be the "unprincipled in charge" you referred to).

1 more reply

Bartweiss6y ago

And even the ones who do practice decent anonymization are generally contributing to the problem just by holding a lot of data.

Lots of companies are content to stop at "our data can't be linked back to a person's identity", which doesn't prevent building a uniquely-identifying user profile. (e.g. via browser fingerprinting, plus enough metadata to associate a user's computer and phone accounts.) Even if they do better than that, its typically "our data is not uniquely identifying in isolation", which still isn't enough. If your differential privacy model says that these four pieces of data have a specificity of 10,000 possible individuals, that's a good start. But if someone with an individual's PII and three of those keys comes looking, they can still narrow down information about the fourth value from your aggregates.

And even if no one screws up, what happens when someone queries a half dozen differential datasets for different subsets of a uniquely identifying key? It's something like the file-drawer problem, where one researcher hiding bad data is malicious, but a dozen studies failing to coordinate produces the same result innocently. If outright failures to anonymize become rarer, cross-dataset approaches become more rewarding.

sarnowski6y ago

As one step to raise awareness about the differences I really like this overview:

https://fpf.org/wp-content/uploads/2017/06/FPF_Visual-Guide-...

stebann6y ago

Having read about anonymization techniques I have started to believe that definitions of anonymity and pseudo-anonymity are well settle by now but criteria that contributes to the invariants for performing data transformation are not, so the result is that this criteria fail to guide the implementations of the transformations.

You keep data because data is economically valuable, but even when you care enough to implement some techniques that depends on the invariants you still fail to achieve something the better because of scale and because you don't want to refine the techniques. This also means that somehow somebody may have a technique that, provided enough pieces of data, can reverse you transformation.

inciampati6y ago

Differential privacy provides a system that can allow the sharing of databases without allowing an external observer to determine if a particular individual was included.

If companies were required to aggregate information in this way and throw away their logs, perhaps leaks would be much less risky for their users.

Today this might seem far-fetched, but it could come to pass in the future, when people raised in this environment and able to understand the implications and technical aspects come to political power.

https://www.cis.upenn.edu/~aaroth/privacybook.html

https://en.wikipedia.org/wiki/Differential_privacy

bostik6y ago

Differential privacy provides a lot less protection than you would think (or want to believe). A few months ago I saw a talk by E. Kornaropoulos, about his paper "Attacks on Encrypted Databases Beyond the Uniform Query Distribution"[0].

The main take-away from the talk - an in fact all the talks I saw on the same day - was that while DP is touted as a silver bullet and the new hotness, in reality it can not protect against the battery of information theoretical attacks advertisers have been aware of for couple of decades, and intelligence agencies must have been doing for a lot longer. Hiding information is really hard. Cross-correlating data across different sets, even if each set in itself contains nothing but weak proxies, remains a powerful deanonymisation technique.

After all, if you have huge pool of people and dozens or even hundreds of unique subgroups, the Venn-diagram-like intersection of just a handful will carve out a small and very specific population.

0: https://eprint.iacr.org/2019/441

DarthGhandi6y ago

Australian government released "anonymised" healthcare data to researchers. Within months a good chunk of it was deanonymised, including celebrities and some politicians themselves.

There's a lot of privacy snakeoil out there and even large govt departments fall for it.

https://pursuit.unimelb.edu.au/articles/the-simple-process-o...

1 more reply

mattb3146y ago

Personally I'm not super bullish on differential privacy outside a couple specific use cases, but correlation attacks and cross referencing against external data are exactly the vectors that differential privacy is intended to protect against: it requires that the results of any query or set of queries would be identical with some probability even if a specific person wasn't present in the dataset.

It's possible I'm misreading, but your paper seems to focus on the very anonymization techniques diff privacy was invented to improve on, specifically because these kinds of attacks exist. While I agree it's no silver bullet, the reason is because it's too strong (it's hard to get useful results while providing such powerful guarantees) rather than not strong enough.

I've found the introduction to this textbook on it to be useful and very approachable if others are interested: https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf

ThePhysicist6y ago

We're building an analytics system that is based on differential privacy / randomization of data. It's possible but there are many limitations and caveats, at least if you really care about the privacy and not just apply differential privacy as a PR move. Most systems that implement differential privacy use it for simple aggregation queries, for which it works well. It doesn't work well for more complex queries or high-dimensional data though, at least not if you choose a reasonably secure epsilon: Either the data will not be useful anymore or the individual that the data belongs to won't be reliably protected from identification or inference.

After spending three years working on privacy technologies I'm convinced that anonymization of high-dimensional datasets (say more than 1000 bits of information entropy per individual) is simply not possible for information-theoretic reasons, the best we can do for such data is peudonymization or deletion.

BlueTemplar6y ago

Database sharing has been (in theory) illegal inside the government for decades for this very reason. Why would private companies be allowed to do it ?

bostik6y ago

You posted your reply while I was writin my own. Do you happen to have pointers to any really good research results and/or papers?

I want to be better equipped to respond to this slowly emerging "DP is a silver bullet" meme and your response implies that you'd have actual research to back the position up.

1 more reply

ergl6y ago

It's not far-fetched. Differential privacy is going to be used for the US census this year. Here's a report on it: https://arxiv.org/abs/1809.02201

Also, it's not a magical solution. Here's one of the issues from the linked paper (edited for clarity):

"The proponents of differential privacy have always maintained that the setting of the [trade-off between privacy loss (ε) and accuracy] is a policy question, not a technical one. [...] To date, the Census committee has set the values of ε far higher than those envisioned by the creators of differential privacy. (In their contemporaneous writings, differential privacy’s creators clearly imply that they expected values of ε that were “much less than one.”)

DataWorker6y ago

“In addition, the historical reasons for having invariants may no longer be consistent with the Census Bureau’s confidentiality mandate.”

us census, RIP

jefftk6y ago

> If companies were required to aggregate information in this way and throw away their logs, perhaps leaks would be much less risky for their users.

One of the leaks they talk about way from Experian, a credit reporting agency. Not only would this approach work poorly for them, it wouldn't be legal (they need to be able to back up any claims they make about people, which requires going back to the source data).

mjevans6y ago

I've considered how I would like E.G. GPS / driving apps to anonymize data.

For freeways, lots of small segments, and fuzzing of timestamps to co-mingle users. Where there's a stoplight snap the intersection cross-time to the green light (guess) for anyone in the queue.

The anonymity would come from breaking up both requests and observed telemetry to fragments too small to tie back to a single user or session (and thus form a pattern; I hope).

Do NOT record end-times, only an intended route. Do NOT associate that movement to any particular user or persistent session (ideally in memory on the mobile device only, not saved: though it could save favorite routes locally). Packages of transition times between various freeway exits would generally help add to anonymity.

That would also be part of generally improving the UI for the user. The application on the device should be making most of the decisions, by asking about the traffic in a given region on a grid. I also want it to show me (the driver) the data (heatmap) on the rejected routes so I know what isn't a good option.

redis_mlc6y ago

Largely true, but there are HHS rules and guidelines that are accepted in the US healthcare space:

https://www.hhs.gov/hipaa/for-professionals/privacy/special-...

kube-system6y ago

HIPAA data is not immune to a data leak... not even the organization that wrote those guidelines are immune:

https://www.deccanchronicle.com/technology/in-other-news/201...

There's tons of PHI on the internet. Your local hospital's online medical chart, your insurance companies bill-pay, etc...

SiempreViernes6y ago

The title refers to claims by marketing companies that they have appropriately anonymised the data, and is not an attack on the concept of anonymisation itself.

akavel6y ago

What does "computer science concentrator" or "statistics concentrator" mean? It's a first time I see such a title (?)

hwbehrens6y ago

Harvard calls their fields of study "concentrations", not majors [0]. Thus, a CS concentrator is an undergraduate student who is majoring in CS.

[0]: https://en.wikipedia.org/wiki/Academic_major

ComodoHacker6y ago

Students have found data enrichment techniques exist and can be effectively applied to breach datasets. Good for them.

ghostpepper6y ago

Yeah, I was a bit surprised when I read this was a project for a first year course Privacy and Technology (CS 105). I don't see it being reported anywhere other than Harvard's own website.

ansmithz426y ago

I think this should be sent to the government officials that they were able to find in their research, it might get them to wake up and stop treating it so lightly.

lwb6y ago

Relevant XKCD: https://xkcd.com/792/

kache_6y ago

Is it just data leaks? How about Google's reports on how busy a certain area is (restaurants, malls)? That is pretty much telling a potential terrorist the optimal time to target an area. We leak data everywhere, and all we need is a single bad actor to utilize it for a catastrophe to occur.

j / k navigate · click thread line to collapse

31 comments

ThePhysicist6y ago

Most companies still don’t know what anonymization means and confuse anonymized with pseudonymized or masked data.

ravenstine6y ago

It's not that most of them don't know what anonymization is or are confused about it.

I wouldn't put it past most companies to screw up an approach like differential privacy. Not enough people actually care that much.

dfxm126y ago

Security measures, especially those that protect the users, don't make money.

This is why the government has to make regulations with teeth in this space (of course, the government could be the "unprincipled in charge" you referred to).

1 more reply

Bartweiss6y ago

And even the ones who do practice decent anonymization are generally contributing to the problem just by holding a lot of data.

sarnowski6y ago

As one step to raise awareness about the differences I really like this overview:

https://fpf.org/wp-content/uploads/2017/06/FPF_Visual-Guide-...

stebann6y ago

inciampati6y ago

Differential privacy provides a system that can allow the sharing of databases without allowing an external observer to determine if a particular individual was included.

If companies were required to aggregate information in this way and throw away their logs, perhaps leaks would be much less risky for their users.

https://www.cis.upenn.edu/~aaroth/privacybook.html

https://en.wikipedia.org/wiki/Differential_privacy

bostik6y ago

After all, if you have huge pool of people and dozens or even hundreds of unique subgroups, the Venn-diagram-like intersection of just a handful will carve out a small and very specific population.

0: https://eprint.iacr.org/2019/441

DarthGhandi6y ago

Australian government released "anonymised" healthcare data to researchers. Within months a good chunk of it was deanonymised, including celebrities and some politicians themselves.

There's a lot of privacy snakeoil out there and even large govt departments fall for it.

https://pursuit.unimelb.edu.au/articles/the-simple-process-o...

1 more reply

mattb3146y ago

I've found the introduction to this textbook on it to be useful and very approachable if others are interested: https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf

ThePhysicist6y ago

BlueTemplar6y ago

Database sharing has been (in theory) illegal inside the government for decades for this very reason. Why would private companies be allowed to do it ?

bostik6y ago

You posted your reply while I was writin my own. Do you happen to have pointers to any really good research results and/or papers?

I want to be better equipped to respond to this slowly emerging "DP is a silver bullet" meme and your response implies that you'd have actual research to back the position up.

1 more reply

ergl6y ago

It's not far-fetched. Differential privacy is going to be used for the US census this year. Here's a report on it: https://arxiv.org/abs/1809.02201

Also, it's not a magical solution. Here's one of the issues from the linked paper (edited for clarity):

DataWorker6y ago

“In addition, the historical reasons for having invariants may no longer be consistent with the Census Bureau’s confidentiality mandate.”

us census, RIP

jefftk6y ago

> If companies were required to aggregate information in this way and throw away their logs, perhaps leaks would be much less risky for their users.

mjevans6y ago

I've considered how I would like E.G. GPS / driving apps to anonymize data.

For freeways, lots of small segments, and fuzzing of timestamps to co-mingle users. Where there's a stoplight snap the intersection cross-time to the green light (guess) for anyone in the queue.

The anonymity would come from breaking up both requests and observed telemetry to fragments too small to tie back to a single user or session (and thus form a pattern; I hope).

redis_mlc6y ago

Largely true, but there are HHS rules and guidelines that are accepted in the US healthcare space:

https://www.hhs.gov/hipaa/for-professionals/privacy/special-...

kube-system6y ago

HIPAA data is not immune to a data leak... not even the organization that wrote those guidelines are immune:

https://www.deccanchronicle.com/technology/in-other-news/201...

There's tons of PHI on the internet. Your local hospital's online medical chart, your insurance companies bill-pay, etc...

SiempreViernes6y ago

The title refers to claims by marketing companies that they have appropriately anonymised the data, and is not an attack on the concept of anonymisation itself.

akavel6y ago

What does "computer science concentrator" or "statistics concentrator" mean? It's a first time I see such a title (?)

hwbehrens6y ago

Harvard calls their fields of study "concentrations", not majors [0]. Thus, a CS concentrator is an undergraduate student who is majoring in CS.

[0]: https://en.wikipedia.org/wiki/Academic_major

ComodoHacker6y ago

Students have found data enrichment techniques exist and can be effectively applied to breach datasets. Good for them.

ghostpepper6y ago

Yeah, I was a bit surprised when I read this was a project for a first year course Privacy and Technology (CS 105). I don't see it being reported anywhere other than Harvard's own website.

ansmithz426y ago

I think this should be sent to the government officials that they were able to find in their research, it might get them to wake up and stop treating it so lightly.

lwb6y ago

Relevant XKCD: https://xkcd.com/792/

kache_6y ago

j / k navigate · click thread line to collapse