Artificial data give the same results as real data without compromising privacy (opens in new tab)

(news.mit.edu)

99 pointssibmike8y ago46 comments

46 comments

I'm highly dubious of the ability for synthetic data to model accurately datasets without introducing unexpected bias, esp. to account for causality.

If you dig through the original paper, the conclusion is on the line with that:

“For 7 out of 15 comparisons, we found no significant difference between the accuracy of features developed on the control dataset vs. those developed on some version of the synthesized data; that is, the result of the test was False.”

So, on the tests they developed, the proposed method doesn't work 8 times out of 15…

hyperbovine8y ago

Agreed, seems suspect. If they are really able to learn the population-level distribution then why even bother generating fake data. Just release that instead.

bunderbunder8y ago

Well, just knowing a few distributions wouldn't be great for building machine learning models.

bunderbunder8y ago

I'd like to read the paper before drawing such a conclusion. (the link to it seems to be broken)

"for 7 out of 15 comparisons, we found no significant difference" could mean all sorts of things. It could mean that 7 comparisons were perfect and 8 were complete garbage, as you suggest. Or it could mean that 7 comparisons were perfect and 8 had differences that were statistically significant, but the magnitudes of the differences were small enough that the results would still have been perfectly adequate for practical application.

In concrete terms: Let's say the synthetic data lets me build binary classifier that helps with a business issue, and has F1 scores of about 0.8. But if I had access to the real data, I could have got F1 of around 0.85. In that case, I'd happily take the data. As someone who's trying to solve business problems, it would be downright irresponsible of me to reject something that's better than what I currently have on the grounds that it's still less than some unattainable ideal.

fardin13688y ago

You are ignoring the restrictions and regulation that exist around sharing data in lots of financial, government and medical industries. Sometimes, the missed cost of 5 percent accuracy is much less than the inspections, delays and blockages that otherwise would occur if they wanted to use real data.

1 more reply

nonbel8y ago

I couldn't read the paper (seemed to be missing), but has anyone else noticed that MIT seems to have big problems with open science?

I mean I have formed an association specifically with the MIT brand now, so this type of work coming out of there doesn't surprise me. I couldn't tell you exactly what has lead to this association though.

malshe8y ago

Just below what you reproduced, they write:

When we examined the confidence intervals for the remaining 8 tests, we found that for half, the mean of accuracies for features written over synthesized data was higher then for those written on the control dataset.

In other words, for 4 out of remaining 8 cases, the models on synthetic data performed better.

Cynddl8y ago

Yes, I did leave that out, as I think it's still an issue. A synthetic model performing better is a little dubious, since the modeled distribution has less information than the original one. Overall, the discrepancy seems more important to notice than the actual performance.

stelfer8y ago

Haven't read the paper, but I will.

But I want to comment that it's worked for us. Sequence to sequence learning can reproduce every kind of iid and non-iid things we've ever looked at.

The real question is how safe/anonymous is it really?

bunderbunder8y ago

I imagine it depends on how closely you model the conditional probabilities.

If it gets down to correctly modeling the probability of colon cancer diagnosis by age, sex and ZIP code, and also the correct distribution of ages by ZIP code, then that'll be a potential problem in counties that only have one male 87-year-old.

1 more reply

sriku8y ago

I haven't read the original paper (yet), but something doesn't sit right with the work, if the way it is portrayed is indeed faithful to it and I'm not missing something important.

- It looks like the work of the data scientists will be limited to the extent of the modeling already done by recursive conditional parameter aggregation. (edit: So why not just ship that model and adapt it instead of using it to generate data?)

- Its "validation" appears to be doubly proxied - i.e. the normal performance measures we use are themselves a proxy, and now we're comparing those against these performance measures derived from models built out of the data generated by these models. I'm not inclined to trust a validation that is so removed.

Any one who can explain this well?

jj123458y ago

Just finished the paper, so let me take a stab:

Peeling back the mystery a bit, what is happening is:

1. From each child table upwards, model each column as a simple distribution (e.g. Gaussian) and covariance matrix.

2. Given those child table distribution parameters, pass them back as row values to their respective parent tables.

What you end up with is a "flattened" version of each parent table that has the information (in an "information theoretic" sense) of all child relations. Sampling from distributions is straight forward. The stats methods are outlined in section 3 of the paper.

Things of note:

- The paper makes heavy use of Copula transformations to normalize data whenever it passes around the distribution parameters.

- It deals with missing values by adding something like a dummy column.

- The key insight is that columns must be represented by parameterized distributions, but they don't have to be Gaussian. The Kolmogrov-Smirnov test is used to choose the "best fit" CDF to model.

To your question about the role of the data scientists: they are using the resulting simulations to solve more complex tasks. The goal of the experiment was to see how well the sample data would perform against Kaggle competitions. So I guess the idea was that if winners were indistinguishable, the simple/hierarchical distributions would be considered robust enough for complex tasks. In the end, I'm sure shipping the underlying is preferable for consumers.

sriku8y ago

(Going through the paper .. a few questions/notes)

Table modeling: While column distributions are picked using the KS-test, the covariance matrix calculation first normalizes the column distributions. Assuming that is reasonable, there is a claim of "this model contains all the information about the original table in a compact way..", but it doesn't account for possible multi-dimensional relationships in the data. It only looks at a series of projections to 2D. Can a d-dimensional dataset (in practice) be effectively summarized by the set of projections on to the d(d-1)/2 two-dimensional subspaces? That's once kind of summary, but I'm unsure whether that is adequate for practical modeling work, especially if folks try to apply high dimensional techniques (DL?) to this. (edit: I feel reasonably sure it isn't adequate. If a column ends up being bi-modal, for example, even that gets lost in translation in this approach?)

Crowdsourced validations: The synthetic sets were generated for already available public datasets. It isn't clear from the paper how any bias resulting from prior familiarity with the public datasets would be accounted for in the study concluding equivalence.

Privacy claims: This is a bit unclear. The "apply random noise" technique seems to suggest something similar to differential privacy, but makes no mention of it. If not DP, what definition of "privacy" is being used here? (I'm ok that proving their algorithm to be privacy safe according to a chosen definition of privacy may be out of scope of the paper.)

(Edit2: I can't help the feeling I have that this paper is an elaborate April fool's joke released early ;)

1 more reply

jacobush8y ago

I think they just invented the political representative in modelling

sibmikeOP8y ago

Correct me if I am wrong.

As you note, the Kolmogorov-Smirnov test is used to choose the "best fit" CDFs. The set of CDFs then used to generate a random vector, which after a covariance adjustment becomes a synthetic datapoint.

The step that can ruin the synthetic data is exactly (the "best fit" CDFs) as the original distribution does not necessarily fit well any of the well-known distribution.

At the same time, "best fit" CDFs are responsible for anonymizing the results. So if you overfit and stick to the original data too close, you lose anonymity and capture the original data bias. But if you approximate with a distribution you introduce a distribution bias.

So the solution provides a tradeoff between anonymity and "best fit" corruption of the data.

mehrdadn8y ago

On a parallel note, search for "thresholdout". It's another (genius, I think) way to "stretch" how far your data goes in training a model. I won't do a better job trying to explain it than those who already have, so I won't try—here's a nice link explaining it instead: http://andyljones.tumblr.com/post/127547085623/holdout-reuse

claytonjy8y ago

I got really excited about thresholdout a couple weeks ago, but I've since cooled; setting the threshold seems like too much black magic.

I thought the Zillow blogpost [1] was a nice intro (and I'm a sucker for Seinfeld references), and it demonstrates the sensitivity-to-threshold value in a way the original academic authors never did.

[1]: https://www.zillow.com/data-science/double-dip-holdout-set/

lokopodium8y ago

They use real data to create artificial data. So, real data is still more useful.

function_seven8y ago

The idea is to sidestep the need to access private information in order for researchers to do their work. So in this case, the artificial data is more useful, since the real data is inaccessible.

michaelscott8y ago

But the artificial data must come from somewhere? It can be modeled from real data in order to take into account outliers and to avoid cognitive biases in generation, but then there's still an initial reliance on the real data.

1 more reply

kalyanv8y ago

Hi, I'm one of the authors of this work. We're very proud that this has attracted so much attention on Hacker News. I'm happy to answer a few questions.

We had two requirements for the synthetic data: From the paper, “This synthetic data must meet two requirements:

1. it must somewhat resemble the original data statistically, to ensure realism and keep problems engaging for data scientists.

2. it must also formally and structurally resemble the original data, so that any software written on top of it can be reused.”

Our goal was as follows:

* Provide synthetic data to users - data scientists similar to the ones that engage on KAGGLE.

* Have them do feature engineering and provide us the software that created those features. Feature engineering is a process of ideation and requires human intuition. So being able to have many people work on it simultaneously was important to us. But it is impossible to give real data to everyone.

* They submit this software and we execute it on the real data, train a model and produce predictions for test data.

* In essence, their work is being evaluated on the real data - by the data holder - us.

The tests we performed:

* We gave 3 groups different versions of synthetic data ( and in some cases added noise to it)

* For a 4th group we gave the real data.

* We did not tell the users that they were not working on real data.

* All groups wrote feature engineering software looking at the data they got.

* We took their software executed it on real data, and evaluated their accuracy in terms of the predictive goal.

* We did this for 5 datasets

* Our goal was to see if the team that had access to real data “did they come up with better features?” . With 5 datasets and 3 comparisons per dataset, we had 15 tests.

Results:

* In 7 of those we found no significant difference.

* In 4 we found the features written by users looking at synthetic dataset were, in fact, better performing than the features generated by users looking at real dataset.

What can we conclude:

* Our goal was to enable crowdsourcing of feature engineering by giving the crowd synthetic data, gather the software they write on top of the synthetic data (not their conclusions) and assemble a machine learning model.

* We found that this is feasible.

* While the synthetic data is capturing as many correlations as possible, in general, the requirement here is for it to be enough such that the user working on it does not get confused, can roughly understand the relationships in the data, be able to intuit features, write software, and debug. That is, they can conclude a particular feature is better for predictions vs. another, inaccurately, based on the dataset they are looking at and it is ok. Since we are able to get many contributions simultaneously, the features one user misses could be generated by others.

* We think this methodology will work only for crowdsourcing feature engineering - a key bottleneck in the development of predictive models.

2 more replies

pavon8y ago

If I was responsible for protecting privacy of data, I don't know that I would be comfortable with this method. Anonymization of data is hard, and frequently turns out to be not as anonymous as originally thought. At a high level, this sounds like they are training a ML system on your data, and then using it to generate similar data. What sort of guarantees can be given that the ML system won't simulate your data with too high of fidelity? I've seen too many image generators that output images very close to the data they were trained on. You could compare the two datasets and look for similarities, but you'd have to have good metrics of what sort of similarity was bad and what sort was good, and I could see that being tricky, in both directions.

Although, I suppose that if the data was already anonymized to the best of your ability, and then this was run on top of that, as a additional layer of protection, that might be okay.

lopmotr8y ago

I wonder how secure it is against identifying individuals. With over-fitting, you can producing the training data as output. Hopefully they have a robust way to prevent that, or any kind of reverse engineering of the output to somehow work out the original data.

srean8y ago

Could not get hold of the paper. Are they doing Gibbs sampling or a semiparametric variant of that ?

https://en.wikipedia.org/wiki/Gibbs_sampling

Generating tuples(row) by Gibbs sampling will allow generation of samples from the joint distribution. This in turn would preserve all correlations, conditional probabilities etc. This can be done by starting at a original tuple at random and then repeatedly mutating the tuple by overwriting one of its fields(columns). To overwrite, one selects another random tuple that 'matches' the current one at all positions other than the column selected for overwriting. The match might need to be relaxed from an exact match to a 'close' match.

If the conditional distribution for some conditioning event has very low entropy or the conditional entropy is low, one would need to fuzz the original to preserve privacy, but this will come at the expense of distorting the correlations and conditionals.

malshe8y ago

I could download it from here: https://dai.lids.mit.edu/wp-content/uploads/2018/03/SDV.pdf

Are you facing any trouble while accessing this link?

srean8y ago

Ah ! thanks it works.

_0ffh8y ago

Seems like only helpful for testing methods that can't capture any correlations the original method didn't.

_56598y ago

Is this akin at all to random sampling with replacement ie bootstrapping?

myopicgoat8y ago

No, because that would take full rows of the feature matrix (thereby corresponding to the full information of one individual). The idea here is to “generate” rows corresponding to plausible artificial individuals. That way you can give a third party artificial data to build an ML model without compromising (too much) the privacy of the real individual in the initial data.

has2k18y ago

It is easy to confuse it for such, but it is not bootstrapping. It is a form of multi-dimensional random variable generation, where the generated dimensions preserve same correlations/relationships as those in the original dataset.

EGreg8y ago

How is this related to and different from differential privacy?

majos8y ago

Differential privacy is a formal guarantee of an algorithm. Roughly, given algorithm A that takes input database X, we say A is differentially private if m, for any X' differing in at most one row from X, the output distributions of A(X) and A(X') are similar. So to say an algorithm is differentially private you need to prove a claim like this.

It's hard to compare to this paper, because this paper's privacy claims appear to be heuristic, not formal. This isn't necessarily bad, since existing approaches for constructing synthetic data in a differentially private way is still not very practical. But heuristics do necessarily lack provable privacy guarantees, so there's no proof that something very bad privacy-wise can't happen with sufficiently clever processing of the synthetic data.

jj123458y ago

To add to this answer: the methods outlined in the paper allow for perfect reconstruction of the underlying data in many cases, as the simulation of data is simply sampling from fitted distributions.

fardin13688y ago

I am looking into their experiments. Seems most of them are pretty simple predictions/classifications. No wonder they get good results.

The claim is too bold and I would reject this paper They should clarify that the data is good enough for linear regression. Not to say there is no difference between real and syn data.

dwheeler8y ago

The abstract claims there was no difference only 70% of the time. So 30% of the time there was a difference. Unsurprisingly it greatly limits the kind of data analysis that was allowed, which greatly reduces the applicability even if you believe it. I'm pretty dubious of this work anyway.

anon12538y ago

Heh. I wrote a paper about this a while ago https://www.liebertpub.com/doi/full/10.1089/bio.2014.0069

aspaceman8y ago

Does someone have a link to the preprint / arxiv? The link in the story is a 404 (I presume that the paper just hasn't been posted yet or something?)

krab8y ago

I've found these documents:

- https://dspace.mit.edu/handle/1721.1/109616#files-area

- https://pdfs.semanticscholar.org/64ad/643e8084486ca7d3312ed4...

sandGorgon8y ago

Sounds very similar to homomorphic encryption, except with no compromise in performance.

I wonder if this the technique behind Numerai

bschreck8y ago

The link to the actual paper is now working

j / k navigate · click thread line to collapse

46 comments

Cynddl8y ago

I'm highly dubious of the ability for synthetic data to model accurately datasets without introducing unexpected bias, esp. to account for causality.

If you dig through the original paper, the conclusion is on the line with that:

So, on the tests they developed, the proposed method doesn't work 8 times out of 15…

hyperbovine8y ago

Agreed, seems suspect. If they are really able to learn the population-level distribution then why even bother generating fake data. Just release that instead.

bunderbunder8y ago

Well, just knowing a few distributions wouldn't be great for building machine learning models.

bunderbunder8y ago

I'd like to read the paper before drawing such a conclusion. (the link to it seems to be broken)

fardin13688y ago

1 more reply

nonbel8y ago

I couldn't read the paper (seemed to be missing), but has anyone else noticed that MIT seems to have big problems with open science?

malshe8y ago

Just below what you reproduced, they write:

In other words, for 4 out of remaining 8 cases, the models on synthetic data performed better.

Cynddl8y ago

stelfer8y ago

Haven't read the paper, but I will.

But I want to comment that it's worked for us. Sequence to sequence learning can reproduce every kind of iid and non-iid things we've ever looked at.

The real question is how safe/anonymous is it really?

bunderbunder8y ago

I imagine it depends on how closely you model the conditional probabilities.

1 more reply

sriku8y ago

I haven't read the original paper (yet), but something doesn't sit right with the work, if the way it is portrayed is indeed faithful to it and I'm not missing something important.

Any one who can explain this well?

jj123458y ago

Just finished the paper, so let me take a stab:

Peeling back the mystery a bit, what is happening is:

1. From each child table upwards, model each column as a simple distribution (e.g. Gaussian) and covariance matrix.

2. Given those child table distribution parameters, pass them back as row values to their respective parent tables.

Things of note:

- The paper makes heavy use of Copula transformations to normalize data whenever it passes around the distribution parameters.

- It deals with missing values by adding something like a dummy column.

- The key insight is that columns must be represented by parameterized distributions, but they don't have to be Gaussian. The Kolmogrov-Smirnov test is used to choose the "best fit" CDF to model.

sriku8y ago

(Going through the paper .. a few questions/notes)

(Edit2: I can't help the feeling I have that this paper is an elaborate April fool's joke released early ;)

1 more reply

jacobush8y ago

I think they just invented the political representative in modelling

sibmikeOP8y ago

Correct me if I am wrong.

The step that can ruin the synthetic data is exactly (the "best fit" CDFs) as the original distribution does not necessarily fit well any of the well-known distribution.

So the solution provides a tradeoff between anonymity and "best fit" corruption of the data.

mehrdadn8y ago

claytonjy8y ago

I got really excited about thresholdout a couple weeks ago, but I've since cooled; setting the threshold seems like too much black magic.

I thought the Zillow blogpost [1] was a nice intro (and I'm a sucker for Seinfeld references), and it demonstrates the sensitivity-to-threshold value in a way the original academic authors never did.

[1]: https://www.zillow.com/data-science/double-dip-holdout-set/

lokopodium8y ago

They use real data to create artificial data. So, real data is still more useful.

function_seven8y ago

The idea is to sidestep the need to access private information in order for researchers to do their work. So in this case, the artificial data is more useful, since the real data is inaccessible.

michaelscott8y ago

1 more reply

kalyanv8y ago

Hi, I'm one of the authors of this work. We're very proud that this has attracted so much attention on Hacker News. I'm happy to answer a few questions.

We had two requirements for the synthetic data: From the paper, “This synthetic data must meet two requirements:

1. it must somewhat resemble the original data statistically, to ensure realism and keep problems engaging for data scientists.

2. it must also formally and structurally resemble the original data, so that any software written on top of it can be reused.”

Our goal was as follows:

* Provide synthetic data to users - data scientists similar to the ones that engage on KAGGLE.

* They submit this software and we execute it on the real data, train a model and produce predictions for test data.

* In essence, their work is being evaluated on the real data - by the data holder - us.

The tests we performed:

* We gave 3 groups different versions of synthetic data ( and in some cases added noise to it)

* For a 4th group we gave the real data.

* We did not tell the users that they were not working on real data.

* All groups wrote feature engineering software looking at the data they got.

* We took their software executed it on real data, and evaluated their accuracy in terms of the predictive goal.

* We did this for 5 datasets

* Our goal was to see if the team that had access to real data “did they come up with better features?” . With 5 datasets and 3 comparisons per dataset, we had 15 tests.

Results:

* In 7 of those we found no significant difference.

* In 4 we found the features written by users looking at synthetic dataset were, in fact, better performing than the features generated by users looking at real dataset.

What can we conclude:

* We found that this is feasible.

* We think this methodology will work only for crowdsourcing feature engineering - a key bottleneck in the development of predictive models.

2 more replies

pavon8y ago

Although, I suppose that if the data was already anonymized to the best of your ability, and then this was run on top of that, as a additional layer of protection, that might be okay.

lopmotr8y ago

srean8y ago

Could not get hold of the paper. Are they doing Gibbs sampling or a semiparametric variant of that ?

https://en.wikipedia.org/wiki/Gibbs_sampling

malshe8y ago

I could download it from here: https://dai.lids.mit.edu/wp-content/uploads/2018/03/SDV.pdf

Are you facing any trouble while accessing this link?

srean8y ago

Ah ! thanks it works.

_0ffh8y ago

Seems like only helpful for testing methods that can't capture any correlations the original method didn't.

_56598y ago

Is this akin at all to random sampling with replacement ie bootstrapping?

myopicgoat8y ago

has2k18y ago

EGreg8y ago

How is this related to and different from differential privacy?

majos8y ago

jj123458y ago

To add to this answer: the methods outlined in the paper allow for perfect reconstruction of the underlying data in many cases, as the simulation of data is simply sampling from fitted distributions.

fardin13688y ago

I am looking into their experiments. Seems most of them are pretty simple predictions/classifications. No wonder they get good results.

The claim is too bold and I would reject this paper They should clarify that the data is good enough for linear regression. Not to say there is no difference between real and syn data.

dwheeler8y ago

anon12538y ago

Heh. I wrote a paper about this a while ago https://www.liebertpub.com/doi/full/10.1089/bio.2014.0069

aspaceman8y ago

Does someone have a link to the preprint / arxiv? The link in the story is a 404 (I presume that the paper just hasn't been posted yet or something?)

krab8y ago

I've found these documents:

- https://dspace.mit.edu/handle/1721.1/109616#files-area

- https://pdfs.semanticscholar.org/64ad/643e8084486ca7d3312ed4...

sandGorgon8y ago

Sounds very similar to homomorphic encryption, except with no compromise in performance.

I wonder if this the technique behind Numerai

bschreck8y ago

The link to the actual paper is now working

j / k navigate · click thread line to collapse