If you dig through the original paper, the conclusion is on the line with that:
“For 7 out of 15 comparisons, we found no significant difference between the accuracy of features developed on the control dataset vs. those developed on some version of the synthesized data; that is, the result of the test was False.”
So, on the tests they developed, the proposed method doesn't work 8 times out of 15…
"for 7 out of 15 comparisons, we found no significant difference" could mean all sorts of things. It could mean that 7 comparisons were perfect and 8 were complete garbage, as you suggest. Or it could mean that 7 comparisons were perfect and 8 had differences that were statistically significant, but the magnitudes of the differences were small enough that the results would still have been perfectly adequate for practical application.
In concrete terms: Let's say the synthetic data lets me build binary classifier that helps with a business issue, and has F1 scores of about 0.8. But if I had access to the real data, I could have got F1 of around 0.85. In that case, I'd happily take the data. As someone who's trying to solve business problems, it would be downright irresponsible of me to reject something that's better than what I currently have on the grounds that it's still less than some unattainable ideal.
I mean I have formed an association specifically with the MIT brand now, so this type of work coming out of there doesn't surprise me. I couldn't tell you exactly what has lead to this association though.
When we examined the confidence intervals for the remaining 8 tests, we found that for half, the mean of accuracies for features written over synthesized data was higher then for those written on the control dataset.
In other words, for 4 out of remaining 8 cases, the models on synthetic data performed better.
But I want to comment that it's worked for us. Sequence to sequence learning can reproduce every kind of iid and non-iid things we've ever looked at.
The real question is how safe/anonymous is it really?
If it gets down to correctly modeling the probability of colon cancer diagnosis by age, sex and ZIP code, and also the correct distribution of ages by ZIP code, then that'll be a potential problem in counties that only have one male 87-year-old.
- It looks like the work of the data scientists will be limited to the extent of the modeling already done by recursive conditional parameter aggregation. (edit: So why not just ship that model and adapt it instead of using it to generate data?)
- Its "validation" appears to be doubly proxied - i.e. the normal performance measures we use are themselves a proxy, and now we're comparing those against these performance measures derived from models built out of the data generated by these models. I'm not inclined to trust a validation that is so removed.
Any one who can explain this well?
Peeling back the mystery a bit, what is happening is:
1. From each child table upwards, model each column as a simple distribution (e.g. Gaussian) and covariance matrix.
2. Given those child table distribution parameters, pass them back as row values to their respective parent tables.
What you end up with is a "flattened" version of each parent table that has the information (in an "information theoretic" sense) of all child relations. Sampling from distributions is straight forward. The stats methods are outlined in section 3 of the paper.
Things of note:
- The paper makes heavy use of Copula transformations to normalize data whenever it passes around the distribution parameters.
- It deals with missing values by adding something like a dummy column.
- The key insight is that columns must be represented by parameterized distributions, but they don't have to be Gaussian. The Kolmogrov-Smirnov test is used to choose the "best fit" CDF to model.
To your question about the role of the data scientists: they are using the resulting simulations to solve more complex tasks. The goal of the experiment was to see how well the sample data would perform against Kaggle competitions. So I guess the idea was that if winners were indistinguishable, the simple/hierarchical distributions would be considered robust enough for complex tasks. In the end, I'm sure shipping the underlying is preferable for consumers.
Table modeling: While column distributions are picked using the KS-test, the covariance matrix calculation first normalizes the column distributions. Assuming that is reasonable, there is a claim of "this model contains all the information about the original table in a compact way..", but it doesn't account for possible multi-dimensional relationships in the data. It only looks at a series of projections to 2D. Can a d-dimensional dataset (in practice) be effectively summarized by the set of projections on to the d(d-1)/2 two-dimensional subspaces? That's once kind of summary, but I'm unsure whether that is adequate for practical modeling work, especially if folks try to apply high dimensional techniques (DL?) to this. (edit: I feel reasonably sure it isn't adequate. If a column ends up being bi-modal, for example, even that gets lost in translation in this approach?)
Crowdsourced validations: The synthetic sets were generated for already available public datasets. It isn't clear from the paper how any bias resulting from prior familiarity with the public datasets would be accounted for in the study concluding equivalence.
Privacy claims: This is a bit unclear. The "apply random noise" technique seems to suggest something similar to differential privacy, but makes no mention of it. If not DP, what definition of "privacy" is being used here? (I'm ok that proving their algorithm to be privacy safe according to a chosen definition of privacy may be out of scope of the paper.)
(Edit2: I can't help the feeling I have that this paper is an elaborate April fool's joke released early ;)
As you note, the Kolmogorov-Smirnov test is used to choose the "best fit" CDFs. The set of CDFs then used to generate a random vector, which after a covariance adjustment becomes a synthetic datapoint.
The step that can ruin the synthetic data is exactly (the "best fit" CDFs) as the original distribution does not necessarily fit well any of the well-known distribution.
At the same time, "best fit" CDFs are responsible for anonymizing the results. So if you overfit and stick to the original data too close, you lose anonymity and capture the original data bias. But if you approximate with a distribution you introduce a distribution bias.
So the solution provides a tradeoff between anonymity and "best fit" corruption of the data.
I thought the Zillow blogpost [1] was a nice intro (and I'm a sucker for Seinfeld references), and it demonstrates the sensitivity-to-threshold value in a way the original academic authors never did.
[1]: https://www.zillow.com/data-science/double-dip-holdout-set/
We had two requirements for the synthetic data: From the paper, “This synthetic data must meet two requirements:
1. it must somewhat resemble the original data statistically, to ensure realism and keep problems engaging for data scientists.
2. it must also formally and structurally resemble the original data, so that any software written on top of it can be reused.”
Our goal was as follows:
* Provide synthetic data to users - data scientists similar to the ones that engage on KAGGLE.
* Have them do feature engineering and provide us the software that created those features. Feature engineering is a process of ideation and requires human intuition. So being able to have many people work on it simultaneously was important to us. But it is impossible to give real data to everyone.
* They submit this software and we execute it on the real data, train a model and produce predictions for test data.
* In essence, their work is being evaluated on the real data - by the data holder - us.
The tests we performed:
* We gave 3 groups different versions of synthetic data ( and in some cases added noise to it)
* For a 4th group we gave the real data.
* We did not tell the users that they were not working on real data.
* All groups wrote feature engineering software looking at the data they got.
* We took their software executed it on real data, and evaluated their accuracy in terms of the predictive goal.
* We did this for 5 datasets
* Our goal was to see if the team that had access to real data “did they come up with better features?” . With 5 datasets and 3 comparisons per dataset, we had 15 tests.
Results:
* In 7 of those we found no significant difference.
* In 4 we found the features written by users looking at synthetic dataset were, in fact, better performing than the features generated by users looking at real dataset.
What can we conclude:
* Our goal was to enable crowdsourcing of feature engineering by giving the crowd synthetic data, gather the software they write on top of the synthetic data (not their conclusions) and assemble a machine learning model.
* We found that this is feasible.
* While the synthetic data is capturing as many correlations as possible, in general, the requirement here is for it to be enough such that the user working on it does not get confused, can roughly understand the relationships in the data, be able to intuit features, write software, and debug. That is, they can conclude a particular feature is better for predictions vs. another, inaccurately, based on the dataset they are looking at and it is ok. Since we are able to get many contributions simultaneously, the features one user misses could be generated by others.
* We think this methodology will work only for crowdsourcing feature engineering - a key bottleneck in the development of predictive models.
Although, I suppose that if the data was already anonymized to the best of your ability, and then this was run on top of that, as a additional layer of protection, that might be okay.
https://en.wikipedia.org/wiki/Gibbs_sampling
Generating tuples(row) by Gibbs sampling will allow generation of samples from the joint distribution. This in turn would preserve all correlations, conditional probabilities etc. This can be done by starting at a original tuple at random and then repeatedly mutating the tuple by overwriting one of its fields(columns). To overwrite, one selects another random tuple that 'matches' the current one at all positions other than the column selected for overwriting. The match might need to be relaxed from an exact match to a 'close' match.
If the conditional distribution for some conditioning event has very low entropy or the conditional entropy is low, one would need to fuzz the original to preserve privacy, but this will come at the expense of distorting the correlations and conditionals.
Are you facing any trouble while accessing this link?
It's hard to compare to this paper, because this paper's privacy claims appear to be heuristic, not formal. This isn't necessarily bad, since existing approaches for constructing synthetic data in a differentially private way is still not very practical. But heuristics do necessarily lack provable privacy guarantees, so there's no proof that something very bad privacy-wise can't happen with sufficiently clever processing of the synthetic data.
The claim is too bold and I would reject this paper They should clarify that the data is good enough for linear regression. Not to say there is no difference between real and syn data.
- https://dspace.mit.edu/handle/1721.1/109616#files-area
- https://pdfs.semanticscholar.org/64ad/643e8084486ca7d3312ed4...
I wonder if this the technique behind Numerai