We had two requirements for the synthetic data: From the paper, “This synthetic data must meet two requirements:
1. it must somewhat resemble the original data statistically, to ensure realism and keep problems engaging for data scientists.
2. it must also formally and structurally resemble the original data, so that any software written on top of it can be reused.”
Our goal was as follows:
* Provide synthetic data to users - data scientists similar to the ones that engage on KAGGLE.
* Have them do feature engineering and provide us the software that created those features. Feature engineering is a process of ideation and requires human intuition. So being able to have many people work on it simultaneously was important to us. But it is impossible to give real data to everyone.
* They submit this software and we execute it on the real data, train a model and produce predictions for test data.
* In essence, their work is being evaluated on the real data - by the data holder - us.
The tests we performed:
* We gave 3 groups different versions of synthetic data ( and in some cases added noise to it)
* For a 4th group we gave the real data.
* We did not tell the users that they were not working on real data.
* All groups wrote feature engineering software looking at the data they got.
* We took their software executed it on real data, and evaluated their accuracy in terms of the predictive goal.
* We did this for 5 datasets
* Our goal was to see if the team that had access to real data “did they come up with better features?” . With 5 datasets and 3 comparisons per dataset, we had 15 tests.
Results:
* In 7 of those we found no significant difference.
* In 4 we found the features written by users looking at synthetic dataset were, in fact, better performing than the features generated by users looking at real dataset.
What can we conclude:
* Our goal was to enable crowdsourcing of feature engineering by giving the crowd synthetic data, gather the software they write on top of the synthetic data (not their conclusions) and assemble a machine learning model.
* We found that this is feasible.
* While the synthetic data is capturing as many correlations as possible, in general, the requirement here is for it to be enough such that the user working on it does not get confused, can roughly understand the relationships in the data, be able to intuit features, write software, and debug. That is, they can conclude a particular feature is better for predictions vs. another, inaccurately, based on the dataset they are looking at and it is ok. Since we are able to get many contributions simultaneously, the features one user misses could be generated by others.
* We think this methodology will work only for crowdsourcing feature engineering - a key bottleneck in the development of predictive models.