This is spoken to [in the full complaint][1]. The data scientist was told
Frank really did have 4 million users, and the scientist only needed to generate this "synthetic data" as a way to "anonymize" their "real" data. I.e. the scientist was duped:
JAVICE told Scientist-1 [...] that she had a database of approximately 4 million
people and wanted to create a database of anonymized data that mirrored the
statistical properties of the original database (the “Synthetic Data Set”).
[After JAVICE sends Scientist-1 the data], Scientist-1 understood that the data
available via the Access Link Email -
**a data set of approximately 142,000 people** (emphasis added) -
was a random sample of a larger database which contained data for approximately
4 million people. In fact, that data represented every Frank user who had at
least started a FAFSA.
[1]:
https://www.justice.gov/usao-sdny/press-release/file/1577861...