undefined | Better HN

0 pointsitronitron8y ago0 comments

Lots of data typically means streams of data, which means processes running 24/7 moving data and files around. Streams, connectivity, and processes can cut out periodically which means you need some logic to reconcile and fill the gap and also restart the processes. You will also need some data QA as it is perfectly reasonable to get 'extra' data, either as duplicates or metadata bleeding into content.

If your data is from disparate sources then you may need to normalize timestamps across records from different sources, you may be dealing with different languages, identical tokens that mean different things depending on the source, different formatting of numeric fields, etc.

This is an incomplete list, the GP probably has a more exhaustive list of problem types...

0 comments

pcmaffey8y ago

ie a subset of Data Engineering role

j / k navigate · click thread line to collapse