undefined | Better HN

0 pointsdghlsakjg7d ago0 comments

Your understanding of the datasets I helped create seems at odds with my experience actually creating the datasets. Do you have some insider experience or knowledge with dataset curation and creation for voice assistants that contradicts my own.

The guideline is that the newer your model, the more likely it is to have diverse voice recognition datasets since it solves the earlier problems caused by non representative data. The trend is moving towards better recognition for outliers. The training models are fed data that is very specific and not at all just whatever recordings they have collected in an S3 bucket. Given the amount of post recording work diarization, and QA we had to do on every single recording, I can’t imagine wanting to YOLO in bulk data.

0 comments

vineyardmike6d ago

You're missing the point. No one cares about the datasets you've created in a commercial context.

The effort being discussed is a volunteer effort among a community of tech enthusiasts, who are disproportionately privacy-oriented vs the average person. This will undoubtable skew towards middle-aged male audiences, and will be extra-selective against children. It's a best-effort collection, they're probably not turning anyone away, and it's only what they can get, they're (AFAIK) not paying anyone to collect underrepresented demographics.

dghlsakjgOP6d ago

Ah.

I thought you were talking about voice assistants in general. My mustake

j / k navigate · click thread line to collapse