That paper kind of does the same thing that my comment above proposed, starting with as large dataset as they can get and then filtering it to extract a much smaller dataset focused on a specific task that still is larger than all of English Wikipedia.