Most people learn by doing rather than reading (myself included) so just pick a project and start building.
This is the journey I took:
- Setup a Hadoop cluster from scratch (start with 4 nodes on virtualbox)
- Write software to crawl and store data on every single torrent. (I dont know why I picked torrents, it was just interesting at the time), but pick a single topic, and then scale it as far as you can.
(Can I store 100,000 torrent files? Can I crawl 200 websites every 5 minutes? Can I index every single file inside the torrent - whoops I have 500,000,000 rows now, can I distribute that across a cluster, can I upgrade the cluster without downtime? Can I swap Hadoop and HBase out for Cassandra? Can I do that with no downtime?) Why aren't all these CPU's being utilised? How can I use redis as a distributed cache? Now the whole system is running, can I scale it 2x, 5x, 10x? What happens if I randomly kill a node?
Just pick a single project - Astronomy Data, Weather Data, Planes in the air, open IoT sensors, IRC chat, Free Satellite Data, Twitter streams, pick a datasource that interests you and then your exercise is to scale it as far as you can - this is an exercise in engineering, not data science, not pure research, the goal is scale.
As you build this you'll do research and discover which technologies are better at scaling for reads, writes, difference consistency guarantees, different querying abilities.
Sure you could read all of this, but unless you apply it, much of it wont stick