Ask HN: Who is actually doing something with "Big Data"?

5 pointscbo14y ago6 comments

I'm not doubting the validity of the term. Obviously it exists. The big companies -- Google, Facebook, Twitter, etc. -- all have enormous datasets to work with, so of course it's real.

But I read this term so much that it seems as though EVERYONE is working with it, and that just doesn't seem feasible to me.

Do you work with "Big Data" at your (preferably not aforementioned) company? What do you do with it? Where do you get it from? Are there any significant pain points?

6 comments

apaprocki14y ago

The term, at least to me, does not necessarily refer to storage of large amounts of data. At Bloomberg, we process a large amount of data in realtime. When focusing on the processing of realtime market data, the hardware topology looks different than when you are, say, doing a map reduce across a huge number of nodes. You often have a single point of entry for a particular stream (with redundancy, of course), but then the trick is to efficiently distribute the data to the many places that require it with as near zero latency as possible. Obviously there is a need for longer term storage and instantaneous retrieval of billions of data points, but the largest focus is on near-zero latency from point of consumption, distributing to all internal nodes/apps and to many thousands of sites in 180+ countries with a large number of "nines".

The biggest challenge is that data feeds originate in nearly all of those countries and also need to be distributed efficiently to every other country. (e.g., NASDAQ originates in the US and reaches around the globe, and the same is true for realtime feeds on the opposite side of the globe in the Middle East, India, Singapore, Hong Kong, Tokyo, etc.) The Internet is not reliable from a latency point of view so coupled with the required hardware is the required network. We operate one of the largest private networks in the world.

edit: Also, from a processing point of view we have had great success with speeding up complex algorithms that would normally take minutes to run across huge compute clusters, bringing them down to seconds by porting them to run on large GPU clusters. Certain things are definitely suited for running on GPUs, but I feel it is still pretty foreign to most programmers and hard for companies to decide to jump into that kind of project. You're starting to see more specialized use of GPU or slower-clock-but-massively-parallel compute devices for a wider variety of tasks. (e.g., http://gigaom2.files.wordpress.com/2011/07/facebook-tilera-w...)

ifearthenight14y ago

Thanks for the long but interesting comment. Are you able to share a little about the db setup you are using?

quadlock14y ago

I've been using Big Data techniques(using nosql and processes such as map-reduce) to do data analysis and to produce useful and actionable information. My data isn't super huge right now, my mongodb data directory is at about 33GB. It is a good start and useful for working out techniques that can be applied to much larger datasets.

vaib14y ago

Great! I've been looking for that sort of comment. Can you help me get started on some idea to build a data analysis tool to produce some useful results out of some large data that would normally take a lot of time for data analysts? I just want want to make a good hadoop POC. Technology no problem, I just want help some sort of idea on the idea side.

gumbo14y ago

big data is not really a technology, it is just a term to refer to the fact that when you have "too much data" to deal with you can't do it the usual way.

Now to come back to the point of your question: often when people says "Big data", they are referring to the fact they use "some" technologies suited for "big data" like NoSQL database.

SanjeevSharma14y ago

check out http://www.storediq.com/. They seem to be doing some interesting stuff with 'Big Data'

j / k navigate · click thread line to collapse