We're collecting about 2-3 million records a week, and expect to collect about a year's worth of data in due time.
I'd really like some advice on techniques on storing and processing this data. We'd like to be able to answer queries similar to:
(1) For a given location, who was near that location (within a specified distance) over a specified period of time?
(2) Which locations are near each other?
That's the general idea. We don't need a real-time response, but what are good databases (or other data storage software)? I've come across people talking about k-d trees, does that work at this scale? What kind of hardware do I need? I'm hoping to get pointers towards general strategies. How do we store this data? Does it even make sense to store it all in a database? Which data/software/packages lend themselves well to distance/radius calculations?
We're most familiar with Python/Linux, would prefer to stay away from Java and prefer open source/free software. We're new to all this, pointers to books and papers would also be useful. All and any advice would be greatly useful.