Thanks in advance.
- An exabyte-scale storage engine. Nothing too exotic here technically and a few companies have built them, but the design needs to address continuous data corruption, continuous hardware failure, geo-federation, etc.
- A real-time database kernel that supports very high throughput for mixed workloads. A production kernel of this type doesn't currently exist though several people working in closed source databases understand the necessary computer science in principle; the academic literature is far behind the state-of-the-art. The ability to gracefully shift load and transients between servers under full load with many millions of writes per second is not trivial.
- Native discrete topology operators. Necessary for geospatial analytics, sensor coverages, etc. If you can do it natively in the database kernel, it makes the second requirement easier to achieve since you don't need secondary indexing generally.
Any solution even halfway toward the general solution would be viable. The value possible if you have such a system is hard to overestimate. Companies have paid half a million dollars for the output of a single analytic query on tens of trillions of IoT records; the differentiator was that it was possible to execute such a query at all.
It is extremely high-end and polymathic computer science, but serious valuable if you can make a credible dent in it. And unlike some advanced topics in computer science, there are no epic unsolved theoretical problems you have to solve, though some relevant computer science may be unpublished.
A good example of an IoT data analytics problem is analyzing a petabyte of drone sensor data, which on a large drone amounts to a few flights worth. Typical raw sources tend to be some combination of hyper-spectral imaging/video and LIDAR. Or RF probability functions e.g. mobile. Or a combination of all of the above because you are fusing multiple sources to reduce the uncertainty for your analysis.
FWIW, the "tens of trillions" of IoT records I mentioned was a real-world example from one of the most famous financial companies. It was a spatial analytic on a polygon model, and a classic IoT data model. If KX solved that particular analysis problem, they would have used it.
There's been an explosion in raster data in just the last 5 years. Cheap satellites, cheap drones, and cheap platforms have really turned the firehose to "high". And the resolution on scientific data -- climate and weather model output, astronomy data, particle physics data, seismic data, etc. -- just keeps going up.
This is an area of massive growth for which no existing solution is quite adequate. If I had the time and/or the cash, I'd be diving in head-first.
His company provides sensor systems for civil engineering projects. A single large bridge can have sensor packs every 50m or so - per beam. The amount of sensor data coming in for a single municipality or region is already staggering. Vibration and stress analytics are required on a daily basis.
The final requirement, one which you didn't mention, is that this setup should be fairly low maintenance. If you need a team of rocket scientists to operate and just keep it from falling over, the cost structure will be unsustainable.
A service that provided all this in a platform with sane APIs and good BI integration should be making tons of money.
During development or if you want to store the raw data you can split the stream. eg: collect | tee >(store) >(anotherAnalyzer) | analyze | report
ain't that the truth
One of the points I've tried to make at various companies (we've worked at the same one before) is that streaming solutions and batch solutions need to be fused into a single execution engine.
A streaming system on its own (operating on temporal windows) is not nearly as useful as on that can be joined to a storage engine with data at rest. It also needs to be disk based, so windows can be large, which most people do not want to take on. It also needs to be extremely parallel, and efficient.
Thousands of requests a second per server is not even in the right ball-park (which is lots of current execution engines now). Operating at line rate is generally table stakes IMO. The operations on the stream should be parallelized automatically, up to petabytes a day of input. Humans don't have the necessary context to do the partitioning up front, especially with streams that change.
The issue is (and I've tried to come up with designs to address this, though not in practice), is that co-locating the data at rest, with data that is moving through the system is a tricky problem, especially with complicated joins.
They can be the same engine (and should), but traditional database engines tend to have a problem with streaming queries, since they are just repeatedly executing a query against every new record. They are expressible, just not efficient. There is room to innovate in this space, but most people building these engines either solve the parallelism problem naively, or not at all.
There's also the problem of driving this computation to the edge, which is also something I have a solution for in a way that no one is doing, but have not yet met a company willing to take this level of effort on.
All the points you make about the kernel are apt, as are the points about the distribution algorithms. Also, the protocols used aren't nash safe, so at scale most of these systems become an operational juggling act under pressure.
All streaming systems that I know of do not know enough about the underlying data to gracefully rebalance and co-locate, since they all tend to embody the map/reduce paradigm, which is oblivious to underlying data distribution, at least in current practice.
There is available computer science to solve all these issues, I think some of the spatial algorithms out there can also be applied to the streaming space, especially in join evaluation.
> something I have a solution for in a way that no one is doing
Is the general direction for this something you can share? 30 years of database literature accumulated a lot of knowledge. It's be a bold claim to say there's something powerful yet non-obvious.
I'm just speculating, of course. I was sort of entertaining your idea and going back and forth between "a sounds-good fantasy" and "No, that would be awesome, why does this not exist?"
The data processing could become a commodity but the architecture won't be -- its highly tied to your specific industry.