I hope this doesn’t come across as confrontational; I find this inspiring.
My view of the CTO role is that it's focused on more long term efforts or the product generally (in more technical companies). So that's what I've been doing for much of this year. With the caveat that I'm also the founder and on the board so there is still about 10 hours a week of executive meetings/management, reviewing of other people's writing & product efforts and working with our biggest customers and prospects.
All that being said, I'm not just working on this because it's what I want to do and it's what excites me (which it does). I also think it's my highest point of leverage within the company. Very few people have the same view and in depth experience in this problem domain. Most of the people I know of that do are either founding and running competitive offerings or they're working on similar projects at Google, AWS, or Azure and getting paid significantly more than we can afford to pay a single engineer.
I think this is one of our most important efforts right now and the best way I know to make it successful is to be working on it in depth. I'm not the best programmer on the team or the smartest, but I have some depth of experience that gives me a clear vision on what it should be able to do and what tradeoffs we should be making.
My role on this effort is basically as the product person and the tech lead. In this case the tech lead isn't the manager of people on the team. We try to have a multiple paths for advancement in engineering and one of them tech focused individual contributor.
Of course, I view this project as necessary, but not sufficient for our overall success. That's why most of our engineering team is working on our legacy products and the continued forward development of the overall platform.
I built this for Morgan Stanley (the same place where kdb originally came from) and put it to use in some major high-volume/low-latency trading systems where a significant fraction of US equity trades execute every day:
I can choose whatever memory I have available for each topic/partition to do a query - or KSQL to transform topics.
edit: I should say that I can recognize a few benefits from columnar memory and lower level management with Rust, but it would be a huge infrastructure/tooling shift for a regular CRUD API shop or one that has already invested in any eventing system IMO.
Kafka is used as distributed message queue. I.e. a set of apps produce opaque messages and send them to Kafka queues, while another set of apps consume these messages from Kafka queues.
Time series databases are used for storing and querying time series data. There are no Kafka-like queues in time series databases. Every time series has a key, which uniquely identifies this time series, and a set of (timestamp, value) tuples usually ordered by timestamp. Time series key is usually composed of a name plus a set of ("key"=>"value") labels. These labels are used for filtering and grouping of time series data during queries. There are various time series databases optimized for various workloads. For example, VictoriaMetrics [1] is optimized for monitoring and observability of IT infrastructure, app-level metrics, industrial telemetry and other IoT cases.
As you can see, there is little sense in comparing Kafka with time series databases :)
P.S. Apache Druid should be compared to ClickHouse [3] or similar analytical databases.
[1] https://valyala.medium.com/promql-tutorial-for-beginners-9ab...
Offtopic: I have been following your company and team closely during many Gophercons. Good luck to your team ... although a bit sad to see you go the Rust route :P
However, for large scale analytics on huge data sets where you're scanning all of it, we'll likely push you to EMR or something like that. The nice part is that those big data systems can execute directly against the Parquet files in object storage that form the basis of our durability.
This is part of the bet we're making compatibility with a larger data processing ecosystem.
Almost all of our use cases fall within what you can compute in memory on a single system (which is pretty much up to 1TB).
Anyway, thanks for your work!
You will be able to do efficient range queries on any of these. Most use-cases will work best when you partition your data into time chunks, which commonly would be the time you inserted the samples into the database (created time in your case I guess).
It also defines rules for replication (push and pull), subscriptions to subsets of the data, and processing data as it arrives via a scripting engine.
Some of these features will arrive before others. Right now we want to make it work well for the data InfluxDB is currently good at (metrics, sensor data) and also work well for high cardinality data like events and tracing data.
We'll be publishing crates for parsing InfluxDB Line Protocol, Reading InfluxDB TSM files (for conversion into Parquet and other formats), and client libraries as well.
The entire project itself will also be published as a crate. So you can use any part of the server in your own project.
https://grafana.com/blog/2020/07/29/how-blocks-storage-in-co...
I'm not totally sure how they index things, but I would guess that it's by time series with an inverted index style mapping for the metric and label data to underlying time series. This means they'll have the same problems with working with high cardinality data that I outlined in the blog post.
InfluxDB aims to hit a broader audience over just metrics. We think the table model is great, particularly for event time series, which we want to be best in class for. A columnar database is better suited for analytics queries, and given the right structure is every bit as good for metrics queries.
InfluxData have a long history of writing (and rewriting) their own storage engines, so choosing to do it again is unsurprising. I guess this sort of hints that the current TSM/TSI have probably reached their performance and scalability limits and will be EOL before too long.
What I find interesting is that this project is already almost a year old and only has six contributors (two of whom look like external contractors). It seems more like a fun side project than the future core of the database that is supposed to be deployed into production next year.
The thing about this getting to production next year is that we're doing it in our cloud, which is a services based system where we can bring all sorts of operational tooling to bear. Out of band backups, usage of cloud native services, shadow serving, red/green deploys, and all sorts of things. Basically, it's easier to deploy a service to production once you've built a suite of operational tools to make it possible to do reliably while testing under production workloads that don't actually face the customer.
As for us rewriting the core of the database, that's true. But I think you're unrealistic about what the data systems look like in closed source SaaS providers as they advance through orders of magnitude of scale. Hint: they rewrite and redo their systems.
As for Grafna, MetricsTank was their first, Cortex wasn't developed there, and Loki and Tempo look like interesting projects.
None of those things has the exact same goal as InfluxDB. And InfluxDB isn't meant to be open source DataDog. That's not our thing. We want to be a platform for building time series applications across many use cases, some of which might be modern observability. It also doesn't preclude you from pairing InfluxDB with those other tools.
I don't think this is a fair criticism. Postgres only has 7 core contributors - and influxDB is far less complex targeting a much simpler use case.
For our cloud 1 customers, they'll be able to upgrade to our cloud 2 offering, but in the meantime, their existing installations get the same 24x7 coverage and service we've been providing for years.
As for how this will be deployed, it will be a seamless transition for our cloud customers when we do so. Data, monitoring and analytics companies replace their backend data planes multiple times over the course of their lifetime.
For our Enterprise customers, we'll provide an upgrade path, but not until this product is mature enough to ship an on-premise binary that won't get a chance to get upgraded but for a few times a year.
The only difference here is that we're doing it as open source. They always do theirs behind closed doors. I'm sure most of our users and many of our customers prefer our open source approach.
Your time series data doesn't even need to be touched, it'll "just work" after the upgrade.
The datasets I work with contains a few billion records with 5-10 fields each, almost all 64-bit longs; I originally started with CSV and soon switched to a binary format with fixed-sized records (mmap'd) which gave great performance improvements, but the flexibility, size gains due to columnar compression and the much greater performance of Arrow for queries that span a single column or a small number of them won me over.
For anyone who has to process even a few million records locally, I would highly recommend it.
Right now I'm writing tools in Python (Python!) to analyse several 100TB datasets in S3. Each dataset is made up of 1000+ 6GB parquet files (tables UNLOADed from AWS Redshift db). Parquet's columnar compression gives a 15x reduction in on-disk size. Parquet also stores chunk metadata at the end of each file, allowing reads to skip over most data that isn't relevant.
And once in memory, the Arrow format gives zero-copy compatibility with Numpy and Pandas.
If you try this with Python, make sure you use the latest 2.0.0 version of PyArrow [1]. Two other interesting libraries for manipulating PyArrow Tables and ChunkedArrays are fletcher [2] and graphique[3].
[1] I use: conda install -c conda-forge pyarrow python-snappy
You pay such a high overhead marshalling that data into an Arrow RecordBatch. Best thing ever is to work with the Parquet file and not even decompress the chunks that you don't need. Of course, this assumes that you're writing summary statistics as part of the metadata, which we plan to do.
https://www.dremio.com/webinars/apache-arrow-calcite-parquet...
They also have an OSS version in GitHub.
It's manual. What you get with Arrow is an efficient way to store structured data in a way that values for the same column (same dimension) are together on disk rather than having each record with all its fields together. So if you're storing say a dataset of users with a 64-bit user ID, an IP address, a timestamp, and a country code you'd define a Schema object as having these 4 columns with the size of each one (here 64/32/32/16 bits for example) and then you'd start writing your records block by block. A block is just a set of records and Arrow will mark the start and end of each block. Up to you to decide when to start and end a block, I use 100k entries per block but haven't played much with different values.
In pseudo-code it'd be something like this when reading just the user IDs:
VectorSchemaRoot root = arrowReader.getVectorSchemaRoot();
BigIntVector userIdVector = (BigIntVector) root.getVector("user_id"); // gets this 64-bit dimension from the schema
// ... more vectors defined, one for each dimension
List<ArrowBlock> blocks = arrowReader.getRecordBlocks();
for (ArrowBlock block : blocks) { // go over all blocks
arrowReader.loadRecordBatch(block); // *actually* reads the block
for (int i = 0; i < block.getRowCount(); i++) {
long userId = userIdVector.get(i); // offset is within the current block
processUserId(userId);
}
}
The code in this example will only go over the user IDs, and will read them very quickly. So yes, you have to implement any sort of querying capabilities yourself. In my case it was simple set of queries like "get distribution of dimension X" where X can be a parameter, or "filter records where X < minX || X > maxX", also with parameters, etc. Just a handful in all.For a limited set of queries and not something like full SQL, this was perfect. I found this article very useful to get started: https://github.com/animeshtrivedi/blog/blob/master/post/2017...
This is the direction that QuestDB (www.questdb.io) has taken: columnar database, partitions by time and open source (apache 2.0). It is written is zero-GC java and c++, leveraging SIMD instructions. The live demo has been shown to HN recently, with sub second queries for 1.6 billion rows: https://news.ycombinator.com/item?id=23616878
NB: I am a co-founder of questdb.
BTW, VictoriaMetrics [2] is a time series database built on top of ClickHouse architecture ideas [3], so it inherits high performance and scalability from ClickHouse, while providing simpler configuration and operation for typical production time series workloads.
[2] https://victoriametrics.com/
[3] https://valyala.medium.com/how-victoriametrics-makes-instant...
https://www.dremio.com/webinars/apache-arrow-calcite-parquet...
https://github.com/dremio/dremio-oss
If you have parquet on S3, using an engine like Dremio (or any engine based on arrow) can give you some impressive performance. Key innovations in OSS on data analytics/data lake:
Arrow - Columnar in memory format; Gandiva - LLVM based execution kernel; Arrow flight - Wire protocol based on arrow; Project Nessie - A git like workflow for data lakes
https://arrow.apache.org/. https://arrow.apache.org/docs/format/Flight.html. https://arrow.apache.org/blog/2018/12/05/gandiva-donation/ https://github.com/projectnessie/nessie
Alternative solutions depends on your use case. If it’s about querying S3 data then Dremio/Athena/Presto/spark are good.
I imagine he's happy where he is, but I hope there's some opportunity for InfluxDB to give credit and support for his great work.
[1]: https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-...
Its architecture is dramatically different than InfluxDB. You can do a comparison of the design goals. Read the post this thread refers to. I think you'll find it has very different goals than Postgres, an OLTP database, and Timescale, which is built on top of it.
[1] https://medium.com/@valyala/when-size-matters-benchmarking-v...
[2] https://medium.com/@valyala/high-cardinality-tsdb-benchmarks...
[3] https://medium.com/@valyala/insert-benchmarks-with-inch-infl...
Meanwhile, InfluxDB IOx has a very different set of goals than Postgres. It's not an OLTP (transactional) DB and never will be. It's firmly targeted at OLAP and real-time OLAP workloads.
That means we can do things like optimize for running on ephemeral storage with object storage as the persistence layer. It'll have fine grained control over replication, how data is partitioned in a cluster, and where data is indexed, queried, queued for writes and more. Push and pull replication, bulk transfer, and persistence with Parquet. This last bit means you get integration with other data processing and data warehousing tools with minimal effort.
It'll also support Arrow Flight which will give it great integration into the data science ecosystems in Python and R.
Right now, InfluxDB IOx is really too early to do any real comparison on actual operation. We're putting this out now so that people can see what we're doing, comment on it, and maybe even contribute. We think it's an interesting approach where no single item is completely novel, but the composition of everything together makes it an entirely unique offering in open source.
Edit: one other thing I forgot to mention. InfluxDB IOx is open source, Timescale isn't. For some that matters, for many it doesn't. Depends on your use case.
If you're used to writing SQL, TimescaleDB is much easier to write queries with although if you get over the learning curve both of the query languages in Influx seem very powerful
One notable advantage of Influx is its integrations with other tools for ingest and visualization, and it seems like 2.0 is doubling down on that
1. TensorBase is highly hackable. If you know the Rust and C, then you can control all of the world. This is obvious not for Apache Arrow and DataFusion (on the top of Arrow).
2. TensorBase uses the whole-stage JIT optimization which is (in complex cases possibly hugely) faster than that done in Gandiva. Expression based computing kernel is far from provoding the top performance for OLAP like bigdata system.
3. TensorBase keeps some kinds of OLTP in mind (although in the early stage its still in OLAP). There is no truely OLTP or OLAP viewpoints in users. Users just want all their queries being fastest.
4. TensorBase is now APL v2 based. Enjoy to hack it yourself!
ps: One recent writting about TensorBase (and those compared with some query engine and project in Rust works included) could be seen in this presentation: https://tensorbase.io/2020/11/08/rustfest2020.html
Disclaimer: I am the author of TensorBase.
- InfluxDB IOx: 600K rows/sec
- VictoriaMetrics: 4M rows/sec
I.e. VictoriaMetrics outperforms InfluxDB IOx by more than 6x in this benchmark. I hope InfluxDB IOx performance will be improved over time, since it is written in Rust.
Recently we have a MSc project to add Parquet which is a very good direction, couldn't agree more.
PS: Im a rookie in this whole domain.. so any pointers would be really helpful.
Will there be any type of transactional guarantees (ACID) using MVCC or similar?
Is the execution engine vectorised?
That being said, Arrow Flight will be a first class RPC mechanism, which makes it quite nice for data science stuff as you can get data in to a dataframe in Pandas or R in a few lines of code and with almost zero serialization/deserialization overhead.
This isn't meant to be a transactional system. More like a data processing system for data in object storage. I'm curious what your need is there for OLAP workloads, can you tell me more?
> As an added bonus, within the Rust set of Apache Arrow tools is DataFusion, a Rust native SQL query engine for Apache Arrow. Given that we’re building with DataFusion as the core, this means that InfluxDB IOx will support a subset of SQL out of the box
Apache Spark is a distributed compute platform, which does have some support for Arrow for interop purposes.
One of the things that led me to get involved in Arrow originally was to explore the idea of building something like Apache Spark based on Arrow (and Rust) and my latest prototype of that concept is in the Ballista project [1].
(If you don't know, don't guess.)