I bought in to the TICK stack and planned on using an enterprise support contract when going to production, but every interaction with InfluxData the company has felt a bit sleazy. Trying to push very hard to the cloud offering for example.
That’s bad enough, but the documentation and observability of the database is quite poor, and it’s trivially easy to “vanish” all your data and lock your instance up for hours or days by changing the retention policy of a database. (Not making it much different).
Now of course it’s not TICK at all. More like “TI” as kapacitor and chonograph (dashboarding and alerting respectively) are deprecated products and rolled in to the main offering.
Added to that they completely changed the query language.
I have to say; pick something better if you can. TimescaleDB or Prometheus (which uses openTSDB) are promising.
There's a plugin for Telegraf that looks promising, but it hasn't been merged yet.
Is anyone else using TimescaleDB? If so, what do you use to push monitoring data to it?
Edit: I’m using grafana but was considering checking out apache superset.
I set up 2.x for myself recently, and they have really done a lot of work. The OSS offering has most of the features that cloud/enterprise would. It was easy to set up -- they don't have any instructions for installing it in Kubernetes, and haven't updated their Helm charts for 2.x, but it was like 3 minutes to write a manifest (https://github.com/jrockway/jrock.us/tree/master/production/...) myself, which I prefer 99.9% of the time anyway. The new query language is incredibly verbose, but I see the steps that I remember having with Google's internal system, align, delta, aggregate... all possible. (I had to scratch my head a lot, though, to make it work. And I really am not able to reason about what operations it's doing, what's indexed or not indexed, why I ingest my data as rows but process it as columns, etc.) The performance is good, and it worked well for my use case of pushing data from my Intranet of Stuff. Generally I like it and I don't think they are being shady in any way. It's on my list of something to set up at work to collect various pieces of time series data outside of the Prometheus ecosystem (CI runtimes, etc.).
The reason I picked InfluxDB over TimescaleDB for my personal stuff is because InfluxDB has an HTTP API with built-in authentication. I already a ton of HTTP services exposed to the Internet, and I understand them well. (Yup, I have SSO and rate limiting and all that stuff for my personal projects ;) I can give each of my devices an API key from their web interface, and I make an HTTP request to write data. Very simple. (They have a client library, but honestly my main target is a Beaglebone, and it doesn't have enough memory to compile their client library. I've never seen "go build" run out of memory, but their client makes that happen. I shouldn't develop on my IoT device, of course, but it's just easier because it has Emacs and gopls, and all the sensors connected to the right bus. Was easier to just manually make the API calls than to cross-compile on my workstation and push the release build to the actual device.) TimescaleDB doesn't have that, because it's just Postgres. So I'd basically have to expose port 5432 to the world, create Postgres users for every device, generate a password, store that somewhere, etc. Then to ingest data, I'd connect to the database, tune my connection pool, retry failed requests manually, etc. Using HTTP gets me all that for free; I can just configure retries in Envoy.
But... SQL queries are a lot easier to figure out than FluxQL queries, and I already have good tools for manipulating raw data in Postgres (DataGrip is my preferred method), so I think I will likely be revisiting TimescaleDB. Honestly, I'd pay for a managed offering right now if they had a button in Google Cloud Console that was "Create Instance and by the way this just gets added to your GCP bill for 10% more than a normal Cloud SQL instance".
The industrial historians solve the same problems - collect data at nodes that might have intermittent connectivity, send to a centralized server/service that can handle lots of data, and allow users to plot it.
I wonder if we’ll start to see more open source monitoring on the factory floor. While it will be easy for a product to work as well as industrial offerings, maybe their value is in the long term support (usually close to a decade) and supporter upgrade paths.
We've also updated all our tender documents for future projects to include a requirement that we can query metrics and logs through an API or direct DB access.
A recent project I worked on identified over 500 applications whose only use is to provide monitoring to a bespoke system or tool. This isn't uncommon at a university as different faculties and departments will buy "the best tool for XYZ" without ever asking IT if perhaps there is a tool that is almost as good that we already have.
As well as the things you identified, I suspect that there's just a lot of mistrust of open source in the industrial world - there's that whole thing of perceived value being directly proportional to product cost, plus commercial vendors also tend to at least offer training and tech support, even if they're not always the most helpful.
I am curious in what regard nothing comes close to grafana? I am Currently paying a lot for citect and wonderware support across a couple dozen facilities.
That said, Grafana is a more mature product. I can't fault anyone for using InfluxDB just as a time series database and using Grafana for visualization and alerting.
The sql-like language is similar enough that it’s confusing. And I’m still not sure why a dedicated time series database is supposed to be better than using standard mysql/postgres with inserts. At least the you get a wider range of options for data types, indexing, and querying.
I’m likely still not in the correct mindset but not sure what I’m missing.
It's optimized for timeseries data, so queries and inserts may be faster and the storage requirements may be lower. Other than that, Postgres can probably do everything influx can do and more.
Maybe not in this specific case, but in general Prometheus my preferred TSDB sitting between Telegraf and Grafana
If you need further scale-out there are options for federating Prometheus instances as well.
Trying to get a change merged into the codebase is a nightmare though. Especially if it's in a plug-in that isn't a money earner for InfluxDB.
The reason merges have been a problem is that historically there was only a couple of people involved in the project doing all of the code reviews, and new plugins are usually large chunks of new code that interact with products or protocols those reviewers aren't familiar with, so there is a steep learning curve just to properly review code contributions.
Last year we formed a new maintainers team that is a mix of InfluxData staff and community contributors who are working together to review and land code changes. This has significantly increased the rate they're getting through PRs but there's still a very large backlog to get through, plus new stuff coming in all the time.
Anybody who wants to see new code and plugins land in Telegraf faster can ask to join the maintainers team. You'll need to be familiar with the codebase and willing to work on any plugins of functionality that come in (a lot of plugins come from people building things for their own job/product).
You don't need to host InfluxDb and Grafana yourself. I would also consider gathering logs and traces to troubleshoot problems. Straightforward with top tier observability vendors, harder to do it on your own.
Disclaimer: I'm employee of Sumo Logic.
The one thing I would add to this guide is enabling HTTPS for the whole stack, if you are transmitting over the public internet. Fortunately, it is quite straightforward (and free) with Let's Encrypt.