ListenBrainz moves to TimescaleDB | Better HN

ListenBrainz moves to TimescaleDB | Better HN

96 comments

mfreed5y ago

Fun fact: TimescaleDB exists because we were using InfluxDB + Postgres for a previous IoT project and also found it unworkable (developer experience, query language, reliability, scalability and performance, operations, etc).

We first built TimescaleDB as "Postgres for time-series" for our own needs and then decided to open-source it for others. :-)

zitterbewegung5y ago

Have you thought of making a TimescaleDB app like the Postgres.app for macOS? Or could I use Postgres.app to make a TimescaleDB app?

avthar5y ago

It seems you can use the Postgres.app and install TimescaleDB on it

Here's some instructions on how to do so: https://github.com/slashdotdash/til/blob/master/postgres/ins...

akulkarni5y ago

Not sure if a coincidence, but someone just published this blog post today:

Installing Timescaledb on Mac OS X with Postgres.app https://prathamesh.tech/2020/07/23/installing-timescaledb-on...

rubyn00bie5y ago

You can mount like any PostgreSQL version, including extensions, in Postgres.app. I’ve used both Timescale and Agens with it, with zero problems.

jarym5y ago

Have you been following ZHeap and do you think Timescale will benefit from a storage engine like that (less write amplification)?

Lockerman5y ago

Timescale engineer here. I'm betting we'll see a nice win; we tend to see write-mostly workloads the UNDO shouldn't be too expensive, and the smaller tuple sizes should be nice. We've built Timescale to be compatible with custom storage engines, so it should work as a drop-in, though of course until we've tested it we won't be sure.

Another Timescale engineer here. As previously pointed out, zheap should work as a drop-in in TimescaleDB. In fact, I just tried it and it works. However, it currently requires an unmerged PR to work properly: https://github.com/timescale/timescaledb/pull/2082, as well as further testing.

Interesting read and thanks for sharing.

Not too long ago, I was asked to work on some analytics project and it required time-series data. I'm not a rockstar programmer and don't really know much about trends. So, I ended up googling and stumble upon InfluxDB. It felt like that right choice and I started playing with it. As the time passed, I realized that it might be a good software and I'm sure people love InfluxDB, but it wasn't the right choice for me. I didn't really like the docs, maybe its good now. And I had the same feeling about query syntax, it felt weird.

I moved to TimescaleDB and never looked back. I have it production for almost 2 months now. 20 tables and over 100Million writes/week. One of things I really liked was staging, I don't use docker and or anything fancy. I have bash script that and it runs on centos box and all timescale extension and postgres database are packaged together.

I was impressed by the timescale compression feature. I wasn't using it earlier because I had to be careful about what columns I need to segmentby. I would love to see some more features but I'm sure timescaledb team is already on it.

mfreed5y ago

Hey 120bits - thanks for the nice words!

What new/other features would you like to see? (Also feel free to join slack.timescale.com or reach out at mike (at) timescale.com)

to each his own - I find influxdb somewhat flaky but the best part about it is not having to write the atrocious sql queries I would need to to get the same kind of windowed aggregations. `group by time(1h)` and so on is pretty handy.

pgt5y ago

+1 on escaping measurement names. Quoting from their source code:

    def get_escaped_measurement_name(user_name): # ... comment omitted
        return '"\\"{}\\""'.format(user_name.replace('\\', '\\\\\\\\').replace('"', '\\"').replace('\n', '\\\\\\\\n'))

sgt5y ago

There are some edge cases in InfluxDB where escaping queries becomes a dark art.

hoseja5y ago

That looks somehow worse than regex in C literal.

contravariant5y ago

Although it seems about par for the course for regexp in elisp.

mfreed5y ago

Do you have a URL? Thanks!

ilogik5y ago

https://github.com/metabrainz/listenbrainz-server/blob/b0846...

iliekcomputers5y ago

Hey! I've been working on ListenBrainz [0] for the past 3-ish years. Happy to answer questions if anyone has any.

[0]: https://listenbrainz.org

paol5y ago

Excited to find out about the project. I regularly use MusicBrainz but didn't know about the sister projects.

I'll definitely be creating a ListenBrainz account. As a long time last.fm user I occasionally worry about the future of the platform. (There have been long stretches of time where it seems to have been in maintenance mode). You seem to support bulk importing last.fm data right?

iliekcomputers5y ago

Yep, we do support bulk import of last.fm data [0].

We also have a Spotify importer that automatically imports stuff from Spotify, if you use Spotify, I would definitely recommend setting that up.

We're a really small team (all volunteers), so we don't move with as much urgency as I'd like to, but we've been making slow but steady progress over the years.

If you find any rough edges, or have any feedback, I'd be happy to hear, my email is in the HN profile. :)

[0]: https://listenbrainz.org/profile/import

I read that whole page and I have no idea what the project does.

iliekcomputers5y ago

Yeah, we need to fix that landing page. It's basically an open repository of your music listening history.

dylz5y ago

It's last.fm.

RedShift15y ago

I'm in the process of moving from InfluxDB to TimescaleDB myself and can't wait to get rid of the hoops I have to jump through to get InfluxDB to answer some basic questions, mostly stemming from the fact that InfluxQL doesn't support boolean expressions. Something like 'SELECT MAX(temperature) > 10 FROM...' doesn't work.

akulkarni5y ago

Welcome to the TimescaleDB community! In case it is helpful, we wrote a tool to easily migrate from InfluxDB to TimescaleDB:

https://docs.timescale.com/latest/tutorials/outflux

brokentone5y ago

I'm personally working on migrating a DB I've had for tracking some economic time-series data from Influx to TimescaleDB, and I have to say Outflux is the most savage tool name ever.

mayhemchaos5y ago

Agreed -- every time I tried to query data from the command line I could not get anything to work. I would usually find a piece of working python code and adapt that to run the query instead.

100% opposite of what a CLI interface should be.

19965y ago

Speaking from my own experience, you may save yourself some future effort by moving directly to clickhouse.

Timescale brings its own issues. If your goal is performance, you will be better served by clickhouse.

nwmcsween5y ago

The issue I have with clickhouse is the codebase, it's an absolute behemoth and seemingly embeds musl libc? It also uses a huge amount of SIMD intrinsics for everything when SWAR or really nothing from my view looking in would have been better (memcpy, etc).

RedShift15y ago

Can you elaborate on those issues?

samhld5y ago

(InfluxData solution architect here) Boolean is supported. You query it in the WHERE clause. Try `SELECT MAX(temperature) FROM ... WHERE temperature > 10`. That said, I'm not sure why you'd run a query like that in InfluxQL as it's the same as `SELECT max(temperature)`. :).

RedShift15y ago

It's not the same thing. 'SELECT MAX(temperature) > 10 FROM ...' gets you all the datapoints, with a value of true or false. Moving it to the WHERE clause only gets you the datapoints where temperature is > 10. Yes, you can fill with 0 after the GROUP BY, but if all datapoints are less than 10, you get nothing back from the database. That's confusing: were all datapoints less than 10, or was there nothing registered in the time window I'm querying? Impossible to tell the difference. Not to mention some user interfaces just bailing: no data. I need to show a chart with a timeline of this condition being true or false.

Plus my actual use case is even more complex, not only do I need something like MAX(temperature) > 10, I need (MAX(temperature)) > 10 && (MAX(temperature) - MAX(dewpoint)) > 4.5).

willvarfar5y ago

Question for influx solution architect:

How do you delete points from a specific measurement in a specific retention policy?

kev0095y ago

Blame the customer. Nice solution, architect!

it's not that hard: `select max from (select max(temperature) from measurement_name group by time(1h)) where max > 10`

RedShift15y ago

Does not produce the desired result, see elsewhere in this thread for why this doesn't work

akulkarni5y ago

(TimescaleDB co-founder). Thanks for the kind words! I feel especially proud about the first point "openness" - this is something we strive for both technically and culturally.

For example, we have a pretty active Slack channel[0] where you can ask us anything. We've probably given away $$$$ of free support over the years ;-)

[0] https://slack.timescale.com/

We too started off with influx but it wasn't a good fit mainly due to use having issues with high cardinality. I don't know if this is still the case with current implementations, but what it boils down to is if your data is searchable by a "user_id" really look elsewhere. That might be an oversimplification but that's the gist of it.

I was fully ready to just roll my own partitioned table and gave TimescaleDB a shot. It worked well. There was a bug we ran into, but it was an existing one documented on github and was addressed pretty quickly.

I still like influx, and would use it again but beware of the cardinality issues.

If you have cardinality issues in InfluxDB, then just substitute InfluxDB with VictoriaMetrics :) [1]

[1] https://medium.com/@valyala/insert-benchmarks-with-inch-infl...

awinter-py5y ago

timescale is a postgres extension. 'postgres as a platform' is an interesting world to live in.

postgres built-in RBAC is clunky or people would be relying on it, but an ecosystem of postgres plugins could include cleaner or smaller versions of this feature.

Even things like complex migrations (github's gh-ost, for example) could exist as DB plugins.

decafninja5y ago

As someone who wants to pick up a time series DB to learn, what would be the best in terms of being the "industry standard"? InfluxDB? TimescaleDB?

I'm familiar with some basics of kdb and use it often in my day job, but from what I understand that isn't widely used outside of finance?

The following time series databases are popular right now:

* ClickHouse (this is a general-purpose OLAP database, but it is easy to adapt it to time series workloads)

* InfluxDB

* TimescaleDB

* M3DB

* Cortex

* VictoriaMetrics

The last three of these TSDBs support PromQL query language - the most practical query language for typical time series queries [1]. So I'd recommend starting from learning PromQL and then evaluating time series databases from the list above.

[1] https://medium.com/@valyala/promql-tutorial-for-beginners-9a...

akulkarni5y ago

(I work at TimescaleDB.)

If you are familiar with Postgres and/or SQL, then you may want to start with TimescaleDB. It's just Postgres for time-series. Full SQL, so it's possible to be productive instantly.

osigurdson5y ago

In some cases it is difficult to define the table columns up front. Instead, a few tables: Object, Property, Time and Value (example below) are defined which make it possible to create new items on the fly. This works reasonably well up to a few billion records in the value table. However it does end up taking a lot of space (covering indexes/requisite memory are required for performance). It would be great to see a Postgres compatible solution that solves this problem in a more optimal way than a stock RDMS.

Object objectId objectName other...

Property propertyId objectId FK propertyName other...

Time timeId time other..

Value timeId FK propertyId FK value

silvester235y ago

> If you ever write bad data to a measurement in InfluxDB, there is no way to change it

Correct me if I'm wrong, but I'm fairly certain you can just write data with the same timestamp again and it gets updated. Deleting is also easily possible.

iliekcomputers5y ago

ListenBrainz dev here. We wanted stuff like the ability to do stuff like "DELETE FROM measurement where field = blah" and that support didn't exist last time we looked. [0]

Another thing that's not mentioned in the post but was a pain point for us was that it's not easy to query for fields with "null" values [1].

I figure a lot of our pain might be because we're not as good at Influx as we are at PostgreSQL. We've been running MusicBrainz[2] for ~18 years on PostgreSQL, that knowledge will hopefully transfer over a little with Timescale.

[0]: https://github.com/influxdata/influxdb/issues/3210

[1]: https://github.com/influxdata/docs.influxdata.com/issues/717

[2]: https://musicbrainz.org

silvester235y ago

True, deleting by value does not seem to be possible. I can see how that would be painful if it's necessary for your use case.

sgt5y ago

You can't for example delete data from a time period to another, to the best of my knowledge.

This is actually pretty straightforward. For example,

`DELETE FROM "foo" WHERE time >= now() - 2d AND time < now() - 1d`

rweichler5y ago

Figured I'd use this as an opportunity to plug my own service: https://eqe.fm

Only works on jailbroken devices but it works well, has a local backup, and has been maintained (by me) for 2 years now.

Server costs are $2.50/mo, so this will stay up as long as I am alive.

jbmsf5y ago

I'd love to hear more about how your data ingestion works. I'm thinking of implementing TimescaleDB myself, but in my initial read of the docs, the focus seemed to be managing the database, not getting data into the database...

same way you'd insert data into postgres.

jbmsf5y ago

That's not really helpful. Let's assume you have a distributed system; you probably don't want all of your system components connecting directly to TimescaleDB. You also probably want to have some layer that implements queuing and handles back pressure if it can't insert into the database at the rate that events are coming in. You may want to batch insert data.

I'd assume that most anyone building a system like this at scale has to solve these problems; does everyone roll their own?

Has anyone used Prometheus as well as TimescaleDB in production and have thoughts to share on those, comparatively?

akulkarni5y ago

We have quite a few in our Slack channel: slack.timescale.com

Feel free to ask over there :)

(Btw - TimescaleDB is designed to work with Prometheus. You can see more here: https://github.com/timescale/timescale-prometheus)

I really want to love timescaledb, it's great.. except for the minor issue of not being able to back up.

https://github.com/timescale/timescaledb/issues/1835

akulkarni5y ago

TimescaleDB definitely supports backups :-)

Here is a page from our docs on how to perform Backup & Restore: https://docs.timescale.com/latest/using-timescaledb/backup

Not sure what's going on in that one Github issue, but we are looking into it.

justinclift5y ago

It seems to be affecting multiple people too. :(

k-rus5y ago

This issue is related to slightly confusing _warnings_ that the software prints out, it doesn't effect the _correctness_ of the backups.

The warnings are produced by COPY TO, which is used by pg_dump, since COPY TO doesn't copy chunks. It is not an issue for pg_dump, since it also do COPY TO on each chunk table.

Timescale engineer here - was part of discussion about this warning. We need to do another round and see how to remove this confusion.

akulkarni5y ago

Posted elsewhere, but also posting here for posterity:

That issue is now closed by the original author:

"Data is successfully dumped. also i can see the constraints, indexes are also copied successfully."

https://github.com/timescale/timescaledb/issues/1835

brightball5y ago

That seems like a pretty big deal.

Does a WAL backup approach work?

akulkarni5y ago

TimescaleDB offers a number of backup and restore options, including wal-e (WAL-based), pg_dump & pg_restore:

https://docs.timescale.com/latest/using-timescaledb/backup

There are hundreds of thousands of TimescaleDB databases in production so this is generally not an issue.

j / k navigate · click thread line to collapse