PostgreSQL 9.6 Released (opens in new tab)

(postgresql.org)

553 pointssheff9y ago128 comments

128 comments

Just from reading the documentation, the full text search features on Postgres already look pretty powerful. And it is encouraging that they are actively being worked on. I'm wondering how this compares to a dedicated search engine like Solr or Elasticsearch.

Are there huge differences in performance, features or search quality? At which scale does using Postgres for full text search still make sense?

brightball9y ago

Having used all 3, Postgres search is my go to for most use case simply because I don't have to deal with managing deltas to an outside system and keeping things in sync. The search features are powerful and fast and PG's ability to combine multiple indexes in search results make it trivially easy to include a bit of full text search in a query right next to geographic distance filters or other conditions. You can also combine multiple types of searches on the fly if you're feeling whimsical.

IMO, the only time to reach for an outside system is when the data isn't being written to PG first (like log ingestion with elastic search) or when search is such a central part of your app that it mandates a separate dedicated system.

nnutter9y ago

Are there any good options to support logic (and/or) and facets/fields with Postgres? We started using ES basically just for the "free" query language. (Obviously we would want something that is safe from sql injection.)

anewhnaccount9y ago

SQL supports logic. Either escape manually, use the templating in your driver or use an ORM.

1 more reply

nugator9y ago

What about when having different PG database instances that has data you want to join on? Would you still use PG as an aggregated read-only copy of the databases or would you use for example ES?

brightball9y ago

It's situational. When you need a search across multiple data sources, standing up a dedicated search engine makes a lot more sense. Then again...PG Foreign Data wrappers would make that scenario pretty simple without the need for an aggregate.

I can't speak to performance in that situation though.

fatbird9y ago

I used it at 9.4 for a document management system with thousands, not millions, of PDFs that got indexed on upload, and it worked extremely well at that scale--fast, and with all the basic text search features well-covered (tokenization, stemming, etc.). A big win for me was that doing it well in Postgres meant the site could stay a simple Django site rather than adding another service.

ngrilly9y ago

Did you store the plain text of each PDF in PostgreSQL or just the ts_vector resulting from the plain text?

fatbird9y ago

IIRC, I stored the plain text too because the engine can return contextually marked up plaintext after finding it in the ts_vector.

1 more reply

pumainmotion9y ago

Curious to know since you mentioned that it was fast for thousands of PDFs... any rough timing information on some of your queries for that kind of dataset?

fatbird9y ago

I'm really reaching here to recall, but the short version is that actual searches never took more than a second. All I really cared about was how noticeable a delay to expect, and it was never more than that.

On a bulk import of 1,000+, it took a couple minutes to ingest them. This was all on a $20/month VPS.

buro99y ago

I have found Postgres to be good enough for search.

As in... it works well enough, and the advantage of not having to add other tech makes it a no-brainer, I've had zero support issues or customer complaints and most of my applications use full text search heavily.

The big advantage over other approaches, because it's SQL and it's there in my database where I also store users and permissions knowledge... I can permission-limit my fulltext searches.

batmansmk9y ago

We have been using Postgres Full Text Search for about 3 years now in production. The app is an analytics dashboard, over a set of structured and unstructured data. We have about 20M documents, with hierarchies, dimensions, but also free text elements. It does work extremely well, and having the possibility to group by as one would do in SQL is a god send for tabular or graph based data. Performance are really good, in particular due to the parallel aggregations.

We tested recently to load our index to an Elasticsearch index for one particular use case (a weighted sum of the 20M rows based on a FTS critera) where postgres was underperforming in our opinion. On the same hardware, using all available RAM and CPUs, ES took 6s and PG took 0.7s.

So far, on the 30+ queries of our dashboard tool, we have yet to find a use case that Postgres didn't handle better than Lucene based solutions.

asimuvPR9y ago

Mind sharing a table structure from your db? I'm using ES for a project and would prefer to keep things simple (already use postgres in another part of the system).

ngrilly9y ago

We use Xapian to search over millions of documents. We are thinking of switching to PostgreSQL built-in FTS to simplify our system. We ran an internal benchmark which showed that PostgresSQL can be competitive with Xapian, except when you need to rank results (in that case the performance is bad).

joshberkus9y ago

You'll be interested in ongoing work in this area, then. Oleg & Teodor are working on a new index type (RUM indexes, no less) which will speed up ranking operations conserably.

https://lwn.net/Articles/689387/

ngrilly9y ago

Hi Josh. Yes, I'm aware of this new type of index. Oleg was even kind enough to answer my email asking a few question about it :-)

chishaku9y ago

Does anyone have experience with ZomboDB?

"ZomboDB is a Postgres extension that enables efficient full-text searching via the use of indexes backed by Elasticsearch. In order to achieve this, ZomboDB implements Postgres' Access Method API.

In practical terms, a ZomboDB index appears to Postgres as no different than a standard btree index. As such, standard SQL commands are fully supported, including SELECT, BEGIN, COMMIT, ABORT, INSERT, UPDATE, DELETE, COPY, and VACUUM."

https://github.com/zombodb/zombodb

jameslk9y ago

It looked promising when I was evaluating it a few months ago, but it's limited to use with Elasticsearch 1.x, which was not going to work for us.

zombodb9y ago

As the developer, I do!

Feel free to email the mailing list (zombodb@googlegroups.com). I'd be happy to help answer any questions you might have

combatentropy9y ago

I had used PostgreSQL for a decade, including full-text search, but just within apps that were already storing their data in Postgres.

The time came to replace our website search (tens of thousands of pages), and we decided to try rolling our own. Someone suggested ElasticSearch, and as I read through it, it seemed to do less than PostgreSQL. I still had the hard problems of (1) spidering the site and (2) converting all the file formats (.doc, .xls, .pdf, etc.).

I ended up just putting wget on a daily cron job to spider the site. Then I ran the saved files through a hodgepodge of scripts to extract the plain text and put it into PostgreSQL.

Once it's there, it's far easier to do the rest. Postgres has its own functions to search for matches, rank the matches, give you snippets, and even highlight the search words in the snippets. It's amazing.

Searches run in a split second. Well, at first, when I was testing, they often took a few seconds. But the weird thing is that after go-live it ran faster. My best guess is that so many users caused Postgres to cache more and more of itself into RAM. The whole server is still using less than 1 GB though, and it's running Apache and Postgres for the website and all its apps.

ngrilly9y ago

What is the on-disk size of the table storing the plain text?

combatentropy9y ago

42 MB for the table, which has columns for the address, title, plain-text body, and computed text vectors for 43,000 pages (web pages and office documents of average length). Then another 100 MB for the GIN index on the text-vector column.

1 more reply

rpedela9y ago

Yes there are huge differences between quality and performance. Apache Lucene (powers Solr and ES) is still the best by far. However if Postgres search works well enough for your use case then great. As others have said, it is one less dependency to manage.

scjody9y ago

This has been around for a while but it's a great summary of what Postgres can do: http://rachbelaid.com/postgres-full-text-search-is-good-enou...

In my experience performance is great if you're just doing text search, but if you combine that with other operators in the same SELECT it can be much slower than Elasticsearch since in many of those cases Postgres needs to fall back to a full table scan.

ngrilly9y ago

If you create indexes for all the columns used as filters, which is somehow what Elasticsearch does, then PostgreSQL is able to combine them (generating a bitmap for each used index and ANDing them) and you should get decent performance, don't you?

pilif9y ago

While it's ok for our purposes, I would wish for a bit better customisability of the text parser and it definitely needs better support for compound words to be perfect.

The first issue is with relation to https://www.postgresql.org/docs/9.6/static/textsearch-parser...: The documentation says

> At present PostgreSQL provides just one built-in parser, which has been found to be useful for a wide range of applications

and it really means it - changing the behaviour of this component is not possible unless you write a completely different parser in C which, while possible is no fun experience.

We're using the full text feature over product data and we're having to work around the parser sometimes too eagerly detecting email addresses and URLs which messes with properly detecting brand names which might contain some of these special characters.

The other problem is the compound support. A lot of our data is in German which like other languages likes to concatenate nouns.

For example, you'd absolutely want to find the term "Weisswürste" for the query "wurst" (note the concatenation and the added umlaut for the plural in wurst).

Traditionally, you do this using a dictionary and while Postgres has support for ispell and hunspell dictionaries, only hunspell has acceptable compound support, which in turn isn't supported by Postgres.

So we've ended up using a hacked ispell dictionary where we have to mark all known compounds which is annoying and error-prone.

Also, once you have to use a dictionary, you end up with a further issue: Loading the dictionary takes time and due to the way how Postgres currently works, it has to happen per connection. In our case, with the 20MB hacked german ispell dictionary, this takes ~0.5s which is way too long.

The solution for this is to use a connection pooler in front of Postgres. This works fine but, of course, adds more overhead.

The other solution is http://pgxn.org/dist/shared_ispell/, but I've had multiple postmaster crashes due to corrupted shared memory (thank you, Postgres, for crashing instead of corrupting data) related to that extension, so I would not recommend this for production use.

Lucene and by extension ElasticSearch has much better built-in text analysis features so we could probably fix the parser and compound issue, but that would of course mean even more additional infrastructure, plus, probably some performance issues as we, unfortunately, absolutely cannot return all the FTS matches but instead have to check them for other reasons why they must not be shown which, of course, uses the database again and I'm wary of putting all that logic somehow into ES as well.

This is why we currently deal with the postgres tsearch limitations. But sooner or later, we'd probably want to bite the bullet and go dedicated solution.

ngrilly9y ago

Do you rank full-text search results using something like ts_rank? If yes, do you suffer from slow queries?

pilif9y ago

We use it, but we don't suffer slow queries in our case.

1 more reply

takeda9y ago

Someone here mentioned ZomboDB[1]. Would that help you?

[1] https://github.com/zombodb/zombodb

memracom9y ago

Usually an RDBMS like PostgreSQL is used in an environment that has a different usage pattern than search. SOLR can take advantage of specializing for that type of usage. However, Russia's largest search engine, Yandex, seems to like PG http://momjian.us/main/blogs/pgblog/2016.html#September_28_2...

elmigranto9y ago

I doubt they use it for actual search, though. Given that core company product for end-users is Search Engine, Yandex probably has some in-house system, that Mail team can leverage for their searching needs.

garysieling9y ago

For the small data use cases I've seen, Solr always returns results quickly. You're limited in how you can query it - it's modeled as one giant table, and the query syntax is much more idiosyncratic than SQL.

Postgres is really reliable, and I think a lot of the performance difference comes from robust transactions. For some use cases you can use both and replicate data or query one + the other in sequence.

ngrilly9y ago

It looks like Postgres Professional is working on improving FTS. Here is a relevant presentation from Oleg Bartunov about the new RUM index and its benefits for FTS:

http://www.sai.msu.su/~megera/postgres/talks/pgopen-2016-rum...

taneliv9y ago

Can pgsql fts do stemming or more complex lemmatisation for languages other than English? Or ranking of results based on Okapi BM25 or similar? I was looking into this about two years ago and those were the features in favor of Lucene (basis of ES and Solr).

kuschku9y ago

Pgsql can do stemming and everything it can do in English also in several dozen languages, including German, French, Spanish, and any for which you install dictionaries. It’s quite useful

rpedela9y ago

As far as I am aware, it does not support BM25. See pilif's comment about multiple language support.

snowwolf9y ago

Please can the Postgres team put some major focus on completing logical replication [1]. It's the missing piece to making upgrading across major versions painless and quick on large databases so that we can take advantage of all these nice new features. We're on a Heroku's hosted Postgres service so can't install the pglogical extension.

1. http://blog.2ndquadrant.com/why-logical-replication/

rpedela9y ago

It is being worked on. There is a good chance it will be in the next major version.

https://commitfest.postgresql.org/10/701/

mslate9y ago

I don't think you would be able to take advantage of logical replication on Heroku Postgres regardless--they don't allow you to replicate to your own instances, only other Heroku-hosted instances.

This makes migrating off Heroku for Postgres a PITA and requiring down-time.

snowwolf9y ago

True, that would be an extra bonus if Heroku started allowing replication to non Heroku instances, but as long as they support logical replication to a Heroku Follower instance then you can upgrade to new major versions with near zero downtime - set up logical follower running latest postgres version and then promote to master once it has caught up. Currently you can't have a follower that is a different version to master - meaning an upgrade requires either a full backup and restore to new version resulting in significant downtime if you have a large database or using the pg_upgrade utility which is generally not recommended as it is not guaranteed to work.

ignoramous9y ago

A tangential question:

Everyone speaks about InnoDB and how performant and reliable it is... and multiple firms even use it as a KV-store (Uber/Pinterest/AWS) bypassing MySQL entirely. I have never heard much about storage engines in Postgres, why could this be so?

Wikipedia has a (stub) article on InnoDB, but nothing on Postgres' storage engines... just wondering why that is.

rwultsch9y ago

The PG storage engine is not particularly awesome. It is basically COW (with exceptions) and compaction (called vacuum) has been quite painful for a long time. Every release it is supposedly fixed, but people keep complaining. This not to say PG sucks, their optimizer knows far more about their data than InnoDB and PG can perform far more types of execution plans.

We (Pinterest, I wrote most of the MySQL automation) make heavy use of MySQL replication which is vastly simpler to manage than PG. All queries still flow through SQL and unlike PG, we can force whatever execution plan we need. We do lots of PK lookups, and InnoDB is really good at that. In InnoDB all the data is stored in the PK while in PG it is just a pointer.

dhd4159y ago

>>In InnoDB all the data is stored in the PK while in PG it is just a pointer.

This is just a consequence of the PK being a clustered index in InnoDB which has both pros and cons. One of the big cons is that all of the columns of the PK are implicitly added to every secondary index as the row identifier. That isn't a big problem if your PK is a single column int, but if it's multiple columns, that often results in unnecessary bloat in your secondary indexes. Ideally (as in, dare I say, MS SQL Server), you'd have the option of a clustered or non-clustered PK for your table so you could choose the optimal index structure for your workload on a per-table basis.

rwultsch9y ago

If you don't want a clustered index in InnoDB you can define the primary key as an auto incrementing uint.

1 more reply

ngrilly9y ago

> All queries still flow through SQL and unlike PG, we can force whatever execution plan we need. We do lots of PK lookups, and InnoDB is really good at that.

Being able to force the execution plan is more useful in MySQL than PostgreSQL because MySQL's optimizer is not very good at planning queries.

If you do a lot of PK lookups, then you don't need to force the execution plan.

wmfiv9y ago

Postgres doesn't provide a store engine API. You can achieve some of those goals by using the foreign data wrapper (fdw) api.

https://www.postgresql.org/docs/9.5/static/postgres-fdw.html

For example, Citus Data provides a column store for Postgres via the fdw api.

https://github.com/citusdata/cstore_fdw

dhd4159y ago

PostgreSQL, for better or worse, doesn't have pluggable storage engines. There's some discussion on their dev mailing list about the possibility of adding that capability in PG10, though: http://postgresql.nabble.com/Pluggable-storage-td5916322.htm...

Some earlier (2013) discussion on the same topic: https://wiki.postgresql.org/wiki/2013UnconfPluggableStorage

asah9y ago

tl;dr: Foreign Data Wrappers (FDW) provide 99% of the same functionality, but with even more flexibility incl smart query optimizer support.

pgaddict9y ago

That is far too optimistic, IMHO.

FDW are a great way to access external data sources, but it lacks proper support for visibility and transactions, and so on. Also, the FDW API follows the "tuple at a time" execution model, which prevents a lot of optimizations in the upper part of the stact (vectorized execution etc.).

There are several products using FDWs to change storage, but I'd call it a misuse of a feature designed for very different purpose.

IMHO it's hardly a way forward without significant changes/improvements (which may happen, I don't know).

ioltas9y ago

The discussion is moving on lately with a refactoring to create an access method handler for tables: https://www.postgresql.org/message-id/d7e41e76-e565-8bc0-4e9...

Here is as well some documentation on the matter: https://wiki.postgresql.org/wiki/HeapamRefactoring

Having "CREATE ACCESS METHOD [...] ON STORAGE|TABLE" to create a custom access method, or storage engine, and extending CREATE TABLE to be able to pass a storage method with the table definition could become a quite powerful combination. The main challenge is to come up with an interface solid enough to be able to handle problems related to MVCC, like VACUUM cleanup.

dragonwriter9y ago

> I have never heard much about storage engines in Postgres, why could this be so?

Because PG isn't designed around pluggable storage engines, so its not really as practical to take a storage engine out and use it separately, and doesn't make much sense to talk about the storage engine separately from the whole system.

Nullabillity9y ago

Postgres has one storage engine: Postgres. It doesn't have a pluggable interface like MySQL does.

evanelias9y ago

FWIW, these solutions rarely bypass MySQL entirely or at all. Although there are ways to access InnoDB without making SQL queries (Memcached API; Handler Socket), the MySQL server is still involved. It just skips the normal protocol, auth, SQL parsing, etc.

Even then, there aren't a lot of published cases of people using these alternative access methods at scale yet. AFAIK, all of the large kv use-cases you've mentioned still go through traditional SQL queries. Despite the overhead of SQL parsing, it provides more control and visibility. The ecosystem around alternative access methods isn't nearly as mature.

MustardTiger9y ago

>Everyone speaks about InnoDB and how performant and reliable it is

What? Everyone speaks about how unreliable it is and how many major data corruption problems it has.

>I have never heard much about storage engines in Postgres, why could this be so?

Because they didn't take the approach of having multiple storage engines, they just made one that works and is not easily removed from the database.

evanelias9y ago

You may be confusing InnoDB with MyISAM (which is prone to corruption, especially upon crashes) or with running MySQL without a strict SQL mode (which causes bad things like silent truncation of overflowing values).

InnoDB is, and always has been, a very reliable and durable storage engine with solid performance characteristics.

MustardTiger9y ago

No, I am referring to innodb, which has a number of known reliability problems which are "wontfix".

1 more reply

calpaterson9y ago

Postgres does not have pluggable storage engines - there is essentially just one way to store things.

malisper9y ago

> Index-only scans for partial indexes

This one is huge for my company. Almost every single query of ours could use an index-only scan, but the planner would never choose to perform one because of the weirdness around partial indexes. We expecting a several x speedup once we upgrade to 9.6. All the need to improve now is a way to keep the visibility map up to date without relying on vacuums.

pgaddict9y ago

Thanks, it's nice to see the patch is likely beneficial for other people!

snuxoll9y ago

I don't see ever going away from using vacuum to maintain the visibility map, but hopefully the changes in 9.6 will make it a non-issue on large tables.

anarazel9y ago

> I don't see ever going away from using vacuum to maintain the visibility map

I don't think that's that unlikely to change. There's two major avenues: Write it during hot-pruning (which is done on page accesses), and perform a "lower impact" vacuum on insert-only tables more regularly

> but hopefully the changes in 9.6 will make it a non-issue on large tables.

You mean the freeze map? That doesn't really change the picture for regular vacuums, it changes how bad anti-wraparound vacuums are. The impact of the table vacuum itself is most of the time not that bad these days (due to the visibility map), what's annoying is usually the corresponding index scans. They have to scan the whole index, which is quite expensive.

malisper9y ago

I think there is already a patch for Postgres 10 that runs the vacuum on insert only tables. While not completely solving the problem, that will be helpful.

jbkkd9y ago

Congratulations to the PostgreSQL Global Development Group on a much-anticipated release.

Curious about this:

> parallelism can speed up big data queries by as much as 32 times faster

Why would it be only 32 times faster? The sky's the limit if there aren't major bottlenecks on the way.

olavgg9y ago

I tested parallel queries on PostgreSQL 9.6 on a few TBs of data, 5 billion rows on an older dual Xeon E5620 server. I also striped 4 Intel S3500 800GB drivers with ZFS and enabled LZ4 compression which has a compressratio of 4x.

For a sequential full table scan I could process about 2000MB/s of data(only 125MB/s was read from each SSD), I was limited by CPU power.

Anyway, same query took about 25 minutes on PostgreSQL 9.5 and now it was down to 2minutes and 30 seconds. For comparison, SQL Server 2012 spent 7 minutes on the same dataset on the same hardware.

greggyb9y ago

Would you be willing to re-run that with SQL Server 2016? A Dev license is free, and there's been a lot of relational engine optimization since 2012. I'd be curious to see what the latest release can do compared to Postgres' latest.

I realize I'm asking a stranger on the internet to do something for free for me. If you don't have time or inclination to do this, no worries, but it seems like you've got a nice setup to be able to play with this. I'm sure I'm not the only one curious to see such a comparison.

olavgg9y ago

I've tried SQL Server 2016, no difference.

1 more reply

dhd4159y ago

I'm a fan of both PostgreSQL and SQL Server, but I think these numbers are very workload-specific. I've gotten 1GB/s throughput on SQL Server 2012 on spinning disks and CPUs older than the E5620, so I've no doubt that same workload would exceed 2GB/s on your hardware. The apples-to-apples comparison here is between the two versions of PG where the performance improvement is clear. It's harder to do an apples-to-apples comparison between PG and SQL Server because the optimal schema and queries for a particular workload are likely to differ for each of them.

olavgg9y ago

With SQL Server I don't get 2000MB/s on the same hardware, more like 600-800MB/s. This is most likely because of LZ4 compression and large block sizes(64k-128k) on ZFS, that results in a lot less IO. Because with SQL Server, IO was the bottleneck.

So yes, it is very workload specific. For random read/write they are probably more similar. But for reading a lot of data that can be read sequentially, PostgreSQL seems to win hands down, because it can get a lot of help from ZFS compression.

I would love to run the same test when SQL Server is available on Linux. But ZFS do also deliver slightly better throughput and slightly more iops on the FreeBSD platform, which I ran this benchmark on. And SQL Server probably demands a 4k block size, which is so small that LZ4 compression has no effect as I've already tried to run SQL Server on ZFS via iSCSI.

1 more reply

greggyb9y ago

No one has tested a query that got more than 32x faster, so they don't want to promise something they can't prove.

Jweb_Guru9y ago

There's also a limited amount of memory-level parallelism available... with 4-DIMM sockets you might need an 8-socket machine to get a 32x improvement on large (memory-bound) sequential scans, which I'd guess you can get on top-end Power machines.

(You can probably get more memory level parallelism with random access, but your overall bandwidth will likely be lower... fully exploiting memory bandwidth is complicated and difficult to do for real applications).

joshberkus9y ago

That's pretty much the case, yes.

zejn9y ago

A blog post on 2ndQuadrant shows a bit more how parallelism in PostgreSQL scales across cores: http://blog.2ndquadrant.com/parallel-aggregate/

noxin9y ago

They likely benchmarked it on a 32 core system. Like a dual Opteron board. If the task was single-threaded before a 32-fold improvement is reasonable.

pedrocr9y ago

It's very difficult to get a 32x speedup from 32 cores as there are always parts that are inherently serial, so it's more likely they tested it on a 64 core machine or something like that.

jelder9y ago

Yes, this is thanks to Amdahl's Law.

https://en.wikipedia.org/wiki/Amdahl%27s_law

ris9y ago

Nothing ever scales linearly without limit.

qaq9y ago

Congrats on great release. With availability of E7-8800 v4 based servers (up to 192 cores in a single box) PG can cover a huge number of use cases without complicated setups.

mgberlin9y ago

Does anyone know when this will be available on AWS RDS?

aidos9y ago

On that topic - what's the general feeling about RDS?

I'm running pg on ec2 with a hot standby slave. I need the postgis extension but am not doing anything particularly esoteric. Ideally I'd like to have the certainty of aws handling backups for me.

I was researching moving to RDS today and would love to hear thoughts on whether it's a good general solution or not. What happens about downtime during upgrades or swapping instance sizes?

luhn9y ago

> What happens about downtime during upgrades or swapping instance sizes?

This is one of my favorite features of RDS: You can set a maintenance window and have the option to not have changes take effect until that window. So if I want to upgrade Postgres or change the instance size, I set it up and the downtime happens when I'm fast asleep and nobody is using the site.

I also think (but not 100% sure) that if you have Multi-AZ enabled, changes are done by upgrading the slave, failing over, and then upgrading the ex-master, so downtime is limited to the failover period.

aidos9y ago

Ah ok. That's useful info about the multiAZ setup - I'll have a look into that.

In my case, we now have customers around the world so we don't get the "night time" luxury. Part of the work I'm now completing is to split the system into an accounts db and customer data db. I was thinking to dip my toe in the water by just moving the account db to RDS to see how it goes.

merb9y ago

downtime is more limited by your application. i.e. if there is a failure your connection / database pool just needs to reconnect. btw. we only have had psycopg2 pointing at it yet and that worked without a downtime. however i guess java hikaricp is as fast as that well. only "failures" will have a "small" downtime, however planned downtime is pretty/zero fast.

cwp9y ago

Yes, I've noticed this too. The AWS documentation and the console both say that the changes may take a long time to apply, but in fact the database is up for most of that time. I've done several big upgrades that had only a few seconds of downtime.

rpedela9y ago

Postgres RDS is solid and it supports PostGIS. If you have multi-AZ enabled, then downtime is typically measured in seconds even when upgrading versions or changing instance sizes. It will update one instance, automatically switch to it, and update the other one. It automatically handles backups, syncing read slaves, etc too. It is awesome in my experience.

aidos9y ago

I've just read about this and there are a lot of people saying they had significant downtime during upgrades ([1] one such story, but there were quite a few on stackoverflow etc).

I'm going to run some tests myself to see how well it works on my existing data (only 30GB at the moment, but it was only 20GB a month ago and is growing fairly rapidly).

[1] https://inopinatus.org/2016/02/16/notes-from-a-postgresql-rd...

takeda9y ago

Compared to others my experience with RDS is bad, although I used it last year ago perhaps things improved.

One major issues is that you are restricted to what you can do with it, not all options are available. You can only use extensions that they provide. (this I'm a bit fuzzy about) but changing disk size made service unavailable for ~30 minutes (proportional to new disk size). You weren't able to configure replication, the replication only happens to the backup node. You weren't even able to set up replication across regions.

The replication is kind of a bummer, because if you ever would like to move your data (perhaps to a vm or outside of aws) you would need to have an outage. Also if I remember there was no way to do major version update in place.

There was also another incident (it was caused by bug so hopefully it was fixed and won't happen to anyone else). We had cluster set up with a backup. One day out of nowhere the service stopped working and was unavailable for 1.5 hours. That was quite big issue because we used it for monitoring (zabbix), so any outage makes us blind to issues. Turned out that due to bug their backup routine made a mistake and started doing backup on the master server (normally it supposed to do on slave).

htn9y ago

While not AWS product, Aiven (https://aiven.io/postgresql) has a hosted offering with 9.6 support on AWS as well as on Azure, GCE and DigitalOcean.

nhumrich9y ago

Probably in 3-4 months. AWS has historically had a 3 month gap time for postgres. Their policy (from what they have said on the forums at least) is they wait for at least x.x.1 release before they start working on it.

Roboprog9y ago

Is it just selection bias from posted links on HN, or has the PostgreSQL team been doing many (feature) releases lately?

Sounds good!

dragonwriter9y ago

> Is it just selection bias from posted links on HN, or has the PostgreSQL team been doing many (feature) releases lately?

I think more like the former -- as I recall, the recent articles have mostly been about specific work going on for the 9.6 release, prereleases of 9.6, and now the actual release of 9.6.

pgaddict9y ago

There's still only one major PostgreSQL release per year. There were a few posts about cool stuff built on top of PostgreSQL, a few posts about progress of the 9.6 development (e.g. when the parallel query got committed) etc.

anarazel9y ago

> There's still only one major PostgreSQL release per year.

Well, due to the delayed 9.5 release (January 7th), there have been two this year ;)

pgaddict9y ago

Well, that really depends on where exactly you place start of a year ;-)

Chinese New Year was February 8, 2016. Orthodox New Year was January 14, 2016. So it's 2:1 for me.

1 more reply

gtrubetskoy9y ago

Anyone know the state of BDR in 9.6?

http://blog.2ndquadrant.com/bdr-is-coming-to-postgresql-9-6/

okket9y ago

See discussion from 3 days ago (69 comments): https://news.ycombinator.com/item?id=12576116

TL;DR: It is not in mainline, but it does not need a patch anymore. You need to bring your own conflict resolution logic.

pgaddict9y ago

Or design the application so that there are no conflicts (e.g. modifying different subsets of users on different nodes).

jimktrains29y ago

Which is a form of conflict resolution. It requires the application to be aware of the datastore.

I wonder if, since BDR is just a plugin now, a plugin that used strong consistency guarantees could be built using the same changes that were required for BDR.

anarazel9y ago

BDR has last-updated-wins builtin and conflict handlers that can be called if that's not what you need.

mamcx9y ago

What is BDR?

joeriel9y ago

Bi-directional replication

tmaly9y ago

I am interested in the full text search as well as

Index-only scans for partial indexes

n4nagappan9y ago

Does Postgres offer search based on tf-idf?

ris9y ago

Yes and it's quite flexible in doing so https://www.postgresql.org/docs/9.6/static/textsearch-contro...

Chayanon19819y ago

j / k navigate · click thread line to collapse

128 comments

fabian2k9y ago

Are there huge differences in performance, features or search quality? At which scale does using Postgres for full text search still make sense?

brightball9y ago

nnutter9y ago

anewhnaccount9y ago

SQL supports logic. Either escape manually, use the templating in your driver or use an ORM.

1 more reply

nugator9y ago

What about when having different PG database instances that has data you want to join on? Would you still use PG as an aggregated read-only copy of the databases or would you use for example ES?

brightball9y ago

I can't speak to performance in that situation though.

fatbird9y ago

ngrilly9y ago

Did you store the plain text of each PDF in PostgreSQL or just the ts_vector resulting from the plain text?

fatbird9y ago

IIRC, I stored the plain text too because the engine can return contextually marked up plaintext after finding it in the ts_vector.

1 more reply

pumainmotion9y ago

Curious to know since you mentioned that it was fast for thousands of PDFs... any rough timing information on some of your queries for that kind of dataset?

fatbird9y ago

On a bulk import of 1,000+, it took a couple minutes to ingest them. This was all on a $20/month VPS.

buro99y ago

I have found Postgres to be good enough for search.

The big advantage over other approaches, because it's SQL and it's there in my database where I also store users and permissions knowledge... I can permission-limit my fulltext searches.

batmansmk9y ago

So far, on the 30+ queries of our dashboard tool, we have yet to find a use case that Postgres didn't handle better than Lucene based solutions.

asimuvPR9y ago

Mind sharing a table structure from your db? I'm using ES for a project and would prefer to keep things simple (already use postgres in another part of the system).

ngrilly9y ago

joshberkus9y ago

You'll be interested in ongoing work in this area, then. Oleg & Teodor are working on a new index type (RUM indexes, no less) which will speed up ranking operations conserably.

https://lwn.net/Articles/689387/

ngrilly9y ago

Hi Josh. Yes, I'm aware of this new type of index. Oleg was even kind enough to answer my email asking a few question about it :-)

chishaku9y ago

Does anyone have experience with ZomboDB?

"ZomboDB is a Postgres extension that enables efficient full-text searching via the use of indexes backed by Elasticsearch. In order to achieve this, ZomboDB implements Postgres' Access Method API.

https://github.com/zombodb/zombodb

jameslk9y ago

It looked promising when I was evaluating it a few months ago, but it's limited to use with Elasticsearch 1.x, which was not going to work for us.

zombodb9y ago

As the developer, I do!

Feel free to email the mailing list (zombodb@googlegroups.com). I'd be happy to help answer any questions you might have

combatentropy9y ago

I had used PostgreSQL for a decade, including full-text search, but just within apps that were already storing their data in Postgres.

I ended up just putting wget on a daily cron job to spider the site. Then I ran the saved files through a hodgepodge of scripts to extract the plain text and put it into PostgreSQL.

ngrilly9y ago

What is the on-disk size of the table storing the plain text?

combatentropy9y ago

1 more reply

rpedela9y ago

scjody9y ago

This has been around for a while but it's a great summary of what Postgres can do: http://rachbelaid.com/postgres-full-text-search-is-good-enou...

ngrilly9y ago

pilif9y ago

While it's ok for our purposes, I would wish for a bit better customisability of the text parser and it definitely needs better support for compound words to be perfect.

The first issue is with relation to https://www.postgresql.org/docs/9.6/static/textsearch-parser...: The documentation says

> At present PostgreSQL provides just one built-in parser, which has been found to be useful for a wide range of applications

and it really means it - changing the behaviour of this component is not possible unless you write a completely different parser in C which, while possible is no fun experience.

The other problem is the compound support. A lot of our data is in German which like other languages likes to concatenate nouns.

For example, you'd absolutely want to find the term "Weisswürste" for the query "wurst" (note the concatenation and the added umlaut for the plural in wurst).

So we've ended up using a hacked ispell dictionary where we have to mark all known compounds which is annoying and error-prone.

The solution for this is to use a connection pooler in front of Postgres. This works fine but, of course, adds more overhead.

This is why we currently deal with the postgres tsearch limitations. But sooner or later, we'd probably want to bite the bullet and go dedicated solution.

ngrilly9y ago

Do you rank full-text search results using something like ts_rank? If yes, do you suffer from slow queries?

pilif9y ago

We use it, but we don't suffer slow queries in our case.

1 more reply

takeda9y ago

Someone here mentioned ZomboDB[1]. Would that help you?

[1] https://github.com/zombodb/zombodb

memracom9y ago

elmigranto9y ago

garysieling9y ago

ngrilly9y ago

It looks like Postgres Professional is working on improving FTS. Here is a relevant presentation from Oleg Bartunov about the new RUM index and its benefits for FTS:

http://www.sai.msu.su/~megera/postgres/talks/pgopen-2016-rum...

taneliv9y ago

kuschku9y ago

Pgsql can do stemming and everything it can do in English also in several dozen languages, including German, French, Spanish, and any for which you install dictionaries. It’s quite useful

rpedela9y ago

As far as I am aware, it does not support BM25. See pilif's comment about multiple language support.

snowwolf9y ago

1. http://blog.2ndquadrant.com/why-logical-replication/

rpedela9y ago

It is being worked on. There is a good chance it will be in the next major version.

https://commitfest.postgresql.org/10/701/

mslate9y ago

I don't think you would be able to take advantage of logical replication on Heroku Postgres regardless--they don't allow you to replicate to your own instances, only other Heroku-hosted instances.

This makes migrating off Heroku for Postgres a PITA and requiring down-time.

snowwolf9y ago

ignoramous9y ago

A tangential question:

Wikipedia has a (stub) article on InnoDB, but nothing on Postgres' storage engines... just wondering why that is.

rwultsch9y ago

dhd4159y ago

>>In InnoDB all the data is stored in the PK while in PG it is just a pointer.

rwultsch9y ago

If you don't want a clustered index in InnoDB you can define the primary key as an auto incrementing uint.

1 more reply

ngrilly9y ago

> All queries still flow through SQL and unlike PG, we can force whatever execution plan we need. We do lots of PK lookups, and InnoDB is really good at that.

Being able to force the execution plan is more useful in MySQL than PostgreSQL because MySQL's optimizer is not very good at planning queries.

If you do a lot of PK lookups, then you don't need to force the execution plan.

wmfiv9y ago

Postgres doesn't provide a store engine API. You can achieve some of those goals by using the foreign data wrapper (fdw) api.

https://www.postgresql.org/docs/9.5/static/postgres-fdw.html

For example, Citus Data provides a column store for Postgres via the fdw api.

https://github.com/citusdata/cstore_fdw

dhd4159y ago

Some earlier (2013) discussion on the same topic: https://wiki.postgresql.org/wiki/2013UnconfPluggableStorage

asah9y ago

tl;dr: Foreign Data Wrappers (FDW) provide 99% of the same functionality, but with even more flexibility incl smart query optimizer support.

pgaddict9y ago

That is far too optimistic, IMHO.

There are several products using FDWs to change storage, but I'd call it a misuse of a feature designed for very different purpose.

IMHO it's hardly a way forward without significant changes/improvements (which may happen, I don't know).

ioltas9y ago

The discussion is moving on lately with a refactoring to create an access method handler for tables: https://www.postgresql.org/message-id/d7e41e76-e565-8bc0-4e9...

Here is as well some documentation on the matter: https://wiki.postgresql.org/wiki/HeapamRefactoring

dragonwriter9y ago

> I have never heard much about storage engines in Postgres, why could this be so?

Nullabillity9y ago

Postgres has one storage engine: Postgres. It doesn't have a pluggable interface like MySQL does.

evanelias9y ago

MustardTiger9y ago

>Everyone speaks about InnoDB and how performant and reliable it is

What? Everyone speaks about how unreliable it is and how many major data corruption problems it has.

>I have never heard much about storage engines in Postgres, why could this be so?

Because they didn't take the approach of having multiple storage engines, they just made one that works and is not easily removed from the database.

evanelias9y ago

InnoDB is, and always has been, a very reliable and durable storage engine with solid performance characteristics.

MustardTiger9y ago

No, I am referring to innodb, which has a number of known reliability problems which are "wontfix".

1 more reply

calpaterson9y ago

Postgres does not have pluggable storage engines - there is essentially just one way to store things.

malisper9y ago

> Index-only scans for partial indexes

pgaddict9y ago

Thanks, it's nice to see the patch is likely beneficial for other people!

snuxoll9y ago

I don't see ever going away from using vacuum to maintain the visibility map, but hopefully the changes in 9.6 will make it a non-issue on large tables.

anarazel9y ago

> I don't see ever going away from using vacuum to maintain the visibility map

> but hopefully the changes in 9.6 will make it a non-issue on large tables.

malisper9y ago

I think there is already a patch for Postgres 10 that runs the vacuum on insert only tables. While not completely solving the problem, that will be helpful.

jbkkd9y ago

Congratulations to the PostgreSQL Global Development Group on a much-anticipated release.

Curious about this:

> parallelism can speed up big data queries by as much as 32 times faster

Why would it be only 32 times faster? The sky's the limit if there aren't major bottlenecks on the way.

olavgg9y ago

For a sequential full table scan I could process about 2000MB/s of data(only 125MB/s was read from each SSD), I was limited by CPU power.

Anyway, same query took about 25 minutes on PostgreSQL 9.5 and now it was down to 2minutes and 30 seconds. For comparison, SQL Server 2012 spent 7 minutes on the same dataset on the same hardware.

greggyb9y ago

olavgg9y ago

I've tried SQL Server 2016, no difference.

1 more reply

dhd4159y ago

olavgg9y ago

1 more reply

greggyb9y ago

No one has tested a query that got more than 32x faster, so they don't want to promise something they can't prove.

Jweb_Guru9y ago

joshberkus9y ago

That's pretty much the case, yes.

zejn9y ago

A blog post on 2ndQuadrant shows a bit more how parallelism in PostgreSQL scales across cores: http://blog.2ndquadrant.com/parallel-aggregate/

noxin9y ago

They likely benchmarked it on a 32 core system. Like a dual Opteron board. If the task was single-threaded before a 32-fold improvement is reasonable.

pedrocr9y ago

It's very difficult to get a 32x speedup from 32 cores as there are always parts that are inherently serial, so it's more likely they tested it on a 64 core machine or something like that.

jelder9y ago

Yes, this is thanks to Amdahl's Law.

https://en.wikipedia.org/wiki/Amdahl%27s_law

ris9y ago

Nothing ever scales linearly without limit.

qaq9y ago

Congrats on great release. With availability of E7-8800 v4 based servers (up to 192 cores in a single box) PG can cover a huge number of use cases without complicated setups.

mgberlin9y ago

Does anyone know when this will be available on AWS RDS?

aidos9y ago

On that topic - what's the general feeling about RDS?

I'm running pg on ec2 with a hot standby slave. I need the postgis extension but am not doing anything particularly esoteric. Ideally I'd like to have the certainty of aws handling backups for me.

I was researching moving to RDS today and would love to hear thoughts on whether it's a good general solution or not. What happens about downtime during upgrades or swapping instance sizes?

luhn9y ago

> What happens about downtime during upgrades or swapping instance sizes?

aidos9y ago

Ah ok. That's useful info about the multiAZ setup - I'll have a look into that.

merb9y ago

cwp9y ago

rpedela9y ago

aidos9y ago

I've just read about this and there are a lot of people saying they had significant downtime during upgrades ([1] one such story, but there were quite a few on stackoverflow etc).

I'm going to run some tests myself to see how well it works on my existing data (only 30GB at the moment, but it was only 20GB a month ago and is growing fairly rapidly).

[1] https://inopinatus.org/2016/02/16/notes-from-a-postgresql-rd...

takeda9y ago

Compared to others my experience with RDS is bad, although I used it last year ago perhaps things improved.

htn9y ago

While not AWS product, Aiven (https://aiven.io/postgresql) has a hosted offering with 9.6 support on AWS as well as on Azure, GCE and DigitalOcean.

nhumrich9y ago

Roboprog9y ago

Is it just selection bias from posted links on HN, or has the PostgreSQL team been doing many (feature) releases lately?

Sounds good!

dragonwriter9y ago

> Is it just selection bias from posted links on HN, or has the PostgreSQL team been doing many (feature) releases lately?

I think more like the former -- as I recall, the recent articles have mostly been about specific work going on for the 9.6 release, prereleases of 9.6, and now the actual release of 9.6.

pgaddict9y ago

anarazel9y ago

> There's still only one major PostgreSQL release per year.

Well, due to the delayed 9.5 release (January 7th), there have been two this year ;)

pgaddict9y ago

Well, that really depends on where exactly you place start of a year ;-)

Chinese New Year was February 8, 2016. Orthodox New Year was January 14, 2016. So it's 2:1 for me.

1 more reply

gtrubetskoy9y ago

Anyone know the state of BDR in 9.6?

http://blog.2ndquadrant.com/bdr-is-coming-to-postgresql-9-6/

okket9y ago

See discussion from 3 days ago (69 comments): https://news.ycombinator.com/item?id=12576116

TL;DR: It is not in mainline, but it does not need a patch anymore. You need to bring your own conflict resolution logic.

pgaddict9y ago

Or design the application so that there are no conflicts (e.g. modifying different subsets of users on different nodes).

jimktrains29y ago

Which is a form of conflict resolution. It requires the application to be aware of the datastore.

I wonder if, since BDR is just a plugin now, a plugin that used strong consistency guarantees could be built using the same changes that were required for BDR.

anarazel9y ago

BDR has last-updated-wins builtin and conflict handlers that can be called if that's not what you need.

mamcx9y ago

What is BDR?

joeriel9y ago

Bi-directional replication

tmaly9y ago

I am interested in the full text search as well as

Index-only scans for partial indexes

n4nagappan9y ago

Does Postgres offer search based on tf-idf?

ris9y ago

Yes and it's quite flexible in doing so https://www.postgresql.org/docs/9.6/static/textsearch-contro...

Chayanon19819y ago

j / k navigate · click thread line to collapse