undefined | Better HN

0 pointshashin9y ago0 comments

Could you please elaborate it? What were the issues you were facing?

0 comments

As another user with nothing but negative experiences with Riak-CS in production, I thought I'd take a stab here. We had a 12-node cluster with ~10TB per node, fwiw. In no particular order:

- The restart times of the Riak process ranged from 10 minutes to 3+ hours, during which time the cluster was basically useless. Not a single suggestion from support sped up this process.

- Every single night from 0800 - 0900 UTC, the cluster would grind to a halt (as measured by canaries measuring upload/download cycle times). This continued even after we migrated all customer data and traffic off of the cluster.

- Riak-CS ships with garbage collection disabled despite it being a critical feature. I inherited a cluster that had been run for some months without gc enabled. Turning it on caused the cluster to catastrophically fail. Basho Support, over a period of close to a year, was unable to find a single solution that would get our cluster back to health. If our cluster were a house on a show like Hoarders, the garbage in it would be considered load bearing.

- We attempted to upgrade our way out of our un-garbage-collect-able mess, but the transfer crashed. Every. Single. Time.

- Even had transfers worked, all of the bloated manifests have to be copied in their entirety, so you can't gc the incoming data on the new cluster.

- Even while babying the cluster, it would become unusable at least once a month, requiring a restart of all nodes. The slowest node took 3+ hours to start, followed by another 3+ hours of transferring data. This was 6+ hours of system downtime every month.

- During these monthly episodes, we attempted to engage with support and try to debug the processes (we were a team of seasoned Erlang developers). We could attach Observer and/or use the REPL to grab stats, but not a single support resource was able or willing to engage.

- For giggles, once we had migrated all users off of the cluster, we attempted to let gc run. It never completed. Not once. We let this go on for a few months before nuking the entire cluster.

Now, I absolutely realize that we got ourselves into that mess by running the cluster without gc for an extended period. But in the grand scheme of things, this cluster wasn't storing a very large amount of data -- tens of TB spread over tens of millions of objects. Having the cluster get into a state where gc can never run and where this causes snowballing instability is unacceptable.

We switched to Ceph. We've never looked back.

ranman9y ago

We didn't have any issues with lost data but we had a lot of operational issues that didn't have clear fixes. Primarily around TLS, migrations, and performance. We had to contact support for many issues because the documentation for various failure modes wasn't there.

j / k navigate · click thread line to collapse