- The restart times of the Riak process ranged from 10 minutes to 3+ hours, during which time the cluster was basically useless. Not a single suggestion from support sped up this process.
- Every single night from 0800 - 0900 UTC, the cluster would grind to a halt (as measured by canaries measuring upload/download cycle times). This continued even after we migrated all customer data and traffic off of the cluster.
- Riak-CS ships with garbage collection disabled despite it being a critical feature. I inherited a cluster that had been run for some months without gc enabled. Turning it on caused the cluster to catastrophically fail. Basho Support, over a period of close to a year, was unable to find a single solution that would get our cluster back to health. If our cluster were a house on a show like Hoarders, the garbage in it would be considered load bearing.
- We attempted to upgrade our way out of our un-garbage-collect-able mess, but the transfer crashed. Every. Single. Time.
- Even had transfers worked, all of the bloated manifests have to be copied in their entirety, so you can't gc the incoming data on the new cluster.
- Even while babying the cluster, it would become unusable at least once a month, requiring a restart of all nodes. The slowest node took 3+ hours to start, followed by another 3+ hours of transferring data. This was 6+ hours of system downtime every month.
- During these monthly episodes, we attempted to engage with support and try to debug the processes (we were a team of seasoned Erlang developers). We could attach Observer and/or use the REPL to grab stats, but not a single support resource was able or willing to engage.
- For giggles, once we had migrated all users off of the cluster, we attempted to let gc run. It never completed. Not once. We let this go on for a few months before nuking the entire cluster.
Now, I absolutely realize that we got ourselves into that mess by running the cluster without gc for an extended period. But in the grand scheme of things, this cluster wasn't storing a very large amount of data -- tens of TB spread over tens of millions of objects. Having the cluster get into a state where gc can never run and where this causes snowballing instability is unacceptable.
We switched to Ceph. We've never looked back.