undefined | Better HN

0 pointshans_castorp5y ago0 comments

> If at this point you are still finding that SQLite is not as fast or faster than SQL Server, MySQL, et.al., then I would be very surprised.

What about large aggregation queries, that are parallelized by modern DBMS?

Does it still scale that well if you have many concurrent read and write transactions (e.g. on the same table)?

0 comments

bob10295y ago

We aren't running any reports on our databases like this. I would argue it is a bad practice in general to mix OLTP and OLAP workloads on a single database instance, regardless of the specific technology involved.

If we wanted to run an aggregate that could potentially impact live transactions, we would just copy the SQLite db to another server and perform the analysis there. We have some telemetry services which operate in this fashion. They go out to all of the SQLite databases, make a copy and then run analysis in another process (or on another machine).

I am not aware of any hosted SQL technology which is capable of magically interleaving large aggregate queries with live transactions and not having one or both impacted in some way. At the end of the day, you still have to go to disk on writes, and this must be serialized against reads for basic consistency reasons. After a certain point, this is kinda like trying to beat basic information theory with ever-more-complex compression schemes. I'd rather just accept the fundamental truth of the hardware/OS and have the least amount of overhead possible when engaging with it.

castorp5y ago

> At the end of the day, you still have to go to disk on writes, and this must be serialized against reads for basic consistency reasons.

No, absolutely not.

That's why modern databases use a thing called multi version concurrency control. You can run (multiple) queries on the same table that is updated by multiple transactions at the same time without one blocking the others (assuming the write transactions don't block each other). Of course they are fighting for I/O, but there is no need so serialize anything.

Mixing OLTP and OLAP becomes increasingly "normal" theses days as the capabilities of the database products and the hardware improve. With modern high-end hardware (hundreds of CPUs, a lot of SSDs, large RAM) this actually scales quite nicely .

hodgesrm5y ago

OLTP databases are optimized for mutable data. OLAP databases are optimized for immutable data. There's a big difference between appropriate data structures for each use case that has little to do with hardware capabilities.

OLAP databases tend to write columns in large blocks and apply sort orders to improve compression. This type of structure works well if you write the data once and read it many times. It's horrendously inefficient for concurrent updates to things like user session contexts. (Or even reading them for that matter.) You are better off using a row store with ACID transactions and relatively small pages.

The dichotomy has been visible for decades and shows no sign of disappearing, because the difference is mostly how you arrange and access data, not so much the hardware used.

tristor5y ago

"serialized" here doesn't really mean processed in serial, it means "serializable" in the context of database information theory. Databases have special concurrency control requirements in order to create hard guarantees on database consistency. You can process queries in parallel and still have a serializable result, because of transaction coordination. Doing this on one server is much easier than doing this across a cluster of servers.

So in your case, MVCC is what you're talking about, which is not the same level of consistency guarantee as serializable, rather it is based on snapshot isolation. Some database vendors consider them effectively the same isolation level because the anomalies associated to other common non-serializable isolation levels aren't typically present in most MVCC implementations, but there's a lot more complexity here than you are acknowledging.

Mixing OLTP and OLAP workloads on the same database is pretty much always a bad idea. This is why it's common practice to use ETL jobs to move data from an OLTP optimized database like Postgres or MySQL to a separate database for OLAP (which could be another MySQL or PG instance, or could be something like ClickHouse or another columnar database optimized for OLAP). Just because you /can/ do something, doesn't mean you /should/ do something...

2 more replies

b5n5y ago

> Mixing OLTP and OLAP becomes increasingly "normal"

Just because it's "normal" doesn't mean it's correct. Just because you can doesn't mean you should.

All hail bob1029:

> We aren't running any reports on our databases like this. I would argue it is a bad practice in general to mix OLTP and OLAP workloads on a single database instance, regardless of the specific technology involved.

unilynx5y ago

I think this is specifically because random reads are scaling much better than writes, even though you won't see it on standard "that many MB/s" benchmarks where read is 'just' a few multiples of the write performance.

Persisting a transaction to the database is still (and especially in MVCC): "send data write". "wait for write to be flushed". "toggle metadata bit to mark write as completed". "wait for bit to be completed" which still serialises transaction commits while reads can complete in parallel as fast as the device can handle.

Especially now that the reads and writes don't have to share the disk head, it makes sense for random reads to keep on scaling better than writes.

1 more reply

BFatts5y ago

I think we're seeing the difference between someone who is a programmer and someone who plays a programmer at work. :) Arguing that some modern, complex feature is "better than" a simpler system is crazy talk. Any time I can simplify a system vs. add complexity I will go simplicity. Adding in plug-ins and features goes a long way toward vendor lock-in that prevents your org from being agile when you have to swap systems because you run into limits or have catastrophic failures.

mumblemumble5y ago

> I am not aware of any hosted SQL technology which is capable of magically interleaving large aggregate queries with live transactions and not having one or both impacted in some way. At the end of the day, you still have to go to disk on writes, and this must be serialized against reads for basic consistency reasons.

I do sometimes wonder if dirty reads are what the business folks actually want.

Not necessarily unconstrained dirty reads. But if it were possible to say, "The statistics in your reports may only be accurate to (say) +/- x%," would that be good enough?

Going really philosophical, might they even make better decisions if they had less precision to work with? There are certainly plenty of studies that suggest that that's basically how it works when people are managing their investments.

nthj5y ago

My experience is that customers/executives will accept latency but not inaccuracy. In practice this may be the same thing, it just depends how you position it. “Reports are accurate but may be up to 5 minutes out of date” is a very easy sell to a corporate worker who logs in to check a dashboard once a month.

Primary/replica is probably the correct way to solve this. In some places, I have also shunted writes through SQS queues, which in practice protects us from a locking operation in one place impacting other operations in a customer-facing way. I don’t think this is strictly necessary but it is a nice technical guard against the sociological problem of a contractor merging code like that. They don’t feel the pain of locks gone bad because they (per contract) can’t be on call.

1 more reply

bob10295y ago

We effectively do dirty reads when we just go around and copy the databases for purposes of offline analysis. Our use cases for aggregation queries are all tolerant to incomplete recent data. Most of the time we are running analysis 24 hours after whatever event, so we never get into a situation where we can't answer important business questions because data hasn't been flushed to disk yet.

The fact that most of the stuff we care about is time-domain & causal means that we can typically leverage this basic ideology. Very rarely does a time-series aggregate query need to be consistent with the OLTP workloads in order to satisfy a business question.

j / k navigate · click thread line to collapse

0 comments

bob10295y ago

castorp5y ago

> At the end of the day, you still have to go to disk on writes, and this must be serialized against reads for basic consistency reasons.

No, absolutely not.

hodgesrm5y ago

The dichotomy has been visible for decades and shows no sign of disappearing, because the difference is mostly how you arrange and access data, not so much the hardware used.

tristor5y ago

2 more replies

b5n5y ago

> Mixing OLTP and OLAP becomes increasingly "normal"

Just because it's "normal" doesn't mean it's correct. Just because you can doesn't mean you should.

All hail bob1029:

unilynx5y ago

Especially now that the reads and writes don't have to share the disk head, it makes sense for random reads to keep on scaling better than writes.

1 more reply

BFatts5y ago

mumblemumble5y ago

I do sometimes wonder if dirty reads are what the business folks actually want.

Not necessarily unconstrained dirty reads. But if it were possible to say, "The statistics in your reports may only be accurate to (say) +/- x%," would that be good enough?

nthj5y ago

1 more reply

bob10295y ago

j / k navigate · click thread line to collapse