My point is that even though it's quadratic it can still be fast for the inputs mentioned (dozens of kilobytes), so long as the constant is low. If you have a quadratic algorithm that takes 1 cycle per byte of input squared(or less, using wide registers), it will be pretty damn quick for most inputs. If you have a quadratic algorithm that takes 1 million cycles for each byte of input squared (such as this one), that's a whole different story. The time to process 1 Megabyte in the 1 cycle algorithm would only let you process a kilobyte in the new.
Point is that things like being efficient with memory access and using sufficiently low level (or JIT'ed) languages can get you very far, and it's not really meaningful to dismiss an algorithm solely based on it being quadratic.