undefined | Better HN

0 pointsalexchamberlain5y ago0 comments

We provide (internal) access to data where we provide interactive access via GraphQL-based APIs and bulk access via CSV or RDF dumps - I feel like dump files are grossly undervalued these days.

0 comments

sb82445y ago

I agree. I am going to reflect on this and see if there's a way to support dump files long term in our app. We sorta support it today but it's ad hoc implementation since an export can range from a few hundred of a thing to tens of millions of a thing.

Is there any good literature or patterns on supporting dumps in the tens of millions or larger?

I wrote a sheets plug-in that uses our cursor API to provide a full dump into a spreadsheet. Our professional services team is in love with it, so I bet they'd love generic data export capability.

jerf5y ago

"Is there any good literature or patterns on supporting dumps in the tens of millions or larger?"

The two main things you need are: 1. HTTP is a streaming protocol. You don't need to fully manifest a response in memory before you send it. If your framework forces that, bypass it for this particular call. (If you can't bypass it... your framework choice is now a problem for you.)

2. You presumably have some sort of JSON encoder in your language. As long as it doesn't have some sort of hard-coded "close the stream once we send this JSON" behavior (and if so, file a bug because a JSON encoder has no business doing that), all you have to do is ensure that the right bytes go out on the wire. You, again, don't have to fully manifest the reply in memory before you encode it. Something like:

    stream.Write("[")
    needComma = False
    for item in toBeSerialized:
        if needComma:
            stream.Write(",")
        json.Write(stream, item)
        needComma = True
    stream.Write("]")

A lot of times when you're emitting gigabytes of JSON it's still in lots of little chunks where each individual chunk isn't a memory problem on its own, so doing something like this can be very memory-efficient, especially if "toBeSerialized" is itself something like a cursor coming out of a DB where it itself is not manifesting in memory. (Newlines are also a good idea if your JSON encoder isn't naturally doing it already; helps debugging a lot for very little cost.)

JSON objects can be more annoying; you may need to manually deserialize one and that's more annoying. Protip: Whenever possible, use the JSON encoder in your language; there is no shame or anything in using the JSON encoder to emit strings corresponding to the keys of your object. Much like HTML, you need to be very careful writing things directly to the stream; it really always should go through the encoder. I even send constant strings through the encoder just to make the code look right.

The last little tidbit is that the HTTP software stack will tend to fight you on the matter of keeping long-lived connections open. There can be a lot of places that have timeouts you may want to extend. If this gets too big you may need to do something other than HTTP. You may also need to consider detecting failures (hashing or something) and the ability to restart. (Although don't underestimate modern bandwidth and the speed you can iterate through SELECT -type queries; definitely check into the virtues of just retrying. 10GB/year of extra bandwidth and processing power is still cheaper than a developer even designing* a solution to that problem, let alone implementing and testing it.)

Oh, and if you can use HTTP, be sure you're gzip'ing. It's dirt cheap nowadays on the CPU; only in extreme situations of bandwidth abundance and CPU shortage can it be worth skipping. My rule-of-thumb on JSON shrinking is about 15:1. CSVs don't quite shrink that much but they still shrink down pretty well.

nivertech5y ago

IMO it's better to use JSONL[1].

Also back in the day IBM had XML for Logging format, where every separate line was an XML fragment [2].

Most markup languages suffering of having a root element, which prevents efficient logging or steaming.

[1] https://jsonlines.org/

[2] https://en.wikipedia.org/wiki/XML_log

1 more reply

sb82445y ago

Thanks for this. It seems obvious reading it but I haven't thought about it this way before. I'm definitely going to explore these concepts!

j / k navigate · click thread line to collapse

0 comments

sb82445y ago

Is there any good literature or patterns on supporting dumps in the tens of millions or larger?

I wrote a sheets plug-in that uses our cursor API to provide a full dump into a spreadsheet. Our professional services team is in love with it, so I bet they'd love generic data export capability.

jerf5y ago

"Is there any good literature or patterns on supporting dumps in the tens of millions or larger?"

    stream.Write("[")
    needComma = False
    for item in toBeSerialized:
        if needComma:
            stream.Write(",")
        json.Write(stream, item)
        needComma = True
    stream.Write("]")

nivertech5y ago

IMO it's better to use JSONL[1].

Also back in the day IBM had XML for Logging format, where every separate line was an XML fragment [2].

Most markup languages suffering of having a root element, which prevents efficient logging or steaming.

[1] https://jsonlines.org/

[2] https://en.wikipedia.org/wiki/XML_log

1 more reply

sb82445y ago

Thanks for this. It seems obvious reading it but I haven't thought about it this way before. I'm definitely going to explore these concepts!

j / k navigate · click thread line to collapse