undefined | Better HN

0 pointscerved5y ago0 comments

what's the relation to bugs in logging and analytics? I'm not sure I see it

also, is there a good resource on how to backfill?

0 comments

> what's the relation to bugs in logging and analytics?

I'm not sure what you mean. Software has bugs, data has bugs, etc. To be able to fix a bug and rerun a solution is important in all areas of software, it has nothing to do with logs or analytics (but data and data model type questions usually are important to those domains).

> also, is there a good resource on how to backfill?

Not really, because "backfill" means something different to everyone that holds data. Starting with what questions to ask, I would ask "What do we do if a lot of our data shows up incorrect" and "What do we do if lots of our data goes missing", and solving problems in an individual data stack that arise from those questions.

As an example, at a previous job our ETL/ELT system was all started with a file showing up in an S3 bucket. The code that ingested the contents of those files occasionally had bugs that required reingesting of all data that was processed by that version of the code. Having tools to identify (at the data level) what data was affected by this bug, and then being able to delete that data from a datastore and reingest only those S3 files with a newer version of the ingestion code made these types of bugs much easier to manage over time.

cervedOP5y ago

It was the logging part which puzzled me.

The reason I asked about resources is because I have data generated by a personal project. The initial data model was sloppy and so now I'm finding myself having to backfill to clean the data and it's rather painful. Though I haven't come across anything that deals with the subject so I'm just winging it on my own

void_mint5y ago

> It was the logging part which puzzled me.

OP used "Log stacks", but "Log Stacks" are just a specific flavor of event-based timeseries/analytics stacks. If you were to build a log-ingest and log-aggregation system, you'd just be building an ETL but with a specific emphasis on logging.

> The reason I asked about resources is because I have data generated by a personal project. The initial data model was sloppy and so now I'm finding myself having to backfill to clean the data and it's rather painful. Though I haven't come across anything that deals with the subject so I'm just winging it on my own

Snowflake's stage system works similarly to what I'm describing. You can use S3 as a stage, and then load data from the stage into a table. If something bad happens to your data _in Snowflake_, you can just reload from the Stage (with an updated INSERT).

For more ad-hoc ELT/ETL systems (ie not just All-In on Snowflake), you'd have to just assess your own tooling and build it yourself. In general, when building ingest systems I try to document whatever I can per-record. Meaning, each record in a store includes what version of what software ingested it, and a reference to the raw-est form of that data possible (ie, a JSON blob of the original event or a S3 URL to that event's backing source). This lets you say "We identified a bug in the ingest layer at version 0.1.1, we need to reingest all that data with 0.2.0", and then easily identify and remove the exact data that encountered that bug (because you recorded 0.1.1 as a part of the record itself), and then build a list of exactly what S3 files need to be reingested by 0.2.0.

If you're comfortable expanding your dataset a bit to include that type of metadata you save yourself a lot of time when bad things happen (which they will). It's always a game of metadata/bloat/compute time vs. savings, though.

edit I'll add, none of this matters if your dataset is small enough to be imported from 0 in almost no time at all. If you could write a small script to just parse every file in S3 and insert it into a database, and the time it would take to finish doesn't upset you, you're totally fine just doing that. What I described above is for when your data becomes so large that reimporting from 0 is basically impossible.

ridaj5y ago

For example, your app logs clicks on the "submit" button, but there's a bug in your UX and the button is clickable/tappable multiple times while the form is being processed, instead of being disabled while being processed. Some users are tap-happy and will tap many times thus counting for multiple submissions. If that's how you count actions in your dashboards it will overcount.

In terms of resources, I'm not aware of a one-size-fits-all approach... the most basic would be to define upfront what the playbook is for making backfills, and testing it once in a while if you don't get the natural opportunity to do it.

cervedOP5y ago

Okay, I was confused as I thought you were referring to application logging and not logging that occurs in the data layer.

With a normalized and well defined schema, such inconsistent data is impossible. I guess your point is then to have a well defined process on how to go about resolving this when things go awry -- an important point that makes sense.

ridaj5y ago

Regardless of where the logging is, the can be bugs, and there will be bugs given sufficient amount of time and complexity. It's all about planning for recovery.

j / k navigate · click thread line to collapse

0 comments

void_mint5y ago

> what's the relation to bugs in logging and analytics?

> also, is there a good resource on how to backfill?

cervedOP5y ago

It was the logging part which puzzled me.

void_mint5y ago

> It was the logging part which puzzled me.

ridaj5y ago

cervedOP5y ago

Okay, I was confused as I thought you were referring to application logging and not logging that occurs in the data layer.

ridaj5y ago

Regardless of where the logging is, the can be bugs, and there will be bugs given sufficient amount of time and complexity. It's all about planning for recovery.

j / k navigate · click thread line to collapse