also, is there a good resource on how to backfill?
I'm not sure what you mean. Software has bugs, data has bugs, etc. To be able to fix a bug and rerun a solution is important in all areas of software, it has nothing to do with logs or analytics (but data and data model type questions usually are important to those domains).
> also, is there a good resource on how to backfill?
Not really, because "backfill" means something different to everyone that holds data. Starting with what questions to ask, I would ask "What do we do if a lot of our data shows up incorrect" and "What do we do if lots of our data goes missing", and solving problems in an individual data stack that arise from those questions.
As an example, at a previous job our ETL/ELT system was all started with a file showing up in an S3 bucket. The code that ingested the contents of those files occasionally had bugs that required reingesting of all data that was processed by that version of the code. Having tools to identify (at the data level) what data was affected by this bug, and then being able to delete that data from a datastore and reingest only those S3 files with a newer version of the ingestion code made these types of bugs much easier to manage over time.
The reason I asked about resources is because I have data generated by a personal project. The initial data model was sloppy and so now I'm finding myself having to backfill to clean the data and it's rather painful. Though I haven't come across anything that deals with the subject so I'm just winging it on my own
OP used "Log stacks", but "Log Stacks" are just a specific flavor of event-based timeseries/analytics stacks. If you were to build a log-ingest and log-aggregation system, you'd just be building an ETL but with a specific emphasis on logging.
> The reason I asked about resources is because I have data generated by a personal project. The initial data model was sloppy and so now I'm finding myself having to backfill to clean the data and it's rather painful. Though I haven't come across anything that deals with the subject so I'm just winging it on my own
Snowflake's stage system works similarly to what I'm describing. You can use S3 as a stage, and then load data from the stage into a table. If something bad happens to your data _in Snowflake_, you can just reload from the Stage (with an updated INSERT).
For more ad-hoc ELT/ETL systems (ie not just All-In on Snowflake), you'd have to just assess your own tooling and build it yourself. In general, when building ingest systems I try to document whatever I can per-record. Meaning, each record in a store includes what version of what software ingested it, and a reference to the raw-est form of that data possible (ie, a JSON blob of the original event or a S3 URL to that event's backing source). This lets you say "We identified a bug in the ingest layer at version 0.1.1, we need to reingest all that data with 0.2.0", and then easily identify and remove the exact data that encountered that bug (because you recorded 0.1.1 as a part of the record itself), and then build a list of exactly what S3 files need to be reingested by 0.2.0.
If you're comfortable expanding your dataset a bit to include that type of metadata you save yourself a lot of time when bad things happen (which they will). It's always a game of metadata/bloat/compute time vs. savings, though.
edit I'll add, none of this matters if your dataset is small enough to be imported from 0 in almost no time at all. If you could write a small script to just parse every file in S3 and insert it into a database, and the time it would take to finish doesn't upset you, you're totally fine just doing that. What I described above is for when your data becomes so large that reimporting from 0 is basically impossible.
In terms of resources, I'm not aware of a one-size-fits-all approach... the most basic would be to define upfront what the playbook is for making backfills, and testing it once in a while if you don't get the natural opportunity to do it.
With a normalized and well defined schema, such inconsistent data is impossible. I guess your point is then to have a well defined process on how to go about resolving this when things go awry -- an important point that makes sense.