Is "data lake" just the new plural of "dataset"?
I'd say like 95% of the case I've seen people talking about these things, they basically mean: shove everything into S3 and use that as the canonical source of truth for your data systems, rather than some OLAP system; instead you build the OLAP system off S3.
More simply, I think of it like a term to describe a particular mindset concerning your ETL: always work on the source data. And source data is often messy and unstructured. It's a lot of potentially unstructured and underspecified bullshit. So S3 is pretty good storage for something like that versus datastores with performance/usability cliffs around things like cardinality, fields that come and go, etc...
One advantage of this design I can see is that S3 is very "commodified" by this point (lots of alternative offerings) and can be integrated with in nearly every pipeline, and your tools can be replaced more easily, perhaps. S3 is more predictable and "low level" in that regard than something like a database, with many more performance/availability considerations. Like in the example I gave, you could feasibly replace Athena with Trino for instance, without disturbing too much beyond that system. You just need to re-ingest data from S3 for a single system. While if you loaded and ETL'd all your data into a database like Redshift, you might be stuck with that forever even if you later decide it was a mistake. This isn't a hard truth (you might still be stuck with Athena) but just an example of when this might be more flexible.
As usual this isn't an absolute and there are things in-between. But this is generally the gist of it, I think. The "lake" naming is kind of weird but makes some amount of sense I think. It describes a mindset rather than any particular tech.
I am so excited, couldn't wait to see more!
We do something similar with VAST at https://vast.io. We’re still early, but especially live and retro detection of threat intel is what we are focusing on. Essentially operationalizing security content for detection and response, plus acquiring and extracting context of alerts and telemetry.
We have an experimental serverless deployment with Lambda and Fargate, but the majority of our users still collocate VAST near network sensors like Zeek and Suricata.
We’re running everything on top of Apache Arrow, storage of telemetry is now also Parquet. The idea is to do everything with open standards to minimize vendor lock-in.
Honestly think at that point you’d be better off and cheaper to go with a commercial security data lake..
Roughly this comes out to $1/(TB/day) for ingest compute costs which is much cheaper than a commercial solution. We are also working on moving over our Lambda's to ARM for even better cost-effficency.
[0] https://vector.dev/docs/reference/vrl/ [1] https://vector.dev/docs/setup/going-to-prod/sizing/#sizing
Given a prior exploration of a security visibility stack on AWS at scale, this is likely a colossal sized operational expense.
Me: Oh cool, can I run it in my k8s cluster? <clicks link>
It: "designed specifically for AWS"
Me: disappointed and annoyed by title
Looking at that service diagram, "Powered by AWS services" seems more accurate.
In fairness to the project, I think that clickbait was just the submission title, I don't see that language in the GH page at all
downvote and flag buttons. you only need a handful alt accounts get any comment you dislike [dead] or [flagged], and it only takes one successful submission of a yuppie clickbait article per alt
> run in my k8s cluster
Those two don't really go together. ;)
Kidding aside, yeah we definitely leverage all the power of AWS services to give a completely serverless experience.
So currently, one would have to maintain the OpenFaaS control plane themselves, which takes away most of the benefits for an end user using serverless (would still have ops and no multi-tenant cost benefit).
It's a pretty complex diagram:
https://github.com/matanolabs/matano/blob/main/website/src/a...
I see the term zero-ops. But maintaining and debugging this pipeline is going to require some ops, even if you are not managing VMs.
The serverless data ingestion pipeline means you don't need to over-provision for ingestion (Logstash and Splunk Forwarders are notorious for related costs / ops in high scale use-cases) in the write path. For reads, since Matano queries Iceberg tables backed by highly-compressed parquet files on object storage you won't pay anything close to what you would for a database or search engine based SIEM.
This means as a dev you don’t need to maintain a server, or even container image, you just deploy the code, which is less maintenance overhead and more scalable.
The diagram just shows how it interacts with the other components of the log pipeline.
https://docs.cloudquery.io/blog/our-open-source-journey-buil...
Our IDS solution outputs zeek/suricata info to s3 as dns.1234.log.gz, http.1234.log.gz, etc.
Can these files be handled automatically?
Feel free to join our Discord, happy to walk you through the steps and learn about your use case.
Although the only supported query service is currently Athena, we plan to integrate with popular vendors like Snowflake and Dremio. Thanks to the growing industry support for Iceberg, we believe vendor lock-in should be a story of the past for security data.
Anyway, he acknowledged Snowflake has Iceberg support by planning to integrate with Snowflake.