Is "data lake" just the new plural of "dataset"?
I am so excited, couldn't wait to see more!
We do something similar with VAST at https://vast.io. We’re still early, but especially live and retro detection of threat intel is what we are focusing on. Essentially operationalizing security content for detection and response, plus acquiring and extracting context of alerts and telemetry.
We have an experimental serverless deployment with Lambda and Fargate, but the majority of our users still collocate VAST near network sensors like Zeek and Suricata.
We’re running everything on top of Apache Arrow, storage of telemetry is now also Parquet. The idea is to do everything with open standards to minimize vendor lock-in.
Honestly think at that point you’d be better off and cheaper to go with a commercial security data lake..
Roughly this comes out to $1/(TB/day) for ingest compute costs which is much cheaper than a commercial solution. We are also working on moving over our Lambda's to ARM for even better cost-effficency.
[0] https://vector.dev/docs/reference/vrl/ [1] https://vector.dev/docs/setup/going-to-prod/sizing/#sizing
Given a prior exploration of a security visibility stack on AWS at scale, this is likely a colossal sized operational expense.
Me: Oh cool, can I run it in my k8s cluster? <clicks link>
It: "designed specifically for AWS"
Me: disappointed and annoyed by title
Looking at that service diagram, "Powered by AWS services" seems more accurate.
In fairness to the project, I think that clickbait was just the submission title, I don't see that language in the GH page at all
> run in my k8s cluster
Those two don't really go together. ;)
Kidding aside, yeah we definitely leverage all the power of AWS services to give a completely serverless experience.
It's a pretty complex diagram:
https://github.com/matanolabs/matano/blob/main/website/src/a...
I see the term zero-ops. But maintaining and debugging this pipeline is going to require some ops, even if you are not managing VMs.
This means as a dev you don’t need to maintain a server, or even container image, you just deploy the code, which is less maintenance overhead and more scalable.
The diagram just shows how it interacts with the other components of the log pipeline.
https://docs.cloudquery.io/blog/our-open-source-journey-buil...
Our IDS solution outputs zeek/suricata info to s3 as dns.1234.log.gz, http.1234.log.gz, etc.
Can these files be handled automatically?
Feel free to join our Discord, happy to walk you through the steps and learn about your use case.
Although the only supported query service is currently Athena, we plan to integrate with popular vendors like Snowflake and Dremio. Thanks to the growing industry support for Iceberg, we believe vendor lock-in should be a story of the past for security data.