Show HN: Open-source serverless security lake powered by Rust + Apache Iceberg (opens in new tab)

(github.com)

63 pointsshaeqahmed3y ago48 comments

48 comments

I'm going to regret asking this, but what the hell is a "security lake"? A collection of audit logs?

Its a data lake in which you store security logs. That includes Cloud/SaaS audit logs, network security logs (Zeek, Suricata, Snort), VPN/Firewall logs, and more.

remram3y ago

But logs are structured and filtered by their relevance to security. In what way is that a "lake"?

Is "data lake" just the new plural of "dataset"?

aseipp3y ago

People tend to call them "lakes" because, I think, they are "unfiltered" and contain raw data objects and blobs, originally from the source system, unmodified. In a normal data warehouse system you ETL things and the final "load" step stores them in the warehouse, and then you use that as your source of truth. Your data warehouse might be Redshift on Amazon. In the "Data lake" case you instead load everything into something like S3, and then everything uses S3 as the source of truth -- including your query engine, which in this case might be Athena (also on Amazon). I won't go into Redshift vs Athena but if you're familiar with them, this should make sense.

I'd say like 95% of the case I've seen people talking about these things, they basically mean: shove everything into S3 and use that as the canonical source of truth for your data systems, rather than some OLAP system; instead you build the OLAP system off S3.

More simply, I think of it like a term to describe a particular mindset concerning your ETL: always work on the source data. And source data is often messy and unstructured. It's a lot of potentially unstructured and underspecified bullshit. So S3 is pretty good storage for something like that versus datastores with performance/usability cliffs around things like cardinality, fields that come and go, etc...

One advantage of this design I can see is that S3 is very "commodified" by this point (lots of alternative offerings) and can be integrated with in nearly every pipeline, and your tools can be replaced more easily, perhaps. S3 is more predictable and "low level" in that regard than something like a database, with many more performance/availability considerations. Like in the example I gave, you could feasibly replace Athena with Trino for instance, without disturbing too much beyond that system. You just need to re-ingest data from S3 for a single system. While if you loaded and ETL'd all your data into a database like Redshift, you might be stuck with that forever even if you later decide it was a mistake. This isn't a hard truth (you might still be stuck with Athena) but just an example of when this might be more flexible.

As usual this isn't an absolute and there are things in-between. But this is generally the gist of it, I think. The "lake" naming is kind of weird but makes some amount of sense I think. It describes a mindset rather than any particular tech.

1 more reply

warent3y ago

data lake isn't a new term (relatively). I remember first hearing it when I worked at Google like 5 years ago, and the context was always referring to some enormous raw data store. Probably the term "lake" is supposed to evoke a sense of largeness and shapelessness. If you wanted to train a model, you would tap into a data lake which had up to petabytes.

2 more replies

AndrewKemendo3y ago

A single attack surface that makes it easier for adversaries to own your system

geodel3y ago

From GitHub it looks like implementation of Random Buzzword Lake APIs in Javascript.

I am so excited, couldn't wait to see more!

wizwit9993y ago

Haha! We use typescript for CDK infrastructure automation/deployment. Most of our runtime code is in Rust, some Kotlin, and Python for user authored detections.

mavam3y ago

Great to see more data engineering in the direction of SecOps.

We do something similar with VAST at https://vast.io. We’re still early, but especially live and retro detection of threat intel is what we are focusing on. Essentially operationalizing security content for detection and response, plus acquiring and extracting context of alerts and telemetry.

We have an experimental serverless deployment with Lambda and Fargate, but the majority of our users still collocate VAST near network sensors like Zeek and Suricata.

We’re running everything on top of Apache Arrow, storage of telemetry is now also Parquet. The idea is to do everything with open standards to minimize vendor lock-in.

happyopossum3y ago

Serverless sounds cool for this at first, but what are the ingest/compute costs going to look like at a modest 20Tb/day? How about 100, or 1Pb?

Honestly think at that point you’d be better off and cheaper to go with a commercial security data lake..

shaeqahmedOP3y ago

Matano is designed specifically for petabyte-scale security log analytics use cases, so performance and costs are a top priority. Our data pipeline borrows from Vector's Rust based data transformation language [0] for maximal performance, with each parallel function invocation capable of processing upwards of 20MiB/s [1] thanks to auto-vectorizaton.

Roughly this comes out to $1/(TB/day) for ingest compute costs which is much cheaper than a commercial solution. We are also working on moving over our Lambda's to ARM for even better cost-effficency.

[0] https://vector.dev/docs/reference/vrl/ [1] https://vector.dev/docs/setup/going-to-prod/sizing/#sizing

OrvalWintermute3y ago

> Roughly this comes out to $1/(TB/day) for ingest compute costs which is much cheaper than a commercial solution. We are also working on moving over our Lambda's to ARM for even better cost-effficency.

Given a prior exploration of a security visibility stack on AWS at scale, this is likely a colossal sized operational expense.

electromech3y ago

It: "powered by Rust + Apache Iceberg"

Me: Oh cool, can I run it in my k8s cluster? <clicks link>

It: "designed specifically for AWS"

Me: disappointed and annoyed by title

Looking at that service diagram, "Powered by AWS services" seems more accurate.

mdaniel3y ago

Related, "powered by Rust" I guess is some kind of shorthand way of saying "powered by Kotlin, TypeScript, Python, Rust, but mostly TypeScript"

In fairness to the project, I think that clickbait was just the submission title, I don't see that language in the GH page at all

aliqot3y ago

Over enough time on HN you start to notice patterns of things that are easy low-hanging upvotes. Of course we can't do anything with upvotes, but we can do things with clicks and usage metrics, and I'd imagine that Hn is a considered market when it comes to introducing something to the greater tech community. A lot of us tend to have some degree of decision making capability and influence in our orgs.

throwaway0x7E63y ago

>Of course we can't do anything with upvotes

downvote and flag buttons. you only need a handful alt accounts get any comment you dislike [dead] or [flagged], and it only takes one successful submission of a yuppie clickbait article per alt

echelon3y ago

We need a solid open source and portable serverless platform. Using lambda gets you locked into lambda.

wizwit9993y ago

There is OpenFaaS (https://www.openfaas.com), but surprisingly there's no hosted OpenFaaS services.

electromech3y ago

There are lots. This is one: https://knative.dev/docs/

wizwit9993y ago

> serverless

> run in my k8s cluster

Those two don't really go together. ;)

Kidding aside, yeah we definitely leverage all the power of AWS services to give a completely serverless experience.

electromech3y ago

There are plenty of FOSS serverless options

https://knative.dev/docs/

https://krustlet.dev

https://www.openfaas.com/

wizwit9993y ago

Thanks, yes, I mentioned OpenFaaS in a sibling and find it very interesting as a possible standard, but realistically there isn't currently a way to deploy OpenFaaS compatible functions to hosted providers (I actually couldn't find anyone offering a hosted OpenFaaS service).

So currently, one would have to maintain the OpenFaaS control plane themselves, which takes away most of the benefits for an end user using serverless (would still have ops and no multi-tenant cost benefit).

gunapologist993y ago

Looks neat, but in what way is this serverless?

It's a pretty complex diagram:

https://github.com/matanolabs/matano/blob/main/website/src/a...

wizwit9993y ago

There's no servers to maintain in the entire architecture, we heavily use Lambda and even use MSK serverless for Kafka.

georgyo3y ago

This sounds truely nightmarish and costly. Even moderate volumes of data are going to add up very quickly cost wise.

I see the term zero-ops. But maintaining and debugging this pipeline is going to require some ops, even if you are not managing VMs.

shaeqahmedOP3y ago

Using and maintaining Matano is a fraction of the cost compared to popular non-serverless alternatives like ELK or Spunk. Matano is specifically designed for petabyte-scale security analytics use-cases that don't fit in a traditional SIEM.

The serverless data ingestion pipeline means you don't need to over-provision for ingestion (Logstash and Splunk Forwarders are notorious for related costs / ops in high scale use-cases) in the write path. For reads, since Matano queries Iceberg tables backed by highly-compressed parquet files on object storage you won't pay anything close to what you would for a database or search engine based SIEM.

1 more reply

thegagne3y ago

I am pretty sure they are saying is that the matano portion of it, which does the security log processing, deploys in a serverless fashion (lambda I assume?).

This means as a dev you don’t need to maintain a server, or even container image, you just deploy the code, which is less maintenance overhead and more scalable.

The diagram just shows how it interacts with the other components of the log pipeline.

wizwit9993y ago

Thanks for explaining! Yep, that's what we mean. We use serverless services like S3, Lambda, Athena, MSK Serverless, Firehose so you don't have to maintain any servers.

yevpats3y ago

Good to see security moving to data engineering. Shameless plug: we are building similar stuff but for configurations here - https://github.com/cloudquery/cloudquery

https://docs.cloudquery.io/blog/our-open-source-journey-buil...

iJohnDoe3y ago

This sounds very appealing.

Our IDS solution outputs zeek/suricata info to s3 as dns.1234.log.gz, http.1234.log.gz, etc.

Can these files be handled automatically?

shaeqahmedOP3y ago

Yes, they would be handled automatically. Data ingestion is supported through S3 or Kafka, where files are picked up and ETL'D into structured Iceberg tables conforming to an ECS-like schema.

Feel free to join our Discord, happy to walk you through the steps and learn about your use case.

xenophonf3y ago

I've spent about two hours trying to deploy Matano, and it basically doesn't work as documented, if at all. I got as far as trying to bootstrap my AWS account before giving up. I love the idea of Matano, but this isn't even alpha-quality software at the moment.

chevman3y ago

Security lakes are very 2021, everyone moving to the security lakeHOUSE in 2023 broski!

sandGorgon3y ago

is this opensource snowflake-for-security-logs ?

shaeqahmedOP3y ago

It is similar, although Snowflake is more of a query engine whereas we are a cloud security data platform built on an open data model (Apache Iceberg). We help you ingest and normalize data from common security sources into a data lake and offer a serverless platform to deploy & run Python detections-as-code on these events in realtime.

Although the only supported query service is currently Athena, we plan to integrate with popular vendors like Snowflake and Dremio. Thanks to the growing industry support for Iceberg, we believe vendor lock-in should be a story of the past for security data.

crazyperson13y ago

Respectfully, this guy's description of Snowflake is very wrong - it's much more than a query engine. Snowflake already supports Iceberg format. If anything Snowflake is better described as similar to what OP is making (but for all data, not just security data): a cloud data platform that supports open data models (Apache Iceberg).

wizwit9993y ago

Zero karma account created just to post this comment, hmmm...

Anyway, he acknowledged Snowflake has Iceberg support by planning to integrate with Snowflake.

iJohnDoe3y ago

Thanks for the nice explanation.

j / k navigate · click thread line to collapse

48 comments

remram3y ago

I'm going to regret asking this, but what the hell is a "security lake"? A collection of audit logs?

shaeqahmedOP3y ago

Its a data lake in which you store security logs. That includes Cloud/SaaS audit logs, network security logs (Zeek, Suricata, Snort), VPN/Firewall logs, and more.

remram3y ago

But logs are structured and filtered by their relevance to security. In what way is that a "lake"?

Is "data lake" just the new plural of "dataset"?

aseipp3y ago

1 more reply

warent3y ago

2 more replies

AndrewKemendo3y ago

A single attack surface that makes it easier for adversaries to own your system

geodel3y ago

From GitHub it looks like implementation of Random Buzzword Lake APIs in Javascript.

I am so excited, couldn't wait to see more!

wizwit9993y ago

Haha! We use typescript for CDK infrastructure automation/deployment. Most of our runtime code is in Rust, some Kotlin, and Python for user authored detections.

mavam3y ago

Great to see more data engineering in the direction of SecOps.

We have an experimental serverless deployment with Lambda and Fargate, but the majority of our users still collocate VAST near network sensors like Zeek and Suricata.

We’re running everything on top of Apache Arrow, storage of telemetry is now also Parquet. The idea is to do everything with open standards to minimize vendor lock-in.

happyopossum3y ago

Serverless sounds cool for this at first, but what are the ingest/compute costs going to look like at a modest 20Tb/day? How about 100, or 1Pb?

Honestly think at that point you’d be better off and cheaper to go with a commercial security data lake..

shaeqahmedOP3y ago

[0] https://vector.dev/docs/reference/vrl/ [1] https://vector.dev/docs/setup/going-to-prod/sizing/#sizing

OrvalWintermute3y ago

Given a prior exploration of a security visibility stack on AWS at scale, this is likely a colossal sized operational expense.

electromech3y ago

It: "powered by Rust + Apache Iceberg"

Me: Oh cool, can I run it in my k8s cluster? <clicks link>

It: "designed specifically for AWS"

Me: disappointed and annoyed by title

Looking at that service diagram, "Powered by AWS services" seems more accurate.

mdaniel3y ago

Related, "powered by Rust" I guess is some kind of shorthand way of saying "powered by Kotlin, TypeScript, Python, Rust, but mostly TypeScript"

In fairness to the project, I think that clickbait was just the submission title, I don't see that language in the GH page at all

aliqot3y ago

throwaway0x7E63y ago

>Of course we can't do anything with upvotes

downvote and flag buttons. you only need a handful alt accounts get any comment you dislike [dead] or [flagged], and it only takes one successful submission of a yuppie clickbait article per alt

echelon3y ago

We need a solid open source and portable serverless platform. Using lambda gets you locked into lambda.

wizwit9993y ago

There is OpenFaaS (https://www.openfaas.com), but surprisingly there's no hosted OpenFaaS services.

electromech3y ago

There are lots. This is one: https://knative.dev/docs/

wizwit9993y ago

> serverless

> run in my k8s cluster

Those two don't really go together. ;)

Kidding aside, yeah we definitely leverage all the power of AWS services to give a completely serverless experience.

electromech3y ago

There are plenty of FOSS serverless options

https://knative.dev/docs/

https://krustlet.dev

https://www.openfaas.com/

wizwit9993y ago

gunapologist993y ago

Looks neat, but in what way is this serverless?

It's a pretty complex diagram:

https://github.com/matanolabs/matano/blob/main/website/src/a...

wizwit9993y ago

There's no servers to maintain in the entire architecture, we heavily use Lambda and even use MSK serverless for Kafka.

georgyo3y ago

This sounds truely nightmarish and costly. Even moderate volumes of data are going to add up very quickly cost wise.

I see the term zero-ops. But maintaining and debugging this pipeline is going to require some ops, even if you are not managing VMs.

shaeqahmedOP3y ago

1 more reply

thegagne3y ago

I am pretty sure they are saying is that the matano portion of it, which does the security log processing, deploys in a serverless fashion (lambda I assume?).

This means as a dev you don’t need to maintain a server, or even container image, you just deploy the code, which is less maintenance overhead and more scalable.

The diagram just shows how it interacts with the other components of the log pipeline.

wizwit9993y ago

Thanks for explaining! Yep, that's what we mean. We use serverless services like S3, Lambda, Athena, MSK Serverless, Firehose so you don't have to maintain any servers.

yevpats3y ago

Good to see security moving to data engineering. Shameless plug: we are building similar stuff but for configurations here - https://github.com/cloudquery/cloudquery

https://docs.cloudquery.io/blog/our-open-source-journey-buil...

iJohnDoe3y ago

This sounds very appealing.

Our IDS solution outputs zeek/suricata info to s3 as dns.1234.log.gz, http.1234.log.gz, etc.

Can these files be handled automatically?

shaeqahmedOP3y ago

Yes, they would be handled automatically. Data ingestion is supported through S3 or Kafka, where files are picked up and ETL'D into structured Iceberg tables conforming to an ECS-like schema.

Feel free to join our Discord, happy to walk you through the steps and learn about your use case.

xenophonf3y ago

chevman3y ago

Security lakes are very 2021, everyone moving to the security lakeHOUSE in 2023 broski!

sandGorgon3y ago

is this opensource snowflake-for-security-logs ?

shaeqahmedOP3y ago

crazyperson13y ago

wizwit9993y ago

Zero karma account created just to post this comment, hmmm...

Anyway, he acknowledged Snowflake has Iceberg support by planning to integrate with Snowflake.

iJohnDoe3y ago

Thanks for the nice explanation.

j / k navigate · click thread line to collapse