eBPF will help solve service mesh by getting rid of sidecars (opens in new tab)

(isovalent.com)

237 pointstgraf4y ago108 comments

108 comments

Honestly after I learned that the majority of Kubernetes nodes just proxy traffic between each other using iptables and that a load balancer can't tell the nodes apart (ones where your app lives vs ones that will proxy connection to your app) I got really worried about any kind of persistent connection in k8s land.

Since some number of persistent connections will get force terminated on scale down or node replacement events...

Cilium and eBPF looks like a pretty good solution to this though since you can then advertise your pods directly on the network and load balance those instead of every node.

q3k4y ago

> Honestly after I learned that the majority of Kubernetes nodes just proxy traffic between each other using iptables and that a load balancer can't tell the nodes apart (ones where your app lives vs ones that will proxy connection to your app) I got really worried about any kind of persistent connection in k8s land.

There can be a difference, if your LoadBalancer-type service integration is well implemented. The externalTrafficPolicy knob determines whether all nodes should attract traffic from outside or only nodes that contain pods backing this service. For example, metallb (which attracts traffic by /32 BGP announcements to given external peers) will do this correctly.

Within the cluster itself, only nodes which have pods backing a given service will be part of the iptables/ipvs/... Pod->Service->Pod mesh, so you won't end up with scenic routes anyway. Same for Pod->Pod networking, as these addresses are already clustered by host node.

kklimonda4y ago

How do you keep ecmp hashing stable between rollouts?

dharmab4y ago

If you're asking about connection stability in general:

- Ideally, you avoid it in your application design.

- If you need it, you set up SIGTERM handling in the application to wait for all connections to close before the process exits. You also set up "connection draining" at the load balancer to keep existing sessions to terminating Pods open but send new sessions to the new Pods. The tradeoff is that rollouts take much longer- if the session time is unbounded, you may need to enforce a deadline to break connections eventua.

1 more reply

q3k4y ago

You don’t :).

To do it properly you want a maglev-style layer that allows for withdrawals/drains of application servers with minimal disruption thanks to a minimum disruption maglev-style hash and draining support. This will allow you to first drain the given application server (continue maintaining existing connections, but send new ones to a secondaries for that part shard) before fully taking down the instance.

1 more reply

bogomipz4y ago

ECMP hashing would be between the edge router and the IP of the LBs advertising VIPs no? The LB would maintain the mappings between the VIPs and the nodePort IPs of worker nodes that have a local service Endpoint for the requested service. I don't think this would be any different than it is without Kubernetes or am I completely misunderstanding your question?

1 more reply

dharmab4y ago

That's if you're using a NodePort service, which the documentation explains is for niche use cases such as if you don't have a compatible dedicated load balancer. In most professional setups you do have such a load balancer and can use other types of routing that avoid this.

https://kubernetes.io/docs/concepts/services-networking/serv...

topspin4y ago

> In most professional setups you do have such a load balancer

May I ask what one might use in an AWS cloud environment to provide that load balancer within a Region?

Does IPv6 address any of these issues? It seems to me that IPv6 is capable of providing every component in the system its own globally routable address, identity (mTLS perhaps) and transparent encryption with no extra sidecars, eBPF pieces, etc.

shosti4y ago

Ingresses on EKS will set up an ALB that sends traffic directly to pods instead of nodes (basically skips the whole K8s Service/NodePort networking setup). You have to use ` alb.ingress.kubernetes.io/target-type: ip` as an annotation I think (see https://docs.aws.amazon.com/eks/latest/userguide/alb-ingress...).

dharmab4y ago

> May I ask what one might use in an AWS cloud environment to provide that load balancer within a Region?

The AWS cloud controller will automatically set up an ALB for you if you configure a LoadBalancer service in Kubernetes. I've also done custom setups with AWS NLBs.

> Does IPv6 address any of these issues?

It could address some issues- you could conceivably create a CNI plugin which allocates an externally addressable IP to your Pods. Although you would probably still want a load balancer for custom routing rules and the improved reliability over DNS round robin.

1 more reply

p_l4y ago

Whether load balancer can or can-not tell the nodes apart depends on load balancer and method you use to expose your service to it, as well as what kind of networking setup you use (i.e. is pod networking sensibly exposed to load balancer or ... weirdly)

Each "Service" object provides (by default, can be disabled) load-balanced IP address that by default uses kube-proxy as you described, a DNS A record pointing to said address, DNS SRV records pointing to actual direct connections (whether NodePorts or PodIP/port combinations) plus API access to get the same data out.

There are even replacement kube-proxy implementations that route everything through F5 load balancer boxes, but they are less known.

pm904y ago

This is a concern only if you have ungraceful node termination Ie you suddenly yoink the node. In most cases when you terminate the node, k8s will (attempt to) cordon and drain the nodes, letting the pods gracefully terminate the connections before getting evicted.

If you didn’t have k8s and just used an autoscaling group of VMs you would have the same issue…

zdw4y ago

So instead of making the applications use a good RPC library, we're going to shove more crap into the kernel? No thanks, from a security context and complexity perspective.

Per https://blog.dave.tf/post/new-kubernetes/ , the way that this was solved in Borg was:

> "Borg solves that complexity by fiat, decreeing that Thou Shalt Use Our Client Libraries For Everything, so there’s an obvious point at which to plug in arbitrarily fancy service discovery and load-balancing. "

Which seems like a better solution, if requiring some reengineering of apps.

tptacek4y ago

The complexity is an issue (but sidecars are plenty complex too), but the security not so much. BPF C is incredibly limiting (you can't even have loops if the verifier can't prove to its satisfaction that the loop has a low static bound). It's nothing at all like writing kernel C.

the_duke4y ago

You don't have to use C.

There are two projects that enable writing eBPF with Rust [1][2]. I'm sure there is an equivalent with nicer wrappers for C++.

[1] https://github.com/foniod/redbpf

[2] https://github.com/aya-rs/aya

tptacek4y ago

It doesn't make any difference which language you use; the security promises are coming from the verifier, which is analyzing the CFG of the compiled program. C is what most people use, since the underlying APIs are in C, and since the verifier is so limiting that most high-level constructions are off the table.

1 more reply

kelnos4y ago

I don't think client libraries are the answer. If you only have one technology stack (say, Java and Spring) and only use one application-layer network protocol (say, HTTP), then maybe it's fine.

But once you have more than one language or framework, you need to write more and more of these libraries. And what happens if it's not just HTTP? What if you need to speak Redis, MySQL, or some random binary protocol? Do you write client libraries for those too? Maybe a company like Google has the scale to do this, but most orgs do not. But even then, what if you have to run some vendor-supplied code that you don't even have source for?

I agree with you that shoving more of this into the kernel isn't desirable, but libraries aren't great. Been there, done that, don't feel like doing it again. I'd rather stick with sidecars.

tgrafOP4y ago

If you are in a position where you can do that then great. Most folks out there are in a position where they need to run arbitrary applications delivered by vendors without an ability to modify them.

The second aspect is that this can get extremely expensive if your applications are written in a wide number of language frameworks. That's obviously different at Google where the number of languages can be restricted and standardized.

But even then, you could also link a TCP library into your app. Why don't you?

dvogel4y ago

I'm not necessarily advocating for the approach described in the article but it wouldn't worry me from a security perspective. The security model of eBPF is pretty impressive. The security issues arising from engineers struggling to keep the entire model in their head would concern me though.

outside12344y ago

The industry is moving away from the client library approach. This is possible in a place like Google where they force folks to write software in one of four languages (C++, Java, Go, Python) but doesn't scale to a broader ecosystem.

pjmlp4y ago

It sure scales, I am yet to work in organisations where everything goes.

There are a set of sanctioned languages and that is about it.

jayd164y ago

The subtle aspect of the comment you're replying to is that _they write everything_.

Hard to cram a new library into some closed source vendor app.

1 more reply

p_l4y ago

In a world without (D)COM, I find it's much, much harder to make common base libraries and force people to use them, especially if you can't also force limit the set of toolchains used in the environment.

outside12344y ago

The network is the base library - that is the shift you are seeing. You make a call out to a network address with a specific protocol.

Also, as an aside, I think WebAssembly has the potential to shift this back. In a world where libraries and programs are compiled to WebAssembly, it doesn't matter what their source language was, and as such, the client library based approach might swing back into vogue.

jjtheblunt4y ago

> The network is the base library

you remind me of the 20+ years ago Sun Microsystems assertion "The Network IS the Computer".

citation: https://www.networkcomputing.com/cloud-infrastructure/networ...

1 more reply

p_l4y ago

WASM isn't a valid target for many languages, that's one thing.

Two, the case is about the library to interact with the network, so... There's also implementing the protocols.

nonameiguess4y ago

In addition to whether or not all of your various dev teams preferred languages have a supported client SDK, you also have the build vs. buy issue if you're plugging COTS applications into your service mesh, there is no way to force a third party vendor to reengineer their application specifically for you.

This probably dictates a lot of Google's famous "not invented here" behavior, but most organizations can't afford to just write their entire toolchain from scratch and need to use applications developed by third parties.

q3k4y ago

It is the technically better solution IMO/IME, too.

But that doesn't work when you're trying to sell enterprises the idea of 'just move your workloads to Kubernetes!'. :)

MayeulC4y ago

> a good RPC library

I like that approach. If you use client libraries, new RPC mechanisms are "free" to implement (until you need to troubleshoot upgrades). It's also an argument against statically linking.

For instance, if running services on the same machine, io-uring can probably be used? (I'm a noob at this). eBPF for packet switching/forwarding between different hosts, etc.

malkia4y ago

This may no longer be the case, but back at Google I remember one day having my java library no longer using the client library logger, but spawning some other app and talking (sending logs to it). That other app used to be fat-client, linked in our app, supported by another team. First I was wtf.. Then it hit me - this other team can update their "logging" binary at different cycle than us (hence we don't have to be on the same "build" cycle). All they needed to do for us is provide with very "thin" and rarelly changing interface library. And they can write it in any language they like (Java, c++, go, rust, etc.)

Also no need to be .so/ (or .dll/.dylib) - just some quick IPC to send messages around. Actually can be better. For one, if their app is still buffering messages, my app can exit, while theirs still run. Or security reasons (or not having to think about these), etc. etc. So still statically linked but processes talking to each other. (Granted does not always work for some special apps, like audio/video plugins, but I think works fine for the case above).

jayd164y ago

It does feel a bit like we're trying to monkey patch compiled code but the benefits are pretty clear.

lamontcg4y ago

I would argue pretty strenuously that this is not what is being done.

The sockets layer is becoming a facade which can guarantee additional things to applications which are compiled against it, and you've got dependency injection here so that the application layer can be written agnostically and not care about any of those concerns at all.

jayd164y ago

Well ok, but the dependency inject is not statically checked in this case, its changed dynamically, perhaps while the application is running. Is that not similar to a monkey patch?

klysm4y ago

That’s great if you write all your software. If you want to use someone else’s thing then you have to wrap it in that magic everywhere client.

ZeroCool2u4y ago

What if a client library does not yet exist for your language?

q3k4y ago

In a large orga, you limit the languages available for projects to well supported ones internally, ie. to those that are known to have a port of the RPC/metrics/status/discovery library. Also makes it easier to have everything under a single build system, under a single set of code styles, etc.

If some developers want to use some new language, they have to first in put in the effort by a) demonstrating the business case of using a new language and allocating resources to integrate it into the ecosystem b) porting all the shared codebase to that new language.

1 more reply

__alexs4y ago

There are only 5 languages. JavaScript, C++, Java, Python, C#

This is basically the same set of languages people were writing 20 years ago and will probably be the same set of languages people will write in 20 years from now.

MayeulC4y ago

It really depends on your domain. I haven't seen C# a lot, nor python, in some orgs.

For some (like me), it's more a superset of C, assembly, bash, maybe lisp, python and matlab.

For others, it's going to be JavaScript, PHP, CSS, HTML..

I agree though that a library is usually domain-specific, and that you can probably easily identify the subset of languages that you really need official bindings for (thereby making my comment a bit useless, sorry for the noise).

jrockway4y ago

The big secret is that sidecars can only help so much. If you want distributed tracing, the service mesh can't propagate traces into your application (so if service A calls service B which calls service C, you'll never see that end to end with a mesh of sidecars). mTLS is similar; it's great to encrypt your internal traffic on the wire, but that needs to get propagated up to the application to make internal authorization decisions. (I suppose in some sense I like to make sure that "kubectl port-forward" doesn't have magical enhanced privileges, which it does if your app is oblivious to the mTLS going on in the background. You could disable that specifically in your k8s setup, but generally security through remembering to disable default features seems like a losing battle to me. Easier to have the app say "yeah you need a key". Just make sure you build the feature to let oncall get a key, or they will be very sad.)

For that reason, I really do think that this is a temporary hack while client libraries are brought up to speed in popular languages. It is really easy to sell stuff with "just add another component to your house of cards to get feature X", but eventually it's all too much and you'll have to just edit your code.

I personally don't use service meshes. I have played with Istio but the code is legitimately awful, so the anecdotes of "I've never seen it work" make perfect sense to me. I have, in fact, never seen it work. (Read the xDS spec, then read Istio's implementation. Errors? Just throw them away! That's the core goal of the project, it seems. I wrote my own xDS implementation that ... handles errors and NACKs correctly. Wow, such an engineering marvel and so difficult...)

I do stick Envoy in front of things when it seems appropriate. For example, I'll put Envoy in front of a split frontend/backend application to provide one endpoint that serves both the frontend or backend. That way production is identical to your local development environment, avoiding surprises at the worst possible time. I also put it in front of applications that I don't feel like editing and rebuilding to get metrics and traces.

The one feature that I've been missing from service meshes, Kubernetes networking plugins, etc. is the ability to make all traffic leave the cluster through a single set of services, who can see the cleartext of TLS transactions. (I looked at Istio specifically, because it does have EgressGateways, but it's implemented at the TCP level and not the HTTP level. So you don't see outgoing URLs, just outgoing IP addresses. And if someone is exfiltrating data, you can't log that.) My biggest concern with running things in production is not so much internal security, though that is a big concern, but rather "is my cluster abusing someone else". That's the sort of thing that gets your cloud account shut down without appeal, and I feel like I don't have good tooling to stop that right now.

darkwater4y ago

> If you want distributed tracing, the service mesh can't propagate traces into your application (so if service A calls service B which calls service C, you'll never see that end to end with a mesh of sidecars)

Why not? AFAIK traces are sent from the instrumented app to some tracing backend, and a trace-id is carried over via an HTTP header from the entry point of the request until the last service that takes part in that request. Why a sidecar/mesh would break this?

colonelxc4y ago

I think the point is that the service mesh can't do the work of propagation. It needs the client to grab the input header, and attach it to any outbound requests. From the perspective of the service mesh, the service is handling X requests, and Y requests are being sent outbound. It doesn't know how each outbound request maps to an input.

So now all of the sudden we do need a client library for each service in order to make sure the header is being propagated correctly.

2 more replies

sciurus4y ago

> a trace-id is carried over via an HTTP header from the entry point of the request until the last service that takes part in that request.

But it's not, unless you specifically code your services to do that. Which isn't hard, but just means plugging an unmodified service into a service mesh isn't enough.

afrodc_4y ago

This. Header trace propagation is a godsend.

__alexs4y ago

I'm sure someone will write leftPad in eBPF any day now.

hestefisk4y ago

Indeed. We could even embed a WASM runtime (headless v8?) so one can execute arbitrary JavaScript in-kernel… wait :)

zaphar4y ago

eBPF is far too limited to run a WASM runtime. That's why the proposed article approach is even possible.

codetrotter4y ago

> Identity-based Security: Relying on network identifiers to achieve security is no longer sufficient, both the sending and receiving services must be able to authenticate each other based on identities instead of a network identifier.

Kinda semi-offtopic but I am curious to know if anyone has used identity part of a WireGuard setup for this purpose.

So say you have a bunch of machines all connected in a WireGuard VPN. And then instead of your application knowing host names or IP addresses as the primary identifier of other nodes, your application refers to other nodes by their WireGuard public key?

I use WireGuard but haven’t tried anything like that. Don’t know if it would be possible or sensible. Just thinking and wondering.

tptacek4y ago

We're a global platform that runs an intra-fleet WireGuard mesh, so we have authenticated addressing between nodes; we layer a couple dozen lines of BPF C on top of that to extend the authentication model to customer address prefixes. So, effectively, we're using WireGuard as an identity. In fact: we do so explicitly for peering connections to other services.

So yeah, it's a model that can work. It's straightforward for us because we have a lot of granular control over what can get addressed where. It might be trickier if your network model is chaotic.

madjam0024y ago

I too am interested in this.

I long for the day where Kubernetes services, virtual machines, dedicated servers and developer machines can all securely talk to eachother in some kind of service mesh, where security and firewalls can be implemented with "tags".

Tailscale seems to be pretty much this, but while it seems great for the dev/user facing side of things (developer machine connectivity), it doesn't seem like it's suited for the service to service communication side? It would be nice to have one unified connectivity solution with identity based security rather than e.g Consul Connect for services, Tailscale / Wireguard for dev machine connectivity, etc.

starfallg4y ago

>I long for the day where Kubernetes services, virtual machines, dedicated servers and developer machines can all securely talk to eachother in some kind of service mesh, where security and firewalls can be implemented with "tags".

That's exactly what Scalable Group Tags (SGTs) are -

https://tools.ietf.org/id/draft-smith-kandula-sxp-07.html

Cisco implements this as a part of TrustSec

tgrafOP4y ago

One of the methods that Cilium (which implements this eBPF-based service mesh idea) uses to implementation authentication between workloads is Wireguard. It does exactly what you describe above.

In addition it can also be used to enforce based on service specific keys/certificates as well.

allset_4y ago

Isn't the Wireguard implementation in Cilium between nodes only, not workloads (pods)?

tgrafOP4y ago

It can do both. It can authenticate and encrypt all traffic between nodes which then also encrypts all traffic between the pods running on those pods. This is great because it also covers pod to node and all control plane traffic. The encryption can also use specific keys for different services to authenticate and encrypt pod to pod individually.

q3k4y ago

You'd be adding a whole new layer of what would effectively be dynamic routing. It's doable, but it's not a trivial amount of effort. Especially if you want everything to be transparent and automagic.

There's earlier projects like CJDNS which provide pubkey-addressed networking, but they're limited in usability as they route based on a DHT.

gz54y ago

Ziti (Apache) provides bootstrapped* identity based security (and programmable, least privilege overlays).

Disclosure: founder of company which sells SaaS on top of Ziti FOSS.

* https://ziti.dev/blog/bootstrapping-trust-part-5-bootstrappi...

Matthias2474y ago

I understand how BPF works for transparently steering TCP connections. But the article mentions gRPC - which means HTTP2. How can the BPF module be a replacement for a proxy here. My understanding is it would need to understand http2 framing and having buffers - which all sound like capabilities that require more than BPF?

Are they implementing a http2 capable proxy in native kernel C code and making APIs to that accessible via bpf?

tgrafOP4y ago

The model I'm describing contains two pieces: 1) Moving away from sidecars to per-node proxies that can be better integrated into the Linux kernel concept of namespacing instead of artificially injecting them with complicated iptables redirection logic at the network level. 2) Providing the HTTP awareness directly with eBPF using eBPF-based protocol parsers. The parser itself is written in eBPF which has a ton of security benefits because it runs in a sandboxed environment.

We are doing both. Aspect 2) is currently done for HTTP visibility and we will be working on connection splicing and HTTP header mutation going forward.

tptacek4y ago

What does an HTTP parser written in BPF look like? Bounded loops only --- meaning no string libraries --- seems like a hell of a constraint there.

justicezyx4y ago

Bounded loop plus 1M instruction limits in the 5.4 kernel (no record at hands about the exact version), gives a large range of supported headers. Also note that these BPF code are on the network level, which is subject to the MTU limit as well, which usually is 1500 and now can be 10s of KBs (65,525 bytes maxmial in theory accroding to https://www.lifewire.com/definition-of-mtu-817948, but my networking knownledge is poor). These makes it possible to effectively handle all possible headers.

HTTP is actually fine.

HTTP2 will be a bigger issue as it has HPACK, and Huffman coding, that would be very complicated to maintain inside BPF runtime. I haven't thought about it closely yet. But based on our experience at http://px.dev, I am not aware of any glaring technical obstacles.

1 more reply

Matthias2474y ago

another challenges I can see is wheee to actually store the state of a connection. Even if we just focus on http/1.1 then not all headers will be received at one, and data from previous segments needs to be carried forward. Would it be eBPF maps? Those also seem rather limited for this usecase, and are probably also not extremely fast.

I can imagine getting something to work for http/1.1 - but http/2 with multiplexing and stateful header compression is a completely different beast.

tgrafOP4y ago

It looks not too different from the majority of HTTP parsers out there written in C. Here is an example of NodeJS [0].

[0] https://github.com/nodejs/http-parser/blob/main/http_parser....

1 more reply

xmodem4y ago

Doing this with eBPF is definitely an improvement, but when I look at some of the sidecars we run in production, I often wonder why we can't just... integrate them into the application.

cfors4y ago

You can! There are downsides though for any sufficiently polyglot organization, which is maintaining all the different client SDK's that need to use that.

Sidecars are often useful for platform-centric teams that would like to have access to help manage something like secrets, mTLS, or traffic shaping in the case of Envoy. The team that's responsible for that just needs to maintain a single sidecar rather than all of the potential SDK's for teams.

Especially if you have specific sidecars that only work on a specific infrastructure, for example if you have a Vault sidecar that deals with secrets for your service over EKS IAM permissions, you suddenly can't start your service without a decent amount of mocking and feature flags. Its nice to not have to burden your client code with all of that.

Also, there is a decent amount of work being done on gRPC to speak XDS which also removes the need for the sidecar [0].

[0] https://istio.io/latest/blog/2021/proxyless-grpc/

xemdetia4y ago

Another thing too is that if your main application artifact can be static while your sidecar can react to configuration changes/patches/vulns/updates. Depending on your architecture it can make some components last for years without a change even though the sidecar/surrounding configuration is doing all sorts of stuff. Back when more people ran Java environments there were all sorts of settings you can do with just the JVM without the bytecode moving for how JCE worked which was extraordinarily helpful.

It depends on your environment and architecture combined with how fast you can move especially with third party components. Having the microservice be 'dumb' can save everything.

darkwater4y ago

> Especially if you have specific sidecars that only work on a specific infrastructure, for example if you have a Vault sidecar that deals with secrets for your service over EKS IAM permissions, you suddenly can't start your service without a decent amount of mocking and feature flags. Its nice to not have to burden your client code with all of that.

Could you please elaborate on this? I don't fully understand what you mean. Especially, I don't understand if "Its nice to not have to burden your client code with all of that" applies to a setup with or without sidecars.

cfors4y ago

Take vault for example. Rather than have to toggle a flag in your service to get a secret, you could have the vault sidecar inject the secret automatically into your container, as opposed to having to pass a configuration flag `USE_VAULT` to your application, which will conditionally have a baked in vault client that fetches your secret for you.

Your service doesn't really care where the secret comes from, as long it can use that secret to connect to some database, API or whatever. So IMO it makes your application code a bit cleaner knowing that it doesn't have to worry about where to fetch a secret from.

2 more replies

pjmlp4y ago

For a moment I thought you're talking about POSIX directory services.

miduil4y ago

Author of linkerd argues that splitting this responsibility will improve stability as you'll have a homogeneous interface (sidecar proxy) over a heterogeneous group of pods. Updating a sidecar-container (or using the same across all applications) is possible, whereas if it's integrated into the application you'll encounter much more barriers and need much wider coordination.

mixedCase4y ago

There are good reasons more often than not.

Being able to pick up something generic rather than something language-specific.

Not having to do process supervision (which includes handling monitoring and logs) within your application.

Not making the application lifecycle subservient to needs such as log shipping and request rerouting. People get sig traps wrong suprisingly often.

taeric4y ago

My gut is that using sidecars doesn't really solve these problems straight up. Just moved them to the orchestrator.

Which is not bad. But that area is also often misconfigured for supervision. And trapping signals remains mostly broken in all sidecars.

dboreham4y ago

Like it or not the socket has become the demarcation mechanism we use. Therefore all software ends up deployed as a thing that talks on sockets. Therefore you can't/shouldn't put functionality that belongs on the other end of the socket inside that thing. If you do that it's no longer the kind of thing you wanted (a discrete unit of software that does something). It's now a larger kind of component (software that does something, plus elements of the environment that software runs within). You probably don't want that.

pjmlp4y ago

The irony is arguing for monolithic kernels with a pile of such layers on top.

unmole4y ago

Offtopic: I really like the style of the diagrams. I remember seeing something similar elsewhere. Are this manually drawn or is this the result of some tool?

tgrafOP4y ago

OP here: It's whimsical.com. I really love it.

unmole4y ago

Thank you, Thomas! I really admire all that you have done with Cilium.

zinclozenge4y ago

It's not clear how eBPF will deal with mTLS. I actually asked that when interviewing at a company using eBPF for observability into Kubernetes the answer was they didn't know.

Yea, if you're getting TLS termination at the load balancer prior to k8s ingress then it's pretty nice.

tgrafOP4y ago

Then you should interview again but with us.

This is not too different from wpa_supplicant used by several operating for key management for wireless networks. The complicated key negotiation and authentication can remain in user space, the encryption of the negotiated key can be done in the kernel (kTLS) or, when eBPF can control both sides, it can even be done without using TLS but encrypting using a network level encapsulation format to it works for non-TCP as well.

Hint: We are hiring.

GauntletWizard4y ago

The answer to this is simple - TLS will start being terminated at the pods themselves. The frontend load balancer will also terminate TLS - to the public sphere, and then will authenticate it's connection to your backends as well. Kubernetes will provide x509 certificates suitable for service-to-service communications to pods automatically.

The work is still in the early phases, so the exact form this will take has yet to be hammered out, but there's broad agreement that this functionality will be first-class in k8s in the future. If you want to keep running proxies for the other feature they provide, great - They'll be able to use the certificates provided by k8s for identity. If you'd like to know more, come to on of the SIGAUTH Meetings :)

manvendrasingh4y ago

I am wondering how would this solve the problem of mTLS while still supporting service level identities? Is it possible to move the mTLS to listeners instead of sidecar or some other mechanism?

davewritescode4y ago

From a resource perspective this makes sense but from a security perspective this drives me a little bit crazy. Sidecars aren't just for managing traffic, they're also a good way to automate managing the security context of the pod itself.

The current security model in Istio delivers a pod specific SPIFFE cert to only that pod and pod identity is conveyed via that certificate.

That feels like a whole bunch of eggs in 1 basket.

tgrafOP4y ago

What the proposed architecture allows is to continue using SPIFFE or another certificate management solution to generate and distribute the certificates but use either a per-node proxy or an eBPF implementation to enforce it. Even if the authentication handshake remains in a proxy but data encryption moves to the kernel then that is a massive benefit from an overhead perspective. This already exists and is called kTLS.

outside12344y ago

There is a good talk about this (and more) from KubeCon:

https://www.youtube.com/watch?v=KY5qujcujfI

ko274y ago

Not convinced that this a better solution then just implementing these features as part of the protocol. For example, most languages have libraries that support grpc load balancing.

https://github.com/grpc/proposal/blob/master/A27-xds-global-...

j / k navigate · click thread line to collapse

108 comments

dijit4y ago

Since some number of persistent connections will get force terminated on scale down or node replacement events...

Cilium and eBPF looks like a pretty good solution to this though since you can then advertise your pods directly on the network and load balance those instead of every node.

q3k4y ago

kklimonda4y ago

How do you keep ecmp hashing stable between rollouts?

dharmab4y ago

If you're asking about connection stability in general:

- Ideally, you avoid it in your application design.

1 more reply

q3k4y ago

You don’t :).

1 more reply

bogomipz4y ago

1 more reply

dharmab4y ago

https://kubernetes.io/docs/concepts/services-networking/serv...

topspin4y ago

> In most professional setups you do have such a load balancer

May I ask what one might use in an AWS cloud environment to provide that load balancer within a Region?

shosti4y ago

dharmab4y ago

> May I ask what one might use in an AWS cloud environment to provide that load balancer within a Region?

The AWS cloud controller will automatically set up an ALB for you if you configure a LoadBalancer service in Kubernetes. I've also done custom setups with AWS NLBs.

> Does IPv6 address any of these issues?

1 more reply

p_l4y ago

There are even replacement kube-proxy implementations that route everything through F5 load balancer boxes, but they are less known.

pm904y ago

If you didn’t have k8s and just used an autoscaling group of VMs you would have the same issue…

zdw4y ago

So instead of making the applications use a good RPC library, we're going to shove more crap into the kernel? No thanks, from a security context and complexity perspective.

Per https://blog.dave.tf/post/new-kubernetes/ , the way that this was solved in Borg was:

Which seems like a better solution, if requiring some reengineering of apps.

tptacek4y ago

the_duke4y ago

You don't have to use C.

There are two projects that enable writing eBPF with Rust [1][2]. I'm sure there is an equivalent with nicer wrappers for C++.

[1] https://github.com/foniod/redbpf

[2] https://github.com/aya-rs/aya

tptacek4y ago

1 more reply

kelnos4y ago

I don't think client libraries are the answer. If you only have one technology stack (say, Java and Spring) and only use one application-layer network protocol (say, HTTP), then maybe it's fine.

I agree with you that shoving more of this into the kernel isn't desirable, but libraries aren't great. Been there, done that, don't feel like doing it again. I'd rather stick with sidecars.

tgrafOP4y ago

If you are in a position where you can do that then great. Most folks out there are in a position where they need to run arbitrary applications delivered by vendors without an ability to modify them.

But even then, you could also link a TCP library into your app. Why don't you?

dvogel4y ago

outside12344y ago

pjmlp4y ago

It sure scales, I am yet to work in organisations where everything goes.

There are a set of sanctioned languages and that is about it.

jayd164y ago

The subtle aspect of the comment you're replying to is that _they write everything_.

Hard to cram a new library into some closed source vendor app.

1 more reply

p_l4y ago

outside12344y ago

The network is the base library - that is the shift you are seeing. You make a call out to a network address with a specific protocol.

jjtheblunt4y ago

> The network is the base library

you remind me of the 20+ years ago Sun Microsystems assertion "The Network IS the Computer".

citation: https://www.networkcomputing.com/cloud-infrastructure/networ...

1 more reply

p_l4y ago

WASM isn't a valid target for many languages, that's one thing.

Two, the case is about the library to interact with the network, so... There's also implementing the protocols.

nonameiguess4y ago

q3k4y ago

It is the technically better solution IMO/IME, too.

But that doesn't work when you're trying to sell enterprises the idea of 'just move your workloads to Kubernetes!'. :)

MayeulC4y ago

> a good RPC library

I like that approach. If you use client libraries, new RPC mechanisms are "free" to implement (until you need to troubleshoot upgrades). It's also an argument against statically linking.

For instance, if running services on the same machine, io-uring can probably be used? (I'm a noob at this). eBPF for packet switching/forwarding between different hosts, etc.

malkia4y ago

jayd164y ago

It does feel a bit like we're trying to monkey patch compiled code but the benefits are pretty clear.

lamontcg4y ago

I would argue pretty strenuously that this is not what is being done.

jayd164y ago

Well ok, but the dependency inject is not statically checked in this case, its changed dynamically, perhaps while the application is running. Is that not similar to a monkey patch?

klysm4y ago

That’s great if you write all your software. If you want to use someone else’s thing then you have to wrap it in that magic everywhere client.

ZeroCool2u4y ago

What if a client library does not yet exist for your language?

q3k4y ago

1 more reply

__alexs4y ago

There are only 5 languages. JavaScript, C++, Java, Python, C#

This is basically the same set of languages people were writing 20 years ago and will probably be the same set of languages people will write in 20 years from now.

MayeulC4y ago

It really depends on your domain. I haven't seen C# a lot, nor python, in some orgs.

For some (like me), it's more a superset of C, assembly, bash, maybe lisp, python and matlab.

For others, it's going to be JavaScript, PHP, CSS, HTML..

jrockway4y ago

darkwater4y ago

colonelxc4y ago

So now all of the sudden we do need a client library for each service in order to make sure the header is being propagated correctly.

2 more replies

sciurus4y ago

> a trace-id is carried over via an HTTP header from the entry point of the request until the last service that takes part in that request.

But it's not, unless you specifically code your services to do that. Which isn't hard, but just means plugging an unmodified service into a service mesh isn't enough.

afrodc_4y ago

This. Header trace propagation is a godsend.

__alexs4y ago

I'm sure someone will write leftPad in eBPF any day now.

hestefisk4y ago

Indeed. We could even embed a WASM runtime (headless v8?) so one can execute arbitrary JavaScript in-kernel… wait :)

zaphar4y ago

eBPF is far too limited to run a WASM runtime. That's why the proposed article approach is even possible.

codetrotter4y ago

Kinda semi-offtopic but I am curious to know if anyone has used identity part of a WireGuard setup for this purpose.

I use WireGuard but haven’t tried anything like that. Don’t know if it would be possible or sensible. Just thinking and wondering.

tptacek4y ago

So yeah, it's a model that can work. It's straightforward for us because we have a lot of granular control over what can get addressed where. It might be trickier if your network model is chaotic.

madjam0024y ago

I too am interested in this.

starfallg4y ago

That's exactly what Scalable Group Tags (SGTs) are -

https://tools.ietf.org/id/draft-smith-kandula-sxp-07.html

Cisco implements this as a part of TrustSec

tgrafOP4y ago

One of the methods that Cilium (which implements this eBPF-based service mesh idea) uses to implementation authentication between workloads is Wireguard. It does exactly what you describe above.

In addition it can also be used to enforce based on service specific keys/certificates as well.

allset_4y ago

Isn't the Wireguard implementation in Cilium between nodes only, not workloads (pods)?

tgrafOP4y ago

q3k4y ago

There's earlier projects like CJDNS which provide pubkey-addressed networking, but they're limited in usability as they route based on a DHT.

gz54y ago

Ziti (Apache) provides bootstrapped* identity based security (and programmable, least privilege overlays).

Disclosure: founder of company which sells SaaS on top of Ziti FOSS.

* https://ziti.dev/blog/bootstrapping-trust-part-5-bootstrappi...

Matthias2474y ago

Are they implementing a http2 capable proxy in native kernel C code and making APIs to that accessible via bpf?

tgrafOP4y ago

We are doing both. Aspect 2) is currently done for HTTP visibility and we will be working on connection splicing and HTTP header mutation going forward.

tptacek4y ago

What does an HTTP parser written in BPF look like? Bounded loops only --- meaning no string libraries --- seems like a hell of a constraint there.

justicezyx4y ago

HTTP is actually fine.

1 more reply

Matthias2474y ago

I can imagine getting something to work for http/1.1 - but http/2 with multiplexing and stateful header compression is a completely different beast.

tgrafOP4y ago

It looks not too different from the majority of HTTP parsers out there written in C. Here is an example of NodeJS [0].

[0] https://github.com/nodejs/http-parser/blob/main/http_parser....

1 more reply

xmodem4y ago

Doing this with eBPF is definitely an improvement, but when I look at some of the sidecars we run in production, I often wonder why we can't just... integrate them into the application.

cfors4y ago

You can! There are downsides though for any sufficiently polyglot organization, which is maintaining all the different client SDK's that need to use that.

Also, there is a decent amount of work being done on gRPC to speak XDS which also removes the need for the sidecar [0].

[0] https://istio.io/latest/blog/2021/proxyless-grpc/

xemdetia4y ago

It depends on your environment and architecture combined with how fast you can move especially with third party components. Having the microservice be 'dumb' can save everything.

darkwater4y ago

cfors4y ago

2 more replies

pjmlp4y ago

For a moment I thought you're talking about POSIX directory services.

miduil4y ago

mixedCase4y ago

There are good reasons more often than not.

Being able to pick up something generic rather than something language-specific.

Not having to do process supervision (which includes handling monitoring and logs) within your application.

Not making the application lifecycle subservient to needs such as log shipping and request rerouting. People get sig traps wrong suprisingly often.

taeric4y ago

My gut is that using sidecars doesn't really solve these problems straight up. Just moved them to the orchestrator.

Which is not bad. But that area is also often misconfigured for supervision. And trapping signals remains mostly broken in all sidecars.

dboreham4y ago

pjmlp4y ago

The irony is arguing for monolithic kernels with a pile of such layers on top.

unmole4y ago

Offtopic: I really like the style of the diagrams. I remember seeing something similar elsewhere. Are this manually drawn or is this the result of some tool?

tgrafOP4y ago

OP here: It's whimsical.com. I really love it.

unmole4y ago

Thank you, Thomas! I really admire all that you have done with Cilium.

zinclozenge4y ago

It's not clear how eBPF will deal with mTLS. I actually asked that when interviewing at a company using eBPF for observability into Kubernetes the answer was they didn't know.

Yea, if you're getting TLS termination at the load balancer prior to k8s ingress then it's pretty nice.

tgrafOP4y ago

Then you should interview again but with us.

Hint: We are hiring.

GauntletWizard4y ago

manvendrasingh4y ago

I am wondering how would this solve the problem of mTLS while still supporting service level identities? Is it possible to move the mTLS to listeners instead of sidecar or some other mechanism?

davewritescode4y ago

The current security model in Istio delivers a pod specific SPIFFE cert to only that pod and pod identity is conveyed via that certificate.

That feels like a whole bunch of eggs in 1 basket.

tgrafOP4y ago

outside12344y ago

There is a good talk about this (and more) from KubeCon:

https://www.youtube.com/watch?v=KY5qujcujfI

ko274y ago

Not convinced that this a better solution then just implementing these features as part of the protocol. For example, most languages have libraries that support grpc load balancing.

https://github.com/grpc/proposal/blob/master/A27-xds-global-...

j / k navigate · click thread line to collapse