Down the bottom, which is where 'things we probably will never do' is when IPv6 comes in the door.
Azure (for instance) is a fully IPv6 enabled fabric. Microsoft "get" IPv6. They are all over it. They understand it, its baked into the DNA. So how come K8s people just kind of think "yea.. nah.. not right now"?
Because proxy Ipv6 at the edge is really sucky. We should be using native IPv6, preserve e2e under whatever routing model we need for reliability, and gateway the V4 through proxies in the longer term.
(serious Q btw)
The issue [2] has existed for over 3 years, so it's not a new suggestion.
Not to diminish the very real challenges in getting IPv6 implemented, but this is an interesting turn of phrase.. especially because rolling out IPv6 would actually solve a whole class of problems (and I'm not even a particularly big advocate of the need for IPv6, since most things should still be NATed anyway.)
(And especially considering parent's phrase "baked into the DNA" at Azure.)
I guess even today many people have problems getting more than a /64 in the office or home network (edit: it's supported usually with the prefix delegation option in DHCPv6 by most ISPs), so it's not frictionless in the dev environment.
Everything I've seen in their networking configuration screens and APIs appears to only allow IPv4 addresses.
Alas no. When I looked at k8s/Azure the IPv6 support was new.
My comment about IPv6 'baked into the DNA' of Microsoft is about Microsoft not Azure -A lot of the work on privacy addresses, the deployment of Teredo, adoption of ULA addresses, comes from people inside Microsoft, And they have been presenting recently at NANOG and IETF on IPv6 only deployments in the Redmond campus.
All containers can communicate with all other containers without NAT.
All nodes can communicate with all containers (and vice-versa) without NAT.
The IP that a container sees itself as is the same IP that others see it as.
When using Docker by itself, you get into all sorts of complicated situations because most running containers have an IP address that's host-specific and not routable for any other machines. This makes networking across hosts a giant pain. Kubernetes takes that away by making things behave exactly how you'd hope they'd behave. My IP as I see it is reachable by anybody in the cluster who has it (policy permitting).
The simplicity of working in this networking model means that there's a little more work for the networking infrastructure to handle, making sure that IPs are allocated without collision and that routes are known across many hosts. Several technologies exist to build these bridges, including old-school tech that has solved these exact problems for decades like BGP (see Calico/canal).
Ultimately, there's no silver bullet. I'd recommend giving the k8s networking page a read. [1]
[1] https://kubernetes.io/docs/concepts/cluster-administration/n...
The main difference is that Kunernetes assumes that all IP's are routable and Docker does not. When using bridge networking this means the admin must ensure routes are properly configured in the host for cross-host communication on Kunernetes.
Docker does not provide cross-host service discovery for bridge networking out of the box. This does not prevent admins from setting this up themselves.
For overlay networking solutions (e.g. Weave), the cross-host networking is handled for you and typically still even uses bridge networking to provide container connectivity, with service discovery also working cross-host.
ipvlan and macvlan are "underlay" solutions (i.e. attached directly to the host networking interfaces). For these it is expected that the admin has configured the networking and that containers on different hosts are routable. Service discovery should work across hosts with these solutions, but actual networking is dependent on the how the host networking is setup because the containers will be assigned IP's from the host's network and are bound to a particular host network interface.
When using ipvlan or macvlan (or overlay networking for that matter), Docker effectively makes the same assumptions as Kunernetes does for its networking.
I notice that you conveniently left out the "ingress" component. Stuff in K8s talking with other K8s stuff is easy. Getting the flows into K8s apps from outside the K8 network is amazingly clunky in its current state.
https://medium.com/google-cloud/understanding-kubernetes-net...
This is the first of a two-part series, the first dealing with pod networking and the second with services. I plan a third on ingress after kubecon. It's a little GKE-specific in the implementation details, and the whole thing is pluggable and can be configured in different ways (as the OP shows), but I think it covers the fundamentals pretty well.
* Part 1: https://medium.com/@ApsOps/an-illustrated-guide-to-kubernete...
* Part 2: https://medium.com/@ApsOps/an-illustrated-guide-to-kubernete...
> This is especially problematic where the connected next-hop e.g. switch is expecting frames from a specific mac from a specific port.
e.g.: if the host is attached to a managed switch with a strict security policy, macvlan would not work.
Obviously it needs a switch at the otherside that can handle a huge and quick changing arp table. Also if you have mac address limiting typical on edge switches, its a non flyer
If you do something like advertise a /32 for each container you can very quickly fill up TCAMs on your network hardware (in particular cheap top of rack switches that are pervasive in data centers).
The entire v4 internet is something like 600k prefixes right now and the routers that can handle that many prefixes at line rate are irritatingly expensive. ToRs as of a couple of years ago when I last tested this would fall over at 1-10k prefixes.
So be careful when looking at BGP solutions because it's very easy to have a BGP topology that doesn't scale, despite it being the exchange protocol for the Internet.
Assuming everything is nice and hierarchical, you can easily aggregate an entire rack to a single prefix. Even the shitty ToR switches can usually handle a couple thousand prefixes, which should be plenty if done correctly.
Obviously you shouldn't be advertising /32s.
> The entire v4 internet is something like 600k prefixes right now ...
Just checked my edge routers and it looks like we're up to ~671k prefixes here and that number is still increasing everyday.
I read your comment as, "Don't use technology that you can misconfigure, because you can misconfigure it!". Well yeah, the same can be said with anything networking related.
Rolls right off the tongue, doesn't it?
To the extent your requirement match theirs, this could be a good alternative. The most significant in my mind is that it's meant to be used in conjunction with Envoy. Envoy itself has its own set of design tradeoffs as well.
For example, Lyft currently uses 'service-assigned EC2 instances'. Not hard to see how this starting point would influence the design. The Envoy/Istio model of proxy per pod also reflects this kind of workload partitioning. Obviously, a design for a small number of pods (each with their own proxy) per instance is going to be very different from one that needs to handle 100 pods (and their IPs), or more, per instance.
Another is that k8s network policy can't be applied since the 'Kubernetes Services see connections from a node’s source IP instead of the Pod’s source IP'. But I don't think this CNI is intended to work with any other network policy API enforcement mechanism. Romana (the project I work on) and the other CNI providers that use iptables to enforce network policy rely on seeing the pod's source IP.
Again, this might be fine if you're running Envoy. On the other hand, L3 filtering on the host might be important.
Also, this design requires that 'CNI plugins communicate with AWS networking APIs to provision network resources for Pods'. This may or may not be something you want your instances to do.
FWIW, Romana lets you build clusters larger than 50 nodes without an overlay or more 'exotic networking techniques' or 'massive' complexity. It does it via simple route aggregation, completely standard networking.
>"Unfortunately, AWS’s VPC product has a default maximum of 50 non-propagated routes per route table, which can be increased up to a hard limit of 100 routes at the cost of potentially reducing network performance."
Could someone explain why increasing from 50 to 100 non-propagated routes in a VPC results in network performance degradation?
> Lincoln Stoll’s k8s-vpcnet, and more recently, Amazon’s amazon-vpc-cni-k8s CNI stacks use Elastic Network Interfaces (ENIs) and secondary private IPs to achieve an overlay-free AWS VPC-native solutions for Kubernetes networking. While both of these solutions achieve the same base goal of drastically simplifying the network complexity of deploying Kubernetes at scale on AWS, they do not focus on minimizing network latency and kernel overhead as part of implementing a compliant networking stack.
https://github.com/aws/amazon-vpc-cni-k8s/blob/master/propos...
There they clearly state:
>"To run Kubernetes over AWS VPC, we would like to reach following additional goals:
Networking for Pods must support high throughput and availability, low latency and minimal jitter comparable to the characteristics a user would get from EC2 networking"