One way to make containers network: BGP (opens in new tab)

(jvns.ca)

92 pointsbartbes9y ago67 comments

67 comments

It's really not that difficult to network containers. We're using flannel [1] on CoreOS. We're using flannel's VXLAN backend to encapsulate container traffic. We're Kubernetes users so every kube pod [2] gets it's own subnet and flannel handles the routing between those subnets, across all CoreOS servers in the cluster.

I was skeptical when we first deployed it but we've found it to be dependable and fast. We're running it in production on six CoreOS servers and 400-500 containers.

We did evaluate Project Calico initially but discovered some performance tests that tipped the scales in favor of flannel. [3] I don't know if Calico has improved since then, however. This was about a year ago.

[1] https://github.com/coreos/flannel

[2] A Kubernetes pod is one or more related containers running on a single server

[3] http://www.slideshare.net/ArjanSchaaf/docker-network-perform...

crb9y ago

Even better, Flannel and Calico have merged. See:

https://www.projectcalico.org/canal-tigera/

https://coreos.com/blog/coreos-intel-calico-packet-extend-gi...

https://github.com/tigera/canal

moondev9y ago

Is flannel used in Kubernetes for networking by default? Or is it something that needs to enabled and configured separately?

amouat9y ago

Kubernetes has a requirement that containers (more accurately "pods") can connect via a "flat networking space". How this is achieved varies between deployments, flannel, calico and weave are all common approaches. Kelsey Hightower's "Kubernetes the Hard Way" simply configured it at the router level: https://github.com/kelseyhightower/kubernetes-the-hard-way/b...

moondev9y ago

That makes sense. Thanks for the link as well, i've been looking for something exactly like it. Looks like a great resource!

lobster_johnson9y ago

Kubernetes doesn't have a "default" as such. It requires something external to manage the subnet, and needs to be configured to use it.

However, if you run it on AWS, it can automatically configure a bridge (cbr0) and configure up the VPC routing table for you.

GCE (Google's managed Kubernetes on Google Cloud) also handles this automatically.

There's also experimental support for Flannel built into K8s, which can be enabled with a flag. Not sure if it's worth using.

bboreham9y ago

Nitpick: Google's managed Kubernetes is called GKE.

However the OSS Kubernetes has code to configure routes on GCE same as it does for AWS.

chris_marino9y ago

Another solution to this problem is Romana [1] (I am part of this effort). It avoids overlays as well as BGP because it aggregate routes. It uses its own IP address management (IPAM) to maintain the route hierarchy.

The nice thing about this is that nothing has to happen for a new pod to be reachable. No /32 route distribution or BGP (or etcd) convergence, no VXLAN ID (VNID) distribution for the overlay. At some scale, route and/or VNID distribution is going to limit the speed at which new pods can be launched.

One other thing not mentioned in the blog post or in any of these comments is network policy and isolation. Kubernetes v1.3 includes the new network APIs that let you isolate namespaces. This can only be achieved with a back end network solution like Romana or Calico (some others as well).

[1] romana.io

crb9y ago

On the topic of "why do we need a distributed KV store for an overlay network?" from the blog: there's a good blog post about why Kubernetes doesn't use Docker's libnetwork.

http://blog.kubernetes.io/2016/01/why-Kubernetes-doesnt-use-...

jaytaylor9y ago

Thanks! I had almost forgotten about the container networking mayhem. Would love to find out what progress has been made over the past six months since the blog post was written.

bboreham9y ago

Kubernetes now uses CNI to configure interfaces in its most common "kubenet" configuration, and (obviously) when you put it into "CNI" mode.

Most container network offerings - Calico, Flannel, Weave, etc. ship with a CNI plugin.

Docker have not altered their network plugin API.

(I work on Weave Net, including the plugins for both Docker and CNI)

bboreham9y ago

Author seems to have missed that Calico also depends on etcd.

jlgaddis9y ago

BGP seems a needlessly complex solution to this problem. VXLAN would, IMO, be a much better fit.

(--Network engineer who manages BGP for an ISP)

X-Istence9y ago

Why is BGP complex? What makes BPG complex? The only complex thing about BGP is the fact that you can have it talk to the rest of the world if necessary and deal with traffic balancing and all that fun stuff, dealing with getting routes from multiple locations and a variety of other tasks and topics that are not "BGP" so much as network engineering.

That's not required here.

If you give me 10 machines on a L2 domain, I can set up a private network on top of those 10 machines, and advertise what IP is where to each of them by sharing a routing table... I can of course manually add routes saying a /24 is located on Host 1, and another /24 is on Host 2, or...

What better way to share a routing table than with a route distribution protocol of some sort?

So plop BIRD with BGP on all of the machines, peer em, and have them pull routes out of the Linux routing table and insert routes as necessary.

Now if I spin up a container on Host 1, I advertise a /32 for that IP, and Host 2-10 can all know to forward packets for that /32 to Host 1. If I move that container or IP to Host 2, BGP announces it to all the other hosts and traffic starts flowing there instead.

There is no requirement that Calico (or BIRD rather) peer with any existing BGP infrastructure... you can of course do that, but there is no requirement.

Stop making BGP sound like it's some bad evil thing that's difficult to understand.

ymse9y ago

I don't know. With VX?LANs you need a whole lot of other infrastructure to manage e.g. routes, load balancing or tenant-specific ACLs. By coercing BGP into providing isolation, you can slap on routing, load balancing, ACLs "for free", i.e. managed by the same control plane.

If you just need isolation I agree with you. But I actually find the Calico solution rather elegant when looking at the whole package.

(--System administrator who manages VXLAN on a public cloud)

mrmondo9y ago

We're just about to switch to BGP internally using Calico (mentioned in another comment, I believe performance is good now), we run around 300-600 containers currently using our implementation using Consul+Serf. We'll drop a talk on it once we've made the switch if anyone is interested. We're deliberately avoiding flannel because of the tunnelled networking and added complexity that we don't feel we want to introduce.

e12e9y ago

I've for a long time wondered if anyone has successfully just gone full ipv6 only with a substantial container/vm roll-out. On paper it should have:

1) enough addresses. Just enough. For everything. For everyone. Google-scale enough.

2) Good out-of-the box dynamic assignment of addresses.

And finally, optional integration with ipsec, which I get might in the end be over-engineered, and under-used -- but wouldn't it be nice if you could just trust the network (you'd still have to bootstrap trust somehow, probably running your own x509 CA -- but how nice to be able to flip open any book on networking from the 80s and just replace the ipv4 addressing with ipv6 and just go ahead and use plain rsh and /etc/allow.hosts as your entire infrastructure for actually secure intra-cluster networking -- even across data-centres and what not. [ed: and secure nfsv3! wo-hoo]).

But anyway, have anyone actually done this? Does it work (for a meaningfully large value of work)?

otterley9y ago

You don't really need IPv6 to do this. IPv4 is sufficient; you just need to assign more than 1 IP address to your physical interfaces. This has been possible since at least the early 1990s.

The problem is that many cloud providers (ahem EC2) don't make this trivially easy like they should.

tw049y ago

They don't make it trivial because we're out of IPv4 address space. Which would be the reason for doing this with IPv6. The automatic addressing of IPv4 is also nowhere near as simple as v6, not even close.

e12e9y ago

Exactly. Last place I worked, we had two whole routable /24s just for servers - it was still not enough to easily give every service routable addresses (especially not as we moved to microservices, and were considering floating ips in addition to per service-instance ips etc).

People suffer from a serious case of Stockholm syndrome wrt. ipv4 addresses and non-routable networks and what-not. There are some (very few) good use-cases for NAT - in most all other cases it just makes everything more complicated - for no real gain (well, you get to avoid buying networking equipment that supports ipv6...). And don't get me started on the "but it helps with security"-crowd... If you need a firewall, get a firewall. Stop conflating it with accidental features of limited address space.

I would have experimented more with ipv6 on my personal server already, if only I could get broadband that actually supported ipv6 (apparently my DSLAM is from the 90s). But now I'm moving, so hopefully I can get that last bit sorted. If nothing else it would appear most 4g networks support ipv6.

For non-personal, non-limited use, one would probably need to set up a fleet of ipv4 proxies/load-balancers -- but I'd be more than happy if I could just move to ipv6 and stop caring about the rest of the luddites ;-)

The main feature draw (on paper) of ipv6 isn't that it enables anything new, it's just that it allows simple stuff to be simple (again). And radical simplicity can be a great feature.

otterley9y ago

Both AWS and GCE can allocate private 10.0.0.0/8 subnets for internal networks (e.g. VPCs). There is no address scarcity in such subnets.

1 more reply

lobster_johnson9y ago

BGP looks really complex. Isn't OSPF (BGP's "little brother") a much attractive choice here? It's still complex, but should be much simpler.

Another attractive alternative to Flannel is Weave [1], run in the simpler non-overlay mode. In this mode, it won't start a SDN, but will simply act as a bridge/route maintainer, similar to Flannel.

[1] https://www.weave.works/products/weave-net/

detaro9y ago

BGP IMHO is much simpler than OSPF. No different area types, support for communities, no need to keep a link-state database for the entire network in all nodes, ...

tptacek9y ago

People keep talking about link state database overhead, but how significant is this in reality? The graphs we're talking about, even in huge deployments, are small.

If you're running etcd or consul, I'm not sure you retain the right to call LSA flooding "complicated". It's simple compared to RAFT!

detaro9y ago

Size probably doesn't matter that much until you start to fill entire datacenters, OSPF nowadays should work with hundreds of routers as well. Would be interesting to see how failure cases compare, if I remember right one of the arguments for BGP in data-center fabrics was that the updates following them stay more localized. (EDIT: a description how Microsoft uses it for really large networks, slide 11 talks about surges: http://www.janog.gr.jp/meeting/janog33/doc/janog33-bgp-nkpos...)

I find BGP easier to understand, and I don't see what benefit OSPF would have. (Not that I really have non-trivial experience with either, have only used them at home and toy networks)

X-Istence9y ago

Calico uses BIRD, and Calico just programs routes into the Linux kernel, BIRD then picks those up.

BIRD supports OSPF, so if you'd like to import/export routes using OSPF you can.

delinka9y ago

Have I misunderstood something here? We don't BGP on a local networks. Via ARP, a node says "who has $IP?" Something answers with a MAC address. The packet for $IP is wrapped in an Ethernet frame for that MAC address. If the IP isn't local to your network, your router answers with its own MAC, and the packet is framed up for the router.

BGP is the process by which ranges of IPs are claimed by routers. Is Calico really used by docker containers in this way?

atombender9y ago

That is indeed the use case being solved here.

Kubernetes enforces a specific rule: Each pod (a group of containers) must be allocated its own cluster-routable IP address. This vastly simplifies Docker setups: In a way, it containerizes the network, just like Docker containerizes processes. It's the only sane way to manage containers, in my opinion.

This system requires something that can hand out IPs and ensure that they're routable on every machine. That something can be done in different ways, range from extremely simple to rather complex. For example, you could have something that acts like a bridge and coordinates with other nodes to find available IPs, and simply maintains the routing table on the nodes themselves in sync with this shared database (Flannel can run in this mode). Or you could use an SDN-defined overlay network (e.g. Weave).

q3k9y ago

This works as long as you're willing to stretch an L2 network to reach all the hosts of your container platform. That doesn't really scale.

Using a real routing protocol also immediately gives you access to traffic control, shaping, monitoring and redundancy tools, hardware support and knowledge that network administrators have been applying for years.

bboreham9y ago

People are networking their containers on IP address ranges segregated from their host addresses. So the routers have no idea where to send the packets, unless you take some additional steps.

iheartmemcache9y ago

Yeah, have an upvote. This is totally a case of "using an industrial core drill with tungsten carbide bits when all you need is an Ikea drill and a chinese cheese-grade bit".

Sounds like this guy just found out about a cool new tool and decided to blog about it. BGP can be used on local networks but it's total overkill for a docker situation where all of your instances are likely in the same rack (often on the same machine!). Ifyou don't even have an AS from ARIN/RIPE there's no reason to even touch this (as you pointed out, it's all the protocol is designed to broadcast to the public internet -- e.g. 'hey I own this AS which has rights to this net-block, direct packets in this fashion please!' Jeez.

I have no idea what the CPU overhead of running this is but I'm sure it's not trivial, especially if the BGP daemon is tuned to retain any significant amount of the whole BGP table (RAM/swap issues galore I'd imagine). Granted the article is titled 'one way'.. which is empirically true, it's a Rube Goldberg way of going about networking.

(n.b., OSPF is what people use for "BGP" within your own intranet, even when you have tens of thousands of boxes. It's called "border gate protocol" for a reason..)

(Sorry, I don't use docker so I can't actually make a constructive comment telling you what the canonical/right way of doing it is, but I can assure you, this is not it.)

jauer9y ago

(I didn't downvote you, but this might be why someone did...)

Eh, BGP on local networks is common enough that it was novel maybe a decade ago. It's perfect for running on servers to announce /32 addresses upstream to ToR switches. OSPF is actually more heavyweight & complex since you have to carry link state, do neighbor elections, etc. Ref: NANOG talk in 2003: https://www.nanog.org/meetings/nanog29/presentations/miller....

You don't even need an AS from your RIR for BGP to be useful on internal networks, just pick one (or more) of the private ASNs and roll with it.

Current best practice for internal networks (on the SP side at least) is to use OSPF to bootstrap BGP by enabling IP reachability for p2p & loopback addresses. After that customer / service routes are carried by BGP, not by OSPF. This is because BGP doesn't spam link state and has better policy knobs. You get a double win because OSPF converges faster with fewer routes and if you have redundant links your BGP sessions won't flap because the loopback addresses stay reachable. Ref: NANOG talk in 2011: https://www.nanog.org/meetings/nanog53/presentations/Sunday/...

CPU/RAM overhead is insignificant with a bgpd like BIRD or Quagga. They work for internet-scale routing (currently over 610,000 prefixes) with trivial load. An Atom C-series with a few GB of RAM can deal with internet BGP routing updates (CPU becomes significant for packet forwarding, not so much maintaining routing tables).

I'll take a boring 10-year-old routing setup using iBGP on cluster of route reflectors running BIRD with servers numbered out of a /24 per rack routed by a ToR l3switch with each service using BGP to announce a /32 over new and exciting L2 overlay networks any day. Troubleshooting is easier without having to work through different layers of encapsulation, dealing with MTU, trusting a novel control plane for whatever overlay network your using, etc.

gonzo9y ago

> CPU/RAM overhead is insignificant with a bgpd like BIRD or Quagga. They work for internet-scale routing (currently over 610,000 prefixes) with trivial load. An Atom C-series with a few GB of RAM can deal with internet BGP routing updates (CPU becomes significant for packet forwarding, not so much maintaining routing tables).

We've got a DPDK-enabled VRouter that will run over 12Mpps on a C2758 (8 core) with a full BGP table.

X-Istence9y ago

> (Sorry, I don't use docker so I can't actually make a constructive comment telling you what the canonical/right way of doing it is, but I can assure you, this is not it.)

And you are categorically wrong. This is a very good way to do it.

sargun9y ago

More on this here: https://medium.com/@sargun/a-critique-of-network-design-ff85... -- BGP isn't just about containers. It's about signaling. It's a mechanism for machines to influence the flow of traffic in the network.

This isn't container weirdness. This is because networks got stuck in 2008. We still don't have have IPv6 SLAAC. Many of us made the jump to layer 3 clos fabrics, but stopped after that. My belief is because AWS EC2, Google GCE, Azure Compute, and others consider this the gold standard.

IPv6 natively supports autoconfiguring multiple IPs per NIC / machine automagically*. This is usually on by default as part of the privacy extensions, so in conjunction with SLAAC, you can cycle through IPs quickly. It also makes multi-endpoint protocols relevant.

Containers and bad networking because of the lack of IP / container is a well-known problem, it's even touched on in the Borg paper, briefly: One IP address per machine complicates things. In Borg, all tasks on a machine use the single IP address of their host, and thus share the host’s port space. This causes a number of difficulties: Borg must schedule ports as a resource; tasks must pre-declare how many ports they need, and be willing to be told which ones to use when they start; the Borglet must enforce port isolation; and the naming and RPC systems must handle ports as well as IP addresses.

Thanks to the advent of Linux namespaces, VMs, IPv6, and software-defined networking, Kubernetes can take a more user-friendly approach that eliminates these complications: every pod and service gets its own IP address, allowing developers to choose ports rather than requiring their software to adapt to the ones chosen by the infrastructure, and removes the infrastructure complexity of managing ports.

But, I ask, what's wrong with the Docker approach of rewriting ports? Reachability is our primary concern, and I'm unfortunately BGP hasn't become the lingua franca for most networks ("The Cloud"). I actually think ILA (https://tools.ietf.org/html/draft-herbert-nvo3-ila-00#sectio...) / ILNP (RFC6741) are the most interesting approaches here.

bboreham9y ago

> what's wrong with the Docker approach of rewriting ports

It requires that you rewrite the software trying to talk to that port, to make it aware that you've put the new port number in a special environment variable.

sargun9y ago

Have you looked at Docker bridge mode? and Mesosphere's VIPs?

What do you think of them?

bboreham9y ago

Docker bridge only works between containers on one machine; this is exactly why we wrote Weave Net two years ago, to let you network simply between containers running anywhere.

I hadn't considered using Virtual IPs to reverse out port-mapping. I guess it would work provided you have good connectivity between hosts - it would be a nightmare to try to configure a firewall where the actual ports in use jump around all the time.

Also such schemes require that you know in advance which ports each component listens on, and that there are no loops in the graph. Both of these requirements can be constraining.

NetStrikeForce9y ago

Or you could NAT on the host and deploy simpler overlay networking: https://github.com/pjperez/docker-wormhole

You can deploy this on any machine (container or not) and have it always reachable from other members of the same network, which could be e.g. servers on different providers (AWS, Azure, Digital Ocean, etc)

q3k9y ago

You should probably mention that this is a PaaS.

(and maybe also that you are affiliated with them)

NetStrikeForce9y ago

Hi,

Sorry, I should have made it explicit. As it's my own repo and my profile's email address gives away I'm part of Wormhole I didn't think about making a statement on the coment; but you're right.

Thanks!

tptacek9y ago

Especially since there isn't really a policy-routing component to this, isn't BGP pretty _extremely_ complicated for the problem Calico is trying to solve?

Stipulating that you need a routing protocol here (you don't, right? You can do proxy ARP, or some more modern equivalent of proxy ARP.), there's a whole family of routing protocols optimized for this scenario, of which OSPF is the best-known.

kijiki9y ago

One reason you sometimes "need" BGP for this is because the networking team is highly skeptical about their ToR switches accepting routes from the server team's hypervisors/containervisors. BGP route filtering on the ToRs makes them feel more secure and happy.

Opinions vary whether this is a real concern, or just a way for the networking team to maintain their relevance.

X-Istence9y ago

Calico is installing routes in the Linux kernel. Those routes are pulled out and distributed using BIRD. BIRD can do OSPF instead if you'd like.

All Calico cares about is that routes are distributed across various systems, they don't necessarily care how you do it (configure BIRD however you'd like).

BGP is surprisingly simple and easy to set up with BIRD. Setting up a route reflector with local hosts on the same L2 all peering with each other and suddenly you can route whatever IP's you want by announcing them to your peers.

Why do people think BGP is complicated?

hueving9y ago

>Why do people think BGP is complicated?

Read your own paragraph before this question. Why do I need to run another process to exchange routes and configure a mesh or a route reflector? As an admin that's just another mess of processes and communication to worry about.

Just because BGP is easy for you does not mean it's easy for most server admins and devs without heavy networking backgrounds.

X-Istence9y ago

Wait what? How else should we be exchanging routes? Should we shove them into a distributed key value store and then having each of the nodes pull out the routes and installing them?

> As an admin that's just another mess of processes and communication to worry about.

Yet we fully expect admins to understand and build HA redundant clusters for databases, or how to manage and update all the machines under their control, and a variety of other tasks.

There is nothing inherently different about running a BGP speaking daemon. It's all config.

I don't have a heavy networking background at all. I'm a software engineer that's currently working as a system architect, but even I can understand something as simple as a route distribution system.

tptacek9y ago

Because BGP is complicated compared to intradomain routing protocols.

X-Istence9y ago

In this case you are using BGP only for it's ability to send routes from one place to another.

This isn't complicated, it's config management. You can ignore 99% of what BGP can do in this use case.

cthalupa9y ago

There's a lot of misinformation in this.

>A Linux container is a process, usually with its own filesystem attached to it so that its dependencies are isolated from your normal operating system. In the Docker universe we sometimes talk like it's a virtual machine, but fundamentally, it's just a process. Like any process, it can listen on a port (like 30000) and do networking.

A container isn't a process. It's an amalgamation of cgroups and namespaces. A container can have many processes. Hell, use systemd-nspawn on a volume that contains a linux distro and your container is basically the entire userspace of a full system.

>But what do I do if I have another computer on the same network? How does that container know that 10.0.1.104 belongs to a container on my computer?

Well, BGP certainly isn't a hard requirement. Depending on how you've setup your network, if these are in the same subnet and can communicate via layer 2, you don't need any sort of routing.

>To me, this seems pretty nice. It means that you can easily interpret the packets coming in and out of your machine (and, because we love tcpdump, we want to be able to understand our network traffic). I think there are other advantages but I'm not sure what they are.

I'm not sure where the idea that calico/BGP are required to look at network traffic for containers on your machine came from. If there's network traffic on your machine, you can basically always capture it with tcpdump.

> I find reading this networking stuff pretty difficult; more difficult than usual. For example, Docker also has a networking product they released recently. The webpage says they're doing "overlay networking". I don't know what that is, but it seems like you need etcd or consul or zookeeper. So the networking thing involves a distributed key-value store? Why do I need to have a distributed key-value store to do networking? There is probably a talk about this that I can watch but I don't understand it yet.

I think not at all understanding one of the major players in container networking is a good indication it might not yet be time to personally write a blog about container networking. Also absent is simple bridging.

Julia generally writes fantastic blogs, and I know she doesn't claim to be an expert on this subject and includes a disclaimer about how this is likely to be more wrong than usual, but I feel like there was a lot of room for additional research to be done to produce a more accurate article. I understand the blog is mostly about what she has recently learned, and often has lots of questions unanswered... But this one has a lot of things that are answered, incorrectly :(

amouat9y ago

I've never read any of her blogs before (that I can recall) and I do agree there are some misunderstandings. But it was clearly written as as a brain dump to help other people going through the same process and it largely achieves that goal. I really don't like the concept "don't write a blog unless you're an expert" - we would lose out on lots of valuable discussions and helpful articles, especially for beginners. For example, we wouldn't have this HN thread with useful commentary if it wasn't for the author.

I think the best approach is to constructively comment and engage to improve the article. The important thing is to do this in a positive manner so that the author feels they have done something of value and started a conversation. It is surprisingly difficult to do this, and I've certainly failed on occasion, but it is definitely worth trying.

cthalupa9y ago

I'm certainly not suggesting the need to be an expert to write a blog - I just personally would be more cautious with some statements if I was writing on something I wasn't knowledgable in.

jvns9y ago

> I think not at all understanding one of the major players in container networking is a good indication it might not yet be time to personally write a blog about container networking.

Honestly, I struggle with when to publish things a lot -- I practically never write about things I understand well, but I do usually write about things that I understand a little better than this post. Consider it an ongoing experiment :)

I really appreciate factual corrections like "a container isn't a process", and I think comment threads like this are a good place for that. I fixed up a few of the more egregious incorrect things.

philip12099y ago

The internal OpenDNS docker system, Quadra, relies on BGP for a mix on of on-prem and off-prem hosting:

http://www.slideshare.net/bacongobbler/docker-with-bgp

otterley9y ago

The real problem is that cloud providers don't provide out-of-the-box functionality to assign more than one IP to a network interface. If they did this, there wouldn't even be an issue.

I've been requesting this feature from the EC2 team at AWS for some time about this, to no avail. You can bind multiple interfaces (ENIs) to an instance (up to 6, I think, depending on the instance size), each with a separate IP address, but not multiple IPs to a single interface.

BGP, flannel, vxlan, etc. are IMO a waste of cycles and add needless complexity to what could otherwise be a very simple architecture.

feisuzhu9y ago

It's not a waste of cycles. You are just pushing your responsibilities to AWS, the cycles and memory for routing your additional IPs are required anyway.

otterley9y ago

Why do you think that maintaining the network should be my responsibility, as opposed to the provider's?

dozzie9y ago

Oh boy. And containers were supposed to make things easier.

api9y ago

You could just have every container get a magic IPV6 address that just works.

https://www.zerotier.com/community/topic/67/zerotier-6plane-...

Full disclosure: this is ours.

catern9y ago

Or you could just give every container a real IPv6 address, no need for any magic...

api9y ago

Nothing against that either but some people can't do it due to hybrid deployments, providers that don't give you a /64, or providers that don't offer V6 at all.

Currently Amazon, Google, and Azure have no native IPv6 support.

Also many are allergic to the security implications. You have to be rigorous with ip6tables and making sure everything speaks SSL or another encrypted protocol using authentication in both directions. Many things do not support SSL or don't support bidirectional auth.

Personally I doubt overlay networks are going away. Most backplane software like databases, distributed caches and event servers, etc. offers literally no security because it's all built with the assumption that it will run on a secure backplane. I have personally railed against this for years but I've found that it's a total waste of breath.

1 more reply

j / k navigate · click thread line to collapse

67 comments

chrissnell9y ago

I was skeptical when we first deployed it but we've found it to be dependable and fast. We're running it in production on six CoreOS servers and 400-500 containers.

[1] https://github.com/coreos/flannel

[2] A Kubernetes pod is one or more related containers running on a single server

[3] http://www.slideshare.net/ArjanSchaaf/docker-network-perform...

crb9y ago

Even better, Flannel and Calico have merged. See:

https://www.projectcalico.org/canal-tigera/

https://coreos.com/blog/coreos-intel-calico-packet-extend-gi...

https://github.com/tigera/canal

moondev9y ago

Is flannel used in Kubernetes for networking by default? Or is it something that needs to enabled and configured separately?

amouat9y ago

moondev9y ago

That makes sense. Thanks for the link as well, i've been looking for something exactly like it. Looks like a great resource!

lobster_johnson9y ago

Kubernetes doesn't have a "default" as such. It requires something external to manage the subnet, and needs to be configured to use it.

However, if you run it on AWS, it can automatically configure a bridge (cbr0) and configure up the VPC routing table for you.

GCE (Google's managed Kubernetes on Google Cloud) also handles this automatically.

There's also experimental support for Flannel built into K8s, which can be enabled with a flag. Not sure if it's worth using.

bboreham9y ago

Nitpick: Google's managed Kubernetes is called GKE.

However the OSS Kubernetes has code to configure routes on GCE same as it does for AWS.

chris_marino9y ago

[1] romana.io

crb9y ago

On the topic of "why do we need a distributed KV store for an overlay network?" from the blog: there's a good blog post about why Kubernetes doesn't use Docker's libnetwork.

http://blog.kubernetes.io/2016/01/why-Kubernetes-doesnt-use-...

jaytaylor9y ago

Thanks! I had almost forgotten about the container networking mayhem. Would love to find out what progress has been made over the past six months since the blog post was written.

bboreham9y ago

Kubernetes now uses CNI to configure interfaces in its most common "kubenet" configuration, and (obviously) when you put it into "CNI" mode.

Most container network offerings - Calico, Flannel, Weave, etc. ship with a CNI plugin.

Docker have not altered their network plugin API.

(I work on Weave Net, including the plugins for both Docker and CNI)

bboreham9y ago

Author seems to have missed that Calico also depends on etcd.

jlgaddis9y ago

BGP seems a needlessly complex solution to this problem. VXLAN would, IMO, be a much better fit.

(--Network engineer who manages BGP for an ISP)

X-Istence9y ago

That's not required here.

What better way to share a routing table than with a route distribution protocol of some sort?

So plop BIRD with BGP on all of the machines, peer em, and have them pull routes out of the Linux routing table and insert routes as necessary.

There is no requirement that Calico (or BIRD rather) peer with any existing BGP infrastructure... you can of course do that, but there is no requirement.

Stop making BGP sound like it's some bad evil thing that's difficult to understand.

ymse9y ago

If you just need isolation I agree with you. But I actually find the Calico solution rather elegant when looking at the whole package.

(--System administrator who manages VXLAN on a public cloud)

mrmondo9y ago

e12e9y ago

I've for a long time wondered if anyone has successfully just gone full ipv6 only with a substantial container/vm roll-out. On paper it should have:

1) enough addresses. Just enough. For everything. For everyone. Google-scale enough.

2) Good out-of-the box dynamic assignment of addresses.

But anyway, have anyone actually done this? Does it work (for a meaningfully large value of work)?

otterley9y ago

You don't really need IPv6 to do this. IPv4 is sufficient; you just need to assign more than 1 IP address to your physical interfaces. This has been possible since at least the early 1990s.

The problem is that many cloud providers (ahem EC2) don't make this trivially easy like they should.

tw049y ago

e12e9y ago

The main feature draw (on paper) of ipv6 isn't that it enables anything new, it's just that it allows simple stuff to be simple (again). And radical simplicity can be a great feature.

otterley9y ago

Both AWS and GCE can allocate private 10.0.0.0/8 subnets for internal networks (e.g. VPCs). There is no address scarcity in such subnets.

1 more reply

lobster_johnson9y ago

BGP looks really complex. Isn't OSPF (BGP's "little brother") a much attractive choice here? It's still complex, but should be much simpler.

Another attractive alternative to Flannel is Weave [1], run in the simpler non-overlay mode. In this mode, it won't start a SDN, but will simply act as a bridge/route maintainer, similar to Flannel.

[1] https://www.weave.works/products/weave-net/

detaro9y ago

BGP IMHO is much simpler than OSPF. No different area types, support for communities, no need to keep a link-state database for the entire network in all nodes, ...

tptacek9y ago

People keep talking about link state database overhead, but how significant is this in reality? The graphs we're talking about, even in huge deployments, are small.

If you're running etcd or consul, I'm not sure you retain the right to call LSA flooding "complicated". It's simple compared to RAFT!

detaro9y ago

I find BGP easier to understand, and I don't see what benefit OSPF would have. (Not that I really have non-trivial experience with either, have only used them at home and toy networks)

X-Istence9y ago

Calico uses BIRD, and Calico just programs routes into the Linux kernel, BIRD then picks those up.

BIRD supports OSPF, so if you'd like to import/export routes using OSPF you can.

delinka9y ago

BGP is the process by which ranges of IPs are claimed by routers. Is Calico really used by docker containers in this way?

atombender9y ago

That is indeed the use case being solved here.

q3k9y ago

This works as long as you're willing to stretch an L2 network to reach all the hosts of your container platform. That doesn't really scale.

bboreham9y ago

People are networking their containers on IP address ranges segregated from their host addresses. So the routers have no idea where to send the packets, unless you take some additional steps.

iheartmemcache9y ago

Yeah, have an upvote. This is totally a case of "using an industrial core drill with tungsten carbide bits when all you need is an Ikea drill and a chinese cheese-grade bit".

(n.b., OSPF is what people use for "BGP" within your own intranet, even when you have tens of thousands of boxes. It's called "border gate protocol" for a reason..)

(Sorry, I don't use docker so I can't actually make a constructive comment telling you what the canonical/right way of doing it is, but I can assure you, this is not it.)

jauer9y ago

(I didn't downvote you, but this might be why someone did...)

You don't even need an AS from your RIR for BGP to be useful on internal networks, just pick one (or more) of the private ASNs and roll with it.

gonzo9y ago

We've got a DPDK-enabled VRouter that will run over 12Mpps on a C2758 (8 core) with a full BGP table.

X-Istence9y ago

> (Sorry, I don't use docker so I can't actually make a constructive comment telling you what the canonical/right way of doing it is, but I can assure you, this is not it.)

And you are categorically wrong. This is a very good way to do it.

sargun9y ago

bboreham9y ago

> what's wrong with the Docker approach of rewriting ports

It requires that you rewrite the software trying to talk to that port, to make it aware that you've put the new port number in a special environment variable.

sargun9y ago

Have you looked at Docker bridge mode? and Mesosphere's VIPs?

What do you think of them?

bboreham9y ago

Docker bridge only works between containers on one machine; this is exactly why we wrote Weave Net two years ago, to let you network simply between containers running anywhere.

Also such schemes require that you know in advance which ports each component listens on, and that there are no loops in the graph. Both of these requirements can be constraining.

NetStrikeForce9y ago

Or you could NAT on the host and deploy simpler overlay networking: https://github.com/pjperez/docker-wormhole

q3k9y ago

You should probably mention that this is a PaaS.

(and maybe also that you are affiliated with them)

NetStrikeForce9y ago

Hi,

Sorry, I should have made it explicit. As it's my own repo and my profile's email address gives away I'm part of Wormhole I didn't think about making a statement on the coment; but you're right.

Thanks!

tptacek9y ago

Especially since there isn't really a policy-routing component to this, isn't BGP pretty _extremely_ complicated for the problem Calico is trying to solve?

kijiki9y ago

Opinions vary whether this is a real concern, or just a way for the networking team to maintain their relevance.

X-Istence9y ago

Calico is installing routes in the Linux kernel. Those routes are pulled out and distributed using BIRD. BIRD can do OSPF instead if you'd like.

All Calico cares about is that routes are distributed across various systems, they don't necessarily care how you do it (configure BIRD however you'd like).

Why do people think BGP is complicated?

hueving9y ago

>Why do people think BGP is complicated?

Just because BGP is easy for you does not mean it's easy for most server admins and devs without heavy networking backgrounds.

X-Istence9y ago

Wait what? How else should we be exchanging routes? Should we shove them into a distributed key value store and then having each of the nodes pull out the routes and installing them?

> As an admin that's just another mess of processes and communication to worry about.

Yet we fully expect admins to understand and build HA redundant clusters for databases, or how to manage and update all the machines under their control, and a variety of other tasks.

There is nothing inherently different about running a BGP speaking daemon. It's all config.

tptacek9y ago

Because BGP is complicated compared to intradomain routing protocols.

X-Istence9y ago

In this case you are using BGP only for it's ability to send routes from one place to another.

This isn't complicated, it's config management. You can ignore 99% of what BGP can do in this use case.

cthalupa9y ago

There's a lot of misinformation in this.

>But what do I do if I have another computer on the same network? How does that container know that 10.0.1.104 belongs to a container on my computer?

Well, BGP certainly isn't a hard requirement. Depending on how you've setup your network, if these are in the same subnet and can communicate via layer 2, you don't need any sort of routing.

amouat9y ago

cthalupa9y ago

I'm certainly not suggesting the need to be an expert to write a blog - I just personally would be more cautious with some statements if I was writing on something I wasn't knowledgable in.

jvns9y ago

> I think not at all understanding one of the major players in container networking is a good indication it might not yet be time to personally write a blog about container networking.

I really appreciate factual corrections like "a container isn't a process", and I think comment threads like this are a good place for that. I fixed up a few of the more egregious incorrect things.

philip12099y ago

The internal OpenDNS docker system, Quadra, relies on BGP for a mix on of on-prem and off-prem hosting:

http://www.slideshare.net/bacongobbler/docker-with-bgp

otterley9y ago

The real problem is that cloud providers don't provide out-of-the-box functionality to assign more than one IP to a network interface. If they did this, there wouldn't even be an issue.

BGP, flannel, vxlan, etc. are IMO a waste of cycles and add needless complexity to what could otherwise be a very simple architecture.

feisuzhu9y ago

It's not a waste of cycles. You are just pushing your responsibilities to AWS, the cycles and memory for routing your additional IPs are required anyway.

otterley9y ago

Why do you think that maintaining the network should be my responsibility, as opposed to the provider's?

dozzie9y ago

Oh boy. And containers were supposed to make things easier.

api9y ago

You could just have every container get a magic IPV6 address that just works.

https://www.zerotier.com/community/topic/67/zerotier-6plane-...

Full disclosure: this is ours.

catern9y ago

Or you could just give every container a real IPv6 address, no need for any magic...

api9y ago

Nothing against that either but some people can't do it due to hybrid deployments, providers that don't give you a /64, or providers that don't offer V6 at all.

Currently Amazon, Google, and Azure have no native IPv6 support.

1 more reply

j / k navigate · click thread line to collapse