I was skeptical when we first deployed it but we've found it to be dependable and fast. We're running it in production on six CoreOS servers and 400-500 containers.
We did evaluate Project Calico initially but discovered some performance tests that tipped the scales in favor of flannel. [3] I don't know if Calico has improved since then, however. This was about a year ago.
[1] https://github.com/coreos/flannel
[2] A Kubernetes pod is one or more related containers running on a single server
[3] http://www.slideshare.net/ArjanSchaaf/docker-network-perform...
https://www.projectcalico.org/canal-tigera/
https://coreos.com/blog/coreos-intel-calico-packet-extend-gi...
However, if you run it on AWS, it can automatically configure a bridge (cbr0) and configure up the VPC routing table for you.
GCE (Google's managed Kubernetes on Google Cloud) also handles this automatically.
There's also experimental support for Flannel built into K8s, which can be enabled with a flag. Not sure if it's worth using.
The nice thing about this is that nothing has to happen for a new pod to be reachable. No /32 route distribution or BGP (or etcd) convergence, no VXLAN ID (VNID) distribution for the overlay. At some scale, route and/or VNID distribution is going to limit the speed at which new pods can be launched.
One other thing not mentioned in the blog post or in any of these comments is network policy and isolation. Kubernetes v1.3 includes the new network APIs that let you isolate namespaces. This can only be achieved with a back end network solution like Romana or Calico (some others as well).
[1] romana.io
http://blog.kubernetes.io/2016/01/why-Kubernetes-doesnt-use-...
Most container network offerings - Calico, Flannel, Weave, etc. ship with a CNI plugin.
Docker have not altered their network plugin API.
(I work on Weave Net, including the plugins for both Docker and CNI)
(--Network engineer who manages BGP for an ISP)
That's not required here.
If you give me 10 machines on a L2 domain, I can set up a private network on top of those 10 machines, and advertise what IP is where to each of them by sharing a routing table... I can of course manually add routes saying a /24 is located on Host 1, and another /24 is on Host 2, or...
What better way to share a routing table than with a route distribution protocol of some sort?
So plop BIRD with BGP on all of the machines, peer em, and have them pull routes out of the Linux routing table and insert routes as necessary.
Now if I spin up a container on Host 1, I advertise a /32 for that IP, and Host 2-10 can all know to forward packets for that /32 to Host 1. If I move that container or IP to Host 2, BGP announces it to all the other hosts and traffic starts flowing there instead.
There is no requirement that Calico (or BIRD rather) peer with any existing BGP infrastructure... you can of course do that, but there is no requirement.
Stop making BGP sound like it's some bad evil thing that's difficult to understand.
If you just need isolation I agree with you. But I actually find the Calico solution rather elegant when looking at the whole package.
(--System administrator who manages VXLAN on a public cloud)
1) enough addresses. Just enough. For everything. For everyone. Google-scale enough.
2) Good out-of-the box dynamic assignment of addresses.
And finally, optional integration with ipsec, which I get might in the end be over-engineered, and under-used -- but wouldn't it be nice if you could just trust the network (you'd still have to bootstrap trust somehow, probably running your own x509 CA -- but how nice to be able to flip open any book on networking from the 80s and just replace the ipv4 addressing with ipv6 and just go ahead and use plain rsh and /etc/allow.hosts as your entire infrastructure for actually secure intra-cluster networking -- even across data-centres and what not. [ed: and secure nfsv3! wo-hoo]).
But anyway, have anyone actually done this? Does it work (for a meaningfully large value of work)?
The problem is that many cloud providers (ahem EC2) don't make this trivially easy like they should.
Another attractive alternative to Flannel is Weave [1], run in the simpler non-overlay mode. In this mode, it won't start a SDN, but will simply act as a bridge/route maintainer, similar to Flannel.
If you're running etcd or consul, I'm not sure you retain the right to call LSA flooding "complicated". It's simple compared to RAFT!
BIRD supports OSPF, so if you'd like to import/export routes using OSPF you can.
BGP is the process by which ranges of IPs are claimed by routers. Is Calico really used by docker containers in this way?
Kubernetes enforces a specific rule: Each pod (a group of containers) must be allocated its own cluster-routable IP address. This vastly simplifies Docker setups: In a way, it containerizes the network, just like Docker containerizes processes. It's the only sane way to manage containers, in my opinion.
This system requires something that can hand out IPs and ensure that they're routable on every machine. That something can be done in different ways, range from extremely simple to rather complex. For example, you could have something that acts like a bridge and coordinates with other nodes to find available IPs, and simply maintains the routing table on the nodes themselves in sync with this shared database (Flannel can run in this mode). Or you could use an SDN-defined overlay network (e.g. Weave).
Using a real routing protocol also immediately gives you access to traffic control, shaping, monitoring and redundancy tools, hardware support and knowledge that network administrators have been applying for years.
Sounds like this guy just found out about a cool new tool and decided to blog about it. BGP can be used on local networks but it's total overkill for a docker situation where all of your instances are likely in the same rack (often on the same machine!). Ifyou don't even have an AS from ARIN/RIPE there's no reason to even touch this (as you pointed out, it's all the protocol is designed to broadcast to the public internet -- e.g. 'hey I own this AS which has rights to this net-block, direct packets in this fashion please!' Jeez.
I have no idea what the CPU overhead of running this is but I'm sure it's not trivial, especially if the BGP daemon is tuned to retain any significant amount of the whole BGP table (RAM/swap issues galore I'd imagine). Granted the article is titled 'one way'.. which is empirically true, it's a Rube Goldberg way of going about networking.
(n.b., OSPF is what people use for "BGP" within your own intranet, even when you have tens of thousands of boxes. It's called "border gate protocol" for a reason..)
(Sorry, I don't use docker so I can't actually make a constructive comment telling you what the canonical/right way of doing it is, but I can assure you, this is not it.)
Eh, BGP on local networks is common enough that it was novel maybe a decade ago. It's perfect for running on servers to announce /32 addresses upstream to ToR switches. OSPF is actually more heavyweight & complex since you have to carry link state, do neighbor elections, etc. Ref: NANOG talk in 2003: https://www.nanog.org/meetings/nanog29/presentations/miller....
You don't even need an AS from your RIR for BGP to be useful on internal networks, just pick one (or more) of the private ASNs and roll with it.
Current best practice for internal networks (on the SP side at least) is to use OSPF to bootstrap BGP by enabling IP reachability for p2p & loopback addresses. After that customer / service routes are carried by BGP, not by OSPF. This is because BGP doesn't spam link state and has better policy knobs. You get a double win because OSPF converges faster with fewer routes and if you have redundant links your BGP sessions won't flap because the loopback addresses stay reachable. Ref: NANOG talk in 2011: https://www.nanog.org/meetings/nanog53/presentations/Sunday/...
CPU/RAM overhead is insignificant with a bgpd like BIRD or Quagga. They work for internet-scale routing (currently over 610,000 prefixes) with trivial load. An Atom C-series with a few GB of RAM can deal with internet BGP routing updates (CPU becomes significant for packet forwarding, not so much maintaining routing tables).
I'll take a boring 10-year-old routing setup using iBGP on cluster of route reflectors running BIRD with servers numbered out of a /24 per rack routed by a ToR l3switch with each service using BGP to announce a /32 over new and exciting L2 overlay networks any day. Troubleshooting is easier without having to work through different layers of encapsulation, dealing with MTU, trusting a novel control plane for whatever overlay network your using, etc.
And you are categorically wrong. This is a very good way to do it.
This isn't container weirdness. This is because networks got stuck in 2008. We still don't have have IPv6 SLAAC. Many of us made the jump to layer 3 clos fabrics, but stopped after that. My belief is because AWS EC2, Google GCE, Azure Compute, and others consider this the gold standard.
IPv6 natively supports autoconfiguring multiple IPs per NIC / machine automagically*. This is usually on by default as part of the privacy extensions, so in conjunction with SLAAC, you can cycle through IPs quickly. It also makes multi-endpoint protocols relevant.
Containers and bad networking because of the lack of IP / container is a well-known problem, it's even touched on in the Borg paper, briefly: One IP address per machine complicates things. In Borg, all tasks on a machine use the single IP address of their host, and thus share the host’s port space. This causes a number of difficulties: Borg must schedule ports as a resource; tasks must pre-declare how many ports they need, and be willing to be told which ones to use when they start; the Borglet must enforce port isolation; and the naming and RPC systems must handle ports as well as IP addresses.
Thanks to the advent of Linux namespaces, VMs, IPv6, and software-defined networking, Kubernetes can take a more user-friendly approach that eliminates these complications: every pod and service gets its own IP address, allowing developers to choose ports rather than requiring their software to adapt to the ones chosen by the infrastructure, and removes the infrastructure complexity of managing ports.
But, I ask, what's wrong with the Docker approach of rewriting ports? Reachability is our primary concern, and I'm unfortunately BGP hasn't become the lingua franca for most networks ("The Cloud"). I actually think ILA (https://tools.ietf.org/html/draft-herbert-nvo3-ila-00#sectio...) / ILNP (RFC6741) are the most interesting approaches here.
It requires that you rewrite the software trying to talk to that port, to make it aware that you've put the new port number in a special environment variable.
What do you think of them?
You can deploy this on any machine (container or not) and have it always reachable from other members of the same network, which could be e.g. servers on different providers (AWS, Azure, Digital Ocean, etc)
(and maybe also that you are affiliated with them)
Sorry, I should have made it explicit. As it's my own repo and my profile's email address gives away I'm part of Wormhole I didn't think about making a statement on the coment; but you're right.
Thanks!
Stipulating that you need a routing protocol here (you don't, right? You can do proxy ARP, or some more modern equivalent of proxy ARP.), there's a whole family of routing protocols optimized for this scenario, of which OSPF is the best-known.
Opinions vary whether this is a real concern, or just a way for the networking team to maintain their relevance.
All Calico cares about is that routes are distributed across various systems, they don't necessarily care how you do it (configure BIRD however you'd like).
BGP is surprisingly simple and easy to set up with BIRD. Setting up a route reflector with local hosts on the same L2 all peering with each other and suddenly you can route whatever IP's you want by announcing them to your peers.
Why do people think BGP is complicated?
Read your own paragraph before this question. Why do I need to run another process to exchange routes and configure a mesh or a route reflector? As an admin that's just another mess of processes and communication to worry about.
Just because BGP is easy for you does not mean it's easy for most server admins and devs without heavy networking backgrounds.
>A Linux container is a process, usually with its own filesystem attached to it so that its dependencies are isolated from your normal operating system. In the Docker universe we sometimes talk like it's a virtual machine, but fundamentally, it's just a process. Like any process, it can listen on a port (like 30000) and do networking.
A container isn't a process. It's an amalgamation of cgroups and namespaces. A container can have many processes. Hell, use systemd-nspawn on a volume that contains a linux distro and your container is basically the entire userspace of a full system.
>But what do I do if I have another computer on the same network? How does that container know that 10.0.1.104 belongs to a container on my computer?
Well, BGP certainly isn't a hard requirement. Depending on how you've setup your network, if these are in the same subnet and can communicate via layer 2, you don't need any sort of routing.
>To me, this seems pretty nice. It means that you can easily interpret the packets coming in and out of your machine (and, because we love tcpdump, we want to be able to understand our network traffic). I think there are other advantages but I'm not sure what they are.
I'm not sure where the idea that calico/BGP are required to look at network traffic for containers on your machine came from. If there's network traffic on your machine, you can basically always capture it with tcpdump.
> I find reading this networking stuff pretty difficult; more difficult than usual. For example, Docker also has a networking product they released recently. The webpage says they're doing "overlay networking". I don't know what that is, but it seems like you need etcd or consul or zookeeper. So the networking thing involves a distributed key-value store? Why do I need to have a distributed key-value store to do networking? There is probably a talk about this that I can watch but I don't understand it yet.
I think not at all understanding one of the major players in container networking is a good indication it might not yet be time to personally write a blog about container networking. Also absent is simple bridging.
Julia generally writes fantastic blogs, and I know she doesn't claim to be an expert on this subject and includes a disclaimer about how this is likely to be more wrong than usual, but I feel like there was a lot of room for additional research to be done to produce a more accurate article. I understand the blog is mostly about what she has recently learned, and often has lots of questions unanswered... But this one has a lot of things that are answered, incorrectly :(
I think the best approach is to constructively comment and engage to improve the article. The important thing is to do this in a positive manner so that the author feels they have done something of value and started a conversation. It is surprisingly difficult to do this, and I've certainly failed on occasion, but it is definitely worth trying.
Honestly, I struggle with when to publish things a lot -- I practically never write about things I understand well, but I do usually write about things that I understand a little better than this post. Consider it an ongoing experiment :)
I really appreciate factual corrections like "a container isn't a process", and I think comment threads like this are a good place for that. I fixed up a few of the more egregious incorrect things.
I've been requesting this feature from the EC2 team at AWS for some time about this, to no avail. You can bind multiple interfaces (ENIs) to an instance (up to 6, I think, depending on the instance size), each with a separate IP address, but not multiple IPs to a single interface.
BGP, flannel, vxlan, etc. are IMO a waste of cycles and add needless complexity to what could otherwise be a very simple architecture.
https://www.zerotier.com/community/topic/67/zerotier-6plane-...
Full disclosure: this is ours.