Thanks for that. the hackernoon piece in particular is really worth the read. Two salient points/questions/comments that jumped out to me after reading (these are really about the theory and not practice):
-The fact that you have a bunch of capsules in each layer, and each one of them is intrinsically performing a non-linear filter function (a particular squashing function shown as an image in that blog), seems like both a great asset (it looks like a 'meta-network') and also a potential problem. If you need to tune weights within these caps by the derivative of such a composite function, it doesn't seem straightforward.
-The 'routing by agreement' feature is interesting, but I dont quite get why it is superior to max pooling. If the feature is simply that it punishes weak links rather than selecting only the strongest, one interesting analogy is that it seems a bit Hebbian , and related to a concept in un-supervised learning called STDP (spike timing dependent plasticity).