I suspect that there is no way to predict in advance the smallest model which can represent any given function. However, we should be able to at least develop an intuition for depth and breadth that is sufficient on any given problem. Classically, division circuits are large and complex, so I naively expect a NN built out of add, mul, and Heaviside(x)*x to also be large and complex.
As with many models, I suspect that this network really learned some other property that has almost-but-not-quite the same pattern as divisible-by-N.