If I understand correctly after skimming, one of the fundamental ideas behind this appears to be similar to the well-known Fast Multiple Method [1]. It’s also a tree-based approach where far away points are aggregated into larger chunks?
In particular, their scheme using exact bounds on energy differences (evaluated using a hierarchical tree as in BH) but in such a way that no approximation is being made. The tree is evaluated to whatever depth is needed to decide whether to accept/reject the MC move (which in worse case could be a brute-force sum over the whole lattice/system) -- this is different I think than BH or other multipole inspired methods (which have a kind of "truncation" or "tolerance" parameter).
[This also works well with systems where update are local and not global, which I think is a difference from some other spatial partitioning schemes -- but I'm less conversant with that aspect].
Sarle, Warren S. (1994). Neural Networks and Statistical Models. Proceedings of the Nineteenth Annual SAS Users Group International Conference, April, 1994.
If you want an in-depth discussion, these books relate ANNs and statistical methods:
Schürmann, Jürgen. (1996). Pattern Classification: A Unified View of Statistical and Neural Approaches.
Raudys, Sarunas. (2001). Statistical and Neural Classifiers: An Integrated Approach to Design.