[1] https://github.com/HigherOrderCO/HVM
From my understanding, main problem there is a compilation into (optimal) CUDA code and CUDA runtime, not language or internal representation per se. CUDA is hard to debug, some help can be warranted.
BTW, this HVM thing smells strange. The PAPER does not provide any description of experiments where linear parallel speedups were achieved. What were these 16K cores? What were these tasks?