Whenever I see a speed boost to do what is conceptually the same thing I'm always curious where the fat was cut. What did we give up? You can dump the resulting assembly with -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly and diff might be revealing.
My hunch is that the line from the tutorial: `@CFunction(transition = Transition.NO_TRANSITION)` makes all the difference. Explanation of NO_TRANSITION from [0]:
No prologue and epilogue is emitted. The C code must not block and must not call back to Java. Also, long running C code delays safepoints (and therefore garbage collection) of other threads until the call returns.
Which is probably great for BLAS-like calls. This lines up with my understanding from Cliff Click's great talk "Why is JNI Slow?"[1] basically saying that to be faster you need make assumptions about what the native code could and couldn't do and that generally developers would shoot themselves in the foot.
[0]: https://github.com/oracle/graal/blob/master/sdk/src/org.graa... [1]: https://www.youtube.com/watch?v=LoyBTqkSkZk
"JNI is slow", being the conventional wisdom, and knowing just how frequent the calls would be, people had ignored it as an option.
Randomly one of the devs who was most bothered by the bottleneck, had an hour spare and threw the conventional wisdom out the window and dropped in JNI calls to an standard (highly optimised) library and re-benchmarked. 40% performance boost. Further experiments found that "JNI is slow" isn't as true as conventional wisdom quite had it.
https://android.googlesource.com/platform/libcore/+/master/d...
EDIT: I forgot to mention @CriticalNative as well
https://android.googlesource.com/platform/libcore/+/master/d...
GCC recognized #extern "Java" in headers generated from class files. You could then call (gcj-compiled) Java classes from C++ as if they were native C++ classes, as well as implement Java "native" methods in natural C++.
The whole thing performed a lot better than JNI since it was, more or less, just using the standard platform calling conventions. Calling a native CNI method from Java had the same overhead as any regular Java virtual method call.
Ultimately, GCJ faded away because there wasn't a great deal of interest in native Java compilation back then, and too many compatibility challenges in the pre-OpenJDK days. But it's interesting to see many of it's ideas coming back now in the form of Graal/GraalVM.
Most third party commercial Java SDKs do have support for native compilation, specially on the embedded space.
Around 2009 GCJ suffered an exodus of developers to OpenJDK.
You can follow along here:
http://mail.openjdk.java.net/pipermail/panama-dev/
The same project is also adding support for writing vector code in Java (SSE, AVX etc).
I can say for a fact that panama is not seriously targeting this space. We implement a ton of that native code today that works with c++ and actual android today. We also handle gpus. Project panama is only targeting c, and even then will only do it a cross platform non committal fashion. They aren't doing it the way they should be in order to properly target native vectorized code.
We know this from experience, because this is all we do: https://github.com/deeplearning4j/deeplearning4j https://github.com/bytedeco/javacpp-presets
We tried seeing if we could get some of this work in to the JDK, but their goals fundamentally compete with what it takes to get vector math to be fast. It's also not nearly as ambitious as it needs to be to handle real world tensor workloads.
John Rose of Oracle:
Panama is not just about C headers. It is about building a framework in which any data+function schema of APIs can be efficiently plugged into the JVM. So it's not just C or C++ but protocol specs and persistent memory structures and on-disk formats and stuff not invented yet. We've been relentless about designing the framework down to essential functionality (memory access and procedure calls), not just our (second-)favorite language or compiler.
The important deliverable of Panama is therefore not Posix bindings, but rather a language-neutral memory layout-and-access mechanism, plus a language-neutral (initially ABI-compliant) subroutine invocation mechanism. The jextract tool grovels over ANSI C (soon C++) schemas and translates to the layouts and function calls, bound helpfully to Java APIs with unsurprising names. But the jextract tool is just the first plugin of many.
We do look forward to building more plugins for more metadata formats outside the Java ecosystem, such as what you are building.
In fact, I expect that, in the long run, we will not build all of the plugins, but that people who invent new data schemas (or even data+function schemas or languages) will consider using our tools (layouts, binder, metadata annotations) to integrate with Java, instead of the standard technique, which is to write a set of Java native functions from scratch, or (if you are very clever) with tooling. The binder pattern, in particular, seems to be a great way to spin repetitive code for accessing data structures of all sorts, not just C or Java. I hope it will be used, eventually, in preference to static protocol compilers. The JVM is very good at on-line optimization, even of freshly spun code, so it is a natural framework for building a binder.
>They aren't doing it the way they should be in order to properly target native vectorized code.
Which is interesting since Intel is the one contributing the majority of the vector code changes.
https://twitter.com/sundararajan_a/status/101507363642677248...