For those who still think manual memory management and especially malloc and free are fast, check glibc implementation.
https://fossies.org/dox/glibc-2.21/malloc_8c_source.html#l02...
Also see this benchmark between different, faster allocators: http://locklessinc.com/benchmarks_allocator.shtml
The extra work doesn't end at allocation. When you actually implement a concurrent system, you'll usually end up having corner cases at object lifetime changes, which you need to synchronize. If the memory is only claimed when there are no more references to it, this extra synchronization step can be avoided. If you can't rely on this, you'll probably end up doing synchronization, such as (atomic) reference counting.
Synchronization is very expensive and it can quickly become the performance bottleneck for the whole system. On modern X86, you can do 5-20k floating point operations during one contended atomic sync op. Reference count increase or decrease is one sync op. A simple mutex needs two of those.
The more you have CPU cores, the more there will be synchronization (cache coherence) traffic broadcasted to all cores.
Words like "JIT" and "GC" seem to cause knee-jerk reactions in some developers. Likewise for manual memory management. It's not so black and white. There'll always be trade-offs. I usually write low level (firmware and kernel driver) and high performance code. C/C++/SIMD. Code that might need to react under a microsecond.
My message is just please be more open minded.
Analyze where your code spends its execution time. You might be surprised how much of it is spent in things like C++ streams, xprintfs and memory allocation. Unfortunately inter-core synchronization is more insidious. It's not visible in benchmarks on small systems. Often you only start to see hints of this problem when actually running on more cores. Enough of them, and that's all your code is doing.