My experience is that you can lock and unlock a golang mutex in around 130ns, do a cas in around 60ns, and a fetch and increment in around 30ns. So building something atop the atomic primitives is potentially faster, but the difference is not so dramatic that you shouldn't try a simple implementation first.
Lock free structures typically have worse average or even uniformly worse performance. The issue is what happens under contention or predictability. Lock freedom guarantees that at least one process makes progress whereas with locks a process can acquire a lock and get switched between cores or get GC paused or whatever.