The calling process being dynamically linked might impact fork() a lot to copy the various page table setups and then a tiny bit more in exec*() to tear them down. Not sure something like a shell has vfork() available as an option, but I saw major speed-ups for Python launching using vfork vs. fork. Of course, a typical Python instance has many more .so's linked in than osh probably has.
One could probably set up a simple linear regression to get a good estimate of added cost-per-loaded .so on various OS-CPU combos, but I am unaware of a write up of such. It'd be a good assignment for an OS class, though.