Modern CPUs execute micro ops, RISC-like instructions (e.g. load from memory address to register, add two registers, store from register to memory address). The CPU's "decode" stage translates x86 instructions into micro-ops (often 1-to-1, but x86 compare followed by x86 jump are translated into a single micro-op, while some x86 instructions are translated into multiple micro-ops).
On one CPU model a x86 operation like "256bit add" might translate into "256bit add" micro-op, and on another model the same x86 operation might be translated into a series of micro-ops like "128bit add, wait a cycle for the 1st add to finish, pass the carry bit into a 2nd 128 bit add", because that model doesn't have a real 256bit adder. So the latency of the operation is 2 cycles, but nothing else is changed.
Some x86 instructions might be very complicated and cannot be translated into a fixed-length series of micro-ops using a template. For example, the integer division, square root or the string compare machine instructions might be loops with conditionals in them and don't run the same amount of micro-ops every time. They can be implemented by Intel using a program written in micro-ops. Intel stores this program in flash on the CPU and the decoder knows to run that program when encountering the instruction. The OS doesn't need to help here, this is not emulation or software-floating point, it's just that the single instruction takes 200 clock cycles. What this does to the out-of-order engine is another story. These "programs", called microcode, can have bugs and newer versions of microcode updates, sent to the CPU at boot by the BIOS/UEFI and/or by the OS, update them.
https://en.wikichip.org/wiki/macro-operation
https://en.wikichip.org/wiki/micro-operation
https://en.wikipedia.org/wiki/Microcode