The original Pentium I believe introduced a second pipeline that required a compiler to optimize for it to achieve maximum performance.
AMD actually made successful CPUs based on Berkeley RISC, similar to SPARC (they used register windows). The AMD K5 had this RISC CPU at its core. AMD bought NexGen and improved their RISC design for the K6 then Athlon.
Bob Colwell (mentioned elsewhere ITT) wrote a fascinating technical history of the P6: The Pentium Chronicles.
It's certainly not the same kind of OoO. They had register renaming¹, But only enough storage for a few renamed registers. And they didn't have any kind of scheduler.
The lack of a scheduler meant execution units still executed all instructions in program order. The only way you could get out-of-order execution is when instructions went down different pipelines. A floating point instruction could finish execution before a previous integer instruction even started, but you could never execute two floating point instructions Out-of-Order. Or two memory instructions, or two integer instructions.
While the Pentium Pro had a full scheduler. Any instruction within the 40 μop reorder buffer could theoretically execute in any order, depending on when their dependencies were available.
Even on the later PowerPCs (like the 604) that could reorder instructions within an execution unit, the scheduling was still very limited. There was only a two entry reservation station in front of each execution unit, and it would pick whichever one was ready (and oldest). One entry could hold a blocked instruction for quite a while many later instructions passed it through the second entry.
And this two-entry reservation station scheme didn't even seem to work. The laster PowerPC 750 (aka G3) and 7400 (aka G4) went back to singe entry reservation stations on every execution unit except for the load-store units (which stuck with two entries).
It's not until the PowerPC 970 (aka G5) that we see a PowerPC design with substantial reordering capabilities.
¹ well on the PowerPC 603, only the FPU had register naming, but the POWER1 and all later PowerPCs had integer register renaming
https://en.wikipedia.org/wiki/Tomasulo's_algorithm
Took a while until transistor budgets allowed it to be implemented in consumer microprocessors.
It wasn't a full pipeline, but large parts of the integer ALU and related circuitry were duplicated so that complex (time-consuming) instructions like multiply could directly follow each other without causing a pipeline bubble. Things were still essentially executed entirely in-order but the second MUL (or similar) could start before the first was complete, if it didn't depend upon the result of the first, and the Pentium line had a deeper pipeline than previous Intel chips to take most advantage of this.
The compiler optimisations, and similar manual code changes with the compiler wasn't bright enough, were to reduce the occurrence of instructions depending on the results of the instructions before, which would make the pipeline bubble come back as the subsequent instructions couldn't be started until the current one was complete. This was also a time when branch prediction became a major concern, and further compiler optimisations (and manual coding tricks) were used to help here too, because aborting a deep pipeline because of a branch (or just stalling the pipeline at the conditional branch point until the decision is made) causes quite a performance cost.
As the CPU was not out of order, to execute two instructions per clock you had to pair them so that the second one was simple, and did not use the output of the first one. Existing code and most compilers around at the time were generally bad at this, but things like inner render loops in games could make a lot of use if you wrote them in assembly.