Even for English the exact tweaking of line breaking and hyphenation is a problem that requires manual intervention from time to time. In mathematics research papers it’s not uncommon to see symbols that haven’t yet made it into Unicode. Look at the state of text on the web and you’ll encounter all these problems; even Google Docs gave in and now renders to a canvas.
PDF’s Unicode handling is indeed a big mess but it does have the ability to associate any glyph with an arbitrary Unicode string, for text extraction purposes, so there’s nothing to stop the program that generates the PDF from mapping the fi ligature glyph to the to-character string “fi”.