Think of the hardware involved, and you'll understand a lot better why it takes so much longer to divide than multiply. Both operations are done down at the Floating Point Unit (FPU) level, and even in the world of integral ALUs, the division circuit is a far busier place than a multiplication circuit. I would suspect this is only more painful in the world of floating point, as now the data isn't just least to most significant digit ordered, but is instead ordered by the IEEE 754 standard.
As for the round off, it's really about wherever the signals traveling between the gates get soldered to ground; where that happens, you lose digits. Not rounding, so much as truncation.
Or were you asking about simulating floating point arithmetic using just integers?