Wonder why you don't seem to be getting the performance improvement on the Quadra with floating-point that you should?
Do you do your work exclusively in the floating-point domain, or do you frequently need to convert to integer for outputting graphics, sound, or text?
While the Quadra is maybe an order of magnitude faster in floating-point addition, subtraction, multiplication, and division, it does not implement the 6888x FINTRZ instruction, so conversions from float to int take 4 times longer on the Quadra than the FX. FINTRZ means "float-to-int, rounding toward zero", and is the rounding method that is specified by C. One the 68040, the FINTRZ instruction is not implemented, and is implemented through a trap.
Below, we can see the relative timings:
*Timing*
FX Quadra900 Quadra950
Instruction time time time
l=x; /*FINTRZ*/ 241 1624 1232
l=TruncXToLong(x); 631 118 71
x*=x; 90 22 20
*Relative speed between processors with the same instruction (compare across)*
FX Quadra900 Quadra950
Instruction speed speed speed
l=x; /*FINTRZ*/ 7X 1X 1.3X
l=TruncXToLong(x); 1X 5X 9X
x*=x; 1X 4X 4.5X
This shows that while pure floating point operations are 4X as fast on the Quadra as an FX, conversion to integer is 7X slower.
Another way of looking at this is to compare relative instruction timings on the same processor. Below, we show relative speeds of floating-point and conversion operations on the same processor.
*Relative Speed on the same processor (compare down)*
FX Quadra900 Quadra950
Instruction speed speed speed
l=x; /*FINTRZ*/ 2.6X 1X 1X
l=TruncXToLong(x); 1X 14X 17X
x*=x; 7X 73X 61X
This says that 70 floating point operations on the Quadra 900 takes less time than one float-to-int conversion! You could probably invert a 3x3 matrix in that time!
Can you imagine????
The accompanying code (TruncXToLong, found in FloatToLong.a) will speed up float-to-int conversions on the Quadra by over an order of magnitude, making them only 5X slower than multiplies, rather than over 60X!
Unfortunately, though, this slows down the FX's conversion by a factor of 2.6, so you can't optimize for both the FX and the Quadra with the same code. The best solution, of course, would be to implement the FINTRZ instruction in hardware in the '040, but until that time the enclosed code will improve your demos.
Note that there are three procedures implemented in FloatToLong.a:
long RoundXToLong(long double x);
long TruncXToLong(long double x);
long FloorXToLong(long double x);
These correspond to the 68881/2 rounding modes of
• round to nearest
• round to zero
• round to -∞
The only rounding mode not used is "round to +∞", which would implement a CeilXToLong() function. If you need it, I'm sure you can figure out how to modify the enclosed code.
The enclosed code was written for MPW. It is trivial to convert it to Think C.
Additionally, in THINK C, you can optimize it even further by using a "fmove.s" if your data is single precision "float" instead of a "fmove.x" for extended precision "long double", because THINK C passes float and double arguments without conversion (as provided for by ANSI), whereas MPW converts all floating-point arguments to extended. You can optimize it even more by putting it inline. Beware of pulling the change-of-rounding-mode out of the inner loop, because it affects all floating-point computations (multiplication, square root, etc.), not just float-to-int conversion.
None of the timing numbers have been gathered in an unbiased way. You'll probably get the same numbers if you do a tight loop executing the same instruction over 100,000 times, but who does that on a regular basis? Your mileage may vary. All I know is that when it is plugged into some of our 2D and 3D rendering code in ATG, it feels like its rendering 3 times faster. It's hard to come by that kind of speed improvement with careful code and algorithm optimization. This one's a no-brainer.