Previous Table of Contents Next


Listing 63.2 1 L63-2.ASM

; unoptimized dot product; 17 cycles
        fld     [vec0+0]        ;starts & ends on cycle 0
        fmul    [vec1+0]        ;starts on cycle 1
        fld     [vec0+4]        ;starts & ends on cycle 2
        fmul    [vec1+4]        ;starts on cycle 3
        fld     [vec0+8]        ;starts & ends on cycle 4
        fmul    [vec1+8]        ;starts on cycle 5
                                ;stalls for cycles 6-7
        faddp   st(1),st(0)     ;starts on cycle 8
                                ;stalls for cycles 9-10
        faddp   st(1),st(0)     ;starts on cycle 11
                                ;stalls for cycles 12-14
        fstp    [dot]           ;starts on cycle 15,
                                ; ends on cycle 16

Listing 63.3 L63-3.ASM

;optimized dot product; 15 cycles
        fld     [vec0+0]        ;starts & ends on cycle 0
        fmul    [vec1+0]        ;starts on cycle 1
        fld     [vec0+4]        ;starts & ends on cycle 2
        fmul    [vec1+4]        ;starts on cycle 3
        fld     [vec0+8]        ;starts & ends on cycle 4
        fmul    [vec1+8]        ;starts on cycle 5
        fxch    st(1)           ;no cost
        faddp   st(2),st(0)     ;starts on cycle 6
                                ;stalls for cycles 7-8
        faddp   st(1),st(0)     ;starts on cycle 9
                                ;stalls for cycles 10-12
        fstp    [dot]           ;starts on cycle 13,
                                ; ends on cycle 14

The Cross Product

When last we looked at the cross product, we found that it’s handy for generating a vector that’s normal to two other vectors. The cross product is calculated as [u2v3-u3v2 u3v1-u1v3 u1v2-u2v1]. The theoretical minimum cycle count for the cross product is 21 cycles. Listing 63.4 shows a straightforward implementation that calculates each component of the result separately, losing 15 cycles to stalls.

Listing 63.4 L63-4.ASM

;unoptimized cross product; 36 cycles
        fld     [vec0+4]         ;starts & ends on cycle 0
        fmul    [vec1+8]         ;starts on cycle 1
        fld     [vec0+8]         ;starts & ends on cycle 2
        fmul    [vec1+4]         ;starts on cycle 3
                                 ;stalls for cycles 4-5
        fsubrp  st(1),st(0)      ;starts on cycle 6
                                 ;stalls for cycles 7-9
        fstp    [vec2+0]         ;starts on cycle 10,
                                 ; ends on cycle 11
        fld     [vec0+8]         ;starts & ends on cycle 12
        fmul    [vec1+0]         ;starts on cycle 13
        fld     [vec0+0]         ;starts & ends on cycle 14
        fmul    [vec1+8]         ;starts on cycle 15
                                 ;stalls for cycles 16-17
        fsubrp  st(1),st(0)      ;starts on cycle 18
                                 ;stalls for cycles 19-21
        fstp    [vec2+4]         ;starts on cycle 22,
                                 ; ends on cycle 23

        fld     [vec0+0]         ;starts & ends on cycle 24
        fmul    [vec1+4]         ;starts on cycle 25
        fld     [vec0+4]         ;starts & ends on cycle 26
        fmul    [vec1+0]         ;starts on cycle 27
                                 ;stalls for cycles 28-29
        fsubrp  st(1),st(0)      ;starts on cycle 30
                                 ;stalls for cycles 31-33
        fstp    [vec2+8]         ;starts on cycle 34,
                                 ; ends on cycle 35

We couldn’t get rid of many of the stalls in the dot product code because with six inputs and one output, it was impossible to interleave all the operations. However, the cross product, with three outputs, is much more amenable to optimization. In fact, three is the magic number; because we have three calculation streams and the latency of FADD, FSUB, and FMUL is 3 cycles, we can eliminate almost every single stall in the cross-product calculation, as shown in Listing 63.5. Listing 63.5 loses only one cycle to a stall, the cycle before the first FST; the relevant FSUB has just finished on the preceding cycle, so we run into the extra cycle of latency associated with FST. Listing 63.5 is more than 60 percent faster than Listing 63.4, a striking illustration of the power of properly managing the Pentium’s FP pipeline.

Listing 63.5 L63-5.ASM

;optimized cross product; 22 cycles
        fld       [vec0+4]        ;starts & ends on cycle 0
        fmul      [vec1+8]        ;starts on cycle 1
        fld       [vec0+8]        ;starts & ends on cycle 2
        fmul      [vec1+0]        ;starts on cycle 3
        fld       [vec0+0]        ;starts & ends on cycle 4
        fmul      [vec1+4]        ;starts on cycle 5
        fld       [vec0+8]        ;starts & ends on cycle 6
        fmul      [vec1+4]        ;starts on cycle 7
        fld       [vec0+0]        ;starts & ends on cycle 8
        fmul      [vec1+8]        ;starts on cycle 9
        fld       [vec0+4]        ;starts & ends on cycle 10
        fmul      [vec1+0]        ;starts on cycle 11
        fxch      st(2)           ;no cost
        fsubrp    st(5),st(0)     ;starts on cycle 12
        fsubrp    st(3),st(0)     ;starts on cycle 13
        fsubrp    st(1),st(0)     ;starts on cycle 14
        fxch      st(2)           ;no cost
                                  ;stalls for cycle 15
        fstp      [vec2+0]        ;starts on cycle 16,
                                      ; ends on cycle 17
        fstp      [vec2+4]        ;starts on cycle 18,
                                  ; ends on cycle 19
        fstp      [vec2+8]        ;starts on cycle 20,
                                  ; ends on cycle 21

Transformation

Transforming a point, for example from worldspace to viewspace, is one of the most heavily used FP operations in realtime 3-D. Conceptually, transformation is nothing more than three dot products and three additions, as I will discuss in Chapter 61. (Note that I’m talking about a subset of a general 4×4 transformation matrix, where the fourth row is always implicitly [0 0 0 1]. This limited form suffices for common transformations, and does 25 percent less work than a full 4×4 transformation.)

Transformation is calculated as:

-  -    -                -  -   -
 v1      m11 m12 m13 m14    u1
 v2  =   m21 m22 m23 m24    u2
 v3      m31 m32 m33 m34    u3
 1       0   0   0   1     1
-  -    -                -  -   -

or

v1 = m11u1 + m12u2 + m13u3 + m14
v2 = m21u1 + m22u2 + m23u3 + m24
v3 = m31u1 + m32u2 + m33u3 + m34.


Previous Table of Contents Next

Graphics Programming Black Book © 2001 Michael Abrash