----------------------------------------------------------------- FLOATING-POINT DIVISION WITH OPTIONAL CHECKING TO ENSURE FULL RESULT PRECISION ----------------------------------------------------------------- Questions pertaining to documents and code related to the Intel FDIV software patch can be directed to software-support@intel.com or FAXed to (408)765-5165 attention FDIV PATCH. CONTENTS -------- REVISION HISTORY OBJECTIVE BACKGROUND PATCH PROCESS STRATEGY THE CORE ALGORITHM RECOMPILATION ISVs - END-USER MODIFICATIONS PATCH IMPLEMENTATION IMPLEMENTATION VARIATIONS LIBRARIES TECHNICAL NOTES Running the Patch on Processors Preceding the Intel486(tm) Processor Detecting the Floating-Point Unit Mnemonic Interpretation Scaling Factor Scaling Exceptions Precision Loss FPU Status Word RELATED PROOFS Safety of the Logarithmic Instructions FYL2X and FYL2XP1 FYL2X FYL2XP1 Identifying Problematic Divisors TESTING AND VALIDATION CYCLE COST CONSIDERATIONS APPENDIX A Division with optional checking functions summary APPENDIX B Division with optional checking files summary APPENDIX C Patch code revision history REVISION HISTORY ---------------- 011395 ------ Added revision history section to patch document. Retitled the section "Running the Patch on an Intel386(tm) Pro- cessor" to "Running the Patch on Processors Preceding the Intel486(tm) Processor." Added text indicating the need to check for the presence of a floating-point unit. Added Table 2, Executions times of FDIV patch with memory operands, under CYCLE COST CONSIDERATIONS. Added APPENDIX C containing patch code revision history. OBJECTIVE --------- The following document describes an Intel-approved software ap- proach to floating-point division that utilizes proved software algorithms and existing hardware instructions. Using this ap- proach overcomes the possibility of a reduction in precision due to a floating-point division flaw in some steppings of the Pentium(tm) processor. The objectives of this approved approach are to provide a method for floating-point division that 1. Ensures floating-point division result precision on all Intel386 processors and beyond, in all precision modes. 2. Has been optimized for efficient performance on current and future Intel processors. This optimization has been accomplished through the hand-coding of assembly routines that include such techniques as the elimination of branching code and the avoidance of CPU stalls. A number of software patches have been proposed that may be suit- ed to avoiding a potential division flaw. Note that Intel's pro- posed software workaround, or patch, does not disable the floating-point unit on susceptible Pentium(tm) processors. Hand coded and optimized assembly routines were developed that contin- ue to utilize the hardware floating-point division operation with additional operations executed only in the rare case that a given division is known to be susceptible to a floating-point division flaw. The software patch presented is intended to be implemented at the compilation and software development levels. End-users should use recompiled code where available. Recompiled code with a patch such as the one that Intel is providing will allow the fastest patched executable speed. Patches that interrupt execut- ables to override floating-point division operations with alter- nate solutions incur the added expense interrupts. The disabling of the floating-point unit as a patch for the floating-point division flaw will slow all floating-point calculations, includ- ing those such as FADD that are unaffected by the FDIV flaw. The utilization of other patches, such as those that disable the floating-point unit, can be of use to those who must execute code that has not been coded or recompiled with incorporation of a software patch. BACKGROUND ---------- Certain steppings of the Intel Pentium(tm) processors have exhi- bited a flaw in the floating-point unit that may result in some loss of precision in division results. This precision loss can manifest itself in bit positions 13 and beyond of the mantissa of a floating-point division result and may occur in any of the three (single, double, and extended) precisions, independent of rounding mode. The floating-point division flaw can affect floating-point division instructions such as FDIV and FDIVR as well as functions utilizing hardware division instructions in- cluding FPTAN, FPATAN, FPREM, and FPREM1. Because the flaw af- fects a maximum of 5 sparsely populated divisor value ranges out of 1024 possible ranges and particular combinations of operands, precision is only affected in approximately 1 of 9 billion ran- domly fed floating-point division operations. Intel is determined to provide a safeguard against floating-point division inaccuracies expediently and on all processors. To ac- complish these goals, a unique collaboration was formed between Intel and experts in the industry. Analysis and software patches for the Pentium(tm) processor floating-point division flaw have been devised at Intel utilizing expert input from Cleve Moler, Terje Mathisen, Tim Coe, and Peter Tang. Coe has been able to precisely simulate the FDIV flaw and provide proofs of correct- ness for the techniques described in this document. Moler developed a software adjustment technique. He has implemented an FDIV patch in MATLAB and is currently verifying the result. Mathisen devised a table-driven check of floating-point divisor values, wrote an initial version of software patch code, and as- sisted Intel with instruction-level optimization in assembly code. Intel extended these techniques further to provide imple- mentation flexibility to software developers and to minimize the clock count of the floating-point division precision correction. The resulting workaround can be implemented by replacing each floating-point division instruction with a macro that expands in line. PATCH PROCESS STRATEGY ---------------------- To avoid slowing the execution of already correct floating-point division operations, the software patch first asserts the possi- bility of an imprecise result before executing a correction. The recommended software solution involves a code expansion of each FDIV-type instruction into a macro that includes a call to an error checking and adjustment routine, described in more de- tail later in this document. The routine includes several steps needed to eliminate any potential precision loss from the Pentium(tm) floating-point division flaw, as follows. 1. Test a global flag to indicate whether or not a processor is flawed. If the processor is not a Pentium(tm) processor containing the floating-point division flaw, a normal floating-point hardware division is applied to the original operands. Otherwise, 2. Perform an operand range check for those processors which contain the flaw. a. If the range test indicates that a divisor is not in a susceptible numeric range, return the result of a normal floating-point hardware division applied to the original operands. Otherwise, b. If the range test indicates that a divisor is in a susceptible numeric range, 1. Perform a software adjustment of the numerator and denominator. 2. Execute a hardware division with the adjusted operands. 3. Return the full precision result of the division. The software patch should be generated and executed by default rather than under a specific option flag. In particular, "blend- ed code," code targeted towards multiple processors, should in- corporate the software fix. This guarantees that even execution of applications not optimized for the Pentium(tm) processor will be protected against floating-point precision reduction should they unexpectedly be executed on a Pentium(tm) processor with the floating-point division flaw. An option should be provided to disable the patch during execution. In order to accommodate the requests of implementors of the Intel software patch, several variations to the basic software correc- tion and its implementation have been developed by Intel. Developers can choose a suggested option that is technically correct in their environment, minimally intrusive to their current production schedules, and allows for the fastest tur- naround. If a developer requires a solution not encompassed by the existing patches, Intel can assist in developing appropriate techniques. THE CORE ALGORITHM ------------------ The core of the division with optional checking process is accom- plished through several steps and should not be modified. The core algorithm need only be executed on flawed Pentium(tm) processors. The existence of a Pentium(tm) processor with the floating-point division processor flaw is identified at run time by executing a floating-point division instruction with operands known to induce a loss of precision. During the first part of the core algorithm, a range check is performed. Only divisors in identifiable ranges indicate divi- sion operations susceptible to a floating-point division result precision loss. An early proof indicated that a maximum of 5 out of 16 ranges, or 31% of ranges, of divisor values identify sus- ceptible divisions. Tim Coe and Peter Tang have proved that there are a maximum of 5 out of 1024 ranges of divisor values that constitute susceptible divisions, or a potential of less than 1%. Peter Tang independently verified a proof of 5 suscep- tible numeric bands out of 128. That proof currently supports the core algorithm as it was available at the start of patch code testing and validation. Consider this representation of a normal divisor. +------+-----------+-----+--------+-------------------------+ | sign | exp | 1. | 1111 | 111 . . . . . . . . . . | +------+-----------+-----+--------+-------------------------+ | |(RANGE) | | +----------- mantissa -----------+ +-- zero if denormal Figure 1. RANGE refers to the 4 bits of the mantissa seen in the figure above. In order for a reduction in precision to possibly affect a floating-point division result, these 4 bits must be equal to the decimal values 1, 4, 7, 10, or 13. Furthermore, the subse- quent three bits must all be ones (i.e. equal to a 3-bit value of 7 decimal). Consider the floating point number 14.999999. This number is known to be a divisor susceptible to a reduced precision divi- sion. Its hexadecimal value is 416FFFF. Its binary representa- tion can be seen in Figure 2. +------+-----------+-----+--------+-------------------------+ | 0 | 1000 0010 | 1. | 1101 | 111 1111 1111 1111 1111 | +------+-----------+-----+--------+-------------------------+ | |(RANGE) | | +----------- mantissa -----------+ +-- hidden bit zero if denormal Figure 2. The first part of the range-check algorithm tests to see if the divisor in question is a denormal number. If the divisor value is found to be denormal, it is shifted 2^64 to the left to nor- malize the value before continuing with the range check process. Next, the three bits following RANGE are masked. If any of those bits equals zero, the core algorithm executes a hardware floating-point division with the original operands, then exits. Using the Figure 2 example, 14.999999 is not denormal and the three bits following RANGE are all ones. Therefore, the algo- rithm continues. Next, an efficient table-lookup scheme developed by Terje Math- isen is employed to detect a divisor whose RANGE value is 1, 4, 7, 10, or 13. A table is initialized with 16 elements, as seen in Figure 2. Positions 1, 4, 7, 10, and 13 are set to one while the remaining positions are set to zero. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0|1|0|0|1|0|0|1|0|0|1|0|0|1|0|0| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ element 0 element 15 Figure 3. When the RANGE value is used as an index into the table, a one or zero is returned. If a zero is returned, no precision reduction will occur in a division result, and the core algorithm executes a hardware floating-point division with the original operands, then exits. In the Figure 2 example, the 4 RANGE bits equal 13. Indexing 13 into the table (table[13]) returns a value of one, so the core algorithm is not yet complete. At this point in the algorithm, it is known that the divisor falls into a numeric range that may be susceptible to the Pentium(tm) processor floating-point division flaw. Both numera- tor and divisor are multiplied by 15/16, which means that the floating-point division itself is multiplied by 1 ((15/16) / (15/16)). The scaling factor of 15/16 shifts the operand values of the floating-point division into ranges known to be immune to the floating-point division flaw. That is, 1 is shifted to 0, 4 to 3, 7 to 6, and so on. A hardware floating-point division with the scaled operands is performed and the algorithm exits. In single and double precision modes, division results will have full precision and conform to IEEE standards. In extended modes, at most one bit of precision may be lost due to the extra floating-point operations on the operands. RECOMPILATION ------------- The Intel software patch can optimally be implemented at the com- pilation level. The workaround can be implemented in a compiler by following the guidelines under the section entitled PATCH IM- PLEMENTATION. Compilers should be modified to generate a software patch in place of each originally generated FDIV-type instruction as described later. Source-level applications can then be recompiled to incorporate the Intel workaround. Though a compiler should generate patch code by default, an op- tion should be available to disable patch code generation during compilation. When such an option is specified, a compiler should still be able to check for the presence of the floating-point division flaw and issue a warning if present. ISVs - END-USER MODIFICATIONS ----------------------------- The presented software solution can be validly applied at the source level; however, modifying an application at the source- code level in this way can be error-prone and catalyze problems in future generations of the application. Implementation of the division_with_checking routine directly within a source-level ap- plication is not straightforward as during assembly code genera- tion of an application. Scanning specifically for floating-point division symbols in a C program is a complicated process, for ex- ample. Multiple divisions in a statement or within conditional blocks prevent the automatic expansion of a macro in place of a normal floating-point division expression. The software patch should preferably be implemented at the compilation level, where operations have been broken into single independent instructions and immediately succeeding function calls can therefore be tolerated. It is recommended that ISVs recompile their applications with compilers modified to generate the FDIV-instruction patch expan- sion. This will guarantee that application users receive the precision available from processors not containing the floating- point unit flaw with minimal impact on application execution speed. Individuals with access to their own applications' low-level as- sembly source codes can incorporate the provided macros for floating-point division with checking much as a compiler would before code output. If during a particular run an individual is certain that an ap- plication will not be affected by a reduction in floating-point division accuracy, the fdiv_chk_flag global variable can be turned off to maximize execution speed of that application. PATCH IMPLEMENTATION -------------------- A summary of some precision checking and restoring (division_with_checking) functions is provided in Appendix A. Refer to appropriate assembly source files for the actual Intel division_with_checking routines. Throughout the remainder of this document, the token division_with_checking is used to indi- cate any of the floating-point division patch routines provided by Intel. The preferred method of retaining precision involves compiler in- clusion of a conditionally executed division_with_checking rou- tine at each FDIV instruction originally generated. This process can be expanded by compilers at a very late compilation phase so as to preserve compiler state and minimize the changes required to existing compilers. Assembly code developers can directly in- corporate calls to the patch code in their routines. This late-phase design simplifies the implementation process for compiler vendors by avoiding the disruption of the quality as- surance process that would be caused by mid-compiler modifica- tions. In addition, compiler optimization will proceed normally since a call will not be inserted in place of an FDIV in inter- mediate code. Insertion of a call during earlier phases of com- pilation could potentially turn off optimization around floating-point divisions. In order to incur minimal cycle cost, the division_with_checking routines should be called only during execution of applications on individual processors known to exhibit the floating-point division flaw. To this end, the software patch process includes the use of conditional code enclosing the appropriate division_with_checking routines. In the assembly code example presented in this section, the first operand of a two-operand instruction is the result destination. Consider an FDIVR (reverse division) instruction. The format used in section examples is FDIVR DEST, SRC DEST <- SRC / DEST The actual instruction used in this section's example will be opcode mnemonic d8 fd fdivr st, st(5) The top of the stack will hold the result of a floating-point division of the fifth (zero-based) stack position by the top of the stack. At the affected FDIV-type instruction, macro expansion and gen- eration of a conditional block around the fdivr instruction should take place as seen below. if (fdiv_chk_flag == 1) { fdivr st, st(5) } else division_with_checking FDIV expansion The fdiv_chk_flag global variable is a three-state variable whose value is set within the division_with_checking routines. It is initialized to 0 when declared, set to 1 on processors not re- quiring floating-point division with checking, or set to -1 when the executing Pentium(tm) processor exhibits the floating-point division flaw. The value of fdiv_chk_flag is set by the function fdiv_detect during the first invocation of a division_with_checking routine. The fdiv_detect function stores a value to fdiv_chk_flag depend- ing upon the status of the executing processor. It also returns the new value of fdiv_chk_flag to the calling function. The division_with_checking routine is called when fdiv_chk_flag is not equal to 1, and proceeds when fdiv_chk_flag is equal to -1. cmp fdiv_chk_flag, $1 ; compare global to 1 jne L1 ; if not 1 jump to L1 fdivr st, st(5) ; else do hw fdivr jmp L2 ; then jump to L2 L1: division_with_checking FDIVR expansion ; do fdivr w/checking ; during the first ; invocation, this ; routine else sets ; fdiv_chk_flag L2: Next, insert a call to an appropriate division_with_checking rou- tine (in this example, fdiv_r) within the conditional block in addition to the original FDIV-type instruction. The code for the current example then resembles the sequence below. cmp fdiv_chk_flag, $1 ; compare global to 1 jne L1 ; if not 1 jump to L1 fdivr st, st(5) ; else do hw fdivr jmp L2 ; then jump to L2 L1: call fdiv_r L2: Note that the caller must save and restore condition codes and may need to save and restore register contents around calls to division_with_checking functions. The eflags register contents are destroyed by the division_with_checking routines, and the eax register contents might be overwritten as specified below. When performing register-register divisions, as in the current example, the eax register is used to convey information to the division_with_checking procedure to be executed. Therefore, the caller must at times save and restore eax around invocations of division_with_checking functions. Alternatively, the eax infor- mation can be pushed onto the stack with simple modifications to the relevant division_with_checking routines. For a division with register operands, one operand is taken from the top of the floating-point stack, and the stack position number of the second operand needs to be recorded in the eax re- gister along with additional information about the intended floating-point division as pictured in Figure 4. Only the 6 lowest bits of the eax register are used for this initialization so potential operation in 16-bit mode is still valid. ----------------------------------------------------------------- 5 4 3 2 1 0 Bit position +---+---+---+---+---+---+ | | | | | | | +---+---+---+---+---+---+ | | | | | +-------+ | | | Indicates | | +-- pop (DIVP) stack | +-- reverse (DIVR) position +-- True - result is at ST(bits 3 to 5) False - result is at ST(0) Figure 4. Register Initialization of eax for register-register FDIV patch ----------------------------------------------------------------- Alternatively, when floating-point divisions involve memory operands, the associated division_with_checking routines expect that memory operands have been pushed onto the top of the user stack. This process avoids prefix overrides. Push memory in- structions can be used to accomplish the memory operand setup on the user stack, eliminating the need for additional register as- signment. In the given example, the last six bits of eax should be set to 101010 (decimal 42). That is, refer to position 5 (101) in the stack for the second operand, the top of stack will hold the result (0), reverse division is specified (1), and no pop will be executed (0). cmp fdiv_chk_flag, $1 ; compare global to 1 jne L1 ; if not 1 jump to L1 fdivr st, st(5) ; else do hw fdivr jmp L2 ; then jump to L2 L1: push eax ; save eax mov eax, $42 ; load eax for fdiv_r call fdiv_r ; do div w/checking fstp result ; get the div result pop eax ; restore eax L2: The division_with_checking routines such as fdiv_r return the division result on the floating-point stack. In UNIX format, the destination and source operands are reversed. Hence, the preceding example would be translated to the subse- quent code. ** cmp $1, fdiv_chk_flag ; compare global to 1 jne L1 ; if not 1 jump to L1 ** fdivr %st(5), %st ; else do hw fdivr jmp L2 ; then jump to L2 L1: ** push %eax ; save eax ** mov $42, %eax ; load eax for fdiv_r call fdiv_r ; do div w/checking fstp result ; get the div result ** pop %eax ; restore eax L2: Asterisks indicate instructions modified during translation to UNIX format. IMPLEMENTATION VARIATIONS ------------------------- The core algorithm that performs the divisions and assures accu- racy should not be modified. There are subtle end cases that must be accounted for to provide results equivalent to the FDIV operation executing on processors not containing the floating- point division flaw. The previous sections describe the preferred software solution for overcoming the Pentium(tm) processor floating-point division flaw. This solution is likely to accommodate most environments. A variation of the preferred resolution may be necessary, as described in the following paragraphs. If the fdiv_chk_flag global can be set in a program's startup code, the checking routines can be modified to eliminate the set- ting and testing of the fdiv_chk_flag global variable. This el- iminates one mandatory call to a division_with_checking routine and an additional compare within checking routines on processors that are not susceptible to the floating-point division flaw. There may be cases where testing a global variable is not practi- cal and the first test of the software patch will be to see if the divisor falls into a problematic range. Intel has developed 16-bit DOS, 32-bit DOS, and UNIX versions of the division_with_checking assembly routines. The code operates in extended-precision mode. The control word is saved and re- stored within the division_with_checking code. If it is known that the processor is always operating in 80-bit precision mode, the control word save and restore code can be deleted. Performing the scaling and result adjustment for all floating- point divisions falling within susceptible ranges without regard to the presence of a floating-point division flaw on the execut- ing processor is never recommended as this needlessly increases processing time. LIBRARIES --------- Code needs to be modified so that floating-point instructions are replaced with floating-point division macros that ensure full- precision division results. It is essential that libraries as well as hand-coded and compiler-generated code be made safe. FDIV may occur in many library routines, especially the hyperbol- ic routines. Other instructions including transcendentals that may be present in library code need to be addressed. These are currently known to include FPTAN, FPATAN, FPREM, and FPREM1. The logarithmic instructions FYL2X and FYL2XP1 are safe from the floating-point division flaw as is proved in the TECHNICAL NOTES section. An implementation of FPTAN using a hardware division instruction as well as a 64-bit software version of FPATAN are available. Software implementations of FPREM and FPREM1 have also been developed. TECHNICAL NOTES --------------- Intel has developed 16-bit DOS, 32-bit DOS, and UNIX versions of the division_with_checking assembly routines. The code operates in extended-precision mode and is designed for Intel386(tm) pro- cessors and beyond. Hardware behavior is not identically mimicked in the FDIV wor- karound code. This section includes explanations of technical details and a summary of the differences between straightforward hardware division and hardware division within the context of the proposed software workaround. Running the Patch on Processors Preceding the Intel486(tm) Pro- cessor ----------------------------------------------------------------- The Intel patch code has been specifically written to run on Intel486(tm) processors and beyond. It should be guaranteed that the the fdiv_chk_flag global variable is set before attempting to execute any workaround code on processors earlier than the Intel486(tm). The fdiv_detect routine does a check for the Pentium(tm) proces- sor floating-point division flaw with a sample division and can be run on all processors in the Intel Architecture family having floating-point units. This routine initializes fdiv_chk_flag. Prior to executing fdiv_detect, the presence of a floating-point unit must be established. This cannot be established in the fdiv_detect routine as such a check would require the incorrect reinitialization of the floating-point unit when checking for the FDIV flaw. Detecting the Floating-Point Unit --------------------------------- Since some operating systems provide means of disabling the floating-point unit, applications need to be aware that the in- formation they need is whether the OS has enabled the FPU, rather than whether the FPU exists. Old 16-bit binaries typically handled the absence of an FPU with built-in emulators. Most 32-bit operating systems provide emula- tion capability so applications do not need to provide their own. Hence, if a user requests that the operating system turn off the floating-point unit on 32-bit operating systems, floating-point operations will be emulated by the 32-bit OS. Alternatively, if a user requests that the operating system turn off the floating- point unit on 16-bit operating system, floating-point instruc- tions will be skipped. 16-bit applications should continue to use the FINIT sequence to detect if the floating-point unit is present. For compatibility issues on older processors, the CPUID instruction should not be used to check for an FPU. For 32-bit applications where most environments already provide FPU functionality by default, it is not necessary for applica- tions to test for the presence of the FPU explicitly. Mnemonic Interpretation ----------------------- Mnemonics, opcodes, and their descriptions adhere to the Pentium(tm) Processor User's Manual. In particular, the mnemonic FDIVRP ST(x), ST represents the opcode DE F0+x and the mnemonic FDIVP ST(x), ST represents the opcode DE F8+x. Note that the UNIX assembler erroneously attaches each of these mnemonics to the other's opcode (e.g. FDIVRP ST(x), ST represents DE F8+x). Scaling Factor -------------- The scaling factor of 15/16 was chosen to guarantee that an operand lying within one of the five flaw-susceptible ranges of numbers will be scaled to a safe region. This guarantee is trivially proven by testing the endpoints of the five known numeric bands. Refer to Statistical Analysis of Floating Point Flaw In the Pentium(tm) Processor (1994) (Sharangpani, Barton) for the boundaries of the potentially unsafe regions. Scaling Exceptions ------------------ Because the scaling factor is less than one, it introduces the possibility of an underflow when the numerator is multiplied by it. If the result of the final division is to be either 32 or 64 bits, this can be addressed by performing the scaling in extended precision. Since extended precision has a minimum exponent of 2^-16382, no single or double-precision input operand has the possibility of becoming a denormal when multiplied. If the scal- ing factor were greater than one, a similar argument shows that overflow is not possible. Unfortunately, the possibility of underflow persists for 80-bit operations with numbers having magnitude less than (16/15)*2^- 16382. Masking the underflow exception while doing the scaling avoids the trap that would ordinarily occur. However, the under- flow bit is sticky, and hence a spurious underflow would still be reflected. Precision Loss -------------- o Hardware division within the context of the FDIV software patch employs different algorithms than a simple hardware division. In order to avoid excessive performance degradation, a few varia- tions in the resulting precision between the two division possi- bilities may be observed. o Newton-Raphson methods are typically less precise due to final roundings. Because the precision-restoring algorithm in the FDIV and FPTAN patch routines introduces two additional floating-point operations to the computation of a division, the precision of the operation is reduced by 1 ULP (unit of least precision). This means that with the given algorithm, an 80-bit precision division result is reduced from 64 bits of mantissa precision to 63. By doing all scaling in extended precision and then dividing in ei- ther single or double-precision accuracy, results of full preci- sion are produced in single and double-precision modes. o When applied to small denormal numerators, the FDIV patch code may produce slightly different results in the least significant binary digits. This potentially occurs when the numerator is denormal, and hence has a reduced number of significant digits. For example, when an extended precision denormal has 6 leading zeros, that number only has 58 significant digits. When such a number is used in a division, the result will only have 58 signi- ficant digits. o If the inputs to FDIV or FPTAN patch routines are not exactly representable as singles or doubles, the result may differ by up to 1 ULP. Exactly representable single and double operands will produce exact results. o The FPATAN patch routine result may differ in extended preci- sion by as much as 3 ULPs. For single and double, the FPATAN patch routine result precision may differ by as much as 1.5 ULPs. FPU Status Word --------------- o The assembly routines provided by Intel should not be called from code with exceptions unmasked where the values of the flags denormal, inexact, or underflow are utilized. o Hardware division within the context of the FDIV software patch employs different algorithms than a simple hardware division. In order to avoid excessive performance degradation, a few varia- tions in the resulting FPU status word between the two division possibilities may be observed. o The inexact flag after scaling and a hardware division may not be the same as a hardware division of the original operands. Sometimes using the original operands results in an inexact ex- ception while using the scaled operands does not, and vice versa. o The denormal bit may be set differently after division within the context of any of the patch routines. o The FDIV patch may set the underflow flag for divisions by denormals when underflow would not otherwise be set. o FDIV and FPTAN patch routines may set C1 differently when called with precision control set to extended. o FDIV and FPTAN patch routines may set C1 differently if the in- put operands were not exactly representable as singles or doubles and the precision control is set to single or double, respective- ly. Exactly representable single and double operands will pro- duce exact results. o FPREM and FPREM1 patch routines may not set C0, C1, and C3 identically if the given instruction performs an incomplete reduction. o The patch code for FPATAN may set the values of C0, C2, and C3 differently than the hardware instruction. Similarly, the patch for FPTAN may set C0 and C3 differently than the hardware. C0, C2, and C3 are marked undefined for these instructions in the reference manual, so proper existing code should not rely upon specific values for them regardless. RELATED PROOFS -------------- Safety of the Logarithmic Instructions FYL2X and FYL2XP1 -------------------------------------------------------- Peter Tang has proved the immunity of FYL2X and FYL2XP1 from the Pentium(tm) processor floating-point division flaw. Proofs fol- low. FYL2X ----- The table-driven polynomial-base algorithm for FYL2X employs one division for arguments x in the range 7/8 < x < 9/8 and one divi- sion for arguments in the range |x - 1| >= 1/8. That is, 0 < x <= 7/8 or x >= 9/8. The two divisions are used for argument transformation. Division does not impact this algorithm. For 7/8 < x < 9/8, the division is correct, and therefore FYL2X is unaffected for input arguments in this range. The reason is that for x in this range, the transformation used 1+x as the denominator. This transformation is quite standard. The bit pattern for 1+x in this range is either 2^(-1) * 1.0000????.... or 1.111?????..... Both bit patterns are safe denominators. For |x - 1| >= 1/8, the denominator has a bit pattern of 2^m * 1.b1 b2 b3 b4 b5 b6 b7 .... where (b6 b7) = (1 0) or (0 1). The reason is that the denominator is obtained by x+c where c is ba- sically the leading bits of x. Precisely, for x = 2^k * 1. b1 b2 b3 b4 b5 ? ? ? ? ? ? ... we have c = 2^k * 1. b1 b2 b3 b4 b5 1 0 0 0 .... 0 0 0 Thus, x + c = 2^(k+1) * 1. b1 b2 b3 b4 b5 b6 b7 ? ? ? where (b6 b7) is (1 0) or (0 1). To be more explicit, x + c is 2^k * 1. b1 b2 b3 b4 b5 ? ? ? ? ? ? ... + 2^k * 1. b1 b2 b3 b4 b5 1 0 0 0 0 0 ... --------------------------------------------------- 2^k * 1 b1. b2 b3 b4 b5 0 ? ? ? ? ? ? ... + 2^k . 0 0 0 0 0 1 ? ? ? ? ? ... --------------------------------------------------- 2^k * 1 b1. b2 b3 b4 b5 b6 b7 ? ? ? ? ? Where b6 and b7 cannot both be ones. (b7 == 1 implies the ?s above are zeroes, making b6 = 0+0 = 0). Since b5, b6, and b7 must all be ones when the flaw is encoun- tered, this range is also safe. FYL2XP1 ------- The algorithm is very much the same as FYL2X, and the impact of division on it is also the same as it is on FYL2X. For |x| < 1/8, a division is used where 2+x is a safe denomina- tor. For |x| >= 1/8, overwrite x by 1+x and again use x+c as a denominator. Identifying Problematic Divisors -------------------------------- Tim Coe has proved the validity of checking the bit patterns in- dicating divisors at risk described previously in this document. Tim Coe and Peter Tang are currently preparing a formal proof that will submitted for publication in the near future. A main thrust of the proof is to establish that the following two digit sequences and P-D table accesses are the only paths to addressing the flawed P-D entries: For cases 1, 4, 7, 10, and 13 ==> Cycle | Q digit | P-D entry | Minimum magnitude | selected | accessed | of ignored partial | | | remainder -------------------------------------------------------- B-3 | -1 or -2 | no restriction | no restriction | | | B-2 | -2 | maximum entry | 125/512 | | for -2 digit | | | | B-1 | +2 | flawed P-D entry | 14/64 | | less 1/8 | | | | B | +2 ==> 0 | flawed P-D entry | 0 | | | For cases 1, 7, and 13 ==> Cycle | Q digit | P-D entry | Minimum magnitude | selected | accessed | of ignored partial | | | remainder -------------------------------------------------------- B-3 | -1 or -2 | no restriction | no restriction | | | B-2 | -1 | maximum entry | 125/512 | | for -1 digit | | | | B-1 | +2 | flawed P-D entry | 14/64 | | less 1/8 | | | | B | +2 ==> 0 | flawed P-D entry | 0 | | | The partial remainder has the form: P-D table <> ignored address <> portion XXXX.XXXxxxxxxxxxxx... xxxxxxxxxxx... 0 <= ignored < 1/4 Start from cycle B and work backwards. Either establish algebra- icly that alternatives cannot occur or assume some alternative can occur and derive a contradiction. Progressing backwards, es- tablish two ones, then three, then four, then finally six ones in positions 2^(-5) to 2^(-10) in the divisor are required to ad- dress the flawed P-D entry. Use preliminary restrictions on the divisor to establish earlier entries in the above tables, and then use these facts to establish tighter restrictions on the divisor. TESTING AND VALIDATION ---------------------- Intel has performed multiple levels of testing and validation in- cluding 1. Core routines 2. Compiler builds 3. Random tests The Intel FDIV software patch has been incorporated into Intel's own compiler and tested extensively for correctness. In addition to recompilation and execution of the entire production compiler test suite, specific division tests were designed using test vec- tors containing billions of random division operand values. These random division tests were recompiled with Intel's compiler and all subsequently executed divisions completed without error, and with full precision to the extent of the exceptions noted in this document. Compiler vendors implementing Intel's software patch will perform additional testing and validation of the pro- cedure. CYCLE COST CONSIDERATIONS ------------------------- Performance impact from the software patch will be minimal. The multiple steps proposed in the preceding sections are optimal as they ensure absolute resolution of a possible FDIV flaw and pro- vide the best possible performance. The cost on systems without the flaw is insignificant. On the other hand, performing the range test at all times would waste processor cycles on proces- sors that do not exhibit the floating-point division flaw. Table 1 includes execution times of tests run on various proces- sors. Software tests consisting of multiple divisions performed with and without the register-register FDIV software patch were compiled using a recent Intel Reference Compiler for UNIX and timed on the systems listed in the table columns. Table 2 displays execution times of the FDIV patch using memory operands. The testing and timing code included a loop like the following. for (i=0; i