Topic PII_BUG from CPU FAQ base

Пожалуйста, обратите внимание на дату представленного здесь сообщения! Информация об адресах, телефонах, организациях и людях наверняка устарела и потеряла практическую ценность, обретя, однако, ценность историческую, заради которой до сих пор и хранится...

— SU.HARDW.PC.CPU (2:5020/299) —————————————————————————————— SU.HARDW.PC.CPU —
From : Anton Rogov 2:5020/438.18 Tue 06 May 97 00:55
Subj : Pentium II math bug?
————————————————————————————————————————————————————————————————————————————————
Привет, All !

http://www.x86.org/secrets/Dan0411.html

Pentium II Math Bug?

------------------------------------------------------------------------

It would appear that there may be a bug in the floating point unit of the new
Pentium II Processor, as well as the current Pentium Pro Processor. Is it real?
Is it serious? It appears to be real. The observed behavior contradicts the IEEE
Floating Point Specifications, and Intel's printed documentation. However, I'm
not a numerical analyst, and therefore I'm not qualified to comment on its
seriousness or its implications. Instead, I'll present the facts herein, and
leave the determination to you.

The Facts

I received email from "Dan" who asked if I could reproduce what he thought was a
bug in the Pentium Pro processor. I wrote an assembly language program that
checked into the problem. I also ran the test on a Pentium-II processor that I
had recently bought at Fry's Electronics, an Intel Pentium Processor (P54C),
Intel Pentium Processor with MMX Technology (P55C), and an AMD K6. Sure enough,
I came to the same conclusion as Dan: it looks like a bug to me.

What do we call this bug?

These days, astronomers name new stars and comets by combining the discoverer's
name and some number. Why should microprocessor bugs be any different? In this
case, "Dan" is the discoverer of the bug, and 04-11 (1997) is the date on which
I got my first email about it. So I've named the bug "Dan-0411" after its
discoverer and the date he first reported it to me.

What is the bug, and what does it affect?

The bug relates to operations that convert floating point numbers into integer
numbers. Floating point numbers are stored inside of the microprocessor in an
80-bit format. Integer numbers are stored in two different sizes. A short
integer is stored in 16-bits, and a long integer is stored in 32-bits. It is
often desirable to store the 80-bit floating point numbers as integer numbers.
Sometimes the converted number won't fit into the smaller integer format. This
is when the bug occurs.

The host software is supposed to be warned by the microprocessor when such a
floating point conversion error occurs; a specific error flag is supposed to be
set in a floating point status register. If the microprocessor fails to set this
flag, it would not be in compliance with the IEEE Floating Point Standards which
mandate such behavior. For the Dan-0411 bug, the Pentium II and Pentium Pro
processors fail to set this error flag in many cases.

This bug appears to affect 231 + 247 different floating point numbers. That's
approximately 140,739,635,839,000 different floating point numbers that result
in the incorrect behavior. The Pentium, Pentium with MMX Technology, and AMD K6
microprocessors do not appear to have this problem.

It might be interesting to note that a launch failure of the Ariane 5 rocket,
which happened less than a minute into the launch, was traced to behavior around
an overflow condition (in this case, it was software, not hardware, that was the
problem). One of the computers on board had a floating point to integer
conversion that overflowed, but because the overflow was not handled by the
software the computer did a dump of its memory. Unfortunately, this memory dump
was interpreted by the rocket as instructions to its rocket nozzles.
Result--boom!

There is a stuffy but complete description of this story (which is actually
quite interesting) at http://www.math.ufl.edu/~cws/3114/ariane-siam.html

Why wasn't this bug detected before?

I'm not exactly sure why this bug wasn't detected sooner, but there are a few
clues that could help provide an explanation. There appears to be a bug in a
popular floating point test program. If Intel relied on this program, its bug
may have inadvertently allowed the Dan-0411 bug to slip by undetected. Professor
William Kahan of Berkeley has written a suite of floating point test programs in
the FORTRAN programming language. (Please refer to Dr. Kahan's home page at
http://http.cs.berkeley.edu/~wkahan.) These programs are commonly used to test
the Float-to-Integer Store instructions (FIST and FISTP). FORTRAN compilers may
have differences in how they handle bit-wise expressions. These compiler
differences could make this test behave differently as well. Technically, it
looks like the original intent of Dr. Kahan's was to use a bit-wise AND instead
of a logical AND in his original FORTRAN source code; this is a potential
non-portability issue -- as I'm not sure how AND is defined by the FORTRAN
standard. This "non-portable" code was discovered when Dan tried to convert Dr.
Kahan's FORTRAN source code to the C programming language -- which has separate
bit-wise and logical AND operators. Dan recognized Dr. Kahan's original intent
and used the proper bit-wise AND operator in his C source code. This is when the
bug appeared in the chip. So in the end, either a bug in the test software, or
in a FORTRAN compiler, may have hidden a bug in the chip.

That's the end of the non-technical discussion. For further technical details,
continue reading.

------------------------------------------------------------------------

How did I get involved?

"Dan, who wants his full name to remain anonymous, sent me the following email
on April 11, 1997 (reprinted with permission):

Robert,

There seems to be a bug in the FIST[P] m16int and FIST[P] m32int
instructions for the P6 (Pentium Pro). Some (perhaps all) values
in the following ranges fail to set the IE (Invalid operation Exception)
flag as required for integer overflow.

FIST[P] m32int: [ c05e80000000000000001, c05e8000000080000000 ] (~-295)
FIST[P] m16int: [ c06e80000000000000001, c06e8000800000000000 ] (~-2111)

(Number of failing mantissas = 231 + 247)

Example on P6 (Pentium Pro):
fcw = 0x37f
FIST[P] m16int c06e80000000000000001 -> 8000 (stored in memory)
FPU status word: B C3 TOP C2 C1 C0 ES SF PE UE OE ZE DE IE
0 0 000 0 0 0 0 0 1 0 0 0 0 0
***FAIL***

Example on P5 (Pentium):
fcw = 0x37f
FIST[P] m16int c06e80000000000000001 -> 8000 (stored in memory)
FPU status word: B C3 TOP C2 C1 C0 ES SF PE UE OE ZE DE IE
0 0 000 0 0 0 0 0 0 0 0 0 0 1

Prof. William Kahan at U.C. Berkeley wrote the following FORTRAN programs
to test floating-point to integer conversions:

http://HTTP.CS.Berkeley.EDU/~wkahan/tests/fistest2.lst
http://HTTP.CS.Berkeley.EDU/~wkahan/tests/fistest4.lst

The following line in the "fistest" programs is non-portable FORTRAN
and could prevent the P6 bug from being detected:

199 Li = ((kflag.AND.Invalid) .NE. Invalid) .OR. Li

-- Dan

Dan wanted to make sure that there wasn't a bug in his C source code, or his C
compiler. That's when he contacted me. Dan wanted me to write assembly language
source code on his behalf. By writing in assembly language, the floating point
hardware may be tested directly and queried directly for its response without
the possible influence of compiler bugs and such.

Normally I don't get involved in debugging other people's problems or writing
source code on their behalf. But Dan was persistent. Within a day or two, Dan
had come up with some very concrete examples of the bug and instructions which I
could use as guidelines for reproducing it. I still wasn't convinced that I
wanted to be involved (not being a floating point expert). But after 10 days or
so, I finally became convinced, and that's when I wrote the first piece of
assembly language source code to detect the Dan-0411 bug.

The Nature of the Bug

This bug occurs when a large negative floating point number is stored to memory
in an integer format. Under normal operation, the largest negative integer is
stored in memory when a floating point number is too large to fit in the integer
format. The FPU Status Word indicates that an Invalid operand Exception (IE)
occurred (FSW.IE = 1).

Storing floating point numbers that overflow the "real number" format are
supposed to behave differently than floating point numbers that overflow the
"integer number" format. Floating point numbers set the overflow flag (FSW.OE =
1), not the Invalid operand Exception flag (FSW.IE). The Pentium Pro Family
Developer's Manual, Volume 2, section 7.8.4 makes this difference quite clear:

The FPU reports a floating-point numeric overflow exception (#O) whenever the
rounded result of an arithmetic instruction exceeds the largest allowable finite
value that will fit into the real format of the destination operand. For
example, if the destination format is extended-real (80 bits), overflow occurs
when the rounded result falls outside the unbiased range of -1.0 * 216834 to 1.0
* 216834 (exclusive). Numeric overflow can occur on arithmetic operations where
the result is stored in an FPU data register. It can also occur on store-real
operations (with the FST and FSTP instructions), where a within-range value in a
data register is stored in memory in a single-or double-real format. The
overflow threshold range for the single-real format is -1.0 * 2128 to 1.0 *
2128; the range for the double-real format is -1.0 * 21024 to 1.0 * 21024.

That explains how float-to-real overflows are supposed to be handled. But the
Pentium Pro manual is very specific by making a distinction between
float-to-real overflows and float-to-integer overflows. In fact, the very next
paragraph in the Pentium Pro manual describes the behavior for the exact
conditions exposed by Dan-0411.

The numeric overflow exception cannot occur when overflow occurs when storing
values in an integer or BCD integer format. Instead, the
invalid-arithmetic-operand exception is signaled.

As I said, this is the precise condition which is not being met by the Pentium
Pro and Pentium II microprocessors. The programs that demonstrate Dan-0411 will
set up these conditions and test whether or not the proper error condition codes
are set by the microprocessor.

Is this already a known bug?

Part of the process of disclosing this bug, was ensuring that it hadn't already
been reported in any of Intel's errata documents. Thanks to Intel for providing
electronic versions of their errata for the Pentium and Pentium Pro
microprocessors, it's very easy to perform an electronic search to see if this
bug has been previously reported. Using this technique, I could not find any
documentation disclosing the Dan-0411 bug on either the Pentium or Pentium Pro
microprocessors.

The Source Code & Programs

I have provided one source code file, and two executable programs. In the case
of the executable programs, both are executable versions of the stand-alone
assembly language source code. The first program, FISTBUG.EXE demonstrates the
bug in a very simple manner. All that appears on the screen is the simple
message:

*** Dan-0411 bug found. ***

- or -

Dan-0411 not found.

The second program, FISTBUGV.EXE runs the same exact tests as the first, but is
much more verbose. This program shows the microprocessor stepping information
and itemized results. Each operand under test is printed to the screen, along
with pass/fail status for four different testing methods.

The Results

I ran this test on various Pentia and other microprocessors. For demonstration
purposes of this article, I will show the results of the Intel 486, Pentium
(P54C), Pentium with MMX Technology (P55C), AMD K6, Pentium Pro, and Pentium II
microprocessors. These results demonstrate that the bug is only present on the
Pentium Pro and Pentium II microprocessors. All other processors I tested did
not demonstrate the Dan-0411 bug.

Conclusion

After reading this, I'm sure than many people will work vigorously to verify or
refute my test results. For this reason, I've provided the source code along
with executable binaries that can be run in DOS or Windows. Since I'm not a
numerical analyst, you should draw your own conclusions or rely on the
conclusions of a qualified expert as to the significance of the Dan-0411 bug.
One thing I can say conclusively: the Pentium Pro and Pentium II processors
behave differently than their predecessors.

Send your feedback.

Tell me how significant you thing this bug is. Send me your feedback. This might
help me understand the significance of this bug and how it might affect your
life. Please send mailto:fistbug@x86.org.

To read what other people have had to say about the Dan-0411 bug, please click
here.

------------------------------------------------------------------------

View results of FISTBUG

ftp://ftp.x86.org/source/fistbug/fistbug.res

Source Code Availability

View source code for FISTBUG.EXE and FISTBUGV.EXE
ftp://ftp.x86.org/source/fistbug/fistbug.asm
ftp://ftp.x86.org/source/fistbug/makefile

Executable Programs

Download FISTBUG.EXE and FISTBUGV.EXE binary executables.
ftp://ftp.x86.org/source/fistbug/fistbug.exe
ftp://ftp.x86.org/source/fistbug/fistbugv.exe
ftp://ftp.x86.org/source/fistbug/Dan0411x.ZIP

The Entire FISTBUG Archive

Download FISTBUG.ZIP archive. Archive contains source code, binary executables,
and my results.
ftp://ftp.x86.org/dloads/FISTBUG.ZIP

------------------------------------------------------------------------

Back to Secrets and Bugs

------------------------------------------------------------------------

c 1991-1997 x86 Monthly Digest and Robert Collins. PGP key available.

Make no mistake!
This web site is proud to provide superior information and service without any
affiliation to Intel Corporation.

"Intel Secrets", "What Intel doesn't want you to know" and anything with a
dropped e in it, are phrases that infuriate Intel Corporation.

Pentium, Intel, and the letter "I" are registered trademarks of Intel
Corporation. 386, 486, 586, P6, all other letters, and all other numbers are
not!
All other trademarks are those of their respective companies. See Trademarks and
Disclaimers for more info.

Robert Collins works somewhere in the United States of America. Robert may be
reached via email or telephone.

With best regards, Anton.

---
* Origin: -= Under construction =- (2:5020/438.18)

Return to the main CPU FAQ page