Jump to content

User:Softtest123/Patents/Bounded floating point

From Wikipedia, the free encyclopedia
Purge

Apparatus for Calculating and Retaining a Bound on Error during Floating Point Operations and Methods Thereof[edit]

US Patent US9817662B2[edit]

I think my first inspiration for this patent, currently filed as a Provisional Patent Application, came from my personal discovery of floating point error mentioned in “Microsoft Calculator Challenge” and described in more detail in Appendix A (pp. 136-144) of my Ph.D. Dissertation.

Strangely enough, the first patent on floating point arithmetic (US 3037701 A, Sierra, 1962) states:

“The use of fixed decimal point arithmetic also has limitations in certain scientific computations because the total number of digits may be extremely large with respect to the total number of available digit receiving positions in the data receiving and storing registers or accumulators. Thus under some conditions, the major portion of the significant data digits may lie beyond the capacity of the registers. Therefore, the result obtained may have little meaning if not totally erroneous.”

The reason there is error in floating point arithmetic is that real numbers cannot be represented accurately with a finite number of digits. For instance, in decimal 1/3 is 0.3333.., ad infinitum. And in binary, decimal 0.1 is 0.000110011001100… forever. When we stuff this value into a fixed number of bits representing this value there will be an error. (See Patriot Missile Failure at Dhahran.)

The heart of my invention is the addition of a new field to the standard floating point format that provides information about the error in the current floating point representation, in particular, an upper bound on that error.

Bounded Floating Point Format
Bounded Floating Point Format

Where S, E, and T and the sign, exponent, and significand with widths 1,e, and t respectively, as described in floating point standards except that “t” is smaller to accommodate the bound field B (of width b). The bound field is further divided into subfields D and N which contain the number of lost bits of precision and the accumulated rounding error. The base value of a bounded floating point representation of a real number is the truncation of the result of the operation and the limit value can be computed from the base value offset by the value represented by the number of lost bits scaled by the exponent.

Further, the number of remaining significant digits are compared against a minimum required precision (optionally programmable) and if insufficient digits of precision remain, a new signaling NaN (loss of precision) is generated prohibiting further computations with an inaccurate floating point value. This is a "Fail-Safe" result.

This method of representation and the related computations provides many benefits.

One advantage is equality comparison. A subtract for equality comparison will return zero only when the remaining significant digits of the result are zero. (Zero is represented in the standard manner with B = 0.)

This new floating point functionality can operate side-by-side with existing standard floating point with easy conversion between the formats, though when converting from standard floating point to bounded floating point, some assumptions must be made about the number of significant digits of the standard floating point representation.

Conversion from external representation to bounded floating point allows for recording the error introduced by such conversion (for example, 0.1). Conversion to external representation has different options including presenting a value with only the significant digits displayed, as a range of values, or as an expected value plus or minus error bounds.

For more information, see the Bounded Floating Point website.