Floating point error mitigation

Floating point error arises because real numbers cannot, in general, be accurately represented in a fixed space. By definition, floating point error cannot be eliminated, and, at best, can only be managed.

H. M. Sierra noted in his 1956 patent "Floating Decimal Point Arithmetic Control Means for Calculator":

"Thus under some conditions, the major portion of the significant data digits may lie beyond the capacity of the registers. Therefore, the result obtained may have little meaning if not totally erroneous."

The first computer (relays) developed by Zuse in 1936 with floating point arithmetic and was thus susceptible to floating point error. Early computers, however, with operation times measured in milliseconds, were incapable of solving large, complex problems and thus were seldom plagued with floating point error. Today, however, with super computer system performance measured in petaflops, (10¹⁵) floating-point operations per second, floating point error is a major concern for computational problem solvers. Further, there are two types of floating point error, cancellation and rounding. Cancellation occurs when subtraction two similar numbers and rounding occurs with significant bits cannot be saved and are rounded or truncated. Cancellation error is exponential relative to rounding error.

The following sections describe the strengths and weaknesses of various means of mitigating floating point error.

Though not the primary focus of numerical analysis,numerical error analysis for the analysis and minimization of floating point rounding error. Numerical error analysis generally does not account for cancellation error.

Error analysis by Monte Carlo arithmetic is accomplished by repeatedly injecting small errors into an algorithms data values and determining the relative affect on the results.

Extension of precision is the use of larger representations of real values. The ISO standard define precision as the number of digits available to represent real numbers, and typically including single precision (32-bits), double precision (64-bits), and quad precision (128-bits). While extension of precision makes the affects of error less likely or less important, the true accuracy of the results are still unknown.

Variable length arithmetic represents numbers as a string of digits of variable length limited only by the memory available. Variable length arithmetic operations are considerably slower than fixed length format floating point instructions. When high performance is not a requirement, but high precision is, variable length arithmetic can prove useful, thought the actual accuracy of the result may not be known.

...
Wikipedia