IEEE Floating Point Data

Next: Random Groups Up: Primary Data Array Previous: Undefined Integers

3.1.2.3 IEEE Floating Point Data

FITS allows transmission of 32- and 64-bit floating point data within the FITS format using the IEEE (1985) standard. This Floating Point Agreement also applies to random groups records and to any extensions for which BITPIX is not explicitly restricted (e.g., BITPIX=8 for XTENSION= 'TABLE '). Values for BITPIX of -32 and -64 indicate IEEE single- and double-precision floating point data, respectively.

The text of the Floating Point Agreement (Wells and Grosbøl 1990) is as follows:

The Basic FITS, Random Groups and Generalized Extensions Agreements are revised to add IEEE-754 32- and 64-bit floating point numbers to the original set of FITS data types. BITPIX=-32 and BITPIX=-64 signify 32- and 64-bit IEEE floating point numbers; the absolute value of BITPIX is used for computing the sizes of data structures. The full IEEE set of number forms are allowed for FITS interchange, including all special values (e.g., the ``not-a-number'' cases). The order of the bytes is sign and exponent first, followed by the mantissa bytes in order of decreasing significance. The BLANK keyword is ignored by FITS readers when BITPIX=-32 or -64.

For a complete, precise description of the IEEE floating point format, refer to the IEEE standard. The following discussion is provided to help in the interpretation of floating point data.

An ordinary IEEE floating point number consists of three components: a sign, an exponent, and a fraction. For regular IEEE 32-bit floating point numbers, the sign is contained in bit 1, the exponent in bits 2-9, and the fraction in bits 10-32. The fraction has an implied binary point in front. The value is given by

value =(-1)^sign × 2^{(exponent-127)} × (1 + fraction). (3.3)

For regular IEEE 64-bit floating point numbers, the sign is contained in bit 1, the exponent in bits 2-12, and the fraction in bits 13-64. The fraction has an implied binary point in front. The value is given by

value=(-1)^sign × 2^{(exponent-1023)} × (1+fraction). (3.4)

Fraction bytes are in order of decreasing significance (i.e., the standard non-byte-swapped order).

For example, suppose the single precision 8-bit byte pattern is 40400000. The sign bit is 0, the exponent bit pattern is 100 0000 0 (or 128), and the fraction pattern is 1 followed by 22 0s with a binary point in front, or 0.5 decimal. The entire number is interpreted as

(-1)° × 2^(128-127) × 1.5 = 3. (3.5)

The IEEE standard specifies in addition a variety of special exponent and fraction values in order to support the concepts of plus and minus infinity, plus and minus zero, ``denormalized'' numbers and ``not-a-number'' (NaN). The BLANK keyword of the original FITS Agreement is thus unnecessary and should be omitted by FITS writers and ignored by FITS readers when BITPIX = -32 or -64 (the NaNs of the IEEE format will act as the blank). FITS writers should not write the BLANK keyword if BITPIX = -32 or -64. For denormalized numbers, 1 is not added to the fraction, and the offset subtracted from the exponent is one smaller than for regular numbers: 126 for single precision and 1023 for double precision. This convention allows IEEE floating point to represent numbers that are smaller than those represented by the regularly defined values, although the number of significant digits decreases for smaller values. All of these special cases are fully accepted for FITS interchange. Appendix B lists the kind of value, regular or special, represented by all possible bit patterns.

The BSCALE and BZERO values should be applied by FITS readers if they differ from 1.0 and 0.0. However, scaling parameters should be used carefully with floating point values, because of the risk of generating overflows and underflows after scaling has been applied.

Next: Random Groups Up: Primary Data Array Previous: Undefined Integers