Summary

1And in Conclusion $\dots$ ¶

The IEEE 754 standard defines a binary representation for floating point values using three fields.

The sign determines the sign of the number ( $0$ for positive, $1$ for negative).
The exponent is in biased notation. For instance, the bias is $−127$ , which comes from $-(2^{(8−1)} −1)$ for single-precision floating point numbers. For double-precision floating point numbers, the bias is $−1023$ . An exponent of 00000000 represents a denormalized number and an exponent of 11111111 represents either NaN, if there is a non-zero mantissa, or infinity, if there is a zero mantissa.
The significand is used to store a fraction instead of an integer and refers to the bits to the right of the leading “1” when normalized. For example, if a mantissa is 1.010011, its significand is 010011.

Figure 3 shows the bit breakdown for the single-precision (32-bit) representation. The leftmost bit is the MSB, and the rightmost bit is the LSB.

For normalized floats:

\text{Value} = (−1)^{\text{Sign}} × 2^{\text{Exp}+\text{Bias}} × 1.\text{Significand}_2

For denormalized floats, including zero:

\text{Value} = (−1)^{\text{Sign}} × 2^{\text{Exp}+\text{Bias}+1} × 0.\text{Significand}_2

Table 1 shows that the IEEE 754 exponent field has values from 0 to 255. When translating between binary and decimal floating point values, we must remember that there is a bias for the exponent.

2Textbook Readings¶

P&H 3.5, 3.9

3Additional References¶

IEEE 754 Simulator

4Exercises¶

Check your knowledge!

4.1Conceptual Review¶

Solution to Exercise 1 #

True. Floating point:

Provides support for a wide range of values. (Both very small and very large)
Helps programmers deal with errors in real arithmetic because floating point can represent $+\infty$ , $-\infty$ , $\text{NaN}$ (Not a Number)
Keeps high precision. Recall that precision is a count of the number of bits in a computer word used to represent a value. IEEE 754 allocates a majority of bits for the significand, allowing for the use of a combination of negative powers of two to represent fractions.

Solution to Exercise 2 #

False. Floating Point can represent infinities as well as NaNs, so the total amount of representable numbers is lower than Two’s Complement, where every bit combination maps to a unique integer value.

Solution to Exercise 3 #

True. The uneven spacing is due to the exponent representation of floating point numbers. There are a fixed number of bits in the significand. In IEEE 32-bit storage there are $23$ bits for the significand, which means the LSB represents $2^{−23}$ times 2 to the exponent. For example, if the exponent is zero (after allowing for the offset) the difference between two neighboring floats will be $2^{−23}$ . If the exponent is $8$ , the difference between two neighboring floats will be $2^{−15}$ because the mantissa is multiplied by $2 ^{8}$ . Limited precision makes binary floating-point numbersdiscontinuous; there are gaps between them.

Solution to Exercise 4 #

False. Because of rounding errors, you can find Big and Small numbers such that: (Small + Big) + Big != Small + (Big + Big)

FP approximates results because it only has 23 bits for the significand.

Solution to Exercise 5 #

A non-zero digit is required prior to the radix in scientific notation, and since the only non-zero digit in base-2 is 1, the normalized value will always start with a 1.