Talk:Single-precision floating-point format

This is the talk page for discussing improvements to the Single-precision floating-point format article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Computing: Software Low‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
Low	This article has been rated as Low-importance on the project's importance scale.
	This article is supported by WikiProject Software (assessed as Low-importance).
	This article is supported by Computer hardware task force (assessed as Low-importance).

Technology

This article is within the scope of WikiProject Technology, a collaborative effort to improve the coverage of technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.TechnologyWikipedia:WikiProject TechnologyTemplate:WikiProject TechnologyTechnology articles

Computer science

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

???

This article has not yet received a rating on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Exponent/mantissa ranges are inconsistent with other sources[edit]

The C++ standard library, as an example, will report the max and min exponents as (-125,+128) and assume that the mantissa is in the range (0.5,1). This article assumes that the exponent is between (-126,127) and that the mantissa is in the range (1,2). Although either method gets the same answer, the one given in the article does not match standard usage. The article should make this clear. —Preceding unsigned comment added by 71.104.122.200 (talk) 03:12, 13 February 2008 (UTC)[reply]

The values in the article are those given in the IEEE 754 standard (1985), which the article calls out in the first paragraph. That is the standard for floating-point arithmetic. What clarification(s) would you like to see? mfc (talk) 20:36, 22 February 2008 (UTC)[reply]

Actually, it matters, because some cstdlib functions will return the raw exponent/mantissa. - Richard Cavell (talk) 05:42, 17 December 2010 (UTC)[reply]

Well the discrepancy seems to resolve itself, when one looks up what std::numeric_limits<float>::min_exponent actually returns. Quoting from cppreference.com "The value of std::numeric_limits<T>::min_exponent is the lowest negative number n such that

r^{n-1}

, where

r

is std::numeric_limits<T>::radix, is a valid normalized value of the floating-point type T". Similar for max_exponent. That seems to imply that one needs to shift (-125,+128) to (-126,127), which are precisely the smallest and largest exponents not used for representing special values in IEEE. The same definition holds for the macros FLT_MIN/MAX_EXPONENT from <float.h> in C. InfoBroker2020 (talk) 10:13, 15 May 2021 (UTC)[reply]

When the implicit bit doesn't exist[edit]

Can someone check this for me:

"The true significand includes 23 fraction bits to the right of the binary point and an implicit leading bit (to the left of the binary point) with value 1 unless the exponent is stored with all zeros."

Is the "exponent is stored with all zeros" part correct? - Richard Cavell (talk) 05:43, 17 December 2010 (UTC)[reply]

Yes, that's correct -- if there were an implicit one bit in that case then the number could not have the value zero. mfc (talk) 14:19, 21 December 2010 (UTC)[reply]

So it's definitely the case that there is no implicit bit when, and only when, the exponent, and only the exponent, is exactly 0? - Richard Cavell (talk) 07:53, 24 December 2010 (UTC)[reply]

The original statement is correct. For IEEE754, there is no implicit bit for zero or any of the denormalized representations, all of which have zero exponents. For pre-754 formats, there might not be an implicit bit at all. DEC floating formats used an explicit most significant bit, as did most mainframes. There are also exceptional cases of no implicit bit for the various types of NaNs, and the two infinities: in those the exponent has all bits set. Here is a good summary. —EncMstr (talk) 18:11, 24 December 2010 (UTC)[reply]

I made an edit around the topic of this talk section here which was reverted by Vincent Lefèvre with the edit summary "This is incorrect due to subnormals". I thought I had accounted for subnormals in my edit, but the edit summary Vincent left was a bit vague (this is fine, there isn't much space in an edit summary) so I just wanted some elaboration on what was wrong from them or someone else who happens to see this. My edit also somewhat changed the tone of the section I was editing, giving a more explanatory feel, so that may be reason to revert my edit as well. Here's a comparison from the original to my edit. sent by TheGnuGod (talk) at 12:46, 22 April 2024 (UTC)[reply]

@TheGnuGod: There were several issues. In particular, "all non-zero binary numbers start with a 1" is wrong for subnormals (but "start with" is rather ambiguous). Moreover, your changes did not follow the encyclopedic style (e.g. do not use "that's" or "we"). I've just edited the article to correct ambiguities and give some details; more details are available via internal links. Regards, — Vincent Lefèvre (talk) 13:59, 22 April 2024 (UTC)[reply]

Thanks for the clarification! I see that my edit didn't really feel very wikipedia-ish but it is just true that the first significant digit of all non-zero binary numbers is a 1. IEE754 doesn't start its mantissa with a 1 for all non-zero binary numbers though so I can understand why that wouldn't make sense to have on the page. I think I will make one (hopefully) final edit to that paragraph to make it flow a bit nicer when you read it, but if I accidentally change its meaning please tell me or revert it. sent by TheGnuGod (talk) at 19:57, 24 April 2024 (UTC)[reply]

@TheGnuGod: The leading bit of the significand is 1 only for normal numbers. Subnormal numbers are nonzero numbers for which the leading bit is 0. Note that the IEEE 754 standard says: "A decimal or subnormal binary significand can also contain leading zeros, which are not significant." — Vincent Lefèvre (talk) 00:49, 26 April 2024 (UTC)[reply]

That is literally what I said. Your message and my message contain exactly the same information, I just didn't use the word subnormal. I accounted for the fact that subnormal numbers start the significand with zero(s) but (like IEEE said) that those zero(s) are not significant. Moreover, I wrote a whole sentence about that, but I suppose you didn't read it for some reason? Kind regards, I guess. sent by TheGnuGod (talk) at 16:10, 26 April 2024 (UTC)[reply]

@TheGnuGod: These zeros are not significant, but they are part of the significand, and this is what matters here, in a section dealing with the representation and encoding. — Vincent Lefèvre (talk) 21:45, 26 April 2024 (UTC)[reply]

I said that as well. You have now failed to read my message twice in a row, telling me I did not say something that I clearly did. I said "IEE754 doesn't start its mastissa with a 1 for all non-zero binary numbers though so I can understand why that wouldn't make sense to have in the article." Please read next time, it is quite helpful when communicating in text. sent by TheGnuGod (talk) at 09:36, 27 April 2024 (UTC)[reply]

Adding conversion to the main article?[edit]

The Section on converting Decimal to Binary32 is really useful, but hard to find tucked in here on the single-precision page. It would be nice if someone could expand it to the general case and put it on the main floating point page. — Preceding unsigned comment added by Canageek (talk • contribs) 20:38, 15 December 2011 (UTC)[reply]

$2^{0}=0$ or $2^{0}=1$ for bits?[edit]

from article:

"consider 0.375, the fractional part of 12.375. To convert it into a binary fraction, multiply the fraction by 2, take the integer part and re-multiply new fraction by 2 until a fraction of zero is found or until the precision limit is reached which is 23 fraction digits for IEEE 754 binary32 format.

0.375 x 2 = 0.750 = 0 + 0.750 => b₋₁ = 0, the integer part represents the binary fraction digit. Re-multiply 0.750 by 2 to proceed

0.750 x 2 = 1.500 = 1 + 0.500 => b₋₂ = 1

0.500 x 2 = 1.000 = 1 + 0.000 => b₋₃ = 1, fraction = 0.000, terminate

We see that (0.375)₁₀ can be exactly represented in binary as (0.011)₂. Not all decimal fractions can be represented in a finite digit binary fraction. For example decimal 0.1 cannot be represented in binary exactly. So it is only approximated.

Therefore (12.375)₁₀ = (12)₁₀ + (0.375)₁₀ = (1100)₂ + (0.011)₂ = (1100.011)₂ "

So 12 can be as

2^{3}+2^{2}+2^{0}+2^{0}=12

if

2^{0}=0

, or

(2^{3})^{1}+(2^{2})^{1}+(2^{1})^{0}+(2^{0})^{0}=12

if

2^{0}=1

? — Preceding unsigned comment added by Versatranitsonlywaytofly (talk • contribs) 17:49, 2 March 2012 (UTC)[reply]

I think, for last digit

(2^{0})^{0}=0

and for any over

(2^{x_{n}})^{0}=1.

Is that correct? For example, for last digit

(2^{0})^{1}=(0)^{1}=0

and

(2^{1})^{0}=(2)^{0}=1

and thats how we get 0. — Preceding unsigned comment added by Versatranitsonlywaytofly (talk • contribs) 18:09, 2 March 2012 (UTC)[reply]

Or if from article

130=10000010

so

2^{8}+2^{0}+2^{0}+2^{0}+2^{0}+2^{0}+2^{1}+2^{0}=128+0+0+0+0+0+2+0

. But then how to get 1, need another bit (say, in the end) for 1 like if 0 (digital) then 0 (decimal) and if 1 (digital) then +1 (decimal). — Preceding unsigned comment added by Versatranitsonlywaytofly (talk • contribs) 18:26, 2 March 2012 (UTC)[reply]

It might be irrelevant, but there almost no chance, that CPU using this stupid conversion. It decimal digit is 4 bits and there is 82 (multiply table of decimal numbers and 0) multiply gates for one decimal digit with over decimal digit (4 bits multiplication with 4 bits and 82 possible results so at least need 4*82=328 transistors - this is minimum, but real number might be much bigger). Also the same 82 gates for division, addition and subtraction, so 4*328=1312 gates for 4 basic operations (+,-,*,/). So intel 4004 CPU have 2300 transistors, this looks little bit not enough. If we talking about single precision (32 bits = 32/4=8 decimal digits), then need minimum 1312*8=10496 transistors. Intel 8086 has 3500 transistors and seems have 80 bits coprocessor 8087 which is another separate chip. If 32 bits is 8 decimal digits, then 80 bits is ten decimal digits, then thus need minimum 1312*10=13120 transistors. This means either 8087 have more than 13120 transistors or there 80 bits doesn't mean, that it have 10 decimal places computing units, but just can calculate in such precision and 8087 is chip of instructions how to calculate in such (80 bits) precision, but this 8087 coprocessor itself don't calculate anything. Thus then 1312<3500, but still too small number (even for coprocessor 8087, but maybe 8087 have more than 3500 transitors, maybe it have over 10000 transistors), maybe 4004 also have coprocessor (which maybe consist of many chips or 4004 is of limited functionality). — Preceding unsigned comment added by Versatranitsonlywaytofly (talk • contribs) 19:11, 2 March 2012 (UTC) correction: Intel 8086 have 29000 transistors and Intel 8008 have 3500 transistors. http://www.intel.com/technology/timeline.pdf[reply]

BTW, by my estimations CPU alone have power with single core 3Ghz to render nowadays graphics. There is hardest thing not shadows projection, but texturing (in 1998 year game Motocross Madness shadows from bike falls on curved surface of hill(s) and this is on nvidia Riva TNT card or Voodoo 2/3 and 300 MHz Pentium II; this is almmost the same as selfshadowing, because anyway there probably just different parts added together in nowadays games; in Tomb Raider Legend 1-5 fps [with FRAPS] difference with shadows or without at about average 20-40 fps, and lagging game even on best cards due to big animated water textures and many small but far bump mapping textures which don't use mipmaping if they used on many far objects, but if you come closer to wall they can't be any bigger and fps rate increases to 40-90 fps with next gen effects, which means bump mapping, which decreasing fps twice). Need 1000*1000=1000000 pixels output on screen about 30-60 frames per second. So need 30*1000000=3*10^7 operations. One multiplication of decimal number CPU doing in one cycle, so 3GHz CPU can make only 3*10^9 decimal number additions or subtraction or multiplications or divisions per second. Nowadays games have about 50000 triangles (or about 100000 vertexes) in scene, thus moving vertexes is not hard part and GPU here really unnecessary, because without GPU 2-3 times difference in fps only of Directx 2010 tutorials (or in 3D Studio Max 5 you can get 10 teapots or 10 spheres with more than million poligons each and with more than 25 fps, but without textures on single core 2-3 GHz processor and nVidia Riva TNT 2). So 30(fps)*10^5(vertexes)=3*10^6 vertexes/s. Here need rotation matrices which have few sine, cosine functions, but cosine you can get roughly result from 1 MB table, say or with Taylor series calculate and here about 10 addition and 10-20 multiplication operations (rising power is simple, because: b=a*a, c=b*b, d=c*c, so a^5=a*c, a^7=c*a, a^8=d). For any rotation there is enough 4 decimal digits or even 2 decimal digits (16 or 8 bits) precision. Thus about 10-30 operation for sine or cosine function calculation; so we have 3*10^6 (vertexes/s) * 30 (operations) * 4 (decimal digits) = 3.6*10^8 (operations) and our CPU at 3 GHz can do 3*10^9 such operations. Then in 3DMax we use texture wraping (texture putting on geometric object) cylindrical or spherical or projectional, whatever you like (I doubt spherical exist), and then this texture convert for 3D game to projectional to made it faster in realtime. Need to calculate rotation about Ox and Oz axises for each triangle and then project texture pixels onto triangle. Triangle have equation of [flat] plane. Need to find point of intersection of projectional ray [from texture pixel] and each triangle of mesh (3D object). This is not very hard, need one square root and about 10-15 addition and multiplication operations in total; with square root, say, about 20 operations. So need 10^6 (pixels) * 30 (fps) * 20 (operations) = 6*10^8 operations. So total 6*10^8 + 3.6*10^8 = 9.6*10^8 (operations) and it is less than 3*10^9 (operations) on 3GHz CPU (this is where you get 90-100 fps). There true that is many textures, but if object is far then mip-maping textures used (smaller versions of texture 1/4 pixels count of original and for very far objects 1/16 number of pixels of original and so on). So we still talking about roughly the same 10^6 pixels in scene which belongs to texture. So usually texture is about 512*512 and so for bump mapping. Bump mapping only is about the same as texturing but little bit harder because need dot product calculate of light and bump map vector (this is exactly 3 multiplication operations and 2 addition operations for each bump map texture pixel, so 5 operations in total, but to compare it with 20 operations in total for texturing it almost nothing). So when we have each pixel of texture projected onto geometry then need rotate this pixels to viewer and this is another about 5-20 operations for calculating cosine or sine function, but quaternions seems can make it faster but less precise. Also there still big chance, that all cosine or sine results are gotten from table. So in total we get 2*6*10^8 + 3.6*10^8 = 1.56*10^9 operations or about 60 fps at 3.1GHz CPU (or 30 fps with bump mapping). — Preceding unsigned comment added by Versatranitsonlywaytofly (talk • contribs) 20:17, 2 March 2012 (UTC)[reply]

Sorry I didn’t read your entire post but 2⁰ = 1 by definition, which is not equal to zero. See Exponentiation#Arbitrary integer exponents (Zeroth power redirects). Is there anywhere that suggests it is zero? Vadmium (talk, contribs) 02:48, 3 March 2012 (UTC).[reply]

But this is there binary digits operations comes handy. To address memory (RAM or nand-flash) need generate code for get each next bit of file. For this need billions combinations and processor can't know each next bit code access number, so need use bits addition. Need to bits string add 1 bit each time. For example

    1001010
   +0000001=
    1001011, then
    1001011
   +0000001=
    1001100

and so on. By adding to string 0000000 each time 0000001 you will get all 2^7=128 results (only in the end of section (in case one file was removed and over inserted) need leave bits string in what section file information continues).

Because I don't know what somebody someone teaching in university or what standarts IEEE is for decimal digits conversion into binary, but there I don't see anyway how you can binary convert simple into decimal to output on screen. You can only emulate such process, but maybe even can't do that, because after trying to get, for example, from binary number 1001010 the digital number, then you know it is

2^{6}+0+0+2^{3}+0+2^{1}+0

(say, you can't get 1 for integers, but only 0.999999 for real numbers or 1.000002). But if you rise power 2^6 binary and add 2^3 and 2^1 binary then you again will get same binary number. And if you want have table of outputing 2^1, 2^2, 2^3 ... 2^n then you need billions bytes of memory to output one number, which have about 10 decimal digits. Theory of binary computations used in article don't apply in practice. Or I am wrong? — Preceding unsigned comment added by Versatranitsonlywaytofly (talk • contribs) 09:26, 3 March 2012 (UTC)[reply]

Actually it is possible to calculate in binary form. But in the end still need some tables of numbers and need to have decimal digits adders. So everything we calculating in binary and then hard part is to convert it to decimal. Free pascal calculating in 64 bits precision (15-16 decimal digits). So to get such precision without emulation, need about 64 bits multiply or divide or add or subtract with 64 bits. So after billions calculation then need this 64 bits convert to decimal digits. Need 64 bits divide into 16 parts, each of 4 bits. Then after final result say, we get 64 bits string:

1010000000000000000000000000000000000000000000000000000001001011.

Then we dividing this string into 16 parts:

1010 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0100 1011.

Last 4 bits have number

2^{3}+0+2^{1}+(bit=1)=8+0+2+1=11.

Last 4 bits can have 16 combinations (number form 0 to 15), but only 10 of them are meaningful (this 4 bits can mean or 0 or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9) when 15 converting into byte (4 bits for 0 or 1 and 4 bits for second number [of 15] from 0 to 9). But there need two decimal places, because 16 have two decimal places. Each of 16 combinations gives decimal number from 0 to 15. For first and for second decimal digit [of 15] 0 have the same 4 bits code for first number as for second number and the same must be for 1 (first decimal digit of 15 can be only 0 or 1; then we can add 1 to over second from end of 64 bits string 4 bits number which decoded becoming 8 or more bits number, but divisible by 4).

Now take a look at 4 bits from 66th bit to 60th bit, which is 0100. This 4 bits is number

0+2^{6}+0+0=64.

But maximum number of this bits can be, I am not sure, or 2^7=128 or 128+64+32+16=240. So this time table have 3 decimal digits and each of 4 bits, so need decode 0100 or any over combination second from end of 4 bits to number from 0 to 128 (or 240). You have in memory each of 4 bits code and they mean number from 0 to 128 (or to 240).

Now take a look at first 4 bits: 1010. This number is

2^{63}+0+2^{61}+0=9223372036854775808+0+2305843009213693952+0=11529215046068469760.

This number is huge, but there is only 16 such huge numbers (like in all cases only 16 numbers and the smaller, the closer 4 bits are to end of 64 bits number). So for this 4 first bits (of 64 bits number) there is each code of 4 first bits (0000, 0001, 0010, 0011, 0100, 0101, 0110, 0111, 1000, 1001, 1010, 1011, 1100, 1101, 1110, 1111). So in our case code 1010 of 4 first bits means number 11529215046068469760. If, say, code of 4 first bits is 1100, then this code means

2^{63}+2^{62}+0+0=13835058055282163712

.

So now we can decode this bits string

1010000000000000000000000000000000000000000000000000000001001011

into decimal digits number:

11529215046068469760+64+11=11529215046068469835.

As you see for 3 add instructions we must use addition table of decimal numbers (we need to know how, for example 0101+1011 outputs decimal number from 0 to 18, so need ten variants of output 0101 with all over another codes for decimal number, then next coded decimal number with all over ten decimal number results; in over words 4 bits + 4 bits and output 8 bits, 4 bits for first number (0 or 1) and 4 bits for second number (from 0 to 9)), which must be integrated into CPU. But we don't must have decimal numbers multiplication and division and subtraction tables (and for this reason there need much less transistors, but more memory for storing numbers of each code).

BTW, interesting fact, that this (

2^{3}+0+2^{1}+(bit=1)=8+0+2+1=11

) calculation is correct:

2^{9}+2^{8}+2^{7}+2^{6}+2^{5}+2^{4}+2^{3}+2^{2}+2+1=512+256+128+64+32+16+8+4+2+1=1023.

So 0000000000=0 and 1111111111=1023. And we know, that

2^{10}=1024

variants (so from 0 to 1023). — Preceding unsigned comment added by Versatranitsonlywaytofly (talk • contribs) 17:06, 4 March 2012 (UTC)[reply]

TL;DR 2^0=1 — Preceding unsigned comment added by 82.139.82.82 (talk) 10:45, 16 September 2015 (UTC)[reply]

PASCAL[edit]

PASCAL float types are "REAL" or additionally "LONGREAL" if there must be types that are different in precision, just like in C they are (long) float. Single, double etc are names from IEEE specs, and are not part of PASCAL the language, but machines specific identifiers.

I didn't correct it in the main article to avoid confusion (specially since in some popular compilers REAL was typically mapped to softfpu, and single/double were mapped to hardware fpu). 88.159.71.34 (talk) 12:04, 30 June 2012 (UTC)[reply]

Number of "bytes"[edit]

"Single-precision floating-point format is a computer number format that occupies 4 bytes (32 bits) in computer memory and represents a wide dynamic range of values by using a floating point."

Does this really mean 4 bytes? Won't 4 octets be less ambiguous? — Preceding unsigned comment added by 112.119.94.80 (talk) 21:20, 7 August 2012 (UTC)[reply]

It is a reasonable question. Byte is certainly much better known whereas octet is, as far as I know, mostly used in RFCs. Since IEEE 754 was standardized and adopted well after the ascension of the eight bit byte, it seems to me we serve our readers better by calling it a byte. I would not object to a note which specifies that a byte here means eight bit byte or octet. Maybe it is even clearer to omit bytes altogether, specifying only bits? —EncMstr (talk) 23:26, 7 August 2012 (UTC)[reply]

What would such a footnote add? You can already see it's a 32-bit value because it actually literally says so. I think the lead is crystal clear on this matter. — Preceding unsigned comment added by 82.139.82.82 (talk) 10:42, 16 September 2015 (UTC)[reply]

# of Significant Digits[edit]

For a single precision floating point in decimal presentation, it was said only 7 digts are significant? but in the article, it is not mentioned, not to mention why it is so. Jackzhp (talk) 17:50, 17 March 2013 (UTC)[reply]

Precision[edit]

There are a few statements relating to precision that I think are incorrect (please see ^[1]):

 "This gives from 6 to 9 significant decimal digits precision (if a decimal string with at most 6 significant decimal digits is converted to IEEE 754 single precision and then converted back to the same number of significant decimal digits, then the final string should match the original; and if an IEEE 754 single precision is converted to a decimal string with at least 9 significant decimal digits and then converted back to single, then the final number must match the original[4])."

You don't really get 9 digits of precision; you only get up to 8. 9 digits is useful for "round-tripping" floats, but it is not precision.

 "some integers up to nine significant decimal digits can be converted to an IEEE 754 floating point value without loss of precision, but no more than nine significant decimal digits can be stored. As an example, the 32-bit integer 2,147,483,647 converts to 2,147,483,650 in IEEE 754 form."

Again, same comment on 9 digits. Also, why is this discussion limited to integers? And the example converts to 2,147,483,648, not 2,147,483,650.

 "total precision is 24 bits (equivalent to log10(2^24) ≈ 7.225 decimal digits)"

The simple "take the logarithm" approach does not apply to floating-point, so the precision is not simply ≈ 7.225.

References

^ http://www.exploringbinary.com/decimal-precision-of-binary-floating-point-numbers/

--Behindthemath (talk) 22:01, 29 June 2016 (UTC)[reply]

7 digits is closer to the truth. Dicklyon (talk) 23:21, 29 June 2016 (UTC)[reply]

All 7-digit integers, and 8-digit integers up to 2^24 = 16777216 are exactly represented as IEEE single precion, but 2^24 + 1 = 16777217 falls between two representaable values. I'd say that's close enough to 7.2 digits. Dicklyon (talk) 05:04, 30 June 2016 (UTC)[reply]

For integers, yes, log10(2^24) makes sense. But not for floating-point values. Depending on the exponent, precision can vary from 6-8 digits Behindthemath (talk) 13:35, 30 June 2016 (UTC)[reply]

Note that 10001 is less precise than 10101 for its size even though they both have 5 bits. With 5 significant bits, 11111 is possible but not 100001. So a bit of precision is lost after the transition. We average 23 and 24 to get 23.5, and multiply it with log(2) to get 7.07. 7.22 is just what you get with all significant bits on, maximizing precision for size. Also, 2147483647 (I don't like these commas) converts to 2147484000 according to me, but it could be 2147480000, 2147483600, 2147483650 or 2147483648 depending on program(ming language) used. 24691358r (talk) 14:10, 27 June 2017 (UTC)[reply]

@24691358r: Please explain why/how 17 is less precise than 21 even though they both have 5 bits. Then, why the programming language matters: For a particular computer, the same hardware is processing the numbers, not a language. —EncMstr (talk) 16:10, 27 June 2017 (UTC)[reply]

In base 18 for example, 17 is single digit (h) but 21 is two-digit (13), and both have integer precision. Because 13 in base 18 starts at a higher digit, it has more precision. I can show example meaningful in decimal: 1001 (9, 1 significant) is less precise than 1101 (13, 2 significant). 4*log(2) is 1.20412, but it assumes precision of 4 bits at a size of 2^4 (10000). However, 1001 has a lower size and therefore lower significant precision. So we average 3 and 4 to get 3.5log(2) which is 1.0536. Same thing for single float which gives average precision of 7.07.

One language could show 6 significant digits, while another could show 8. That's the difference. 24691358r (talk) 19:20, 27 June 2017 (UTC)[reply]

I agree that the listed precision is wrong. The formula log(2^23), or about 6.92, gives the amount of decimal precision. That's log base ten of two to the power of mantissa digits. For example, If you go above 8192, you lose the ability to differentiate differences of 0.001 and 0.002, etc, because after 8192, the values go in increments of 1/512 rather than 1/1024. That's just under 7 digits of precision. We should edit the article to say 6 to 7 decimal digits, rather than 6 to 9, because it single-precision floats simply cannot store 8 or 9 decimal digits of accuracy (if you're curious, you would need 30 bits of mantissa for 9 decimal digits, log(2^30) = 9.03...). Aaronfranke (talk) 09:41, 27 January 2018 (UTC)[reply]

However, it's also worth noting that if you include the implicit bit as precision, log(2^24) is about 7.22, which is still far closer to 7, but it does mean that you need 8 decimal digits to guarantee accurate conversions if you go binary -> decimal -> binary. You cannot, however, have all 8-digit decimal numbers survive decimal -> binary -> decimal. Aaronfranke (talk) 09:52, 27 January 2018 (UTC)[reply]

@Vincent Lefèvre: Thanks for the edits. I was wrong, but not about what you think. When I first edited it, I first thought the same as you, that up to 7 digits is representable. But then I realized that it wasn't talking about digits but significant digits, i.e. omitting trailing zeros. For example, 314,368,000,000 is a 12-digit integer with 6 significant digits (which is representable in IEEE single). And as the discussion in this section is exploring, it's not clear that there doesn't exist a decade in which some 7-significant-digit numbers are ambiguous. So I left it alone, but added the obvious caveat that some integers over the limit are not representable at all. But after seeing your edits and thinking of this reply, I realized that I hadn't read the other part of the claim carefully. There might be no significant loss of precision, but there is still some rounding. 2²⁶ + 36 = 67108900 = 0x4000024 = 2² × 0x1000009 is an 8-digit integer with 6 significant digits which is not exactly representable in single precision; it is rounded to 2³ × 0x800004 = 67108896. 74.119.146.16 (talk) 14:21, 4 October 2019 (UTC)[reply]

One issue here is that "significant digits" is an ambiguous measure of precision: the difference between two representable "3 significant digit" numbers compared to the magnitude of the number is variable, 1.01 to 1.02 gives a ratio of about 1/100, while 9.97 vs 9.98 gives a ratio of about 1/1000 (the link Behindthemath mentioned sort of touches on this, but instead of approaching it in terms of rounding error for an arbitrary value, it approaches the issue as "can it represent every number that has 7 significant digits"). Using precision ratio as the measurement, float32 has worst and best case ratios (excluding denormal and non-numeric values) of effectively 1/(2^23) to 1/(2^24) (so, 1.19e-7 to 5.96e-8), for the cases of significand (1)000... versus (1)111... (implicit 1 in parentheses). 7 significant figures has a precision ratio between 1e-6 to 1e-7, so float32 is almost always more precise (in terms of expected rounding error from higher precision) than 7 significant figures (float32's worst case of 1.19e-7 occasionally overlaps 7-digit's best case of 1e-7 on the number line).

The alternative "simple" calculation (in terms of rounding error) is to take log10 of the number of distinct representable values between 1 and 10 (because we chose to measure it against decimal), which spans a little over 3 exponents of float32: 1 to 2, 2 to 4, 4 to 8, and part of 8 to 16. Each full range of an exponent contains 2^23 representable values. This comes to log10((3 + (10 - 8) / (16 - 8)) * 2^23) = 7.4..., but other ranges between consecutive powers of 10 will overlap different ranges of float exponents and have slightly different answers. No, I don't have a source, its just math based on the definitions. Compwhiz797 (talk) 23:39, 8 April 2022 (UTC)[reply]

Ambiguous decimal-to-binary examples[edit]

The section on decimal-to-binary conversion is unclear on a few important points:

How exactly is a fractional component translated into a sequence of bits?
I was unable to figure out what was meant by "multiply the fraction by 2, take the integer part and repeat with the new fraction by 2" until a different source provided a clearer explanation. It appears to be copyrighted, so I won't repeat its content here. Basically, the conversion process boils down to:
${\begin{aligned}.6875\times 2&=1.375&\Rightarrow 1\\.375\times 2&=0.75&\Rightarrow 0\\.75\times 2&=1.5&\Rightarrow 1\\.5\times 2&=1.0&\Rightarrow 1\end{aligned}}\quad {\text{therefore:}}\quad .6875=1011$ ${\begin{aligned}.6875\times 2&=1.375&\Rightarrow 1\\.375\times 2&=0.75&\Rightarrow 0\\.75\times 2&=1.5&\Rightarrow 1\\.5\times 2&=1.0&\Rightarrow 1\end{aligned}}\quad {\text{therefore:}}\quad .6875=1011$
Or, to put it in JavaScript:
```
let precision = 23;
let fraction = 0.6875;
let bits = 0;

for(let i = 0; fraction && i <= precision; ++i){
	fraction *= 2;
	const int = ~~fraction; // Discard fraction
	bits |= int << (precision - i++);
	fraction -= int;
}

bits === 0b100010100000000000000000;
```
None of this was apparent from reading (and re-reading) the explained procedure.
Is the above translation performed before or after normalisation?
If after, why is normalisation not explained first?
How is the exponent adjusted after normalising?
Given a number like 0.00675, we end up with an exponent of -3 and a significand of 6.5 (110.1). According to the article, 110.1 should be normalised into 1.101, implying the exponent of -3 should be incremented by 2 to compensate ( $-3+2=-1$ ). Based on my own observations and tinkering, it appears the exponent should be decremented instead ( $-3-2=-5$ ), although my understanding is still shaky at best. Could anybody clarify?

Moreover, the section in question is poorly-structured and littered with redundant language. Conversely, the binary-to-decimal section is clear and concise, enough so that I was able to implement it in JavaScript using no other references. It might benefit the article to rewrite the decimal-to-binary section completely. --OmenBreeze (talk) 09:50, 23 February 2020 (UTC)[reply]

Both sections are very unclear, WP:OR, and even incorrect. They should be removed. In your JavaScript code, if you mean bytesToFloat32, then you're not converting to decimal, but to binary. That's very different. Vincent Lefèvre (talk) 21:27, 23 February 2020 (UTC)[reply]

then you're not converting to decimal, but to binary. That's very different — Do you mean the JavaScript snippet I pasted above? Because that's from my (unfinished) attempt at implementing the inverse of bytesToFloat32, which hasn't been pushed to GitHub yet. BTW, I should point out that these functions are purely an exercise to improve my own understanding of data formats – JavaScript already provides an interface for direct manipulation of a raw memory-buffer. --OmenBreeze (talk) 10:15, 24 February 2020 (UTC)[reply]

With bytesToFloat32, you're converting a sequence of bytes representing a binary32 value to a binary32 value. There's no decimal here. And there's no decimal in the inverse of bytesToFloat32. The only things you have here are sequences of bytes and binary32 values (which are, as the name indicates, in binary). Vincent Lefèvre (talk) 12:05, 24 February 2020 (UTC)[reply]

No, because JavaScript represents numbers internally as 64-bit, double-precision floating-point values. Bitwise arithmetic is limited to 32-bit values only, and bitwise operations implicitly discard the decimal half of a Number value. That's the extent to which JavaScript discriminates between fundamental number formats (sans the recently-added BigInt primitive, which addressed the lack of arbitrary-precision arithmetic in JavaScript).
In any case, you can see for yourself by copying the function body, removing the `export ` keyword, and pasting it into your browser's JavaScript console. Then run something like bytesToFloat32([0x41, 0xC8, 0x00, 0x00]);, taking into account that the result will resemble an integer (25) instead of a float (25.0). --OmenBreeze (talk) 13:12, 24 February 2020 (UTC)[reply]

Javascript numbers correspond to binary64 (note: this is binary, not decimal), but your function converts a binary32 encoding into a binary64 number, which is here also a binary32 number (binary32 is a subset of binary64). Anyway, this is an article about a floating-point format, not about Javascript specificities, and decimal is not involved here. Vincent Lefèvre (talk) 22:22, 24 February 2020 (UTC)[reply]

Yes, I'm aware that there's a precision loss; there's also a bytesToFloat64 function which performs the more accurate conversion. Both functions operate on numbers in the range of 0-255 (hence, "bytes") and encode a single- or double-precision value, respectively. Every value I've tested it with has yielded accurate results (see the repository's test suite). OmenBreeze (talk) 00:45, 25 February 2020 (UTC)[reply]

You're reading a binary32 number from its byte-encoding as a binary64 number. So, there is no precision loss. But I repeat that this has nothing to do with decimal as mentioned in the WP article, i.e. this is off-topic. Vincent Lefèvre (talk) 01:29, 25 February 2020 (UTC)[reply]

Please describe how negative numbers are represented[edit]

Is it ones complement, twos complement, or something else? 66.194.253.20 (talk) 13:01, 10 July 2020 (UTC)[reply]

Negative numbers are represented by turning the sign bit on. The IEEE representation uses the sign-and-magnitude approach for storing the significand. InfoBroker2020 (talk) 10:16, 15 May 2021 (UTC)[reply]

3.4 × 10³⁸[edit]

There have been some recent changes to the value of the largest representable number. These seem to have been undone. To discourage such changes, this post reports results of a simple test. Obviously, in the article proper, No Original Research.

In PostScript:

%!PS

/inf 2 2 -23 exp sub 2 127 exp mul def
(\ninf = ) =   inf =

() = () = () = () =
% Smallest x>0 such that x != 1+x
/tooBig 1e-07 def   /tooSmall 0 def   /guess -99 def
{
	tooBig tooSmall add 2 div  dup guess eq {pop exit} if
	/guess exch def
	guess 1 add 1 eq {/tooSmall} {guess = /tooBig} ifelse guess store
} loop  % Concludes that epsilon ~= 5.96047e-08 ~= 2^-24
(\nepsilon = ) =   tooBig =
/epsilonPlus1 tooBig 1 add def

() = () = () = () =
% 0 => largest possible float;
% 1 = boundary of what can be added to inf
% Multiplies by (1 + 2^-23)
[
	{3.40282e+38 {dup = epsilonPlus1 mul} loop}
	{1.01412e+31 {dup = inf 2 copy add eq = () = epsilonPlus1 mul} loop}
] 0 get exec  % Which test? Change integer.

Executed in Adobe Distiller Pro 11.0.23:

Processing prologue.ps...
Done processing prologue.ps.

inf = 
3.40282e+38




7.5e-08
6.25e-08
6.09375e-08
6.01563e-08
5.97656e-08
5.9668e-08
5.96191e-08
5.96069e-08
5.96054e-08
5.9605e-08
5.96048e-08
5.96047e-08
5.96047e-08
5.96047e-08
5.96047e-08
5.96047e-08

epsilon = 
5.96047e-08




3.40282e+38
3.40282e+38
3.40282e+38
3.40282e+38
3.40282e+38
3.40282e+38
3.40282e+38
3.40282e+38
3.40282e+38
%%[ Error: undefinedresult; OffendingCommand: mul ]%%

Stack:
1.0
3.40282e+38

Observe that PostScript’s single-precision can cope with 3.40282×10³⁸, but that is the limit. JDAWiseman (talk) 17:22, 12 July 2021 (UTC)[reply]

[1] ttp://www.exploringbinary.com/decimal-precision-of-binary-floating-point-numbers/

[1]