Talk:Endianness/Archive 4

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Archive 1 Archive 2 Archive 3 Archive 4 Archive 5 Archive 6 Archive 9

"and properly handled"

I think a note needs made on how the problem has been handled. It's not the problem that's gone away, it's the binary data: modern cross-platform data exchange uses text formats such as XML and JSON, preventing programmers from ever encountering the issue in the first place. (...er, mostly. People still botch character set encoding.) In other cases, the framework handles it. e.g. Java handles data writes in big endian without the end-programmer needing to understand the issue.

Typical programmers--not the guys designing Boost, not the guys writing the JVM, not the guys writing platform SDKs, but the other 90%--blow this kind of thing regularly when left to their own devices. It's just that there are fewer opportunities to foul it up anymore as both the need and the access to raw data has been reduced. No one parses a jpeg by themselves anymore unless they're the person writing the jpeg library.

Seems worth a note because the new methodology has led to a lot of technologies that have changed the way we think about data, but not sure how to do it without being antagonistic. Maybe "modern solutions" ? — Preceding unsigned comment added by 98.172.66.86 (talk) 17:22, 24 March 2015 (UTC)

Danny Cohen

I refer to Danny Cohen (1980-04-01). On Holy Wars and a Plea for Peace. IETF. IEN 137..

There I read on p. 2:

„There are two possible consistent orders. One is starting with the narrow end of each word (aka "LSB") as the Little-Endians do, or starting with the wide end (aka "MSB") as their rivals, the Big-Endians, do.“

IMHO, despite his great influence Cohen is missing some important point. There is one and only one primary order in memory and this is the sequence of bytes (as I want to call the smallest addressable unit in memory) and that order is defined by the address. This order has nothing to do with contents, with narrow or wide end. It’s capable to hold any kind of information. This primary order is often mapped to left and right, where left means low and right means high address. One should not mirror this, because one needs one fixed coordinate system for setting potential other coordinate systems in relation to it. (If you mirror this – as maybe the Hebrews do – you mirror everything and you have explained nothing.) Then there are other orders in computer science which come with the type of data to be mapped to memory. One important (secondary) order is significance. Significance ranks a byte (or bit or digit) higher than another byte (of the same field) and comes into play when one wants to compare fields or wants to do arithmetic with fields written in a positional number system. As everybody who takes a close look at this material agrees: significance can be mapped correlated or anti-correlated (or even un- or mixed-correlated) to the primary order, the address. (Since both, significance and address are monotonic, correlated means: significance increases with the address; anti-correlated: significance decreases with the address.) In his paper, Cohen consistently places the "MSB" left and the "LSB" to the right of the (paper) page, so that the Big-Endians have the low address to the left and the Little-Endians to the right. (This indeed is consistent with the shift-operations, where Shift Right divides and Shift Left multiplies by a power of 2.)

However, it is important to realize that every machine (or if you like every software package) anti-correlates significance with address when comparing text strings. A function such as strcmp() is almost unable to work in the correlated way: first detecting the null-delimiter of both string operands and then starting to compare character by character by decrementing from high to low address (and nobody would be interested in such a result, because all sorting conventions rank the first character highest). (Although terribly inefficient, this would be no harm if the only result of the comparison would be equal/unequal. But e. g. for the applicability of Binary Search a 3-way outcome (less than, equal, greater than) is imperative.) Instead the function compares „lexicographically“ which means it compares the 2 first characters, if there is a difference then this determines the result. If they are equal the function compares the subsequent (in terms of address) 2 characters, and so on. This means that the first (lowest address) character ranks highest and has the biggest significance – and that strcmp() operates in big-endian style. It does so at the Lilliputians and I am pretty sure at the Hebrews, too, who read from right to left (which by the way has nothing to do how they store their texts in memory). So I claim that every machine is of type big-endian with respect to text strings. Cohen says on p. 3:

„English text strings are stored in the same order, with the first character in C0 of W0, the next in C1 of W0, and so on.“

If the smallest addressable unit is bigger than a character (e.g. a word of 4 bytes), so that one is tempted to put more than one character into a word then the ordering within that word may be different (which I do not want to discuss here). But, if one maps one character to one byte (or to one word, but not to quarter-word) for building a character string, this mapping is anti-correlated with the address (and big-endian) as soon as (3-way) string comparison comes into play.

The fact that a machine instruction for lexicographical comparison of character fields is beyond the horizon of (many) Little-Endians can be taken as kind of unmasking, because (many) Big-Endians have it.

Now, there are also numerical fields and operations dealing with them. All computers that I have worked with (or heard of) adhere to some positional number system where a single digit has a „significance“, defined by its position within the number and which maps to its address. The mapping of numbers to memory defines the „type“ of the instruction:

  • A big-endian instruction treats significance anti-correlated with the address: First byte ranks highest.
  • Little-endian treats significance correlated with the address: First byte ranks lowest.

And a machine where this pertains to all numerical instructions is called a big-endian resp. a little-endian machine, others may be called mixed-endian.

This is very much in accordance with the article – and even with Cohen who says on p. 7:

„To the best of my knowledge only the Big-Endians of Blefuscu have built systems with a consistent order which works across chunk-boundaries, registers, instructions and memories. I failed to find a Little-Endians' system which is totally consistent.“

But he forgot to add: ... and this is impossible because every machine (or user) compares character strings (texts) in big-endian style. On the other hand, if the notion of endianness is restricted to fixed-point numerics then there may exist also totally consistent Little-Endians' systems.

As a conclusion I would propose to refer to Danny Cohen's article as a delightful story and maybe name giver, because I completely agree with its title. As a true reference to the matter in question maybe Tanenbaum (Andrew S. Tanenbaum; Todd M. Austin (4 August 2012). Structured Computer Organization. Prentice Hall PTR. ISBN 978-0-13-291652-3.) should be taken. But I do not have access to it. Everybody who has may comment. --Nomen4Omen (talk) 10:47, 27 March 2015 (UTC); updated --Nomen4Omen (talk) 12:34, 28 March 2015 (UTC)

I happen to have Structured Computer Organization (3rd ed.) beside me. I won't transcribe the whole section, but there are some choice quotes from section 2.2.3 Byte Ordering:
The bytes in a word can be numbered from left-to-right or right-to-left. At first it might seem that this choice is unimportant, but as we shall see shortly, it has major implications. ... The former system, where the numbering begins at the "big" (i.e., high-order) end is called a big endian computer, in contrast to the little endian of Fig 2-10(b). ... The term was first used in computer architecture in a delightful article by Cohen (1981).
It is important to understand that in both the big endian and little endian systems, a 32-bit integer with the numerical value of, say, 6, is represented by the bits 110 in the rightmost (low-order) 3 bits of a word and zeros in the leftmost 29 bits. In the big endian scheme, these bits are in byte 3 (or 7, or 11, etc.), whereas in the little endian scheme they are in bytes 0 (or 4, or 8, etc.) In both cases, the word containing this integer has address 0.
Pburka (talk) 23:42, 27 March 2015 (UTC)
It looks like you added some additional arguments after my initial response. Your claim that strings have endianess is, I think, novel. I'm not aware of any literature which supports this interpretation of the term. Your argument that lexicographical sorting points to the leftmost character being most significant is intriguing, but I'm not sure it's universally true, and, more importantly, I've never seen this argument in a published source. Pburka (talk) 15:18, 28 March 2015 (UTC)
Thank you very much for your comment. What I do NOT say is:
  1. Strings have endianness.
  2. It is universally true that leftmost character is most significant.
Let me properly point out what could be important for the article Endianness:
  1. Text strings can be compared lexicographically with 3-way outcome where first character ranks highest.
  2. This is a very important kind of comparison since there are implementations of it.
It is easy to verify that:
  1. there are 2 machine instructions in the IBM mainframes, one for operands of the same length and one for operands with different length.
  2. there are subroutines in the C-libraries such as strcmp and memcmp which compare lexicographically. strcmp a little bit more complete in the sense of the enWP-article Lexicographical order in that a proper substring being contained (a Prefix) in the other compares less.
The latter article makes „mathematically“ clear, that the characters have „significance“ depending on their position in the text string they belong to. It even uses the term significant, but not the term big-endian, although the comparison obviously is. Very probably this is, because lexicographical order existed long before little-endian machines came into market and long before Danny Cohen’s article. (Btw same is true for the article Positional notation.)
The importance of the lexicographical comparison comes from its use when looking up dictionaries, telephone books and the like. (Although the latter may be a little bit more complicated than strcmp in that they have or have not case sensitivity, handle Umlauts etc. – but even then the comparison is basically big-endian.) --Nomen4Omen (talk) 17:12, 28 March 2015 (UTC)
It sounds like we're in agreement in that case. Strings don't have endianness, so it's not relevant to this article. Pburka (talk) 18:42, 28 March 2015 (UTC)
Sorry, you misunderstand me. Like numbers don't have endianness as long as you don't calculate with them or compare them for >=< (for example: if you move them they don't have endianness), so text strings do not have endianness as long as you don't compare them for >=< lexicographically (for example: if you only move them). In both cases, the operations bring left or right endianness to the data. Or another way to bring precision into the matter, the operations and the data together define the left or right endianness. --Nomen4Omen (talk) 19:16, 28 March 2015 (UTC)
If you wish to introduce text into this article about endianness and strings I strongly advise you to find reliable sources which discuss this topic. The arguments you've presented, while interesting, are WP:SYNTHESIS and are not appropriate for a Wikipedia article. You don't need to convince us that you're right about endianness and strings: you need to show that it's been discussed in reliable sources. Pburka (talk) 19:59, 28 March 2015 (UTC)