Talk:UTF-16

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

UTF-16 and UCS-2 as one topic[edit]

The history of this page makes it look like there was never anything but a redirect to UCS here (at UTF-16), but I am fairly certain there was a separate entry for UTF-16 in the recent past.

I do not like the idea of redirecting to UCS. While UCS should mention and perhaps summarize what encodings it defines, I strongly feel that the widely-used UTF-8, UTF-16, and UTF-32 encodings should have their own entries, since they are not exclusively tied to the UCS (as UCS-2 and UCS-4 are) and since they require much prose to accurately explain. Therefore, I have replaced the UCS redirect with a full entry for UTF-16. --mjb 16:53, 13 October 2002

I no longer feel so strongly about having both encoding forms discussed in the same article. That's fine. However, I do have a problem with saying that they are just alternative names for the same encoding form. You can't change the definition of UCS-2 and UTF-16 in this way just because the names are often conflated; the formats are defined in standards, and there is a notable difference between them. Both points (that they're slightly different, and that UCS-2 is often mislabeled UTF-16 and vice-versa) should be mentioned. I've edited the article accordingly today. — mjb 23:37, 13 October 2005 (UTC)[reply]
I would also like to see UCS-2 more clearly separated from UTF-16 - they are quite different, and it's important to make it clear that UCS-2 is limited to just the 16-bit codepoint space defined in Unicode 1.x. This will become increasingly important with the adoption of GB18030 for use in mainland China, which requires characters defined in Unicode 3.0 that are beyond the 16-bit BMP space. — Richard Donkin 09:07, 18 November 2005 (UTC)[reply]
I wanted to know something about UTF-16. Talking about UCS-2, is confusing. - Some anonymous user
I agree strongly with the idea that this entire article as of Sept. 27, 2014 seems to be a discussion of UCS-2 and UTF-16. I've never heard of UCS-2 (nor do I care to learn about it, especially in an article about something that supercedes it). I came to this article to briefly learn the difference between UTF-8 and UTF-16, from a practical pov. I found nothing useful. This article drones on and on about what UCS-2 was and how UTF-16 differs and couldn't possibly be of interest except as a regurgitation of information easily found elsewhere and only by the very few people who care about UCS-2. Its like taking 5 pages to explain the difference between a bic lighter and a zippo. Just unnecessary and of doubtful use to 99% of the people looking for understanding on what the differences between UTF-8,16,& 32 are. Needs a complete rewrite from a modern perspective. I've read that UTF-8 is ubiquitous on the web, if so why should we care about UTF-16 and especially UCS-2???72.172.1.40 (talk) 20:02, 27 September 2014 (UTC)[reply]
For what it is worth, UCS2 is used to encode strings in JavaScript. See Wandschneider, Marc (2013). Learning Node.js : a hands-on guide to building Web applications in JavaScript. Upper Saddle River, NJ: Addison-Wesley. p. 29. ISBN 9780321910578. OCLC 857306812.. Peaceray (talk) 19:07, 29 May 2015 (UTC)[reply]
Stop! Standard time! If you look into the standard you'll find

Where ECMAScript operations interpret String values, each element is interpreted as a single UTF-16 code unit.
However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so they may be ill-formed when interpreted as UTF-16 code unit sequences.


Now it does make sense to conflate UCS-2 and UTF-16 whenever you only care about the BMP (Basic Multilingual Plane, see this). However, in an encyclopedic article I expect both concise information and accuracy. Assarbad (talk) 09:55, 15 October 2015 (UTC)[reply]
.NET & the Windows API apparently us UCS2 as well. ".NET uses UCS-2 because the Windows API uses UCS-2 (so when you use Visual Studi..." news.ycombinator.com. Retrieved 2015-05-29. Peaceray (talk) 19:19, 29 May 2015 (UTC)[reply]
Both Java and Windows may sometimes say they use UCS-2 but it is often unclear whether they really are limited to UCS-2. If the API does not actually do something special with non-BMP code points or with surrogate halves, then it "supports" UTF-16 just fine. An api that only treats slash and colon and null and a few other code points specially, as the majority of Windows api's do, therefore supports UTF-16. Officially filenames are UTF-16, so any api that is a filename stored as 16-bit units is UTF-16, no matter what the documentation says.Spitzak (talk) 04:54, 31 May 2015 (UTC)[reply]

UTF-16LE BOMs Away![edit]

Concerning the text explaining UTF-16LE and UTF-16BE, would it not be better that instead of saying,

A BOM at the beginning of UTF-16LE or UTF-16BE encoded data is not considered to be a BOM; it is part of the text itself.

we say something like,

No BOM is required at the beginning of UTF-16LE or UTF-16BE encoded data and, if present, it would not be understood as such, but instead be mistaken as part of the text itself.

--Chris 17:27, 12 January 2006 (UTC)[reply]

Clean-up?[edit]

Compare the clean, snappy, introductory paragraph of the UTF-8 article to the confusing ramble that starts this one. I want to know the defining characteristics of UTF-16 and I don't want to know (at this stage) what other specifications might or might not be confused with it. Could someone who understands this topic consider doing a major re-write.

A good start would be to have one article for UTF-16 and another article for UCS-2. The UTF-16 article could mention UCS-2 as an obsolete first attempt and the UCS-2 article could say that it is obsolete and is replaced by UTF-16. --137.108.145.11 17:02, 19 June 2006 (UTC)[reply]

I rewrote the introduction to hopefully make things clearer; please correct if you find technical errors. The rest of the article also needs some cleanup, which I may attempt. I disagree that UTF-16 and UCS-2 should be separate articles, as they are technically so similar. Dmeranda 15:12, 18 October 2006 (UTC)[reply]
Agreed, despite the different names they are essentially different versions of the same thing
UCS-2 --> 16 bit unicode format for unicode versions <= 3.0
UTF-16 --> 16 bit unicode format for unicode versions >= 3.1
Plugwash 20:13, 18 October 2006 (UTC)[reply]

Surrogate Pair Example Wrong?[edit]

The example: 119070 (hex 1D11E) / musical G clef / D834 DD1E: the surrogate pair should be D874 DD1E for 1D11E. Can somebody verify that and change the example? —Preceding unsigned comment added by 85.216.46.173 (talk)

NO the surrogate pair in the article is correct (and btw this incorrect correction has come up many times in the articles history before)
0x1D11E-0x10000=0x0D11E
0x0D11E=0b00001101000100011110
split the 20 bit number in half
0b0000110100 =0x0034
0b0100011110 =0x011E
add the surrogate bases
0x0034+0xD800=0xD834
0x011E+0xDC00=0xDD1E
-- Plugwash 18:39, 8 November 2006 (UTC)[reply]

Decoding example[edit]

Could there be an example for decoding the surrogate pairs, similar in format to the encoding example procedure? Neilmsheldon 15:29, 27 December 2006 (UTC)[reply]

Java not UTF-16?[edit]

After reading through the documentation for the Java Virtual Machine (JVM) (See Java 5 JVM[1] section 4.5.7), it seems to me that Java does not use UTF-16 as claimed. Instead it uses a modified form of UTF-8, but where it still uses surrogate pairs for supplemental codepoints (each surrogate being UTF-8 encoded though); so it's a weird nonstandard UTF-8/UTF-16 mishmash. This is for the JVM, which is the "byte code". I don't know if Java (the language) exposes something more UTF-16 like than the underlying bytecode, but it seems clear that the bytecode does not use UTF-16. Can somebody more Java experienced than me please verify this and correct this article if necessary. - Dmeranda 06:00, 23 October 2007 (UTC)[reply]

The serialisation format and the bytecode format do indeed store strings in "modified UTF-8" but the strings stored in memory and manipulated by the application are UTF-16. Plugwash 09:25, 23 October 2007 (UTC)[reply]
=== Verified ! (2012) ===

Current documentation states it uses UTF-8 internally, with 1 exception -- it uses a 'invalid UTF-8' combination, to mark end of line, so that strlen/strcmp (which depend on \00 (NUL) ending the string). I'm not sure why this was done, since when thinking through that problem, (was seeing if there was a case where an 'ascii' null might be embedded in a UTF-8 encoded string). If it is a *valid* UTF-8 string, then it can't have a 0 byte except as a NUL. since each byte of a even the longest (unused, as a maximum of 4 bytes are required for full unicode support) encoding for a 32-bit value requires the high bit be 1. A bugging UTF-8 implementation, might try to rely on the fact that the first byte specifies the number of data bytes for the char -- and of each following char, the top 2 bits are ignored (spec says they must be 10). But data-wise, they are ignored, so one could encode UTF-8 data improperly, and still have it be decomposable by a non-validating UTF-8 decoder -- but the same string might have an embedded nul, and cause problems.

I think people got the idea that Java was UTF-16, because they didn't have to call a conversion routine on Windows -- but that's because the version for windows was built to do the conversion automatically.Astara Athenea (talk) 22:29, 23 January 2012 (UTC)[reply]

"Current documentation states it uses UTF-8 internally, with 1 exception" WHICH documentation do you think says that? The documentation for java.lang.string clearly states "A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String. " It is true that UTF-8 based formats are used for strings in some circumstances (serialisation, classfile formats etc) but the languages core string type has always has been based on a sequence of 16-bit quantities. When unicode was 16-bit these represented unicode characters directly, now they represent UTF-16 code units. Plugwash (talk) 23:27, 23 January 2012 (UTC)[reply]

Sorry for the delay -- didn't see your Q. Cited the text from the java documentation (http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.4.7), the paragraph I quoted has been updated. It now lists two differences between standard UTF-8 and the JavaVM's format:

There are two differences between this format and the "standard" UTF-8 format. First, the null character (char)0 is encoded using the 2-byte format rather than the 1-byte format, so that modified UTF-8 strings never have embedded nulls. Second, only the 1-byte, 2-byte, and 3-byte formats of standard UTF-8 are used. The Java virtual machine does not recognize the four-byte format of standard UTF-8; it uses its own two-times-three-byte format instead.
For more information regarding the standard UTF-8 format, see Section 3.9 Unicode Encoding Forms of The Unicode Standard, Version 6.0.0. [1]

References

  1. ^ Lindhold, Yellin. "The Java Virtual Machine Specification, Java SE 7 Ed., section 4.4.7". http://docs.oracle.com/javase/specs/jvms/se7/jvms7.pdf: Oracle. p. 93. Retrieved 24 September 2012. {{cite web}}: |archive-url= requires |archive-date= (help); External link in |location= (help); More than one of |pages= and |page= specified (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)CS1 maint: location (link)

Astara Athenea (talk) 19:07, 24 September 2012 (UTC)[reply]

Windows: UCS-2 vs UTF-16[edit]

UTF-16 is the native internal representation of text in the Microsoft Windows NT/2000/XP/CE

Older Windows NT systems (prior to Windows 2000) only support UCS-2

That sounds like a contradiction. Besides this blog indicates UTF-16 wasn't really supported by Windows until XP: [2]

--Kokoro9 (talk) 12:44, 30 January 2008 (UTC)[reply]

I think surrogate support could be enabled in 2K but i'm not positive on that. Also iirc even XP doesn't have surrogate support enabled by default. As with java windows uses 16 bit unicode quantities but whether surrogates are supported depends on the version and the settings. The question is how best to express that succiently in the introduction. Plugwash (talk) 13:05, 30 January 2008 (UTC)[reply]
I've found this:

Note: Windows 2000 introduced support for basic input, output, and simple sorting of supplementary characters. However, not all system components are compatible with supplementary characters. Also, supplementary characters are not supported in Windows 95/98/Me.

If you are developing a font or IME provider, note that pre-Windows XP operating systems disable supplementary character support by default. Windows XP and later systems enable supplementary characters by default.

Source
That seems to indicate Windows 2000 supports UTF-16 at some level. On the other hand, I think NT should be removed from the list of UTF-16 supporting OSs.--Kokoro9 (talk) 17:38, 30 January 2008 (UTC)[reply]

NT is a somewhat generic term that can refer to either the Windows NT series (3.51, 4.0, 2000, XP, etc.) or to specific versions that used NT in the product name (pretty much just 3.51 and 4.0). NT 3.51 and 4.0 were based on UCS-2. It would probably be more accurate to leave the "NT" out entirely, since the "2000" and "XP" stand on their own (not to mention that NT does not apply to CE). —Preceding unsigned comment added by 24.16.241.70 (talk) 09:37, 19 April 2010 (UTC)[reply]

Simply use NT-platform or NT-based and you'll be fine. But especially for a topic like UCS-2 versus UTF-16 it is of utmost importance to distinguish Windows versions. Windows XP, to my knowledge, introduced full Unicode support. Assarbad (talk) 09:58, 15 October 2015 (UTC)[reply]

Does what Windows calls Unicode include a BOM? Or is the endianness implicit?--87.162.6.159 (talk) 19:35, 2 May 2010 (UTC)[reply]

Windows will guess that a file is UTF-16LE if there is no BOM. However this is now considered a bug (see bush hid the facts) and that lack of a BOM should be used on UTF-8 and legacy encodings (which are easy to distinguish). Most Windows software inserts a BOM into all UTF-16 files.Spitzak (talk) 17:31, 3 May 2010 (UTC)[reply]
To be picky, the Windows OS itself doesn't usually interpret the BOM at all. Windows APIs generally accept UTF-16 parameters and the application is expected to figure out what encoding to use when reading text data. The BOM only applies to unmarked Unicode text, which the OS rarely deals with directly. XML files are an exception (the OS does process XML files directly), but the Unicode processing semantics for XML files is fairly well-specified. Individual applications on the Windows platform (including those that come with Windows such as Notepad) may have specific methods for detecting the encoding of a text file. It is probably most common to see applications save text files encoded in the system's default ANSI code page (not Unicode). In particular, Notepad will check for a BOM, and if there is no BOM, it will use a heuristic to try to guess the encoding. The heuristic has been known to make somewhat silly mistakes (incorrectly treating some short ANSI files as UTF-16, making them show up as nonsense Chinese characters), so the algorithm has been adjusted in recent versions of Windows to try to reduce the chance of this type of mistake. —Preceding unsigned comment added by 24.16.241.70 (talk) 07:38, 8 June 2010 (UTC)[reply]

UCS-2 use in SMS deserves mention[edit]

By far the most SMSes are encoded in either the 7bit GSM default alphabet or UCS-2, especially in GSM. In CDMA other encodings are also supported, like a Japanese encoding and a Korean encoding, but those are minority shares.

Also see: Short message service — Preceding unsigned comment added by 92.70.2.16 (talk) 15:02, 23 July 2008 (UTC)[reply]

@Spitzak: @BIL: As of ETSI TS 123 038 V16.0.0 (2020-07) (3GPP TS 23.038 version 16.0.0 Release 16), which appears, from this 3GPP listing of the 23.XXX specifications, to be the latest revision, UCS-2 ("UCS2") appears to be the only flavor of Unicode-like encodings supported. However, this page from Twilio claims that:

These differences turn out not to matter in practice, because due to the lack of support for the UCS-2 encoding, in modern programming languages smartphones tend to just decode UCS-2 messages as UTF-16 Big Endian. This is good news, because it means in practice we can send non-BMP characters, such as Emoji characters, over SMS, despite the fact that the spec doesn't strictly allow it.

and the same Google search that found that seems to have found a bunch of other pages that speak of "UCS-2" and "UTF-16" in SMS as interchangeable, so perhaps, in practice, UTF-16 is used for at least some messages, even though the specification doesn't support it. Guy Harris (talk) 03:38, 11 February 2022 (UTC)[reply]

UTF-16 and Windows[edit]

In Windows 7 UTF-16 is still only supported for a small set of functions (i.e. IME and some controls). It is not supported for filenames or most API-functions. The SDK documentation does not speak of UTF-16. They say just Unicode and mean UCS-2. The sources in the section "Use in major operating systems and environments" say just this but the sentence tells us it supports UTF-16 native, which is wrong. 217.235.6.183 (talk) 23:45, 29 January 2010 (UTC)[reply]

You probably think that if a non-bmp character returns 2 for wcstrlen then "UTF-16 is not supported". In fact this is the CORRECT answer. The ONLY place "support" is needed is in the display of strings, at which point you need to consider more than one word in order to choose the right glyph. Note that to display even BMP characters correctly you need to consider more than one word, due to combining characters, kerning, etc.
Anybody claiming that strlen should return a value other than 2 for a non-bmp character in UTF-16 (or any value other than 4 for a non-bmp character in UTF-8) is WRONG. Go learn, like actually try to write some software, so you can know that measuring memory in anything other than fixed-size units is USELESS. The fact that your ancient documentation says "characters" is because it was written when text was ASCII only and bytes and "characters" were the same thing. Your misconception is KILLING I18N with obscenely complex and useless API and this needs to be stopped!!!Spitzak (talk) 04:37, 30 January 2010 (UTC)[reply]
Please keep the discussion to the content and don't assume my programming skills. Have you bothered to read the sources given in this section ? Try to pass a filename with surrogates to an Windows7 API function... 217.235.27.234 (talk) 14:41, 30 January 2010 (UTC)[reply]
Quick check seems to reveal that Windows on FAT32 and NTFS will create files with surrogate halves in their names using CreateFile and returns the same strings when the directory is listed. It is true that a lot of I/O does not render them correctly (ie it renders them as 2 UCS-2 characters) but I fully expected that and do not consider it a failure. Exactly what are you saying it does wrong? —Preceding unsigned comment added by Spitzak (talkcontribs) 06:06, 3 February 2010 (UTC)[reply]
You're also welcome to write software that will decode the names of all the files in a directory as floating point numbers. There seems to be some confusion here between character sets and encodings. The concept applicable to specifying the rules for legal file names on an OS or file storage system is the character set. How a file's name is represented in the directory involves an encoding, of course, but at the API level you're dealing with characters from a character set. —Largo Plazo (talk) 02:36, 27 March 2010 (UTC)[reply]
By character set you mean the abstract concept of characters, right? Assigned to a given code point using an encoding and visualized through glyphs, or ...?
From how I read your response and the remark you responded to, you're talking about the same thing exactly, but using different terms. I presume by character set you mean the sum of possible characters which take the visual form of glyphs when displayed, right? The encoding, however, provides code points (to stick with Unicode terminology) assigned each character. So the use case is storage (not just on disk) versus display as you pointed out correctly. But I don't follow on the last statement. Why would an API taking a file name care for the character? Isn't it sufficient at this layer to count the number of, say, 16-bit unsigned words after normalization? The important aspect being: after normalization. Assarbad (talk) 10:18, 15 October 2015 (UTC)[reply]

Different parts of Windows have different levels of support for Unicode. The file system essentially ignores the encoding of the file names. In fact, the file system pretty much only cares that the file names don't use null, back-slash, or forward-slash (maybe colon). Higher-level parts of the OS add checks for other things like *, ?, |, etc. In addition, the file system has only a very simple concept of uppercase/lowercase that does not change from locale to locale. All of these limitations are due to the fact that the file system is solving a specific problem (putting a bunch of bits on the disk with a name that is another bunch of bits -- any higher meaning for the name or contents of a file is up to the user). At the other extreme, the Uniscribe rendering engine does all kinds of tricks to make things like Hebrew and Arabic come out right (combining ligatures, bidirectional text rendering, etc.). To make a long story short, the parts of Windows that actually need to deal with UTF-16 can correctly deal with it. Other parts don't care, and simply store and transmit 16-bit values. To me, that sounds like the right way to deal with it. So I think the page is fully accurate in indicating that Windows supports UTF-16. —Preceding unsigned comment added by 24.16.241.70 (talk) 09:30, 19 April 2010 (UTC)[reply]

Can someone provide a citation or example as to how one can create invalid UTF-8 registry or file names, and how that matters to the API itself? Neither the filesystem nor the registry do anything in UTF-8, and don't use UTF-8 in any way shape or form. Therefore I don't see how it needs even enter the discussion (nor can I see how Windows could or would do anything incorrectly here). If nobody can source this, than that content should be removed. Billyoneal (talk) 22:09, 9 December 2010 (UTC)[reply]

A malicious program can actually change the bytes on the disk and make the registry entry names be invalid UTF-8. Apparently there are ways to achieve this other than writing bytes on the disk but I don't know it. The Windows UTF-16 to UTF-8 translator is unable to produce these byte arrangements, and therefore unable to modify or delete those registry entries.Spitzak (talk) 18:49, 10 December 2010 (UTC)[reply]
I think you are confusing some things here. Both the regular and the Native NT API refer to the registry only in terms of UTF-16 strings, any UTF-16 specific issues would not cause a difference between the APIs. There is a well-known trick where the native API allows key names that include the null (U+0000) char but the Win32 API and GUI tools do not, thus allowing the creation of hard to access/delete keys, however this has nothing to do with UTF-16. Finally manipulating the undocumented on-disk registry file format can create all kinds of nonsense that the kernel cannot deal with, if that file format uses UTF-8 for some strings (a quick check shows that it might) creating an invalid file with invalid or non-canonical UTF-8 could cause problems, but they would not be related to UTF-16. 94.145.87.181 (talk) 00:03, 26 December 2010 (UTC)[reply]

First, until windows XP (including) removing a non-BMP character from the system edit control required pressing backspace twice. So it wasn't fully supported in windows XP. This bug was fixed in vista or win7 though.

Second, it isn't fully supported even in windows 7. Two examples I can come with right now:

  • Writing UCS-2 to the console is supported if the font has the appropriate characters, but UTF-16 non-BMP characters are displayed as two 'invalid' characters (two squares, two question marks, whatever...) independently of the font.
  • You can't get what non-BMP characters are available in some font. GetFontUnicodeRanges that is supposed to do this returns the ranges as 16-bit values.

I'm sure there are more. 82.166.249.66 (talk) 17:46, 26 December 2011 (UTC)[reply]

First reference to apache xalan UTF-16 surrogate pair bug[edit]

The first reference to the apache xalan bug [3] seems broken to me. I'm very interested in that particular problem, could someone fix the link? What was it all about? What was the problem? It is really unclear how it is right now. —Preceding unsigned comment added by 160.85.232.196 (talk) 12:52, 20 April 2010 (UTC) Nevermind, found the bug, the URL was missing an ampersand (&). Fixed it: [4][reply]

Document changes for lead and first half[edit]

The reorganization and changes attempted to improve the following points. Comments or additional suggestions are welcome.

  • The lede for the article only described UTF-16. Since it is now a joint article about both UTF-16 and UCS-2, it should include both methods.
  • The first paragraph used far too much jargon, in my opinion. Immediately using Unicode terms like "BMP" and "plane" and so forth without more background doesn't illuminate the topic. I moved most of these terms to later in the document and provided enlarged descriptions.
  • The encoding section of the article only described the method for larger values from U+10000 and up. The simple mapping of UCS-2 and UTF-16 for the original 16-bit values was never mentioned.
  • The section about byte order encoding had several issues. I think these occurred more from confusing writing than actual errors.
    • First, it implied the byte order issue was related to the use of 4-byte values in UTF-16.
    • Second, the explanation about the BOM value reverses the cause and effect. The value was chosen for other reasons, then designated as an invisible character because it was already the BOM, to hide it when not discarded.
    • Next, the discussion recommends various behaviors for using this feature, which isn't appropriate for an encyclopedia article.
    • Finally, it doesn't clearly explain the purpose of the encoding schemes, since it misleadingly states that endianness is "implicit" when it is part of the label. I think the original meaning referred to the absence of a BOM in the file for those cases, but it wasn't clear.

StephenMacmanus (talk) 12:00, 13 November 2010 (UTC)[reply]

I do not think these changes are an improvement.
  • The lead indeed talked about both UTF-16 and UCS-2, in two paragraphs, rather than mangling them together.
  • References to documents and all history removed from lead.
  • Obviously not all 65536 UCS-2 characters are the same, the values for surrogate halves changed. Later paragraph saying that only "assigned" characters match is more accurate.
  • UTF16-BE/LE require that the leading BOM be treated as the original nbsp, not "ignored". Endinaness definately is "implicit" and a backwards BOM in theory should be the character 0xFFFE (though it might signal software that something is wrong).
  • If there is documentation that the use as BOM was chosen before the nbsp behavior, it should have a reference, and add it to the BOM page as well.
  • Errors is software rarely require "conversion", it is in fact disagreement between software about character counts that causes bugs. Old description of errors was better.
  • Description of encoding is incredibly long and confusing. Previous math-based one was much better. Don't talk about "planes", please, saying "subtract 0x10000" is about a million times clearer than "lower the plane number by 1".

Spitzak (talk) 20:27, 15 November 2010 (UTC)[reply]

StephenMacmanus raises some good points, but I agree with Spitzak that the previous version was much easier to read. I prefer the older lede to the new one, and the older discussion of the encoding for characters beyond BMP is better than the new one. "BMP" is in fact defined before use in the old version. -- Elphion (talk) 02:50, 16 November 2010 (UTC)[reply]

Python UTF-8 decoder?[edit]

The article says 'the UTF-8 decoder to "Unicode" produces correct UTF-16' - that seems to be either wrong or something that could use a longer explanation. Was 'the UTF-16 encoder from "Unicode" produces correct UTF-16' meant? Kiilerix (talk) 15:05, 30 July 2011 (UTC)[reply]

When compiled for 16-bit strings, a "Unicode" string cannot possibly contain actual Unicode code points, since there are more than 2^16 of them. Experiments with the converter from UTF-8 to "Unicode" reveals that a 4-byte UTF-8 sequence representing a character > U+FFFF turns into two "characters" in the "Unicode" string that correctly match UTF-16 code units for this character. However I believe in most cases Python "Unicode" strings are really UCS-2, in particular the simple encode api means that encoders cannot look at more than one code unit at a time. On Unix where Python is compiled for 32-bit strings, the encoder from "Unicode" to UTF-16 does in fact produce surrogate pairs.Spitzak (talk) 20:41, 1 August 2011 (UTC)[reply]

UTF-16 considered harmful[edit]

85.53.51.15 (talk · contribs · WHOIS) added a good-faith blurb to the lead saying that "experts" recommend not using UTF-16. When I reverted that essentially unsourced claim, 85.etc responded with another good-faith addition to the lead [5] with the following edit comment: Better wording. If you guys really care about Unicode please do not just remove this text if you do not like it. Instead try to make it suitable in Wikipedia so we can let this message get to people.

Although the newer blurb is an improvement -- it talks about a specific group that makes this recommendation, rather than the nebulous "experts" -- it provides no evidence that this group is a Reliable Source, a concept that lies at the root of WP. The opinion expressed by the group remains an unauthoritative one, and in any event it does [added: not] belong in the lead; so I have reverted this one too.

Moreover, the IP is under the misapprehension that WP's purpose is to "let this message get to people". WP is not a forum, and not a soap box: advocacy is not the purpose of these articles. There are good reasons for using UTF-16 in some circumstances, good reasons for using UTF-8 in others. Discussing those is fair game, and we already link to an article comparing the encodings. But WP is not the place to push one over the other.

And we do care about Unicode. It is in no danger of being replaced. The availability of three major encodings suitable in various ways to various tasks is a strength, not a weakness. Most of what the manifesto (their word) at UTF-8 Everywhere rails against is the incredibly broken implementation of Unicode in Windows. They have some reasonable advice for dealing with that, but it is not a cogent argument against using UTF-16. The choice of encoding is best left to the application developers, who have a better understanding of the problem at hand.

-- Elphion (talk) 21:07, 15 May 2012 (UTC)[reply]

I believe you didn't actually read it, since the manifesto doesn't say much about broken implementation of Unicode in Windows. You must be confusing with UTF-16 harmful topic on SE. What the manifesto argues about is that the diversity of different encodings, in particular three five (UTF-16/32 come it two flavors) major encodings for Unicode, is a weakness. Especially for multi-platform code. In fact, how is it a strength? Having different interfaces is never a strength, it's a headache (think of multiple AC power plugs, multiple USB connectors, differences in power grid voltages and frequencies, etc...) bungalo (talk) 12:36, 22 June 2012 (UTC)[reply]

I am never particularly impressed by the argument that "you did not agree with me, so you must not have read what I wrote". I did read the manifesto and, except for the cogent advice for dealing with Windows, found it full of nonsense, on a par with "having different interfaces is never a strength" -- an argument that taken to its logical conclusion would have us all using the same computer language, or all speaking English, as Eleanor Roosevelt once proposed in a fit of cultural blindness. A healthy, robust program (or culture) deals with the situation it finds, rather than trying to make the world over in its own image. -- Elphion (talk) 13:35, 22 June 2012 (UTC)[reply]
Excuse me, where have I said "you did not agree with me, so you must not have read what I wrote"? Don't put to my mouth what I haven't said.
I said that your claim that "Most of what the manifesto [...] rails against is the incredibly broken implementation of Unicode in Windows" is totally false, since there is only one item that talks about this ("UTF-16 is often misused as a fixed-width encoding, even by Windows native programs themselves ..."). Claiming that one sentence is the majority of 10 page long essay is a distortion of the reality, and assuming you are a rational person (and not a troll) the only explanation I could come with is that you didn't actually read the whole thing. I would be happy if it happens that I was mistaken, but in such case you'll have to take your words back.
"found it full of nonsense"
Heh? Which nonsense? I find it factually correct. I would appreciate if you could write some constructive criticism so you (or I) could forward it to the authors.
"an argument that taken to its logical conclusion would have us all using the same computer language"
The implication is false, because computer languages are not interfaces.
"or all speaking English"
Yet one more hasty conclusion. On the global scale English is the international language, and for a good reason: not everyone who wants to communicate with the rest of the world can learn all the languages out there, so there is a silent agreement that there must be *some* international language. Right, the choice of the international language is subject to dispute, but it's irrelevant to the fact that some language must be chosen. It is also applicable on the local scale—people of each culture communicate in some specific language. No communication would be possible if each person would speak her own language she invented.
"A healthy, robust program (or culture) deals with the situation it finds"
This is exactly what the manifesto proposes—a way to deal with the current situation of the diversity of encodings.
bungalo (talk) 14:34, 22 June 2012 (UTC)[reply]
Note: Btw, I'm not 85.53.51.15 (talk · contribs · WHOIS) and I'm not protecting her edit. bungalo (talk) 14:38, 22 June 2012 (UTC)[reply]

Ahem: "I believe you didn't actually read it." (Was that a "hasty conclusion" too?) The clear implication is that, since I didn't take from the article the same conclusion you did, I must not have read it. This is never a good argument. If that's not what you meant, then don't write it that way. I have in fact read the article, and did so when considering 85.etc's edit.

The manifesto is not a bad article, but it is informed from start to finish by the broken implementation of wide characters in Windows (and I agree about that). It covers this in far more than "one item", and the shadow of dealing with Windows lies over the entire article and many of its recommendations. I have no problem with that. But the article does more than suggest "a way to deal with the current situation of the diversity of encodings" -- it has a clear agenda, recommending in almost religious terms (consider even the title) that you shun encodings other than UTF-8. The arguments it advances for that are not strong, especially since there is well-vetted open source code for handling all the encodings.

And computer languages certainly are interfaces, ones I am particularly grateful for. Dealing with computers in 0s and 1s is not my forte.

-- Elphion (talk) 16:21, 22 June 2012 (UTC)[reply]

Difficulty of converting from UCS-2 to UTF-16[edit]

I removed a sentence that was added saying "all the code that handles text has to be changed to handle variable-length characters".

This is not true, and this misconception is far more common when discussing UTF-8, where it is equally untrue and has led to a great deal of wasted effort converting UTF-8 to other forms to "avoid" the actually non-existent problem.

As far as I can tell the chief misconception is that somehow strlen() or indexing will somehow "not work" if the index is not "number of characters". And for some reason the obvious solution of indexing using the fixed-size code units is dismissed as though it violates the laws of physics.

There seems to be the idea that somehow having a number that could point into the "middle" of a letter is some horrendous problem, perhaps causing your computer to explode or maybe the universe to end. Most programmers would consider it pretty trivial and obvious to write code to "find the next word" or "find the Nth word" or "find the number of words in this string" while using an indexing scheme that allows them to point into the "middle" of a word. But for some reason they turn into complete morons when presented with UTF-8 or UTF-16 and literally believe that it is impossible.

The other misconception is that somehow this "number of characters" is so very useful and important that it makes sense to rewrite everything so that strlen returns this value. This is horribly misguided. This value is ambiguous (when you start working with combining characters, invisible ones, and ambiguous ideas about characters in various languages) and is really useless for anything. It certainly does not determine the length of a displayed string, unless you restrict the character set so much that you are certainly not using non-BMP characters anyway. The number of code units, however, is very useful, as it is directly translated into the amount of memory needed to store the string.

In fact on Windows all equivalents of strlen return the number of code units. The changes were limited to code for rendering strings, and fixes to text editor widgets. The Windows file system was updated to store UTF-16 from UCS-2 with *NO* changes, and it is quite possible to create files with invalid UTF-16 names and use them.

This same observation also applies to use of UTF-8. Here the misconception that "everything has to be rewritten" is even more pervasive. Don't know how to stop it but me and obviously several others have to continuously edit to keep this misinformation out of Wikipedia. Spitzak (talk) 23:25, 23 September 2012 (UTC)[reply]

Encoding of D800-DFFF code points in UTF-16[edit]

The article says, "It is possible to unambiguously encode them [i.e. code points D800-DFFF] in UTF-16, as long as they are not in a lead + trail pair, by using a code unit equal to the code point."

This doesn't make sense to me. It seems to be saying that you just drop in a code unit like D800 or DC00 and software should understand that you intend it to be a standalone code point rather than part of a surrogate pair. I suppose if this results in an invalid code pair being formed (such as a lead unit without a valid trail unit following), then the decoder could fall back to treating it as a standalone code unit, but what if you want to encode point D800 followed immediately by point DC00? Wouldn't any decoder treat the sequence D800 DC00 as a valid surrogate pair rather than as a pair of illegal code points? If so, then the statement that UTF-16 can "unambiguously encode them" (i.e. all values in the range D800-DFFF) is not true.

If I have misunderstood the point here, please clarify it in the article, because it isn't making sense to me the way it's written. — Preceding unsigned comment added by 69.25.143.32 (talk) 18:09, 4 October 2012 (UTC)[reply]

Never mind, I figured it out. "as long as they are not in a lead + trail pair" is intended to mean "as long as you don't have a D800-DBFF unit followed by a DC00-DFFF unit" (which would be interpreted as a surrogate pair). I had previously interpreted "as long as it doesn't look like a legal surrogate pair" to mean "as long as you don't try to shoehorn the illegal code point into a 20-bit value and package it up as a surrogate pair", which wouldn't work. I'll adjust the article's wording to be more clear. — Preceding unsigned comment added by 69.25.143.32 (talk) 20:03, 4 October 2012 (UTC)[reply]

1st reference[edit]

The 1st reference (math) is not actually a reference, it should be a note instead. — Preceding unsigned comment added by Tharos (talkcontribs) 14:23, 23 August 2013 (UTC)[reply]

UTF-16 transformation algorithm[edit]

In my revert [6] of Zilkane's edits, I got one too many 0s in the edit comment. It should read: and "0x100000" is incorrect. The point is that the value subtracted from the code point is 0x10000 (with 4 zeros), not 0x100000 (with 5 zeros). This converts a value in the range 0x1,0000..0x10,FFFF (the code points above BMP) monotonically into a value in the contiguous 20-bit range 0x0,0000..0xF,FFFF, which is then divided into two 10-bit values to turn into surrogates. -- Elphion (talk) 16:23, 6 September 2013 (UTC)[reply]

I want to edit the UTF-16 convert formula:
U' = yyyyyyyyyyxxxxxxxxxx // U - 0x10000
W1 = 110110yyyyyyyyyy // 0xD800 + yyyyyyyyyy
W2 = 110111xxxxxxxxxx // 0xDC00 + xxxxxxxxxx
to
U ∈ [U+010000, U+10FFFF]
U' = ⑲⑱⑰⑯⑮⑭⑬⑫⑪⑩ ⑨⑧⑦⑥⑤④③②①⓪ // U - 0x10000
W₁ = 110110⑲⑱ ⑰⑯⑮⑭⑬⑫⑪⑩ // 0xD800 + ⑲⑱⑰⑯⑮⑭⑬⑫⑪⑩
W₂ = 110111⑨⑧ ⑦⑥⑤④③②①⓪ // 0xDC00 + ⑨⑧⑦⑥⑤④③②①⓪
what do you think? Diguage (talk) 14:36, 8 May 2022 (UTC)[reply]
I find the current version more readable, though it could use some highlighting and spacing and/or punctuation (e.g. with vertical bars). -- Elphion (talk) 17:08, 8 May 2022 (UTC)[reply]
This formula is misleading at best. the yyyyyyyyyy values in the first appearance is a different string of bits than the second and third apperances. 85.250.226.28 (talk) 21:54, 29 June 2023 (UTC)[reply]
No, yyyyyyyyy signifies the same 10-bit sequence in all appearances. -- Elphion (talk) 14:19, 30 June 2023 (UTC)[reply]

What does "U+" mean?[edit]

In the section title "Code points U+0000 to U+D7FF and U+E000 to U+FFFF" and in the text therein, example: "The first plane (code points U+0000 to U+FFFF) contains...", a "U+" notation is used. This notation is also encountered in many other places, both inside and outside Wikipedia. I've never seen a meaning assigned. What does "U+" mean? - MarkFilipak (talk) 16:30, 22 July 2014 (UTC)[reply]

"U+" is the standard prefix to indicate that what follows is the hexadecimal representation of a Unicode codepoint, i.e., the hex number of that "character" in Unicode. See Unicode. -- Elphion (talk) 17:02, 22 July 2014 (UTC)[reply]

This article is incoherent. It needs a complete revision.[edit]

The lede talks mostly about UCS-2. UCS-2 is OBSOLETE! Perhaps the article means UCS as described in the latest ISO 10646 revision ??? I do NOT have a solid understanding of UTF-16, but if my math is correct: 0x0000 to 0xD7FF plus 0xE000 to 0xFFFF is 55,296 + 8,192 = 63,488 -doesn't this mean that there are 2047 code points from the BMP which are NOT addressable in UTF-16 ? (65535-63488 = 2047). In the description it says:"The code points in each plane have the hexadecimal values xx0000 to xxFFFF, where xx is a hex value from 00 to 10, signifying the plane to which the values belong." Code points have "values" different from their code point values?!?!?! Wow, somebody really is inarticulate. Lets agree that any binary representation of a "character" is dependent on some system of rules, ie a standard. We should also (maybe? I hope!) be able to agree that "code page" is ambiguous, and IMPLEMENTATION DEPENDENT (I can't speak to whether the term exists in any of the ISO/Unicode standards, but it is clear that different standards (ANSI, Microsoft, DBCS, etc.) define 'code page' differently.) So, using a common (but poorly defined) term in this article needs to be done with much more caution. I really don't understand why UCS is used as the basis of comparison for this article on UTF-16. Shouldn't Unicode 6.x be THE basis? I would suggest that if there is need for an article on UCS, then there should be an article on UCS, or at least a comparison article between it and Unicode. Afaics, this article lacks any mention of non-character Unicodes as well as text directionality, composition, glyphs, and importantly, graphemes. If UTF-16 is similar to UCS then this is a serious deficiency. It also seems to me that Microsoft is a huge part of the reason UTF-16 persists, and how it is implemented in various MS products & platforms should be addressed here. Also, I see virtually NOTHING on UTF-8, which is (it seems) much more common on the internet. All-in-all, this article does not present a clear picture of what "standard" UTF-16 is, and what the difference is between that and what its common implementations do. — Preceding unsigned comment added by 72.172.1.40 (talk) 20:15, 28 September 2014 (UTC)[reply]

I have already tried to delete any mention of "planes" but it got reverted, so I gave up. I agree that planes are irrelevant to describing UTF-16. You are correct that there are 2048 points that can't be put in UTF-16 and this causes more trouble than some people are willing to admit because they don't want to admit the design is stupid.Spitzak (talk) 21:45, 28 September 2014 (UTC)[reply]

Example for 軔 is wrong[edit]

The last example seems to be wrong. It lists the symbol "軔", which is the codepoint U+8ED4[1][2] (JavaScript: "軔".charCodeAt(0).toString(16) and Python 3: hex(ord("軔")) agree), but the example says it is the codepoint U+2F9DE.[3] It looks similar, but it isn't the same codepoint.

panzi (talk) 23:48, 25 October 2014 (UTC)[reply]

You're quite right. U+2F9DE is a compatibility ideograph that decomposes to U+8ED4 軔 on normalization. As any software might (and presumably has in this case) silently apply normalization and therefore convert U+2F9DE to U+8ED4 it would be best not to use a compatibility ideograph for the example, but use a unified ideograph. BabelStone (talk) 12:32, 26 October 2014 (UTC)[reply]
OK. I just grabbed something from our Wikibooks UNICODE pages in the range of bit patterns I wanted to demonstrate. Thanks for catching the wrong bit 0xD bit pattern, too! —[AlanM1(talk)]— 02:40, 29 October 2014 (UTC)[reply]

Surrogate terminology[edit]

At present, the article refers to the halves of a surrogate pair as "lead" and "trail", and says that "high" and "low" was the terminology in "previous versions of the Unicode Standard". This appears to be incorrect; Unicode v7.0 calls them "high" and "low", and so does v6, which was (specifically v6.1) the current version when the change was made, in April 2012. What the Standard does say is:

Sometimes high-surrogate code units are referred to as leading surrogates. Low-surrogate code units are then referred to as trailing surrogates. This is analogous to usage in UTF-8, which has leading bytes and trailing bytes.

Now, an earlier edit (October 2011) suggests that the change was made (both in Unicode and here) to avoid confusion over "high" surrogates having lower code-point numbers than "low" surrogates. That's a laudable goal, but the Unicode Standard doesn't say that was the intention, and so we need a citation. And in any case, Unicode still plainly uses "high" and "low" as the primary terms. (Note also that the alternatives are given as "leading" and "trailing".) I'm going to go change the article accordingly, but I'd be very interested to see if sources can be found arguing for the alternative names (or maybe even use of them in ISO 10646?). -- Perey (talk) 12:49, 3 November 2014 (UTC)[reply]

more lucid explanation for those who don't specialise in this subject[edit]

Message for Babelstone,

I see you have just deleted my revisions for this page (2015-01-01).

I added them because the previous state of the page was very unclear ... except to those who already understand how UTF encoding works.

There are quite a few things to say about UTF-16, but the most important thing is to give a lucid, unambiguous explanation to someone who doesn't yet know how it works.

It's difficult to understand, for example, how the fact of "invalid surrogate pairs" could be deemed so unworthy of interest. But actually explaining it fleshes out as required the previous excessively concise explanation of how these pairs work.

... in my opinion

MikeRodent (talk) 18:35, 1 January 2015 (UTC)[reply]

The article is indeed not entirely lucid (particularly the section "Code points U+010000 to U+10FFFF"), but it would help get things started if you gave specific examples of things *you* don't find lucid, and suggestions for how to change them. -- Elphion (talk) 23:31, 1 January 2015 (UTC)[reply]
E.g., I would start the section "Code points U+010000 to U+10FFFF" by saying that codepoints U+01FFFF and above are encoded with two 16-bit codes, which are chosen from a reserved range (so that these codes cannot be confused with 16-bit codes for BMP codepoints) and that the first and second 16-bit codes are themselves chosen from separate ranges so that you can always tell which is supposed to be first and which second (thus making synchronization straightforward). Thus surrogates are legal only in pairs, and any high surrogate not immediately followed by a low surrogate is an error, and any low surrogate not immediately preceded by a high surrogate is an error. (But rephrased lucidly :-) -- Elphion (talk) 23:39, 1 January 2015 (UTC)[reply]
I agree with Babelstone that your edits as they stand do not improve the readability of the article. -- Elphion (talk) 23:45, 1 January 2015 (UTC)[reply]
Yes, sorry mate, but I do in fact think that what I've put is pretty lucid, and much much more lucid than what you have suggested above in para "E.g., I would start...". The thing is, and it is an important point, I am not in this specialism and have struggled to understand the UTF-16 scheme, as described by numerous "intro to UTF 16" texts, including this one, despite the fact that the reality of it isn't that hard to grasp if explained as one would explain to an intelligent 12-year-old. It is absolutely abusive if pages of Wikip are hijacked by people who are specialists in a given domain and are not prepared to accept that the primary purpose of an encyclopedia is to inform the intelligent-enough lay reader of the basics of a given topic.
As far as this particular subject, UTF-16, is concerned, I actually think much of what is currently on the page should be on a subsidiary page, dealing with the more abstruse discussions pertaining to the encoding scheme. It is in fact quite absurd and annoying that, until my edit yesterday, there was no explanation given whatever as to why the UTF-16 scheme, strange as it is, might make sense (i.e. BMP code points stored with one 16-bit code unit at the same time as no chance of confusing BMP with non-BMP code point encodings...). MikeRodent (talk) 16:14, 2 January 2015 (UTC)[reply]

Message for Spitzak

Another attempt to delete my revisions for this page. Presumably you didn't have the wit or the courtesy to bother reading this discussion here. I don't know how to get it into the heads of people who spend their entire lives thinking and breathing I.T.: WIKIPEDIA IS NOT SOLELY FOR SPECIALISTS. The primary purpose of an article on UTF 16 is to explain UTF 16 to people who don't understand what it is. My changes which I made in December attempt to do this, though I had the courtesy (considerably greater than yours or Babelstone's I may add) not to delete anything anyone had previously written. MikeRodent (talk) 20:53, 8 January 2015 (UTC)[reply]

Per Bold, Revert, Discuss you should discuss changes that do not have consensus of other editors rather than engage in edit warring. As they stand your changes are not an improvement, and introduce a fictitious term "wydes". BabelStone (talk) 21:17, 8 January 2015 (UTC)[reply]
You should delete the text this replaces rather than make two explanations of the same thing, if you feel your explanation is better. And stop putting "wydes" in as I have searched and the term is not used. "wchar" is used but does not mean 16 bit units.Spitzak (talk) 01:58, 9 January 2015 (UTC)[reply]

The "How this encoding works and why it came about" section recently added by MikeRodent is still not an improvement. The whole section is unsourced and reads like a personal essay. In particular, phrases such as "In the official jargon these pairs of 16-bit units were baptised with the unhelpful term "surrogate pairs"", "this smacks somewhat of Eurocentrism", "The "planes" of Unicode can be imagined as shelves in a bookcase", "At first sight UTF-16's representation strikes one as odd" are unencyclopedic in tone and not entirely neutral in point of view. BabelStone (talk) 22:36, 12 January 2015 (UTC)[reply]

Ah, but who are you to decide what is "unencyclopaedic" and what is not? I'd be interested if you, with your close interest in UTF-16, could indeed edit my new text to provide suitable sources. At least my change massively improves the approachability: it explains the why, the how and the history, and does not confuse things in a totally unjustifiable way with all that irrelevant nonsense about UCS-2. The term "surrogate pairs" is indeed a totally unhelpful term: if you deny this, please be so kind as to deny it in rational terms. The phrase used by a previous contributor, "the most commonly used characters", is the offensive, non-neutral phrase, not what I said about it. If you disagree, please be so kind as to argue the case. RATIONALLY. And yes, of course UTF-16's encoding scheme is bizarre. Precisely why someone with half a brain needs to explain WHY it came about.
Finally, can all those who want to preserve the mysteries of UTF-16 encoding as a sort of druidic, sacerdotal secret, which shall be known only to initiates, please be good enough to explain why you wish to do so, and also acknowledge that several people have complained about the inadequacies of this page, presumably indicating that thousands, maybe hundreds of thousands, have visited this page and actually gone away understanding even less about UTF-16 encoding than they did before they got here. Just leave the gist of my stuff alone please: it is BETTER than what went before it. MikeRodent (talk) 00:57, 13 January 2015 (UTC)[reply]
"... agree strongly with the idea that this entire article as of Sept. 27, 2014 seems to be a discussion of UCS-2 and UTF-16. I've never heard of UCS-2 (nor do I care to learn about it, especially in an article about something that supercedes it). I came to this article to briefly learn the difference between UTF-8 and UTF-16, from a practical pov. I found nothing useful." [extract from first article of "Talk" above] MikeRodent (talk) 01:38, 13 January 2015 (UTC)[reply]
(from your buddy Spitzak) "You are correct that there are 2048 points that can't be put in UTF-16 and this causes more trouble than some people are willing to admit because they don't want to admit the design is stupid." Spitzak (talk) 21:45, 28 September 2014 (UTC)[reply]
ummm... I rest my case praps? MikeRodent (talk) 03:15, 13 January 2015 (UTC)[reply]

You are not being entirely fair. I have agreed explicitly (and others implicitly) that there are problems with the article. Believe me, we are not trying to keep this mysterious.

The addition of "how this encoding came about" is a step forward, for you are quite right that the UTF-16 encoding is not the sort of thing one would arrive at naturally, given a blank check to "encode a lot characters". I agree that a historical explanation sets the stage better for understanding how it works. (But a historical approach necessarily brings in USC-2 and USC-4; these are far from "irrelevant").

This would, however, be a much better step if it were historically accurate. The real problem is that there were two competing approaches, representing different ideas of what was important. IEEE and the Unicode Consortium (the latter representing primarily computer manufacturers) agreed at first to expand the encoding space from 2^8 to 2^16, and attempted to keep the two developing standards in synch (as to which value encoded which character). It was quickly apparent to IEEE that 2^16 would not be sufficient, and they extended UCS-2 (two bytes per character to cover the BMP) to UCS-4 (four bytes per character, for a 31-bit encoding space), all of whose codes were the same length to preserve the common programming paradigm that characters should have a fixed width.

But the consortium fought this tooth and nail for years, primarily because (a) 4 bytes per character wasted a lot of memory and disk space, and (b) Microsoft was already heavily invested in 2 bytes per character. Objection (a) was partially addressed by the UTF-8 encoding scheme, which can address the full UCS-4 31-bit space; but this was resisted both because of the variable-width encoding and because it requires up to 3 bytes even to cover BMP. Finally, under pressure from the Chinese to expand beyond 2^16, and from manufacturers to retain a 2-byte orientation, IEEE and the consortium agreed to limit the Unicode space to the codepoints encodable via UTF-16 -- which was invented at that point to address both (a) and (b) -- and the two standards effectively merged. The Unicode space is now limited to the so-called "21-bit" space addressable by UTF-16, with three official encodings (UTF-8, -16, and -32, each with its own pluses and minuses). So it was essentially a compromise of convenience.

Specific points:

BMP does in fact contain the vast majority of the volume of characters today: this is not Eurocentric, but true for all the world's major languages.

"Surrogate pairs" is not an unhelpful term, and we must explain what it means. The term must be understood to read any of the Unicode documentation, and it's a perfectly reasonable technical term -- you have to call them something, preferably something concise and unambiguous.

Knuth's term "wyde" has not survived, despite his prestige. We move on.

"Encyclopedic" is determined by consensus, and I suspect the consensus is against you here. We have all learned that our deathless prose will be edited by others (undoubtedly to its disadvantage!), and you should learn that too if you wish to retain your equanimity.  :-)

-- Elphion (talk) 06:31, 13 January 2015 (UTC)[reply]

thank you for arguing your position clearly and courteously. It means that I can courteously respond.
that stuff about the IEEE and the Unicode Consortium does not belong in an introductory section, IMHO. But why not include it below in a section called "History of emergence of the encoding"? For those interested in precise historical aspects that would be great. Concerning my addition, you say "if it were historically accurate"... but my account is historically accurate and perfectly consistent with your explanation. What you perhaps mean is that I have not "told the full story". I'd agree with that, and it is an inherent characteristic of the process of teaching stuff to those who would learn.
in the same vein, there is no need whatever to confuse people with UCS-4 and UCS-2 when they are struggling to understand the oddity of UTF-16, which is, um, the title of the Wikipedia article! Again, these details belong in a subsidiary section of the article, or even (gasp) in a separate article about UCS-4 and UCS-2. It'd be interesting to see how many visitors such an article would get.
the term "surrogate pair" is inexplicably complicated and, again, distracts from the central task of explaining to the uninitiated how UTF-16 encoding works. I have now changed this to "unhelpfully complicated". Yes, of course you need to know the term when reading the Unicode documentation. But first its function as a barrier to understanding for the average reader must be addressed. I do this.
the reason I mentioned Knuth's term "wyde" is because one of the things which I failed to understand about UTF-16 until I did understand is what the attempts at explanation were on about: the basic unit of computer data storage is the 8-bit byte, so when texts were talking about "16-bit units", did they in fact mean pairs of bytes, or what? If you don't happen to have a degree in Comp Sci and don't have others to explain it to you this is not immediately obvious. Shame that Knuth's term didn't catch on. Perhaps the reason it didn't was partly because there was no need once UTF-8 attained primacy.
my deathless prose can be edited, as I am already learning :->... but I simply call on the other habitual editors of this page to do this in a spirit of understanding the needs of those who do not yet understand UTF-16 encoding, who presumably constitute 98% of those who visit the page! MikeRodent (talk) 09:15, 13 January 2015 (UTC)[reply]

I have rewritten the first section along the lines I suggested. (Also added UCS-2 back to the lead, since UCS-2 redirects to this article; it was decided long ago that UCS-2 and UCS-4 would best be covered in UTF-16 and UTF-32, since there's little else to say about them.) More references would be good, but they should be easy to find. The Unicode FAQ has most of it. I don't understand why you find the phrase "surrogate pair" to be "inexplicably complicated"; it's just a pair of surrogates. It's not the term that's complicated, it's the encoding. Having unambiguous terminology is a good thing, not a bad thing. Maybe the explanation I added will help. -- Elphion (talk) 00:10, 17 January 2015 (UTC)[reply]

Completely unacceptable rewrite, sorry.
You may not find the term "surrogate pair" unhelpfully confusing ... but non-specialists do. I have the feeling that you may perhaps be feeling defensive about the people who decided on this term. If this influences your writing this in itself constitutes a "neutrality deficit". I've added "perhaps".
It definitely is confusing for a beginner who doesn't happen to have their own personal IT consultant at their elbow. It is for such people that Wikipedia exists.
Please answer this point before attempting to try editing again (rather than rewriting) my introductory section: 98% of those who come here can be presumed to be baffled by UTF-16 and what it's about. They need an introductory section which explains things simply. Really as simply as possible.
I don't give a monkey's whether UCS-2 diverts here or what decisions were or were not "decided long ago"... There is in fact an article on Universal Character Set ... obviously UCS-2 and UCS-4 should direct there. I don't know how to implement this: please be so kind as to do it for me. And/or expand on the "stub" section I have just created.
What you know in your heart of hearts is that very few people are going to bother reading either such a section or such an article. They're not going to be interested, and you have no right at all to complicate their legitimate desire to learn how UTF-16 works on the motive that "Oh, but they have to know what UCS-2 is and how the IEEE and the UC battled it out." NO. THEY. DON'T.
By all means edit my stuff if you find it not neutral enough. But I repeat: the primary function of this introductory section is to explain the basics, in simple terms. You have no right to make it confusing for non-IT-experts in the way your rewrite did. MikeRodent (talk) 11:16, 17 January 2015 (UTC)[reply]
I've reverted your changes back to the revision by Elphion, which is far superior. Your unencyclopedic tone of writing and non-neutral point of view do not belong on Wikipedia. If you consider it unacceptable for people to rewrite your text then you would be better off writing a blog rather than joining a collaborative project such as Wikipedia. And by the way, it is considered extremely rude to edit other people's comments with your own inline ad-hominem attacks. BabelStone (talk) 11:26, 17 January 2015 (UTC)[reply]
It doesn't take that much intelligence to realise that my inline comments were in fact a measured response to your spontaneous discourteous actions and comments. Discourtesy of the kind manifested by yourself doesn't belong in Wikipedia. For example, you have just chosen not to address a single one of my carefully argued points, many of them taking up other contributors' previous objections to this article. If we're talking about what is rude, this is rudeness.
Rather than talking insultingly about "blogs" you need to look at your own conduct and your own clearly demonstrated intent to keep the introduction to this article more difficult to understand than it needs to be. All my actions over the past couple of weeks have been directed at addressing this. It still needs to be addressed with Elphion's complete rewrite (rather than edit) of what I wrote. He makes the explanation too confusing for most people visiting the page.
I noted from the link to your profile that you have made many contributions to Wikipedia. I presume there is a mechanism known to you in Wikipedia to squash the contributions of people who persistently disagree with more long-standing editors. What a shame that the long-standing editors of this particular article are seeing fit to fight tooth and nail to prevent a simple explanation for ordinary users of Wikipedia. MikeRodent (talk) 11:45, 17 January 2015 (UTC)[reply]
I have reverted MikeRodent's "inline comments" to BabelStone's comments, because they violate the Wikipedia talk page guidelines on editing others' comments. We do not put words in others' mouths (or keyboards) here. RossPatterson (talk) 00:58, 23 January 2015 (UTC)[reply]

Single byte surrogates[edit]

Why not use D8-DF start bytes as surrogates?

0000000000-000000FFFF=        xxxx
0000010000-0000FFFFFF=    D8xxxxxx
0001000000-0001FFFFFF=    D9xxxxxx
0002000000-0002FFFFFF=    DAxxxxxx
0003000000-0003FFFFFF=    DBxxxxxx
0004000000-0004FFFFFF=    DCxxxxxx
0005000000-0005FFFFFF=    DDxxxxxx
0006000000-0006FFFFFF=    DExxxxxx
0007000000-FFFFFFFFFF=DFxxxxxxxxxx

PiotrGrochowski000 (talk) 08:02, 7 March 2015 (UTC)[reply]

On Wikipedia we describe what is, not fantasize about what could have been. As such, your comment is out of scope here (see WP:TALKNO). BabelStone (talk) 11:33, 7 March 2015 (UTC)[reply]

This would create an issue that UTF-16 and UTF-8 both avoid: a code unit could match a single character while being a part of a longer coded character. For example the sequence 'D800 0020' in your encoding is a single character. So is '0020'. Consequently, if we search for the character '0020', we would get a positive match, even though it's not in there. 2604:6000:9981:7600:F0E0:89C2:B327:B147 (talk) 13:57, 9 March 2020 (UTC)[reply]

IEEE v.s. ISO confusion?[edit]

The History section mentions IEEE, but I’m not aware of IEEE ever being involved in Unicode or UTF-16. Could the author have been confused with ISO? ISO/IEC 10646 is related to Unicode.

I also think this is a mistake. The claim should be supported by a reference, or removed. Jsbien (talk) 08:43, 22 May 2019 (UTC)[reply]

Yes, IEEE is a mistake. I have replaced IEEE with ISO/IEC JTC 1/SC 2 which is the ISO/IEC subcommittee responsible for the development of the Universal Coded Character Set. BabelStone (talk) 11:05, 22 May 2019 (UTC)[reply]

biased wording about the hole[edit]

The article says

"They neglected to define a method of encoding these code points, thus leading to a "hole" in the set of possible code points, a source of some difficulty when dealing with Unicode. This mistake was not made with UTF-8."

This implies that the hole was a "mistake". The alternative to a hole would have been making it so UCS-2 and UTF-16 encoded the same code point in different ways which would have made things even more confusing given that the purpose of UTF-16 was clearly to be a successor to UCS-2. 130.88.154.105 (talk) 14:55, 23 June 2015 (UTC)[reply]

UTF-8 uses a different method of encoding the code points 128-255 than 8-bit encodings did, and I don't think anybody is "confused" and I'm pretty certain if those codes were impossible it would be considered a mistake. There is absolutely no reason UTF-16 could not have been designed to handle a continuous sequence of code points except that the programmers were lazy. Since they had to ditch simple sorting order anyway, I think it would have been ok to put these 2048 code points at the very end of the possible code sequences.Spitzak (talk) 02:01, 24 June 2015 (UTC)[reply]
But several of those were already assigned characters. They chose a range that wasn't already occupied. -- Elphion (talk) 02:29, 24 June 2015 (UTC)[reply]
Of course they chose a range unoccupied by characters. That does not mean that you should be unable to encode those code points. If that was allowed they could have chosen *every* unassigned code point and said they all were to encode code points greater than 0xFFFF.Spitzak (talk) 21:20, 24 June 2015 (UTC)[reply]
I wasn't arguing whether the surrogates should be encodable, just pointing out why the use of a non-terminal range was not a "mistake". It was in fact critically important that they chose a range that didn't alter existing UCS-2 assignments, since otherwise several important manufacturers would never have bought into UTF-16 (and therefore into any form of Unicode for codepoints above U+FFFF). At this point the only way I can think of to make the surrogates encodable in a standard way would be to use one of the reserved codepoints below U+FFFF as an escape character, signifying that the next codepoint is to be taken literally, even if it is a surrogate or otherwise reserved. The standards people don't seem to be convinced that this is important (i.e., they feel that one shouldn't be encoding such codepoints anyway), anymore than, say, overlong UTF-8 encodings should ever be preserved when those are found in the input message. This isn't strictly a UTF-16 issue; it boils down to how to retain strictly illegal encodings in any of the UTF schemes in a way that is portable and acceptable to the standard (so that standards-conforming decoders don't fail on that input). -- Elphion (talk) 16:28, 25 June 2015 (UTC)[reply]

Seemingly contradictory statement[edit]

When, under the section entitled Byte order encoding schemes, the article says that

"If the endian architecture of the decoder matches that of the encoder, the decoder detects the 0xFEFF value, but an opposite-endian decoder interprets the BOM as the non-character value U+FFFE reserved for this purpose. This incorrect result provides a hint to perform byte-swapping for the remaining values. If the BOM is missing, RFC 2781 says that big-endian encoding should be assumed."

and then follows with

(In practice, due to Windows using little-endian order by default, many applications also assume little-endian encoding by default.)

isn't the article contradicting itself? An application can't assume two defaults. Shouldn't the word 'also' above be omitted?

-- SourceMath (talk) 13:30, 25 June 2015 (UTC)[reply]

The article means "many applications similarly assume little-endian encoding by default." You're right, the "also" construction can be misinterpreted. -- Elphion (talk) 16:27, 25 June 2015 (UTC)[reply]

languages missing from UCS-2[edit]

Which languages exist in utf-16 that don't exist in ucs-2? It is implied by the article that there are some missing, but it isn't clearly stated anywhere. 84.94.119.60 (talk) 12:01, 23 August 2015 (UTC)[reply]

I wrote a short note about this in the article, and a link to the main article answering this question, Plane (Unicode). --BIL (talk) 18:07, 23 August 2015 (UTC)[reply]
UCS-2 is obsolete. It was one written in 1990, 25 years ago. Comparing it to UTF-16 SHOULD be an academic question (in a perfect world). Or perhaps I'm wrong? WHY is there so much included here about it? Anything added to Unicode since then will not have any UCS-2 representation, and most Unicode code points have been created since then. But if there is a good reason that so much space is used to describe this ancient standard here, it should be explained. I'd go so far as to remove all but one mention of UCS-2 from this, leaving one in the lede since some people seem to think UCS-2 is a synonym for UTF-16 and using it to link to an UCS-2 article.216.96.79.108 (talk) 13:30, 24 September 2015 (UTC)[reply]
Why is there "so much" about USC-2 here? (There's really not very much.) Because UCS-2 redirects here, and this is the natural place to describe it, since it is the basis on which UTF-16 was built. It's fairly clear from the article that UCS-2 is obsolete. We can say that explicitly, say by adding the following at the end of the first paragraph of the History section: "UCS-2 represents an early, incomplete, and now obsolete version of UTF-16 that is limited to the codepoints that can be represented in 2 bytes." -- Elphion (talk) 18:08, 24 September 2015 (UTC)[reply]

Jargon[edit]

There should be some clear explanation of the system used to designate the code points. As this article is now, the reader is assumed to be familiar with hexadecimal as well as common computer science representations of hexadecimal numbers. U+ indicates that the number following will be a hexadecimally encoded Unicode code point. U+hhhh and U+hhhhh where h represents 1 hexadecimal digit (0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F) and is in the range of 0 to 1 114 111 (decimal) or 0 to 10FFFF (hexadecimal). In several computer languages, the symbols 0x are prepended to a hexadecimal number to distinguish it from a decimal number. Thus, 0 = 0x0000 but 10 = 0x000A or sometimes 0xA (the leading zeroes are often removed). This is useful to distinguish a number such as 100 and 0x100 since 0x100 is the hexadecimal representation for 256. It is possibly outside the scope of this article to also mention that Unicode code points are not necessarily unique (representations) and that there may be several ways to specify a character or glyph which have identical meaning. Parts of the specification originated by historical accident and is based on various pre-existing ("pre electronic-age") national and international standards.(Examples include ANSI, ASCII, and Simplified Chinese.) So, character (code point) order is not always the same as alphabetical order in the language which uses those code points. Code points may also represent symbols as well as meta-data (without a glyph representation, an example is U+0082 which means that a line break is allowed at that location in the text; U+0082 has no glyph associated with it.)216.96.79.108 (talk) 13:15, 24 September 2015 (UTC)[reply]

We should mention at the beginning of the Description section that U+hex is the standard way of referring to Unicode codepoints, and provide a link to Unicode#Architecture and terminology, where this is described in more detail (and perhaps to Code point and Hexadecimal as well). Most of the rest of what you mention is discussed elsewhere, and is not appropriate for this article, which is specifically about the encoding. -- Elphion (talk) 18:17, 24 September 2015 (UTC)[reply]

Surrogate halves[edit]

Can the surrogate halves (0xD800 to 0xDFFF) be encoded in UTF-32? 108.66.235.240 (talk) 18:22, 25 September 2016 (UTC)[reply]

No. Surrogate code points are only used in UTF-16, and are invalid in UTF-32 an UTF-8. BabelStone (talk) 18:39, 25 September 2016 (UTC)[reply]
They can technically be encoded in UTF-32 and UTF-8, but that's not valid according to Unicode standard. Single surrogate halves are also not really valid. Read the UTF-16#U+D800 to U+DFFF chapter in the article which says single surrogate halves are often tolerated by decoders. --BIL (talk) 20:51, 25 September 2016 (UTC)[reply]

0 vs 0x0000[edit]

Nobody writes hexadecimal 0 as just 0. 108.65.81.120 (talk) 00:54, 3 October 2016 (UTC)[reply]

please make ucs-2 article[edit]

please make ucs-2 article. i think making them 1 article is wrong. 1) ucs-2 is constant length encoding while utf-16 is variable length encoding which makes utf-16 complicated and longer calculated. for example cutting string after letter number 100 or getting letter number 100 should be nearly 100 times longer for utf-16 string than for ucs-2 string! 2) maybe there are some additional letters in 2 bytes of ucs-2 compared to 2 bytes of utf-16 string which is wrongly simply divided after every 2 bytes (constantly instead of variably 2 or 4 bytes), because some character(s) in first 2 bytes of utf-16 character should be used to show that there are additional 2 bytes of current character , while that special code(s) in first 2 bytes of utf-16 could be used for additional letter(s) in ucs-2. --Qdinar (talk) 18:04, 13 July 2017 (UTC)[reply]

that special codes that show presence of additional 2 bytes are d800-dfff and they are reserved by unicode for that purpose, so they do not encode characters and cannot be used for private use. but, i think, technically they can be used by programmers for private use if they use constant length encoding. --Qdinar (talk) 19:14, 13 July 2017 (UTC)[reply]

some info: http://justsolve.archiveteam.org/wiki/UCS-2 --Qdinar (talk) 18:14, 13 July 2017 (UTC)[reply]

some info: https://web.archive.org/web/20060114213239/http://www.unicode.org/faq/basic_q.html#23 --Qdinar (talk) 18:27, 13 July 2017 (UTC)[reply]

i see that maybe i was wrong. "U+0000 to U+D7FF and U+E000 to U+FFFF" section of the article and this map https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane explains that ucs-2 is mostly same with utf-16. --Qdinar (talk) 19:08, 13 July 2017 (UTC)[reply]

ucs-2 is supported by many programs, among them:

mysql : https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-ucs2.html

php : http://php.net/manual/en/mbstring.supported-encodings.php

--Qdinar (talk) 21:28, 13 July 2017 (UTC)[reply]

i have made changes to the article. --Qdinar (talk) 23:13, 13 July 2017 (UTC)[reply]

Unpaired surrogates[edit]

"It is possible to unambiguously encode an unpaired surrogate..." This paragraph has been mentioned two times already on this talk page and it still doesn't make sense to me. I pondered it several times and I tried to find a way to actually do this, but I failed. Either (A) I'm just too stupid to get it, or (B) this paragraph confuses "encode" and "decode" or "code units" and "code points" or uses the same term in different meanings, or (C) this is actually wrong. Sure, you can interpret one half of a surrogate pair as a UTF-8 character, but it would be wrong. Is this what is meant? See also https://unicode.org/faq/utf_bom.html#utf8-4 . The way it is written now sounds like it is somehow magically possible to squeeze a value >= 0x10000 into 16 bit. Please clarify or remove this paragraph. --213.68.42.195 (talk) 12:04, 9 March 2020 (UTC)[reply]

You didn't quote the sentence in its entirety. The paragraph as written in the article is indeed factually correct and uses its terms properly. The encoding is indeed unambiguous when the surrogate isn't followed/preceded by a matching surrogate. E.g. the sequence of UTF-16 *code units* '0020 D800 0020' can be unambiguously interpreted as the sequence of *codepoints* U+0200, U+D800, U+0020. Granted, however, that such encoding is deemed invalid UTF-16 by the Unicode standard. This is not a theoretical exercise though. A lot of software that uses UCS-2 or "UTF-16" internally will happily accept, store, and transmit, such sequences. 2604:6000:9981:7600:F0E0:89C2:B327:B147 (talk) 13:37, 9 March 2020 (UTC)[reply]
I can see the point about perhaps "decode" is a better word than "encode". Imagine a really bad encoder that turns every character into the same code unit (1, for instance). I guess you could say that encoder is not unambiguous, it knows exactly what to do with every character (turn it into 1). But the *decoder* for this encoding has a serious problem and is ambiguous (it could turn a 1 into any character). In any case the point was that a high surrogate not followed by a low, or a low not preceded by a high, can be turned into a UTF-16 code unit in a way that a decoder cannot possibly think it is some other character (it is also an obvious encoding, there are an infinite number of un-obvious encodings such as swapping the high and low values when writing code units).Spitzak (talk) 19:18, 9 March 2020 (UTC)[reply]
Then I guess I over-interpreted the term "unambiguous". The actual meaning being that the output of such an encoding is wrong, but it is unambiguously wrong? But how do you encode an unpaired surrogate in UTF16? A surrogate already is the encoding, which only makes sense in UTF16. I still think this is confusing and deserves a clarification. How about a full round-trip example (input -> encoding -> interpretation by decoder)? 193.16.224.9 (talk) 10:20, 5 June 2020 (UTC)[reply]

Less how-very-convenient example needed[edit]

<rant> I have to agree with MikeRodent above that this must be one of the least comprehensible articles on Wikipedia. The UCS stuff is just noise to most readers who want to understand how UTF-16 works, not relive every rut in the road that got where we are now, so it should be moved out to the UCS article. I come to this article with some computing experience, so I shudder to think what the uninitiated make of it. Whatever happened to WP:Think of the reader? </rant>

Anyway, with that off my chest, what I really want to understand is how code-points in the upper half of the BMP are encoded. The article has two examples from the lower lower half then conveniently skips over the upper half into the next plane. Can we have an example please? How about U+FFEE HALFWIDTH WHITE CIRCLE? --John Maynard Friedman (talk) 20:15, 22 October 2020 (UTC)[reply]

All valid codepoints in BMP are coded as two bytes in UTF-16: EE FF (little endian) or FF EE (big endian). -- Elphion (talk) 02:20, 23 October 2020 (UTC)[reply]
Messier and messier, as Alice might have said. No wonder this system never gained traction. Which end of the egg does Microsoft have up?
So that means that there is no flag (such as top bit set or clear) to signal the presence or absence of a second byte pair? I had guessed as much, not that the article was any help. Nor does it explain adequately how that [p or a] is identified at decode time. --John Maynard Friedman (talk) 09:01, 23 October 2020 (UTC)[reply]
The "flag" (such as it is) is that the 2-byte codepoint value lies in the leading surrogate range (0xD800 to 0xDBFF). This means that the next two bytes will lie in the trailing surrogate range (0xDC00 to 0xDFFF), and the four bytes together encode a codepoint above the BMP. It's not ideal (it developed for historical reasons as a compromise to save a mountain of existing code that is still with us), but neither is it rocket science: the tests are not hard to perform, and the encoding/decoding of the non-BMP codepoints is not more complicated than UTF-8. UTF-8 has the distinct advantage that it is endian invariant, but endianness is a much older issue that Unicode had to accommodate. -- Elphion (talk) 15:06, 23 October 2020 (UTC)[reply]

The History section[edit]

I edited the History section to attempt to answer some of the issues above. But the last paragraph in that section is still pretty impenetrable, and refers to "surrogates", which have not been introduced yet. How much of that paragraph needs to be preserved? How could it be worded better? -- Elphion (talk) 17:13, 3 November 2020 (UTC)[reply]

I know that most articles have the history up front, but I suggest that there is a strong case in this article to have it last. In most other cases, it is difficult to explain where we are today without first explaining how we got here, but in this case I believe that the history casts more shadow than it does light. Having it at the end means that readers have more of helicopter view and can better understand it. It also solves the surrogate problem! --John Maynard Friedman (talk) 17:28, 3 November 2020 (UTC)[reply]
Of course it would also mean that the Description section couldn't just dive straight in with this sort of strange statement: When reading the first 16 bits of a character encoding, a direct 16-bit character would occupy the lower (0x0000 to 0xD7FF) range or the upper (0xE000 to 0xFFFF) range. If it's a 32-bit character encoding (with room for 20-bit character codes), the unused interval in between is used to signal that another 2 bytes should be read. What??? Wait up!! Let's get back to first principles: it should say something like "a 16 bit string is capable of encoding 216 (65,536) code-points and no more, but the number of world letters, symbols, signs and emojis that need to be encoded is many times more than that. UTF-16 resolves this issue by using just ten bits for characters encoding and reserves the remaining six bits to indicate where those characters are to be found". Or something like that? --John Maynard Friedman (talk) 17:45, 3 November 2020 (UTC)[reply]

I agree that passage is not clear. It's now been removed. We will continue to plug away at this. -- Elphion (talk) 18:21, 3 November 2020 (UTC)[reply]

Mojibake of surrogate characters[edit]

Just a private question: Anyone here who might be able to help me with my question at WP:RD/C#Mojibake of surrogate characters? ◅ Sebastian 12:19, 21 June 2023 (UTC)[reply]

This question, while unanswered, has been archived at WP:Reference desk/Archives/Computing/2023 June 15#Mojibake of surrogate characters. ◅ Sebastian 12:14, 11 August 2023 (UTC)[reply]

Coverage of surrogates[edit]

Currently, surrogate pairs redirects here, while surrogate pair redirects to Universal Character Set characters. I feel the topic fits better here, but the latter redirect has the advantage that it goes to a dedicated section. Where should surrogates best be covered? A dedicated section here would encompass more than half of this article; that would probably be too big. IMHO, they are important enough to have an article on their own, in which case much of this article would be moved there. Any opinions? ◅ Sebastian 12:19, 21 June 2023 (UTC)[reply]

Makes sense to me. The UCS article is pretty bloated with more than one section about surrogates. Spitzak (talk) 14:52, 21 June 2023 (UTC)[reply]
Thanks, Spitzak. Would you or anyone else help me with the creation of the new article? ◅ Sebastian 10:54, 11 August 2023 (UTC)[reply]

What operator to use for the supplementary code conversion?[edit]

Currently, the formula for the supplementary code conversion at Code points from U+010000 to U+10FFFF contains two arithmetic operators: + and -. While the latter is used in the correct arithmetic sense, the former is apparently used in the sense of a text concatenation. (Or in the sense of some preschool kids who think that 1+2=12.) Should we use a dedicated operator from Concatenation#Syntax, or use plain text? (Presumably, the intension might have been to emulate overloading, but that's very confusing here, where all operands can be seen as integers. Possibly there was some intention to express this through selective use of 0x, but that's quite idiosyncratic.) ◅ Sebastian 13:39, 11 August 2023 (UTC)[reply]

the former is apparently used in the sense of a text concatenation No, just ordinary addition. 0xD800 = 0b1101100000000000 and 0xDC00 = 0b1101110000000000, so the values to the left of the comment are the result of adding those two hex values to yyyyyyyyyy and xxxxxxxxxx, respectively. Guy Harris (talk) 20:10, 11 August 2023 (UTC)[reply]
You're right, thanks! I missed that the values without 0x are meant to be 0b. That's already so in the source, but still it may not be the best presentation for an encyclopedic article, as it's not immediately obvious to all readers. ◅ Sebastian 14:58, 12 August 2023 (UTC)[reply]

Why comments?[edit]

Headline inserted 12 August 2023

Oh, and what's the point of the double slashes? Presumably, they stand for comments, but - with the rest of the formulas correct - the appropriate sign would be simply =. Note that the // are rather idiosyncratic, as well, as the're not in the source. ◅ Sebastian 17:01, 11 August 2023 (UTC)[reply]

Yes, the point of the double slashes is to introduce comments, C++/C99-and-later-style. The comments are there to indicate the formula used to calculate the values to the left of the comments; the items to the left of the comments and the right of the equal signs are the bit representations of U', W1, and W2. And, yes, showing them as {name} = {bit representation} = {formula} might also work. Guy Harris (talk) 20:14, 11 August 2023 (UTC)[reply]
Of course double slashes are used by some languages for comments. But that's not the question here. My question is: What's the point? Why use an unintroduced syntax (which is probably not known to all readers) to “illustrate” something that can be expressed with a generally known sign like the equal sign? And even if every reader knew the meaning of //: If an explanation itself needs a (meta-) comment, then it's not a good explanation. ◅ Sebastian 14:14, 12 August 2023 (UTC)[reply]