Talk:C0 and C1 control codes

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

does anyone here have access to ISO/IEC-6429[edit]

and if so can they check the codes in the C1 table (particularlly the 3 not identified by unicode) against it? Plugwash 02:34, 23 January 2006 (UTC) ECMA 48, the european version of this standard, is available online. --Random832 23:32, 1 July 2007 (UTC)[reply]

Supposedly ECMA-48 is identical (and is available for free). The ISO (and ANSI) documents all cost money. Tedickey (talk) 10:23, 10 March 2008 (UTC)[reply]

Is "String Terminator" abbreviated "SI"?[edit]

Control code 0x9C is listed as:

0x9C SI ST String Terminator

However, SI is the abbreviation for:

0x0F SI Shift In

Is the SI in String Terminator supposed to be ST?

24.234.114.35 21:34, 4 May 2007 (UTC)[reply]

Fixed, source RFC 1345 says ST. --217.184.142.52 (talk) 19:52, 16 June 2008 (UTC)[reply]

C1 not derived from/used in ISO/IEC 8859-n[edit]

The C1 codes were included in the ISO-8859-n series of encodings [...].

I think this is wrong if ISO-8859-n means ISO/IEC 8859. I only have access to draft versions of ISO/IEC 8859, but they explicitly say (C1 code points) use is outside the scope of ISO/IEC 8859; it is specified in other International Standards, for example ISO/IEC 6429., see here. --Abdull 08:10, 8 June 2007 (UTC)[reply]

there is a subtule but important difference between ISO/IEC 8859-1 and the IANA charset ISO-8859-1. One is an incomplete standard without control codes the other adds them in to make a usable standard. Plugwash 21:42, 1 July 2007 (UTC)[reply]

CUA stuff[edit]

A few of the entries describe the use of a control key as a shortcut in many Windows programs and CUA X11 programs. For example: "In many programs, a keyboard input of Ctrl-Y is a "redo" command to undo the last Ctrl-Z undo command."

That's true, but the fact that Microsoft, when porting their Office software from the Mac to their own OS, used control keystrokes as a substitute for the missing command key has nothing to do with the meaning of any control character as a C0 control code.

Even if I'm completely wrong, I can't imagine how the undo/redo meanings of ^Z/^Y could be relevant but the clipboard meanings of ^X/^C/^V, the file command meanings of ^N/^O/^S, or the select-all meaning of ^A, the find-related meanings of ^F/^G/^R, etc. --75.36.140.83 07:36, 24 September 2007 (UTC)[reply]

RFC 1345[edit]

Do we really need to include the RFC 1345 acronymns? Aside from some limited usage in a UNIX utility, I haven't come across any evidence that they saw use elsewhere. Caerwine Caer’s whines 22:32, 16 June 2008 (UTC)[reply]

I'd tend to agree - though deciding whether to remove them would take some investigation Tedickey (talk) 00:43, 17 June 2008 (UTC)[reply]

Backspace[edit]

The comments about backspace, and its linked topic do not mention its use for underlining and bold. The comment in the table is rather crowded, but rather than a blanket "deprecated", the point should be made that while composition of characters is not generally supported in terminals, the underline/bold generally are Tedickey (talk) 12:19, 19 June 2008 (UTC)[reply]


I think the description of Backspace is incorrect. This character have not different uses for input and output (the same way of CR or ESC characters, for example): it always move the cursor leftwards, so the phrase "To provide disambiguation between the two potential uses of backspace" have no sense.

A more precise description could be one in the same style of CR or ESC characters, for example:

Move the cursor one position leftwards. The Backspace key on a keyboard will send this character that is usually used to delete the character to the left of the cursor; to do that the three character sequence BS SPACE BS (0x08 0x20 0x08) is used. In early computer technology, where a character once printed could not be erased, the backspace was sometimes used to generate combinations of two characters, like à that could be produced using the three character sequence a BS ` (0x61 0x08 0x60), the method to print underline or overstrike characters combining _ or - with any character, or the standard method in APL programming language to create new operators combining two existing operators, like / BS - Aacini (talk) 05:35, 2 November 2008 (UTC)[reply]

agree Tedickey (talk) 18:44, 2 November 2008 (UTC)[reply]

This article is not about all control characters[edit]

Just a friendly reminder. This article is not about every possible usage of a control character, nor even about usage on every system where 00HEX–1FHEX are control characters. This is about a specific set control characters, the C0 and C1 sets as defined by ISO/IEC 2022. Some of those meanings are generalized, so while instances where an application or system further defines their usage are relevant, a use which is totally unrelated to the character as defined in ISO/IEC 2022 belongs in either a separate article or in control character. Caerwine Caer’s whines 02:58, 12 July 2008 (UTC)[reply]

unclear lines[edit]

The section C1 (ISO 8859 and Unicode) will become clearer if "if being used in an environment where 8-bit characters are not supported or where these octets are being used instead to add additional graphics characters" is removed. Also, I have passed a '+' outside the parentheses in a table column label. —Preceding unsigned comment added by 122.169.5.54 (talk) 08:46, 12 January 2010 (UTC)[reply]

The sentence could be broken up, but removing it would lose the hint for why 7-bit controls are useful. (Sending 2 bytes instead of 1 is not necessarily a good thing). Tedickey (talk) 09:33, 12 January 2010 (UTC)[reply]

C1 (ISO 8859 and Unicode)[edit]

I renamed the heading "C1 (ISO 8859 and Unicode)" as "C1 set" since C1 is not defined in either ISO 8859 or Unicode. C0 and C1 can be used in ISO 8859 or Unicode text, but they don't define C0 or C1. — Preceding unsigned comment added by 88.112.175.168 (talk) 10:06, 27 September 2011 (UTC)[reply]

And so what is «C0 Controls and Basic Latin» and «C1 Controls and Latin-1 Supplement» in Unicode standard?
  1. http://www.unicode.org/charts/PDF/U0000.pdf
  2. http://www.unicode.org/charts/PDF/U0080.pdf — Preceding unsigned comment added by 84.97.14.22 (talk) 06:27, 19 July 2012 (UTC)[reply]
ECMA-35 and ECMA-48 define the use of C0/C1 for ISO-8859-1. Without a document such as that for Unicode (or UTF-8), all the documents that you have mentioned do is to show pictures of the codes that are mapped from ISO-8859-1; the C0/C1 behavior has not been specified. A reliable source on the matter would not leave leeway for guessing what might be meant TEDickey (talk) 08:16, 19 July 2012 (UTC)[reply]
I just want say Unicode standard
  1. recognize those values as control character,
  2. gives their range and aliases
  3. as character, implicitely attributes them a byte sequence depending on the UTF in use.
Might be you just want to say that Unicode does not specify the exact behavior of each control character.
Additionaly, a link can be established to Unicode control characters.
In The Unicode Standard, Version 6.1 page 23, they say: Basic Type control is «Usage defined by protocols or standards outside the Unicode Standard», and classifies them as category Cc with status abstract character.
And they add «Control Codes. Sixty-five code points (U+0000..U+001F and U+007F..U+009F) are defined specifically as control codes, for compatibility with the C0 and C1 control codes of the ISO/IEC 2022 framework. A few of these control codes are given specific interpretations by the Unicode Standard. (See Section 16.1, Control Codes.)»
§16.1 is in page 544 for C0.
In page 545 an additional semantic is clarified for at least eleven of them «Specification of Control Code Semantics» — Preceding unsigned comment added by 84.97.14.22 (talk) 11:18, 19 July 2012 (UTC)[reply]
But that's the point: the paragraph as written states that Unicode "provides" these codes, but it is in a context (and no clarification is made there) to point out that Unicode provides no definition of their behavior. The C1 codes without being translated would be illegal in UTF-8 encoding (because the values in 128-159 are continuation bytes). Without clarification, the paragraph is misleading. The word "provides" is inappropriate in this context - "assigns" would be more idiomatic, and corresponds to the sources you indicate TEDickey (talk) 22:32, 19 July 2012 (UTC)[reply]
C1 is not illegal in UTF-8. U+0085 (NEL / Next Line) is encoded as C2 85 in UTF8. I found this document which suggests that:
I don't know if that claim is true. But I tested a number of terminal emulators, and GNU Screen and Mosh were the only terminal emulators I tested that supported C2 85 as a newline character. --Hirsutism (talk) 21:07, 11 October 2012 (UTC)[reply]
Screen isn't a terminal emulator; nor is mosh - they're applications which use terminals and rely upon those to provide a lot of the functionality associated with a terminal emulator. TEDickey (talk) 21:31, 11 October 2012 (UTC)[reply]
Yes, Mosh does do terminal emulation. See here: "... the opportunity to build a clean UTF-8 terminal emulator from scratch ...". Mosh significantly reinterprets control characters and escape sequences, before sending them to the final terminal emulator. -Hirsutism (talk) 22:36, 11 October 2012 (UTC)[reply]
I'm aware of the opinion of its developer(s), but since it relies on the terminal (and ncurses) for the functionality, it's like screen - a translator which isn't a complete terminal emulator. You're not likely to find an authoritative source which agrees with that opinion. TEDickey (talk) 22:56, 11 October 2012 (UTC)[reply]
We're getting stuck in a side-tangent here. The precise definition of "terminal emulator" isn't important for this Wikipedia page. What matters here is: Putty + Mosh recognize NEL (encoded as C2 85) as a newline character. Even this empirical evidence is a side-tangent... the main discussion is about whether the Unicode spec fully recognizes NEL (or other C1 characters). --Hirsutism (talk) 15:28, 12 October 2012 (UTC)[reply]
Sure. But your suggested source isn't what one might term authoritative, due to several simple errors. For example, on the paragraph following the one you're interested in, he states

Since VT100 (that uses C1 extensively)...

which is incorrect. Scanning quickly, I see other errors. If you're simply stating that you can find someone agreeing with your point, that's easily done of course (google is your friend). TEDickey (talk) 23:03, 12 October 2012 (UTC)[reply]

Octal[edit]

Would anyone object were we to add Octal to the table also? We already have decimal and hex. Maratrean (talk) 08:16, 29 October 2011 (UTC)[reply]

Octal is wonderful, but hasn't its time passed? An extra column would be quite confusing, so why add it? There are probably lots of people who really have no interest in octal, so I think a good reason for adding it would be needed. Johnuniq (talk) 09:10, 29 October 2011 (UTC)[reply]
I object too. Of course, octal is derived from hex (or decimal), so it would just be a dependent addition (deriveable). Of course one can add: so is decimal - all right. Only, decimal is used directly nowadays (e.g. when entering by keyboard). Someone else could argue: hey letys add UTF-8, UTF-16, and such. So I do object. -DePiep (talk) 22:14, 30 October 2011 (UTC)[reply]


The 'C' column includes many missing entries. In the language 'C' it is ordinary to use octal escape sequences to express and enter these missing entries. Why not fill out the missing entries in the C column in octal - such as '\003' - solves the OP, completes the column, and provides a reference to programmers wishing to use the control codes under discussion. — Preceding unsigned comment added by 92.21.236.161 (talk) 00:20, 5 February 2015 (UTC)[reply]

7F[edit]

7F is delete. Which control code operates this? Kg pwn (talk) 22:55, 14 June 2012 (UTC)[reply]

In Unix, it's sometimes referred to as "Ctrl-?" or "^?"... AnonMoos (talk) 05:25, 15 June 2012 (UTC)[reply]
Yeah, but is it like... C2... or something — Preceding unsigned comment added by Kg pwn (talkcontribs) 19:25, 1 August 2012 (UTC)[reply]

Neither - ECMA-35 / ISO-2022 make SPACE and DELETE special cases (not control characters, and not a member of C0/C1). The positions used for those in the 128-255 range are printable characters, by the way. TEDickey (talk) 23:55, 1 August 2012 (UTC)[reply]

Restructuration[edit]

I suggest to restructure this article, as is:

  • Principles
    (why control codes)
  • History
    (main dates)
  • Interoperability
    • Main standards interoperability issues
      utf-8, windows-1252, etc.
    • Main protocols and applications
      terminal, file text, unix, videotext, etc
  • Code assignations
    • C0 set
    • C1 set
  • Example of sequence using control code — Preceding unsigned comment added by 84.97.14.22 (talk) 17:25, 19 July 2012 (UTC)[reply]

Various standards[edit]

http://www.itscj.ipsj.or.jp/ISO-IR/2-6.htm — Preceding unsigned comment added by 77.198.9.102 (talk) 23:21, 24 July 2012 (UTC)[reply]

^X links[edit]

These links are all circular, or point to articles about usage of shortcut combinations on Windows, which has nothing to do with control codes. I recommend reverting the addition of them.Spitzak (talk) 05:20, 21 September 2013 (UTC)[reply]

I partially agree with your observation, but not with your conclusion.
I deliberately put the links in because semantically there is a difference between a control character given in notation ^X (specifies a key combination with Ctrl, not a specific function - associated functions are operating system and application specific), a control character given in notation \x (specific formatting to some programming languages), named control characters distinguished by function (Linefeed, Tabulator, Bell, Null) or named control characters distinguished by code (NUL, ETX, etc.) in specific standards like ASCII etc.
While not being circular, at present some of the links have the same target (which often does not reflect above semantics correctly), but this is a problem of sub-optimal target linking in redirects rather than a problem of adding local links to the terms as is. We will have to retarget some redirects and restructure some articles to create semantically more correct link targets, but this won't happen overnight. However, we will create awareness for this "unevenness" only by starting to incorporate the links - over time, this will create a momentum which will help to shift the targets to be more semantically correct. If we don't add the links, neither the semantically differences nor the structure will become apparent to most users, so changes in this area would happen only randomly and without a clear direction rather than systematically following some overall structure.
--Matthiaspaul (talk) 11:12, 21 September 2013 (UTC)[reply]
The ^X notation actually indicates the character with the value of an ASCII 'X' xor'd with 0x40. Although often the same it is not a symbol for the key sequence. For instance ^@ means a character that is more likely produced by typing ctrl+space. In any case I think links leading to discussion of Windows shortcuts are wrong, these shortcuts are processed directly from keyboard input and at no point is a C0/C1 control code ever used.Spitzak (talk) 01:52, 29 May 2014 (UTC)[reply]

Purpose[edit]

What this article doesn't really make clear is why C0 and C1 are in Unicode. The use of U+2400 ... U+243F is immediately obvious, and I guess it makes some sense to reserve NUL, TAB, CR and LF.

But what are you supposed to do when you encounter SI? Obviously you aren't meant to switch to a different character set, because if people wanted to encode a character not in Unicode they'd use a PUA character. Maybe it's part of a quoted string of bytes to send to some machine for which SI does make sense? No, because then you'd use the visual representation ␏.

If you find BEL, are you supposed to sound a bell? Of course not. A Unicode text is just that, text, not a string of instructions to do something. Even when displayed, it tends to be scrollable and no bell moment exists. And you wouldn't want to allow text to ring bells anyway. Again, for quoted bytes there's the visual representation.

What about SOH? Again, meaningless in text unless quoted. Most of these control codes are useless as part of text. Insofar as they make sense at all, it's as formatting, which isn't within the Unicode scope, but within things like HTML and CSS, or whatever format your word processor uses. The only reason it makes sense to reserve NUL, TAB, CR and LF is the sheer ubiquity of simple file formats (we call them text files, but they do contain formatting in addition to text) and memory representations of strings that need these.

So the question is, what is the purpose of the C0 and C1 control codes? — Preceding unsigned comment added by 82.139.81.0 (talk) 18:44, 28 May 2014 (UTC)[reply]

They're in Unicode to preserve compatibility with ASCII etc. character sets. AnonMoos (talk) 03:36, 7 February 2015 (UTC)[reply]
C1 comes from ISO-6429 (aka EMCA-48), and ISO-2022 (aka ECMA-35). It is not so much for compatibility (since the Unicode standard merely lists the names without attempting to describe functionality) as because ISO10646 grew out of the standardization work for the older encodings. Because Unicode does not describe functionality, it does not standardize C0/C1, merely makes a few assumptions relying upon those other documents as the relevant standards TEDickey (talk) 12:05, 7 February 2015 (UTC)[reply]

sources discussing smtp rather than ISO 10646[edit]

The given sources are discussing smtp rather ISO 10646 as such:

The following is a draft for an RFC updating SMTP to allow and encourage use of ISO 10646 (now DIS, of course).

and without a more suitable supplementary source, the statements do not match the source TEDickey (talk) 23:55, 7 April 2015 (UTC)[reply]

If you read this paragraph:
In Internet messages, the dynamic compaction method (compaction method 5) is used, the initial state being G=32, P=32, R=32, with each octet specifying a value of C. (Translated into normal English, that sentence means: "The text is in 8-bit Latin-1 until we get to the first HOP, if any!") Transitions to other character sets, represented by rows and, in some cases, planes, is done with a sequence that begins with the HOP ("High Octet Preset") code (decimal 129). The SGCI ("Single Graphic Character Introducer") is not used (i.e. we use "level 1" of method 5).
It's pretty clear to me it is discussing how the ISO 10646 draft is applied to SMTP. It's not introducing HOP or SGCI itself, it is pulling them from the draft. It would be great if someone could find old ISO 10646 drafts and we could quote them instead, but even in the absence of copies of those old drafts, I don't think there is any other plausible interpretation of this paragraph. SJK (talk) 12:23, 9 April 2015 (UTC)[reply]

Without the said draft, you cannot distinguish the interpretation which you wish to make from an equally plausible one that refers to some ISO-2022 feature which is commented upon as not being in ISO 10646. As such, your commentary in the topic amounts to original research. As I said, you need a supplementary source to provide the information rather than interpreting TEDickey (talk) 00:43, 10 April 2015 (UTC)[reply]

Please see Ken Whistler, Formal Name Aliases for Control Characters, L2/11-281, Unicode Consortium, July 20, 2011, which explains the situation much better than my previous reference did:

Notes Regarding Omissions

I have deliberately omitted three control code names and their abbreviations
which occur in one (obsolete) RFC, but which are an artifact of early
unapproved drafts of 10646. To wit:

0080 PADDING CHARACTER (PAD)
0081 HIGH OCTET PRESET (HOP)
0099 SINGLE GRAPHIC CHARACTER INTRODUCER (SGC)

Those 3 were proposed (on spec) in early drafts of 10646, for what became
a failed architectural direction for 10646. They would be completely forgotten
now except for the persistent (and pernicious) RFC that lists them without
indicating their failed status. Nobody has ever implemented them, so they
are nothing more than character encoding curiosities.

So this reference justifies my inference as correct. I will replace my prior reference with this one. SJK (talk) 10:52, 10 April 2015 (UTC)[reply]

Missing information[edit]

These control codes had names in Unicode 1.0 but these names were later removed. The article should explain when and why.

10646-1 forbids the use of C1 controls, requiring an ESC FE sequence instead. The article should detail when and why this came about and whether or not it is still in force in Unicode. — Preceding unsigned comment added by 82.139.82.82 (talk) 03:22, 6 September 2015 (UTC)[reply]

That (ESC Fe) was made obsolete a long time ago, and removed. See this for example. TEDickey (talk) 12:55, 6 September 2015 (UTC)[reply]

merge vs deletion[edit]

While it's interesting that Unicode has a subset of C0/C1 codes, deleting most of the content of this topic to replace it by a redirect to a summary paragraph should have some discussion involving the editors who've been maintaining the page. TEDickey (talk) 08:28, 4 August 2016 (UTC)[reply]

C1 control pictures[edit]

Why are there no C1 control pictures in the UCS? 1234qwer1234qwer4 (talk) 15:19, 2 June 2019 (UTC)[reply]

For instance this? Likely disinterest on the part of the committee members who were not involved in software development TEDickey (talk) 16:25, 2 June 2019 (UTC)[reply]
The Unicode Public General Mail List is probably a better place to ask this question. Google "c1 control pictures" site:unicode.org to see the discussions that have already taken place. If your question is "Why do C0 controls get pictures but not C1 controls?" then the short answer is compatibility with a legacy encoding that had C0 control pictures. DRMcCreedy (talk) 16:31, 2 June 2019 (UTC)[reply]
Actually, asking on a mailing list can get mixed results. If I wanted to know, I'd ask Frank. Either way, unless someone points to a mail-archive discussing the relevant issues, the best you'd get would be a primary source (unsuitable for topic development). TEDickey (talk) 19:15, 2 June 2019 (UTC)[reply]

What does C0 and C1 mean? Where did it came from? Are there also C2, C3? or did these exist?[edit]

I'd like to see the article explain the origin of the terms "C0" and "C1" and answers all these questions. --RokerHRO (talk) 16:25, 14 April 2020 (UTC)[reply]

JSON_streaming#Record_separator-delimited_JSON[edit]

I'd like to add a link to JSON streaming#Record separator-delimited JSON but I am unsure where it would fit best. --RokerHRO (talk) 22:40, 5 March 2021 (UTC)[reply]

Perhaps in the rightmost column of the table in C0 and C1 control codes#Basic ASCII control codes - there's a big box for FS/GS/RS/US, mentioning various uses of those control characters. Guy Harris (talk) 22:59, 5 March 2021 (UTC)[reply]

State machines[edit]

This text in C0 codes is certainly anachronistic and arguably simply wrong:

  • This large number of codes was desirable at the time, as multi-byte controls would require implementation of a state machine in the terminal, which was very difficult with contemporary electronics and mechanical terminals

State machines per se were neither difficult nor expensive. Shift states were required for existing coding systems such as BAUDOT, and were significantly less complex than the shift registers already needed for sending and receiving serial communication.

A state machine that could interpret VT-100 style escape sequences however would have been prohibitive in 1964.

The prime reason for avoiding shift states (or state machines in general) was to cope better with unreliable transmission, though I don't have a citation for that.

To describe 32 as a "large number" is laughable compared with the hundreds of controls that are implemented as sequences of bytes by typical terminal emulators.

Bitwise interpretation of ASCII codes
Maybe this table might be useful in an article, once we've figured out which article
bits meaning
0000000
1111111
no action; ignored
00_____ controls
__00___ Transmission controls, affecting DCEs
__01___ layout controls, driving the motors in printers
__10___ Terminal controls, including shift states and device-specific functions
__11___ File format markers
01_____ Digits & punctuation
1______ Letters
_0_____ Upper-case
_1_____ Lower-case

Although ASCII was designed as a coding system for transmission, unlike previous coding systems it could also function as an encoding for computation, with each printable character fitting into a single machine word ("byte", as we would know it today). This meant that there were needed to be in excess of 64 codes, dictating a minimum of 7 bits.

As only around 80-90 graphic characters were envisaged, it would have seemed foolhardy to "skim" on control codes; clearly at least 16 would be useful.

As there are broadly 4 classes of control codes, and a need for at least 5 transmission controls and 6 format controls, it made logical sense to reserve 4 groups of 8 codes, or 32 in all.

The eventual ASCII standard included codes that deviated from this simple arrangement, but this initial framework is still plain to see.

Martin Kealey (talk) 03:04, 13 August 2022 (UTC)[reply]