User:Wtrmute/UTF-8

From Wikipedia, the free encyclopedia

The regular language UTF-8[edit]

When deciding whether a UTF-8 string is valid, one must look for whether the byte classes (marked by the prefixes as designed by Thompson, etc) follow one another in a correct way, but one must also guard against overlong forms. Finally, in principle, one must also reject the encoding of the surrogate pair code points, though this is generally not done on purpose so the UTF-8 may encode invalid UTF-16 strings reversibly ("WTF-8"). The resulting language, even applying these restrictions, is still regular, as the below table shows:

Bits of
code point
First
code point
Last
code point
Bytes in
sequence
Regular expression
  7 U+0000 U+007F 1 [\x00-\x7F]
11 U+0080 U+07FF 2 [\xC2-\xDF][\x80-\xBF]
16 U+0800 U+FFFF 3 \xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEF][\x80-\xBF][\x80-\xBF]
21 U+10000 U+10FFFF 4 \xF0[\x90-\xBF][\x80-\xBF][\x80-\xBF]|[\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]|\xF4[\x80-\x8F][\x80-\xBF][\x80-\xBF]

An UTF-8 string is a sequence of any of the byte sequences above, and therefore a valid UTF-8 string (more properly, WTF-8 string) matches the following language:

/^([\x00-\x7F]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEF][\x80-\xBF][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF][\x80-\xBF]|[\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]|\xF4[\x80-\x8F][\x80-\xBF][\x80-\xBF])*$/

If we want to refuse the codification of Surrogate Pairs, then the 3-byte forms have to be split:

\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC][\x80-\xBF][\x80-\xBF]|\xED[\x80-\x9F][\x80-\xBF]|[\xEE-\xEF][\X80-\xBF][\X80-\xBF]

The language of all UTF-8 strings, as per RFC 3629, is therefore:

/^([\x00-\x7F]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC][\x80-\xBF][\x80-\xBF]|\xED[\x80-\x9F][\x80-\xBF]|[\xEE-\xEF][\X80-\xBF][\X80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF][\x80-\xBF]|[\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]|\xF4[\x80-\x8F][\x80-\xBF][\x80-\xBF])*$/

Finally, in CESU-8, all four-byte sequences are replaced by equivalent six-byte sequences:

/^([\x00-\x7F]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC][\x80-\xBF][\x80-\xBF]|\xED[\x80-\x9F][\x80-\xBF]|[\xEE-\xEF][\x80-\xBF][\x80-\xBF]|\xED[\xA0-\xAF][\x80-\xBF]\xED[\xB0-\xBF][\x80-\x8F])*$/

A lot of this complexity was generated to accomodate UTF-16 and its surrogate pair mechanism. If this were not the case, the language would be quite a bit simpler.