Wikipedia talk:WikiProject Typography/Unicode

Initial discussion[edit]

This discussion was moved here from Wikipedia talk:WikiProject Typography#Unicode tables.

Hi folks. Some articles contains table grids of Unicode glyphs. Each grid row contains information on several Unicode code points. Each grid cell contains several sub-elements for a particular code point. See, for example, Letterlike Symbols. I find this layout extremely confusing and hard to use. I would like to change it to a more linear table, with one code point per row. Something roughly like the table in Miscellaneous Symbols. Comments? Objections? —DragonHawk (talk|hist) 18:34, 10 October 2008 (UTC)[reply]

I like the presentation used in the Miscellaneous Symbols article, but not the implementation. When the point is to show the shape of a character, we should not depend on the correct font being installed on the user's computer, since if his font is incorrect he will be misled into thinking it is correct. Instead, we should use graphics to draw the character. I haven't figured out how to embed a graphic in a table, so I made the table into a graphic in OCR-A font. John Sauter (talk) 03:46, 28 October 2008 (UTC)[reply]

Since writing the above paragraph I have figured out how to include the shape of a character in a table. Briefly, I use Inkscape to create a 1000-point character, convert it to a path, save the .svg file, upload it to Wikimedia, and reference it using the Image: prefix, with a size of 10 pixels. If the reader clicks on the graphic, he sees a very detailed representation. Here is an example: . John Sauter (talk) 12:53, 3 November 2008 (UTC)[reply]

Some of the Unicode tables do provide separately rendered images of each glyph. When available, I think such should be included, for the same reasons you identify. However, I think we should also include the "as is" Unicode character, so that browsers which can render it natively do so. That also makes it possible to use the articles as a copy-and-paste reference (like Charmap). · I'll work on a table design that incorporates everything, and post here with progress. —DragonHawk (talk|hist) 15:24, 3 November 2008 (UTC)[reply]

Okay, here's my first pass at it, using Letterlike Symbols as a pilot case. I think this new layout is much better than the old layout, both for the reader and the editor. With one codepoint per row, it's easier for the eye to correlate fields and the current character. It can also be a sortable table. For the editor, the markup is much cleaner and easier to work with. · It does tend to leave a lot of whitespace to the right of the table on wider screens. I may eventually try tackling that with CSS columns. One thing at a time. :) · I did the conversion using a purpose-written Perl script. Hopefully, I can re-use for additional conversions, making them almost easy. · Suggestions, comments, commendations, condemnations? —DragonHawk (talk|hist) 05:55, 4 November 2008 (UTC)[reply]

I do like your new layout, but I think it would be better still if the Char and Image columns were described in more detail. Perhaps "The Char column shows how your browser renders the character; for obscure characters you might see a box containing the hexadecimal code in small type, question marks, or nothing at all. The Image column shows the character rendered using the (something) font." If you are using several different fonts to image all the characters, that should be explained.

Also, I think the Hex column should be formatted as "U+(hex number)" or "\u(hex number)" since that is how these numbers will be used. Similarly, the decimal column could be "&u(decimal number);". I think that change would also allow the Hex column to sort. See OCR-A Regular characters for an example of a sortable Hex column. John Sauter (talk) 15:38, 5 November 2008 (UTC) John Sauter (talk) 16:34, 28 November 2008 (UTC)[reply]

I just realized I never responded to this. Point by point:

I plan on coming up with some kind of boilerplate description of the table columns, to cover what you mention, and more. Probably as a template, to save effort and keep consistency. But I planned on waiting until all conversions were done. Something might come up.
I didn't render those images myself; I just used images already existing on Wikipedia. Worrying about the correctness or source of those images is outside the scope of my effort. If you want to attack that aspect, please do! :)
- It would be really nice to have SVGs of all the Unicode characters, but I'm not sure of the copyright issues around that. I think we'd need to use a source font that was GFDL or CC compatible. I expect you know more about this than I do.

I doubt I know more than you, but my opinion is that an image of a single character, or a meaningless list of characters like abcdefghijklmnopqrstuvwxyz, cannot be copyrighted since there is no "creative content" beyond the shape of the character. At least in the United States, fonts cannot be copyrighted. However, keep in mind that the concept of "image of a Unicode character" is flawed, since a character does not correspond to a particular shape. You need both a character and a font to get a shape. John Sauter (talk) 16:39, 1 December 2008 (UTC)[reply]

- If you do end up doing work on this, I would suggest a file name format of U+xxxx.svg, to be compatible with existing images of that format.

Unfortunately, different vendors have placed glpyhs at different code points, at least with OCR-A. That is why I have chosen to use the character's name rather than its code point when creating the image. John Sauter (talk) 16:30, 1 December 2008 (UTC)[reply]

Syntax such as "U+", "\u", "&u", etc., is specific to the context. HTML isn't the same as Perl, etc. That's why I went with no prefix for decimal, and the WP:MOSNUM recommendation for hex. The values are universal.
However, you're right in that the 0x prefix breaks table column sorting. I didn't even realize that. I've adjusted Letterlike Symbols to just give the four character hexadecimal value, without any prefix, and it sorts properly now. The table makes it clear these are hex, so no prefix is needed. And no-prefix is also more universal.

Make sense? Anyone else have anything they'd like to say? I'll start attacking more pages Real Soon Now, if there are no objections. —DragonHawk (talk|hist) 06:39, 1 December 2008 (UTC)[reply]

continuation of discussion[edit]

(This part of the discussion took place after it was moved to a separate page by DragonHawk on 22 June 2009.)

Hex prefix[edit]

I continue to disagree with the removal of the U+ prefix from the hex column. It does not prevent sorting (though other prefixes do) and it is not terribly difficult to copy just the hex digits to another context when doing copy and paste. In favor of keeping the U+ prefix is that it is used by the Unicode standard to designate a Unicode character. John Sauter (talk) 05:07, 22 June 2009 (UTC)[reply]

Sorry, I didn't realize you (or anyone else) felt that strongly about it. Originally you had just said the numbers should be prefixed, so that seemed a lot less directed, especially when some of the suggested prefixes implied a context like HTML, or prefixes for decimal expressions. If it will sort properly, I think it's reasonable to go with what the Unicode standard uses. • I'd like to see thoughts from more people, just on general principles, but this is probably sufficiently esoteric for that to be unlikely. —DragonHawk (talk|hist) 11:59, 22 June 2009 (UTC)[reply]

Font choice[edit]

In addition, there is the problem of choosing a font. Some Unicode characters are so obscure that probably nobody would know whether they had been rendered using Bitstream Vera Serif or Century Schoolbook L, but a standard for describing Unicode characters must deal in a reasonable way with all Unicode characters, not just the obscure ones. I suggest that the default font for displaying a Unicode character should be the FreeSerif font distributed with OpenOffice. It seems to match the images published in the Unicode standard reasonably well, and it contains quite a lot of characters. Of course, for the obscure characters we are lucky to find any font which contains the character; I am only suggesting that FreeSerif be the display font when it is a reasonable choice. John Sauter (talk) 05:07, 22 June 2009 (UTC)[reply]

Just to be clear, here: Are you referring to the font specified in CSS for rendering literal Unicode characters from the HTML from Wikipedia, or are you referring to the font we should use to render the sample glyph images? —DragonHawk (talk|hist) 12:03, 22 June 2009 (UTC)[reply]

I was referring to the font we should use to render the sample glyph images. The font used to render characters in an article should be the choice of the article writer based on the needs of the subject, whether the characters are rendered using Unicode notation or not. However, if an article writer gets frustrated because his characters do not appear in most people's browsers, this page would be a good place to give him advice on how to fix that problem. John Sauter (talk) 13:14, 22 June 2009 (UTC)[reply]

That's what I thought, I just wanted to make sure. :) And I agree completely. Font choice in articles isn't something we should dictate here. I also like the idea of using a free font for the sample renderings; I think it's good for Wikipedia to be as "free as possible", regardless of the legal status of font copyright. • If you think FreeSerif is the font to use, I'm perfectly willing to take your word for it. :) • Using one font for most would also let us dodge the issue of different fonts for different characters. We could name each image something like FreeSerif_U+1234.svg. Other fonts may place different glyphs there, but there's only one FreeSerif. —DragonHawk (talk|hist) 02:11, 24 June 2009 (UTC)[reply]

Standard Explanation[edit]

A wiki is a great place to polish an explanation. Let's discuss the standard explanation here, and put the consensus result in the article. Here is my suggestion, to get things started. "The Char column shows how the Unicode character would look in your browser when in an article. You can copy-and-paste the character from there. The Image column shows the standard image of the character, for comparison. Name is the Unicode name of the character. Hex and Decimal are the numeric code point for the character in the Unicode character set. The "U+" prefix in the Hex column is from the Unicode standard. See Wikipedia:WikiProject_Typography/Unicode for further details." John Sauter (talk) 21:22, 22 June 2009 (UTC)[reply]

Good idea! I copied that to a section in the projectpage, and started tweaking. I took out some mentions of Unicode, since that's implied by context. I took out the cut-and-paste note. Anyone who wants to do that should be able to figure it out on their own; the design docs mention it so other Wikipedians know to maintain the feature. I turned the part about U+ into a wikilink. And so on.

I'm still not sure how to actually present this legend in articles, though. I imaging we could put it in a template and include the template before/after every Unicode table. That would be clear and obvious, which is good, but it might be ugly and waste space, too. Perhaps some kind of JavaScript to collapse the legend by default? (No JavaScript just means it is expanded by default.) Another option would be a wikilink to the legend, perhaps from the table headers, or just before/after the table. But I'm not sure where to keep the target of the wikilink (WP:namespace, page name, etc.). I'd be interested in hearing your (or others') thoughts on this. —DragonHawk (talk|hist) 01:44, 24 June 2009 (UTC)[reply]

Perhaps the information could be presented as a sidebar. I've done that with images in the OCR-A article, but I'm not sure how to do it with text. John Sauter (talk) 02:03, 24 June 2009 (UTC)[reply]

Image Column[edit]

I agree that if no image of a Unicode character exists, this column must be left blank. However, I think that should almost never be the case, and that much useful information is lost when no image is available. We should encourage the table author to find or create a suitable image, perhaps by giving some suggestions, or even tutorials, in this page. For example, when I needed images of the OCR-A characters, I rendered the characters at 1000 points, then traced the outlines to produce svg files. Another method would be to trace the outline of the sample characters in the Unicode standard. John Sauter (talk) 02:03, 24 June 2009 (UTC)[reply]

Agreed on should-always-have-an-image, and adding docs to the project page that explain how-to to editors. • It should be possible to script the process of rendering and tracing each character in InkScape. That would let us auto-generate a few thousand glyphs to start with. I'll ask around my local Linux User Group, see if anyone there knows anything about InkScape scripting. —DragonHawk (talk|hist) 02:16, 24 June 2009 (UTC)[reply]