A directory of resources inthe field of technical communication.


16 found.

About this Site | Advanced Search | Localization | Site Maps

Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems. Unicode consists of a repertoire of more than 100,000 characters, a set of code charts for visual reference, an encoding methodology and set of standard character encodings.



The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Ever wonder about that mysterious Content-Type tag? You know, the one you're supposed to put in HTML and you never quite know what it should be? I've been dismayed to discover just how many software developers aren't really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff.

Spolsky, Joel. Joel on Software (2003). Articles>Language>Standards>Unicode


Accent Folding for Auto-Complete

A common assumption about internationalization is that every user fits into a single locale like “English, United States” or “French, France.” It’s a hangover from the PC days when just getting the computer to display the right squiggly bits was a big deal. One byte equaled one character, no exceptions, and you could only load one language’s alphabet at a time. This was fine because it was better than nothing, and because users spent most of their time with documents they or their coworkers produced themselves. Today users deal with data from everywhere, in multiple languages and locales, all the time. The locale I prefer is only loosely correlated with the locales I expect applications to process.

Bueno, Carlos. List Apart, A (2010). Articles>Web Design>Localization>Unicode


Character Matters

Documents are made of characters, XML documents are made of Unicode characters. Comparing with SGML, we now have potentially one million characters while SGML only provides a hundred, but on the other hand, we lost the option of defining our own SDATA entities. This puts us to two challenges. The first is, how can we validate that a document, an element, an attribute only contains those characters that we know how to process, how to render, sort, seek, hyphenate, capitalise, pronounce... How can we tell a type setter for which character set he has to find a font? XML Schema provides a simple way of restricting the set of valid characters in an attribute or a simple elememt to a regular expression, that can use some of the Unicode character properties, like the block it is defined in (like Basic Latin or Latin Extended-B) or the General Category (like Uppercase Letter or Math Symbol), but you can't use that in mixed content, like is typical in text markup.

van Wijk, Diederik Gerth. IDEAlliance (2004). Articles>Information Design>XML>Unicode


Demystifying Unicode

The concept of the Unicode character set began in 1987, thanks to Joe Becker from Xerox and Mark Davis from Apple. The following year, Becker, Davis, and Lee Collins (currently of Xerox; formerly of Apple) began investigating the design and soon made the case for Han unification to ANSI, ISO. Unicode is, indeed, based on the historic evolution of the Chinese character set (Han). Several people from various high tech companies began holding bimonthly meetings in 1989. By the end of 1990 , an initial, full-review draft was created. In 1991, the group became the Unicode Consortium, a non-profit organization incorporated as Unicode, Inc. Version 1.0 became available to the public for the first time in 1992.

Vine, Andrea and Bill Hall. SDL International (1998). Articles>Language>Localization>Unicode



Finding the right fonts and resolving font issues to properly display your content.

i18nGurus.com. Resources>Directories>Typography>Unicode


Guide to the Unicode Standard

This document is mainly intended for “ordinary” people who read the Unicode standard in order to get information about some particular characters or character processing issues that are important to them. The standard, though available online, is difficult to use without some help, and you can easily miss essential information when looking up things in it.

Tampereen Teknillinen Yliopisto (2005). Articles>Language>Localization>Unicode


i18n Gurus

An open directory of links to internationalization (i18n) resources and related material.

i18nGurus.com. Organizations>Language>Localization>Unicode


Understanding Bidirectional (BIDI) Text in Unicode

A little-understood corner of Unicode is its handling for bidirectional text (The spec is a little dry). While English languages are read left-to-right, plenty of scripts (notably Arabic and Hebrew) are read from right to left. When only a single direction of text is used in a document, it's fairly straight forward, but when texts with different directions are mixed in one document, some difficulty arises in determining direction. This document attempts to explain how bidirectional text in Unicode works and what this means for the web. In the Unicode standard, characters have a representational order in memory (which English speakers tend to think of as left to right, but is really start-to-finish in a file), which the bidirectional algorithm then operates on to determine the display characteristics.

Henderson, Cal. Iamcal (2009). Articles>Language>Localization>Unicode



Unicode resources and information.

i18nGurus.com. Resources>Language>Localization>Unicode


The Unicode Character Database

The site lists most of the codepoints ordered in different tables, by block, category, their bidirectional value or by some of the additional properties defined in the original UCD. All as plain Html, the version of each character is shown. Especially the ordering by category is helpful to find characters from different blocks.

Auer, Juergen. SQL und XML (2004). (German) Resources>Language>Standards>Unicode


Unicode Consortium Technical Report on Unicode Security Considerations

Unicode Technical Report #36 on Unicode Security Considerations "describes some of the security considerations that programmers, system analysts, standards developers, and users should take into account [when using the Unicode Standard], and provides specific recommendations to reduce the risk of problems."

Cover Pages (2005). Articles>Language>Security>Unicode


Unicode: Ein paar Anmerkungen

ISO 10646 ist ein Zeichensatz, der Schriftzeichen binären Codenummern zuordnet. Unicode hingegen ordnet den 2- bzw. 4-Byte-Code denselben Schriftzeichen zu, ergänzt die Definition aber um Zeicheneigenschaften, Implementationsregeln und Hinweise. Unicode ist eine private Organisation verschiedener kommerzieller Unternehmen, akademischer Einrichtungen und Anwendergruppen. ISO (International Standards Organisation, eine Unterorganisation der UNO) und Unicode arbeiten seit 1991 zusammen, um Diskrepanzen zwischen ISO 10646 und Unicode zu vermeiden.

Transcom. (German) Articles>Language>Localization>Unicode


Unicode: Making the Web Safe for Furriners

I think that Internet and World Wide Web are capitalized because they are proper names. Many names are capitalized common nouns: the White House, the Ninth Circle of Hell, the Heritage Foundation, the Civil War. I've heard arguments for lowercasing Internet and World Wide Web from people who compare them to things like the telephone system, but lowercase is certainly not the predominant style for these terms. At least 90 percent of the time, they're capitalized, and I don't think you should ignore actual use completely.

Ivey, Keith C. Editorial Eye, The (2003). Articles>Language>Standards>Unicode


Use the Unicode Database to Find Characters for XML Documents

The Unicode consortium is dedicated to maintaining a character set that allows computers to deal with the vast array of human writing systems. When you think of computers that manage such a large and complex data set, you think databases, and this is precisely what the consortium provides for computer access to versions of the Unicode standard. The Unicode Character Database comprises files that present detailed information for each character and class of character. The strong tie between XML and Unicode means this database is very valuable to XML developers and authors. In this article Uche Ogbuji introduces the Unicode Character Database and shows how XML developers can put it to use.

Ogbuji, Uche. IBM (2006). Articles>Language>Localization>Unicode


UTF-8: Documents With a Lot of Character

Did you ever built a webpage in Homesite and then you didn’t encode the html-entities? Then, probably when the client has a look on it, all the german Umlaut characters look awkward on a mac? And did you figure out why? It’s because of the charsets and the encoding of the characters in the saved file!

Opitz, Pascal. Content with Style (2005). Design>Web Design>Localization>Unicode


Walking Backwards: Supporting Non-Western Languages on the Web

IBM apparently be building Hebrew support in the Mozilla project, but AOL/Netscape has of yet not said a word about their plans, if any, for including the BiDi support code in the upcoming Netscape 6.

Forbes, Shoshannah L. List Apart, A (2000). Design>Web Design>Localization>Unicode

Follow us on: TwitterFacebookRSSPost about us on: TwitterFacebookDeliciousRSSStumbleUpon