A directory of resources inthe field of technical communication.

van Wijk, Diederik Gerth


About this Site | Advanced Search | Localization | Site Maps

 

1.
#33798

Character Matters

Documents are made of characters, XML documents are made of Unicode characters. Comparing with SGML, we now have potentially one million characters while SGML only provides a hundred, but on the other hand, we lost the option of defining our own SDATA entities. This puts us to two challenges. The first is, how can we validate that a document, an element, an attribute only contains those characters that we know how to process, how to render, sort, seek, hyphenate, capitalise, pronounce... How can we tell a type setter for which character set he has to find a font? XML Schema provides a simple way of restricting the set of valid characters in an attribute or a simple elememt to a regular expression, that can use some of the Unicode character properties, like the block it is defined in (like Basic Latin or Latin Extended-B) or the General Category (like Uppercase Letter or Math Symbol), but you can't use that in mixed content, like is typical in text markup.

van Wijk, Diederik Gerth. IDEAlliance (2004). Articles>Information Design>XML>Unicode