 | |  |  | 

Documents are made of characters, XML documents are made of Unicode characters. Comparing with SGML, we now have potentially one million characters while SGML only provides a hundred, but on the other hand, we lost the option of defining our own SDATA entities. This puts us to two challenges.
The first is, how can we validate that a document, an element, an attribute only contains those characters that we know how to process, how to render, sort, seek, hyphenate, capitalise, pronounce... How can we tell a type setter for which character set he has to find a font? XML Schema provides a simple way of restricting the set of valid characters in an attribute or a simple elememt to a regular expression, that can use some of the Unicode character properties, like the block it is defined in (like Basic Latin or Latin Extended-B) or the General Category (like Uppercase Letter or Math Symbol), but you can't use that in mixed content, like is typical in text markup. View all 111 works published by IDEAlliance |
 Character Matters http://www.idealliance.org/proceedings/xml04/abstracts/paper54.html
van Wijk, Diederik Gerth IDEAlliance 2004
Abstract: Documents are made of characters, XML documents are made of Unicode characters. Comparing with SGML, we now have potentially one million characters while SGML only provides a hundred, but on the other hand, we lost the option of defining our own SDATA entities. This puts us to two challenges.
The first is, how can we validate that a document, an element, an attribute only contains those characters that we know how to process, how to render, sort, seek, hyphenate, capitalise, pronounce... How can we tell a type setter for which character set he has to find a font? XML Schema provides a simple way of restricting the set of valid characters in an attribute or a simple elememt to a regular expression, that can use some of the Unicode character properties, like the block it is defined in (like Basic Latin or Latin Extended-B) or the General Category (like Uppercase Letter or Math Symbol), but you can't use that in mixed content, like is typical in text markup.
|
 |
 |  |