The SGML Handbook

Characters and Character Sets

Computer systems vary greatly in the sets of characters they make available for use in electronic documents; this variety enables users with widely different needs to find computer systems suitable to their work, but it also complicates the interchange of documents among systems; hence the need for a chapter on this topic in these Guidelines.

Four character-set problems arise for the encoder of electronic texts: selecting a character set to use in creating, processing, or storing the electronic text preparing documents for interchange so that the characters within them are not corrupted in transit encoding characters which are not provided by the character set available on the computer system in use signaling shifts from one character set to another, e.g. from the Latin alphabet to Greek and back, or to a special symbol character set and back This chapter describes the recommended solutions to these problems, in enough detail to satisfy the needs of most users. More detail and more technical information can be found in chapters , , , and . Local Character Sets

No single character set is prescribed for use in TEI-encoded documents. Users may use any character set available to them, subject to the character set restrictions imposed by the SGML declaration. It is recommended that the character set used be documented by one or more writing system declarations (WSD), on which see below. In most cases, a predefined writing system declaration should be suitable. For writing system declarations provided as part of the TEI Guidelines, see chapter .

In general, it is most convenient to use a character set readily available on one's computer system, though for special purposes it may be preferable to customize the character set using software specialized for the purpose. Whether to use the usual character set or create a custom set depends on the documents being encoded, the staff support and tools available for customizing the character set, the user's technical facility, etc. A choice must be made between the perceived relative convenience of living with an existing character set and the effort of modifying it to suit one's documents more closely. Care should be taken, in particular where multi-byte character sets are used, that the syntactic distinctions required by an SGML processor are preserved. Where a suitable character set is defined by a national or international standard, local customization should implement the standard rather than inventing yet another non-standard character set. The choice must be made by each individual according to individual circumstances; no general recommendations are made here as to whether locally customized character sets should be used. In principle, however, for local processing, encoders should use whatever character set they find convenient. Characters Available Locally

When the characters in a text exist in the local character set, the appropriate character codes should be used to represent them. Virtually all computer systems provide at least the 52 letters of the standard Latin alphabet, the ten decimal digits, and some basic punctuation marks.

Other characters, such as Latin characters with diacritics (e.g. ä or é) or non-Latin characters (e.g. Greek, Hebrew, Arabic, Cyrillic) are less universally provided. Oriental scripts pose particular problems because of the size of their character repertoires. If the local character set provides an ä however, there is normally no reason not to use it where that character appears in the text, unless the electronic text is to be moved frequently among machines, in which case one may wish to restrict the electronic text to characters known to translate well among machines. (For more information on moving characters among machines, see section below.)

As noted above, full use of a local character set may require that the SGML declaration be modified to define all the characters used as legal SGML characters. Characters Not Available Locally

Characters not available in the local character set should usually be encoded using SGML entity references. In SGML terms, an entity (described in more detail in section ) is any named string of characters, from a single character to an entire system file. Entities are included in SGML documents by entity references. Lists of suggested names for all the characters and printers' symbols used by most modern European languages have been published by ISO and others. The most widely used such entity set is to be found in Annex D to ISO 8879; it is also reproduced or summarized in most SGML textbooks, notably Charles F. Goldfarb, The SGML Handbook (Oxford: Clarendon Press, 1990). A list of some frequently used standard entity names may be found in chapter . Extensive entity sets are being developed by the TEI and others are being documented in the fascicles of ISO/TR 9573: Technical Report: Information processing --- SGML support facilities --- Techniques for using SGML ([Geneva]: ISO, 1988 et seq.).

For example, the standard entity name for the character ä is auml; a reference to the entity gives the name of the entity, preceded by an ampersand and followed by a semicolon. Strictly speaking the semicolon is not always required; for details see any full treatment of SGML. For example, consider the following German sentence: Trotz dieser langen Tradition sekundäranalytischer Ansätze wird man die Bilanz tatsächlich durchgeführter sekundäranalytischer Arbeiten aber in wesentlichen Punkten als unbefriedigend empfinden müssen. Using the standard names for a-umlaut and u-umlaut, one could transcribe this sentence thus:

As noted above, standard entity names have been defined for most characters used by languages written in the Latin alphabet, and for some other scripts, including Greek, Cyrillic, Coptic, and the International Phonetic Alphabet. A useful subset of these may be found in chapter .

Before an entity can be referred to, it must be declared. Standard public entity names can be declared en masse, by including within the DTD subset of the SGML document a reference to the standard public entity which declares them. The German document quoted above, for example, would have the following lines, or their equivalent, in its DTD subset: %ISOLat1; ]]>

Where no standard entity name exists, or where the standard name is felt unsuitable for some reason, the encoder may declare non-standard entities, using the normal SGML syntax. The replacement string for such entities may vary for different purposes, as in the following extended example.

In transcribing a manuscript, it might be desirable to distinguish among three distinct forms of the letter r. In the transcript, each of these forms will be encoded by an entity reference, for example: &, &r2;, and &r3;. Entity declarations must then be provided, within the DTD subset of the SGML document, to define these entities and specify a substitute string.

One possible set of declarations would be as follows: ]]> The expansions shown above will simply flag each occurrence with a number in brackets to indicate which form of r appears in the manuscript.

Another (imaginary) set of declarations might use some printer-specific codes to produce visually distinct versions of the three letters in output: ]]> Here the replacement strings for each entity contain character references, for example &♯27;, indicating the decimal value (or code point) of the character to be generated by the SGML processor. The assumption is that the sequence of bytes generated will drive a printer to produce a distinctive form of the letter. Such printer instructions are of their nature highly machine- and software-dependent; by allowing them to be confined to a single place (the entity declarations), SGML makes it easier to adapt such device-specific information to new environments.

Alternatively, the declarations can give all three forms the same appearance in the output: ]]>

And finally, to ensure that the SGML output uses the same entity references for these characters as does the SGML input, one could use the following declarations. ]]>

Wherever locally defined entities are used for the representation of characters in the text, whether to record presentational variants as in this example or to represent special symbols not present in standard character sets or public entity sets, it is recommended that the entities be documented in a writing system declaration; for details see chapter .

For transcriptions in scripts not supported by the local character set, entity references may prove unwieldy. In such cases, it is also possible to transliterate the material from its original script into the script of the local character set; like a customized local character set, a transliteration scheme should be documented with a writing system declaration. To avoid information loss, a reversible transliteration scheme (i.e. one in which it is possible to reconstruct the original writing from the transliteration) should be preferred; wherever possible, standard schemes should be preferred to ad hoc schemes. Reversible transliteration schemes are defined by national and international standards too numerous to list here; another useful source is ALA-LC Romanization Tables: Transliteration Schemes for Non-Roman Scripts, approved by the Library of Congress and the American Library Association, tables compiled and edited by Randall K. Barry (Washington: Library of Congress, 1991). Many of these schemes, however, use diacritics which are themselves not always available in standard electronic character sets and thus may require careful adaptation for use in electronic work.

For example, using the Beta code transcription developed for ancient Greek by the Thesaurus Linguæ Græcæ, Thesaurus Linguæ Græcæ, Beta Manual (Irvine: TLG, [1988]). See also Luci Berkowitz and Karl A. Squitier, Thesaurus Linguæ Græcæ Canon of Greek Authors and Works 2nd edition (Oxford: Oxford University Press, 1986). one would transcribe the start of the Iliad of Homer thus:

Some standard transliteration schemes for commonly encoded languages are included in chapter . Shifting Among Character Sets

Many documents contain material from more than one language: loan words, quotations from foreign languages, etc. Since languages use a variety of writing systems, which in turn use a variety of character repertoires, shifts in language frequently go hand in hand with shifts in character repertoire and writing system. Since language change is frequently of importance in processing a document, even when no character set shift is needed, the encoding scheme defined here provides a global attribute lang to make it possible to mark language shifts explicitly. This attribute may also be used to trigger character-set shifting by application programs.

Some languages use more than one writing system. For example, some Slavic languages may be written either in the Latin or in the Cyrillic alphabet; some Turkic languages in Cyrillic, Latin, or Arabic script. In such cases, each writing system must be treated separately, as a separate language. Each distinct value of the lang attribute, therefore, represents a single natural language, a single writing system, and a single coded character set.

Each value used for the lang attribute must correspond to a writing system declaration suitable for the character set and writing system being used. The values should where possible be taken from the standard two- and three-letter language codes defined by the international standard ISO 639. ISO 639: 1988, Code for the representation of names of languages ([Geneva]: International Organization for Standardization, 1988). The most recent version of this standard supplies three-letter codes as well as the earlier system of two-letter codes. Either may be used in TEI documents. When more than one writing system is used for the same language in the same document, suffixes should be added to the values from ISO 639. A selection of standard language codes, as well as a number of standard writing system declarations, are provided in chapter ; other writing system declarations may be provided locally.

Like any global attribute, the lang attribute may be used on any element in the SGML document. To mark a technical term, for example, as being in a particular language, one may simply specify the appropriate language on the term tag (for which see ): ... But then only will there be good ground of hope for the further advance of knowledge, when there shall be received and gathered together into natural history a variety of experiments, which are of no use in themselves, but simply serve to discover causes and axioms, which I call Experimenta lucifera, experiments of light, to distinguish them from those which I call fructifera, experiments of fruit.

Now experiments of this kind have one admirable property and condition: they never miss or fail. ... ]]>

The form in which materials in different writing systems are processed or displayed is not specified by these Guidelines; if appropriate characters are not available locally, application software may choose to display the material in an appropriate transliteration, as a series of entity references, or in other forms. If the local system requires explicit escape sequences or locking-shift control functions to signal shifts to alternate character sets, these escape sequences may be embedded directly in the content of the appropriate element or may be supplied by the application software upon recognition of the language shift. However, escape sequences are vulnerable to loss or misinterpretation when documents are sent from site to site. For this reason, the method of embedding them directly in the document, while allowed within TEI-conformant interchange, is not recommended for general use. Standard methods for character-set shifting are defined in ISO 2022: 1986, Information processing --- ISO 7-bit and 8-bit coded character sets --- Code extension techniques, 3d ed. ([Geneva]: International Organization for Standardization, 1986). A standard set of control functions, including methods of specifying script direction (left-right, right-left, top-down, etc.), is defined by ISO 6429: 1992, Information processing --- Control functions for 7-bit and 8-bit coded character sets, ([Geneva]: ISO, 1992). These and related standards are recommended to the notice of those designing systems to handle multiple character sets. Character Set Problems in Interchange

Electronic texts may be exchanged over electronic networks, through exchange of magnetic media (e.g. disk or tape), or by other means. In every case except the transmission of storage media from one machine to another machine of the same hardware type running the same operating system and using the same coded character set, the characters are subject to translation and interpretation, and hence to misinterpretation and corruption, by utility software working somewhere on the interchange path. Network gateways, tape-reading software, and disk utilities routinely translate from one character set to another before passing the data on. If the utility errs in identifying the character set, or if several utilities translate back and forth among character sets using non-reversible translations, the chances are good that characters will be garbled and information lost.

At this time (1994), it is recommended that in texts subject to blind interchange (that is, interchange between parties who do not or cannot make explicit agreements over the character set to be used in interchange) no characters should be used other than those in the international reference version (IRV) of ISO 646. ISO (International Organization for Standardization). ISO 646: 1991. Information processing --- ISO 7-bit coded character set for information interchange. ([Geneva]: International Organization for Standardization, 1991.) Other characters should be represented by SGML entity references. For scripts (such as those of East Asia) which require a character repertoire larger than 255 characters, regional proposals for character and entity sets to be used in non-negotiated interchange must be developed and published internationally.

In some cases, even the characters in ISO 646 IRV are corrupted in transit; where interchange takes place over links not reliable for the full IRV character set, the characters subject to misinterpretation and corruption should be replaced by standard entities. At present, the characters least susceptible to loss or misinterpretation in transit among systems are those shown below, which represent a subset of the characters in IRV and may thus be called the ISO 646 subset. a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 " % & ' ( ) * + , - . / : ; < = > ? _

In interchange over any transmission link, the transmitted document should contain only those characters which safely survive transmission over the link; others should be represented with entity references, or with transliterations, as described above.

In blind interchange by means of magnetic or optical media, it is recommended that the document be encoded using some well documented and widely used standard character set.

In blind interchange over networks, it is recommended that the transmitted document contain only characters known to travel safely over the networks involved. In the most general case, those characters are the ISO 646 subset given above.

The problem of character data loss in document interchange is in principle a temporary one; eventually, it should disappear as improvements are made in network handling of character data and as more networks support larger character sets such as ISO 10646. ISO (International Organization for Standardization), and IEC (International Electrotechnical Commission), ISO/IEC 10646-1: 1993. Information technology --- Universal Multiple-Octet Coded Character Set (UCS) --- Part 1: Architecture and Basic Multilingual Plane. ([Geneva]: International Organization for Standardization, 1993.) The Basic Multilingual Plane of ISO 10646 is identical to the sixteen-bit character set known as Unicode: The Unicode Consortium, The Unicode Standard: Worldwide Character Encoding Version 1.0, Volume 1 (Reading, Mass.: Addison-Wesley, 1991). Version 1.0 of Unicode has been superseded by later work incorporated into ISO/IEC 10646; these changes are be reflected in the official published standard, and will be documented in version 1.1 of Unicode. The need for backward compatibility with older systems is likely, however, to ensure that care must be taken in blind interchange for the foreseeable future.

For further discussion of negotiated and non-negotiated interchange, see chapter . The Writing System Declaration

Each language and writing system used in a document should be documented in a Writing System Declaration (WSD), which specifies: a formal name for the writing system and language, for use as a lang attribute value on elements encoded in it a specification for the meaning of each character available in the writing system The characters available in a writing system may be specified in the WSD for that writing system in one or more of the following ways: by reference to an international, national, or TEI-registered coded character set or entity set by reference to such a standard followed by formal declaration of all exceptions by providing a formal declaration for each character used Individual characters within a WSD are formally declared, where necessary, by providing the following information: the unique code used to represent the character special properties such as whether the character is a diacritic mark or not brief textual description of the character standard or local entity name used for the character in interchange other standard identifiers for the character, if available, such as its code in the Universal Character Set of ISO 10646 or Unicode and its standard number in the Association for Font Information Interchange registry. optionally, some specification of a suitable graphic rendition for the character in a suitable notation (e.g. graphic image, Metafont program, etc.)

The writing system declaration is one of a set of auxiliary documents which provide documentation relevant to the processing of TEI texts. Auxiliary documents are themselves SGML documents, for which document type declarations are provided. The DTD for the Writing System Declaration is discussed in detail in chapter . Standard Writing System Declarations are provided in chapter .