Computer systems vary greatly in the sets of characters they make available for use in electronic documents; this variety enables users with widely different needs to find computer systems suitable to their work, but it also complicates the interchange of documents among systems; hence the need for a chapter on this topic in these Guidelines.
Four character-set problems arise for the encoder of electronic
texts:
No single character set is prescribed for use in TEI-encoded
documents. Users may use any character set available to them, subject
to the character set restrictions imposed by the SGML declaration. It
is recommended that the character set used be documented by one or more
In general, it is most convenient to use a character set readily
available on one's computer system, though for special purposes it may
be preferable to customize the character set using software specialized
for the purpose. Whether to use the usual character set or create a
custom set depends on the documents being encoded, the staff support and
tools available for customizing the character set, the user's technical
facility, etc. A choice must be made between the perceived relative
convenience of living with an existing character set and the effort of
modifying it to suit one's documents more closely. Care should be
taken, in particular where multi-byte character sets are used, that the
syntactic distinctions required by an SGML processor are preserved.
Where a suitable character set is defined by a national or international
standard, local customization should implement the standard rather
than inventing yet another non-standard character set. The
choice must be made by each individual according to individual
circumstances; no general recommendations are made here as to whether
locally customized character sets should be used. In principle, however,
for local processing, encoders should use whatever character set they
find convenient.
When the characters in a text exist in the local character set, the
appropriate character codes should be used to represent them. Virtually
all computer systems provide at least the 52 letters of the standard
Latin alphabet, the ten decimal digits, and some basic punctuation
marks.
Other characters, such as Latin characters with diacritics (e.g.
ä or é) or non-Latin characters (e.g. Greek, Hebrew, Arabic,
Cyrillic) are less universally provided. Oriental scripts pose
particular problems because of the size of their character repertoires.
If the local character set provides an As noted above, full use of a local character set may require that
the SGML declaration be modified to define all the characters used as
legal SGML characters.
Characters not available in the local character set should usually be
encoded using SGML For example, the standard entity name for the character
As noted above, standard entity names have been defined for most
characters used by languages written in the Latin alphabet, and for some
other scripts, including Greek, Cyrillic, Coptic, and the International
Phonetic Alphabet. A useful subset of these may be found in chapter
Before an entity can be referred to, it must be declared. Standard
public entity names can be declared en masse, by including within the
DTD subset of the SGML document a reference to the standard public
entity which declares them. The German document quoted above, for
example, would have the following lines, or their equivalent, in its
DTD subset:
Where no standard entity name exists, or where the standard name is
felt unsuitable for some reason, the encoder may declare non-standard
entities, using the normal SGML syntax.
The replacement string for such entities may vary for different
purposes, as in the following extended example.
In transcribing a manuscript, it might be desirable to distinguish
among three distinct forms of the letter One possible set of declarations would be as follows:
Another (imaginary) set of declarations might use some
printer-specific codes to produce visually distinct versions of the
three letters in output:
Alternatively, the declarations can give all three forms the same
appearance in the output:
And finally, to ensure that the SGML output uses the same entity
references for these characters as does the SGML input, one could use
the following declarations.
Wherever locally defined entities are used for the representation of
characters in the text, whether to record presentational variants as
in this example or to represent special symbols not present in standard
character sets or public entity sets, it is recommended that the
entities be documented in a writing system declaration; for details
see chapter For transcriptions in scripts not supported by the local character
set, entity references may prove unwieldy. In such cases, it is also
possible to transliterate the material from its original script into
the script of the local character set; like a customized local
character set, a transliteration scheme should be documented with a
writing system declaration. To avoid information loss, a
For example, using the Beta code transcription developed for ancient
Greek by the Thesaurus Linguæ Græcæ, Some standard transliteration schemes for commonly encoded languages
are included in chapter Many documents contain material from more than one language: loan
words, quotations from foreign languages, etc. Since languages use a
variety of Some languages use more than one writing system. For example, some
Slavic languages may be written either in the Latin or in the Cyrillic
alphabet; some Turkic languages in Cyrillic, Latin, or Arabic script.
In such cases, each writing system must be treated separately, as a
separate Each value used for the
Like any global attribute, the Now experiments of this kind have one admirable
property and condition: they never miss or fail. ...
]]>
The form in which materials in different writing systems are
processed or displayed is not specified by these Guidelines; if
appropriate characters are not available locally, application software
may choose to display the material in an appropriate transliteration, as
a series of entity references, or in other forms. If the local system
requires explicit escape sequences or locking-shift control functions to
signal shifts to alternate character sets, these escape sequences may be
embedded directly in the content of the appropriate element or may be
supplied by the application software upon recognition of the language
shift. However, escape sequences are vulnerable to loss or
misinterpretation when documents are sent from site to site. For this
reason, the method of embedding them directly in the document, while
allowed within TEI-conformant interchange, is not recommended for
general use. Electronic texts may be exchanged over electronic networks, through
exchange of magnetic media (e.g. disk or tape), or by other means. In
every case except the transmission of storage media from one machine to
another machine of the same hardware type running the same operating
system and using the same coded character set, the characters are
subject to translation and interpretation, and hence to
misinterpretation and corruption, by utility software working somewhere
on the interchange path. Network gateways, tape-reading software, and
disk utilities routinely translate from one character set to another
before passing the data on. If the utility errs in identifying the
character set, or if several utilities translate back and forth among
character sets using non-reversible translations, the chances are good
that characters will be garbled and information lost.
At this time (1994), it is recommended that in texts subject to
In some cases, even the characters in ISO 646 IRV are corrupted in
transit; where interchange takes place over links not reliable for the
full IRV character set, the characters subject to misinterpretation and
corruption should be replaced by standard entities. At present, the
characters least susceptible to loss or misinterpretation in transit
among systems are those shown below, which represent a subset of the
characters in IRV and may thus be called the In interchange over any transmission link, the transmitted document
should contain only those characters which safely survive transmission
over the link; others should be represented with entity references, or
with transliterations, as described above.
In blind interchange by means of magnetic or optical media, it is
recommended that the document be encoded using some well documented and
widely used standard character set.
In blind interchange over networks, it is recommended that the
transmitted document contain only characters known to travel safely over
the networks involved. In the most general case, those characters are
the ISO 646 subset given above.
The problem of character data loss in document interchange is in
principle a temporary one; eventually, it should disappear as
improvements are made in network handling of character data and as more
networks support larger character sets such as ISO
10646. For further discussion of negotiated and non-negotiated interchange,
see chapter Each language and writing system used in a document should be
documented in a Writing System Declaration (WSD), which specifies:
The writing system declaration is one of a set of
This chapter describes the recommended solutions to these problems, in
enough detail to satisfy the needs of most users. More detail and more
technical information can be found in chapters
Trotz dieser langen Tradition
sekundäranalytischer
Ansätze wird man die Bilanz tatsächlich
durchgeführter sekundäranalytischer Arbeiten
aber in wesentlichen Punkten als unbefriedigend
empfinden müssen.
Using the standard names for a-umlaut and u-umlaut, one
could transcribe this sentence thus:
&, &r2;
, and
&r3;
. Entity declarations must then be
provided, within the DTD subset of the SGML document, to define these
entities and specify a substitute string.
&♯27;
,
indicating the decimal value (or
The characters available in a writing system may be specified in the
WSD for that writing system in one or more of the following ways:
Individual characters within a WSD are formally declared, where
necessary, by providing the following information: