The construction of pointers for lexical encoding

========================================================================= Date: Mon, 2 Dec 1991 11:46:15 GMT Reply-To: Donald A Spaeth Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Donald A Spaeth Subject: Manuscript characters Thanks to Harry, Michael and Peter for their responses. It all sounds embarrassingly straightforward. On reflection, entities are the answer for minims, too, e.g. &minim. Michael suggests that the use of multiple symbols doesn't appear to be ambiguity. The problem arises when the transcriber is unable to identify a character which may take several forms, some of which are similar to the forms other characters may take. The current process is to scan by eye the page looking for other forms which look the same. This procedure falls down when no known word seems to contain the ambiguous form in the appropriate location, and is time-consuming. The example provided for encoding readings seems to assume the need to choose between a limited number of seen possibilities, rather than possibilities that are unperceived because of difficulties reading the letter. It would make sense automate the search for letters described above, possibly tied to a dictionary. Entities could be used to represent different sets of topographical features. A similar procedure would be followed to disambiguate minims, e.g. to help decide whether &minim&minim&minim represented 'ui', 'ni', 'm', etc., by pulling out all possibilities from a dictionary. Donald Spaeth Glasgow University ========================================================================= Date: Mon, 2 Dec 1991 12:01:47 MST Reply-To: Terry Langendoen Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Terry Langendoen Subject: Lexical structure encoding The construction of pointers for lexical encoding authorD. Terence Langendoen Version for circulation on TEI-L file server on December 2, 1991. Revision of material contained in TEI AI1 W9, section entitled The construction of entity definitions for lexical encoding. DTL 1 December 1991 wrote first draft 0 DTL 2 December 1991 revisions and corrections 1 Introduction Background

In TEI AI1 W9, Eanass Fahmy and I proposed a method for constructing and using entities for representing the morphological structures of the words in a document. At the Myrdal meeting, Steven DeRose suggested that using pointers would provide for more control over validation, make it easier to make changes and corrections to marked-up texts, and speed up parsing and processing generally. This suggestion met with approval at the Myrdal meeting, and I agreed to recast the method in terms of pointers. While I have, in the interests of time, limited the domain of application to the same sets of feature structures, features and values as in TEI AI1 W9, I have chosen to use a two-character, rather than a one-character, coding scheme for feature names and values, in anticipation of the need to extend the method to a larger subset of the features and values in TEI AI1 W2. I also represent the person and number features associated with verbs as part of a feature structure which is the value of a feature named agreement, as in TEI AI1 W2. Various assumptions about certain elements and attributes

In this document, the following elements and attributes have been renamed as follows. 1. f.struct becomes fs 2. feature becomes f 3. atomic becomes atm 4. xref becomes x 5. f.s.or becomes or 6. target becomes t

The following new tags and attributes are used. 1. fs.lib 2. f.lib 3. any 4. default 5. unknown 6. not.applicable 7. no.claim 8. element, an attribute of x, and, not and or, which takes as values the names of elements.

I propose fs.lib and f.lib to group together in libraries sets of predefined fss and fs, with ids that can be pointed to from the text or from within the libraries. I propose any, default, unknown, not.applicable and no.claim as empty tags to designate the predefined values for underspecification spelled out in TEI AI1 W2. Conventions for forming ID values Two-character coding scheme for feature names and values

First I give a two-character code for each of the feature names and values in TEI AI1 W9, Coding of word-level features and values, Coding of recurrent features and values, and Feature and value codes for single categories. feature name or value code category (part-of-speech) ps noun nn pronoun pn adjective aj article ar adverb av verb vb preposition pp coordinator cd subordinator sb particle pt interjection ij punctuation pu person pe first r1 second r2 third r3 number nu singular sg plural pl gender ge feminine fe masculine ma neuter ne case ca nominative nm genitive gt dative dt accusative ac degree de positive g1 comparative g2 superlative g3 initial-capital il minus c0 plus c1 common co minus m0 plus m1 possessive po minus s0 plus s1 agreement ag first singular r1sg second singular r2sg third singular r3sg first plural r1pl second plural r2pl third plural r3pl tense te present pr past pa mood md indicative in imperative im subjunctive sj verb-form vf finite fi infinitive nf gerund gr participle pc auxiliary au minus x0 plus x1 orientation or open op close cl matched mt unary un

Second, the underspecification tags are given the following codes. feature name or value code any 00 default 01 unknown 97 not.applicable 98 no.claim 99 Formation of identifiers Identifiers for fs

Identifiers for fs are formed as in TEI AI1 W2 by joining the code for the name and the value, but without the hyphen. For example, the f whose name is number and whose value is singular receives the identifier nusg, as in the following illustration.

<![ CDATA [ <f id=nusg name=number><atm>singular ]]

Here is another example, this time using an underspecified value.

<![ CDATA [ <f id=nu01 name=number><default> ]] Identifiers for fss

Identifiers for fss are formed as in TEI AI1 W2 by joining the codes for the values of the enclosed fs. However, if an underspecified value is used, the code for the name and the second digit of the value must be used if the underspecified value is to appear in the fs identifier. For example, the fs that designates a singular common noun with no initial capital and which is understood by default as third person (such as antiquity), can receive the identifier nnsgm1ilc0, as in the following illustration. I assume that no coding for the default person value is necessary in this situation.

<![ CDATA [ <fs id=nnsgm1ilc0> <x f t=psnn> <x f t=nusg> <x f t=com1> <x f t=ilc0> <x f t=pe01> </fs> ]]

As the preceding illustration shows, the fs encloses xs which point to the fs that they comprise. The f value is the value of the element attribute of x. The intended semantics is that the fs that are pointed to are copied (except for their id attributes) into the positions of the xs. The xs could be given their own id attributes if desired.

The situation with fs which have fss as values, is somewhat more complicated. Suppose that a fs has an f whose name is agreement whose value is another fs, which encloses two fs, one whose name is person with the value third and the other whose name is number with the value singular. The outer fs will have an identifier with the sequence r3sg in it, and will contain an x which points to an f with the identifier agr3sg. The latter f will contain an fs with the identifier r3sg, and this fs will contain two xs, one pointing to an f with the identifier per3 and the other pointing to an f with the identifier nusg. The situation is illustrated as follows.

<![ CDATA [ <fs id=...r3sg...> <x f t=...> ... <x f t=agr3sg> -------------+ <x f t=...> | ... | </fs> | | <f id=agr3sg name=agreement> <--+ <x fs t=r3sg> --------+ | <fs id=r3sg> <-----------+ <x f t=per3> ------------+ <x f t=nusg> ------------x---+ </fs> | | | | <f id=per3 name=person> <---+ | <atm>third | | <f id=nusg name=number> <-------+ <atm>singular ]]

For example, if the structure to be encoded is that of a present tense, indicative, auxiliary verb showing third person singular number agreement, such as has or is, the incomplete fs in the preceding illustration could be fleshed out as follows.

<![ CDATA [ <fs id=vbr3sgprinx1> <x f t=psvb> <x f t=agr3sg> <x f t=tepr> <x f t=mdin> <x f t=vf01> <x f t=aux1> </fs> ]] Identifiers for ambiguous structures

Suppose one wished to mark up an occurrence of thinner as ambiguous between an interpretation as a singular noun and a comparative adjective. One way would be as a disjunction of two xs with its own identifier formed by concatenation from the identifiers for the xs, as in the following illustration; the xs in turn point to the fss which make up the choice.

<![ CDATA [ <or x id=nnsgajg2> <x fs t=nnsg> <x fs t=ajg2> </or> ]]

A more elaborate example is provided by fast, which one might wish to mark up as a singular noun, a verb of some sort, an adjective or an adverb. A possible encoding of this structure is the following.

<![ CDATA [ <or x id=nnsgvbvf9ajg1avg1> <x fs t=nnsg> <x fs t=vbvf9> <x fs t=ajg1> <x fs t=avg1> </or> ]]

Next suppose one wished to mark up an occurrence of was as showing either first or third person agreement. The or can be used to group either atm or f, the latter requiring us to repeat the name of the f. Since we have not seen the need to provide identifiers for atms (though it is certainly possible to do so), we will take the latter option, despite its wordiness. First, we construct the value of the agreement f.

<![ CDATA [ <fs id=r1r3sg> <or x id=per1per3> <x f t=per1> <x f t=per3> </or <x f t=nusg> </or> ]] The agreement f in turn looks as follows. <![ CDATA [ <f id=agr1r3sg name=agreement> <x fs t=r1r3sg> ]]

A possible encoding of the structure for was would then be the following.

<![ CDATA [ <fs id=vbr1r3sgpainx1> <x f t=psvb> <x f t=agr1r3sg> <x f t=tepa> <x f t=mdin> <x f t=vf01> <x f t=aux1> </fs> ]]

Next, we consider how an encoding for have as a present-tense inflected form can be constructed. We assume that the agreement structure represents a choice between any person and plural number or any number and either first or second person. The first of these agreement structures has the following form.

<![ CDATA [ <fs id=pe0pl> <x f t=pe00> <x f t=nupl> </fs> ]] The second structure has the following form. <![ CDATA [ <fs id=r1r2nu0> <or x t=per1per2> <x f t=per1> <x f t=per2> </or> <x f t=nu00> </fs> ]]

The agreement structure, then is an or as follows.

<![ CDATA [ <or x id=pe0plr1r2nu0> <x fs t=pe0pl> <x fs t=r1r2nu0> </or> ]]

It is pointed to by an agreement f as follows.

<![ CDATA [ <f id=agpe0plr1r2nu0 name=agreement> <x or t=pe0plr1r2nu0> </or> ]]

The entire structure is as follows.

<![ CDATA [ <fs id=vbpe0plr1r2nu0prinx1> <x f t=psvb> <x f t=agpe0plr1r2nu0> <x f t=tepr> <x f t=mdin> <x f t=vf01> <x f t=aux1> </or> ]]

Finally, we consider the encoding of have as ambiguous between an indicative, imperative, subjunctive and infinitive form. As an indicative form, it has the structure just given. As an imperative form, it can be assumed to show second person, any number agreement. As a subjunctive form, it can be assumed to show agreement with any person and number. As an infinitive form, agreement is not applicable. We give here just the encoding of the outer or. The details of the encoding can actually be inferred from the structure of the identifiers. The character . is used to separate the codes understood to be joined by and, and the identifier for an and ends in that character.

<![ CDATA [ <fs id= vb.pe0p1r1r2nu0prin.r2nu0prim.pe0nu0prsj.ag8te8md8nf.x1> <x f t=psvb> <or and id=pe0plr1r2nu0prin.r2nu0prim.pe0nu0prsj.ag8te8md8nf.> <and x id=vbpe0p1r1r2nu0prin.> <x f t=agpe0plr1r2nu0> <x f t=tepr> <x f t=mdin> <x f t=vf01> </and> <and x id=r2nu0prim.> <x f t=agr2nu0> <x f t=tepr> <x f t=mdim> <x f t=vf01> </and> <and x id=pe0nu0prsj.> <x f t=agpe0nu0> <x f t=tepr> <x f t=mdsj> <x f t=vf01> </and> <and x id=ag8te8md8nf.> <x f t=ag98> <x f t=te98> <x f t=md98> <x f t=vfnf> </and> </or> <x f t=aux1> </fs> ]]

To avoid the use of and in cases like this, one could require the use of an f named, say, inflection (code if), whose value is an x which points to a fs which encloses pointers to the various fs which are represented by inflectional morphology, such as agreement, tense, mood, verb-form, number, person, etc. Schematically, the result would look as follows.

Systematic use of the inflection f would significantly change the other encoding proposals made here, however. Libraries of fss and fs

The fss and the ors that group pointers to fss, can be collected together in one place either as an external feature-structure library, or internally collected within an fs.lib. The fs and the ors that group pointers to fs, can be collected together in a similar way; the internal collection occurring within an f.lib. Sample text markup

Here is a markup of the first sentence of Mary Robinson, Thoughts on the Condition of Women corresponding to what appears in TEI AI1 W9, Sample document instance showing lexical markup with implied word lemmatization. In the following illustration, the lemmatization is made explicit by means of ss. The hiliteds and lbs have been omitted.

<![ CDATA [ <s id=w1>Custom<x fs t=nnsgm1c1></s> <s id=w2>, <x fs t=puun></s> <s id=w3>from <x fs t=pp></s> <s id=w4>the <x fs t=ar></s> <s id=w5>earlie&st; <x fs t=avg3></s> <s id=w6>periods; <x fs t=nnplm1c0></s> <s id=w7>of <x fs t=pp></s> <s id=w8>antiquity<x fs t=nnsgm1c0></s> <s id=w9>, <x fs t=puun></s> <s id=w10>has <x fs t=vbr3sgprinx1></s> <s id=w11>endeavored <x fs t=vbpapcx0></s> <s id=w12>to <x fs t=sb></s> <s id=w13>place <x fs t=vbnfx0></s> <s id=w14>the <x fs t=ar></s> <s id=w15>female <x fs t=ajde8></s> <s id=w16>mind <x fs t=nnsgm1c0></s> <s id=w17>in <x fs t=pp></s> <s id=w18>the <x fs t=ar></s> <s id=w19>subordinate <x fs t=ajde8></s> <s id=w20>rank <x fs t=nnsgm1c0></s> <s id=w21>of <x fs t=pp></s> <s id=w22>intellectual <x fs t=ajde8></s> <s id=w23>sociability<x fs t=nnsgm1c0></s> <s id=w24>. <x fs t=puun></s> ]]

The fs.lib and f.lib remain to be provided. List of noncompound feature names and values by code feature name or value code any 00 default 01 unknown 97 not.applicable 98 no.claim 99 accusative ac agreement ag adjective aj article ar auxiliary au adverb av initial-capital minus c0 initial-capital plus c1 case ca coordinator cd close cl common co degree de dative dt feminine fe finite fi positive g1 comparative g2 superlative g1 gender ge gerund gr genitive gt interjection ij initial-capital il imperative im indicative in common minus m0 common plus m1 masculine ma mood md matched mt neuter ne infinitive nf nominative nm noun nn number nu open op orientation or past pa participle pc person pe plural pl pronoun pn possessive po preposition pp present pr category (part-of-speech) ps particle pt punctuation pu first r1 second r2 third r1 possessive minus s0 possesive plus s1 subordinator sb singular sg subjunctive sj tense te unary un verb vb verb-form vf auxiliary minus x0 auxiliary plus x1 ========================================================================= Date: Wed, 4 Dec 1991 15:55:00 GMT Reply-To: Peter Flynn Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Peter Flynn Subject: Notes on P1 Text Encoding Initiative Submission on the Guidelines (P1) from the CURIA Project These comments pertain both to the structure and to the content of TEI P1. Readers should bear in mind that the majority of participants in the CURIA project will have no (or very little) previous experience of SGML, and that some of these comments are made with this factor in mind. [For asterisked comments see end note.] A. Structure: 1. The division of the document into four major sections would be helpful (apart from the usual prefatorial matter): a. an overview of SGML, specifically its capabilities and applications; b. a technical introduction to SGML and the TEI proposals c. the TEI guideline proposal text (P2) d. a TEI tutorial, structured to follow the guidelines, with copious examples. Sections (a) and (b) would need to be kept short. This structure would be of immense help to projects wishing to start using the proposals, who need both instructional material for hands-on participants as well as overview material for managers and funding bodies, who may need to be persuaded to use SGML. The content of sections 1-3 serve as a good model for the overview at (a), but needs extensive, gentler, rewording, with the technical details moved to (b). * 2. An annexe giving details of all known software which supports or purports to support SGML, both in the way of tools, and as applications. Clearly this has to be "at the time of writing" for a paper edition, but an electronically-retrievable version could be updated as a separate document. Pointers to publicly-available files demonstrating some applications of the TEI proposals would also be useful. 3. The index should be expanded to distinguish (typographically) between topics, mentions _en passant_, and actual tag names. 4. The document should be available electronically, both in SGML encoded form and as a file in some few other popular formats for the benefit of those interested parties and intending participants who are not yet equipped with SGML-sensitive software. (Much of this has of course already come up, doubtless more than once :-) B. Content: 1. The Poughkeepsie Principles (1.2) should be emphasized more, and it should be made clearer whether the TEI proposals are a subset, superset or cross-set of SGML. 2. The terminology is naturally complex, and requires more careful explanation, perhaps using typographic separation for the explanatory matter. The _locus classicus_ is #PCDATA, which means precisely the opposite of what the computer-literate reader would expect! Having said that, the majority of section 2 is a good exposition of the material at the technical level. 3. One view expressed was that requirements for the use of CONCUR are likely to be extensive. This feature may therefore need some closer study and explanation. 4. Typo p23 in machine text 9. It was felt that crystals are likely to assume a greater importance than the document betrays (5.3.12). 10. A similar argument applies to the use of CONCUR (5.6) [v.s] 11. In view of the recent development of SGML-sensitive browsing software (eg WWW) which can be used to further identify tagged items (particularly names, places and dates), the section (5.7) on such links needs substantial expansion. * 12. Annotation, such as the recording of physical marks not part of the text, needs more explanation. 13. The inclusion of the character set tables is wonderful. What about including the new 256-character set for TeX devised at the Cork conference? (as TeX seems to be fairly widely-used with SGML). I have a draft copy of this if needed. * ----------------- In the cooperative spirit of the Initiative, some resources may be available at the present author's site (UCC) for work in some of the above areas. Possible topics are asterisked. ///Peter Flynn Computer Centre University College Cork, Ireland +353 21 276871 x2609 +353 21 277194 (fax) ========================================================================= Date: Thu, 5 Dec 1991 09:19:46 +0100 Reply-To: Stig Johansson Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Stig Johansson Subject: timealignment I appreciate the work that has been done to align the timeline and alignment. The fact is, however, that the timeline and alignment remain distinct. What was alignment is now 'corresp.grp' and the timeline is renamed 'align'. In spite of this, the concept of the timeline is fundamentally the same. I will have nothing to say about 'corresp.grp' and will focus on the timeline. The most fundamental aspect of speech, as compared with writing, is that it evolves in time. If we need three tags (timeline, pointer, point) to cope with this specifically, it should be well worth it. It is possible that the concept could be generalized to handle 'any one-dimensional structure'. I would, however, prefer to hang on to the timeline idea. If it can be successfully applied to the representation of speech (which I am not yet convinced of), perhaps we can see later whether the notion can be generalised to cover other one- dimensional structures that text encoders might wish to represent. The timeline formalizes the representation of simultaneous events in a nice manner. Each event is related to a common scale outside the text. In existing schemes we find many ways of representing simultaneous events (by vertical alignment, devices like asterisks and parentheses, etc). The lack of agreement on coding is one problem - another is the frequent ambiguity and lack of explicitness. The timeline concept is attractive because it can be used flexibly. Some encoders may only wish to use the timeline to handle the specific problem of overlapping speech. Others may want to extend it to handle the duration of utterances, pauses, and events in general. The amount of specification with respect to time may vary, ranging from relative ordering to indications of absolute time. The main problems, as I see it, are (1) how we encode the text in the first place, with references by pointers and 'start' and 'end' attributes of elements to the points in the timeline and (2) how we actually make use of the encoded text and its associated timeline (for displaying simultaneous events, printing out the text, etc). I realize that the new alignment mechanism can do what the timeline did. I still prefer 'timeline' to 'align type=temporal'. The difference between the names 'point' and 'loc' is trivial. If 'timeline' is kept, so should 'point'. (But if there is a general feeling that we should go for 'align' and 'loc', I will not insist on the original names.) One major difference between the timeline proposal and the new alignment proposal is that 'xref' (an already existing tag) is used for 'pointer'. But we surely would like to be able to identify the 'xrefs' that point to the timeline, as opposed to other 'xrefs'. So what we get would be 'xref type=time' instead of 'pointer'. What is the difference, apart from an increase in bulk? In our proposal we recommend 'pointer' for cases which could not be handled by 'start' and 'end' attributes of the elements. The new alignment proposal seems to involve the use of a lot more 'xrefs' than we had pointers. Is there a sufficient gain in preciseness of encoding to justify this? Finally, the biggest difference between the timeline proposal and the new alignment mechanism. In the latter the points (locs) contain references summarizing the places in the text which refer to a particular point (whereas our 'point' is an empty element). This seems like a heavy apparatus to reveal what could be done by a search. More important, the references in a particular point (loc) would not summarize everything going on at a particular time (because in a lot of cases there would only be references to the timeline for the start and end of events). And we feel that points should be empty not least because they represent points in time (here I quote And Rosta, the member of our working group who is the main originator of the timeline idea). In general, I am not as pleased with our proposals as the above may indicate. It is significant that Brian MacWhinney, in a recent brief comment on our working paper, could not really see the usefulness of the proposals. Whatever solution is chosen, the encoding must be as simple as possible and must be relatable to encoding practice within existing schemes. Oslo, 5 Dec -91 Stig Johansson ========================================================================= Date: Thu, 5 Dec 1991 18:14:35 MST Reply-To: Terry Langendoen Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Terry Langendoen Subject: comments on Stig Johansson's posting on timeline This is in partial reply to Stig Johansson's posting of 5 December on temporal alignment. It would not be at all difficult to separate out temporal alignment from one-dimensional spatial alignment, calling the first timeline, as in the Spoken Language Working Group's proposal (in AI2 W1), and the second, say, spaceline. We could then eliminate the type attribute from what I called align and deal with the fact that the two tags have nearly if not totally identical content models and attribute lists by means of a parameter entity, such as %align;. Regarding the use of xref for pointer, I was following up on a recommendation that came out of the Myrdal meeting to reduce the number of distinct tags whose function is to point. To deal with the problem of identifying what a particular xref is pointing to, I suggested adding an entity attribute, whose value is the name of the entity being pointed to. Thus an xref pointing to a u could look as follows: , or, if the values of element were given in the DTD as a fixed list, simply . My suggestion to treat loc not as an empty tag (as pointer is in AI2 W1) was based on the idea that it would be useful (but not mandatory) to be able to point from the timeline to the text as well as from the text to the timeline. In fact, the ability to point from the timeline to the text (as in the original alignment mechanism in P1) means that you can eliminate all temporal reference from the text. All (and I do mean all) of the alignment can be handled by pointers from the timeline to the text. I am prepared to show this in detail with the toy examples that we were using in Myrdal, and if there is interest I will post an example. (It will be offered for publication in P2 or the subsequent tutorials.) The same is true with spatial alignment; it is possible to indicate line breaks and all the other stuff that can seriously clutter up a document simply with pointers from spaceline. And Rosta and I have been quietly conversing in the background, and I think I have been able to resolve all of our differences, except for what I take to be his ontological objection to loc not being an empty tag. The other outstanding difference, namely my elimination of the since attribute, I have since been persuaded to restore, and will post a revised DTD in which essentially all of the features of the original timeline proposal appear, plus some others. Stig's final paragraph is tantalizing. While Brian MacWhinney's demurral may be significant it is not particularly helpful. What would he like that we haven't provided? And Stig, why are you not so pleased with your own proposals? ========================================================================= Date: Tue, 10 Dec 1991 11:24:45 MST Reply-To: Terry Langendoen Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Terry Langendoen Subject: working paper on feature structure logic .im prt4050n The logic of feature structures authorD. Terence Langendoen Version for circulation on TEI-L file server on December 11, 1991. DTL 9-11 December 1991 wrote document 0 Logic of feature structures Compound forms of fs, f and atm

Steven Zepp and I came up with a set of conjunctive (grouping) and disjunctive (alternation) tags at the feature-structure (fs), feature (f), and atom (atm) levels, for which we suggest the following tag set. fs.grp fs.alt f.grp f.alt atm.grp atm.alt

The alt suffix is my idea. Steven and I were using or in our discussions, but I think alt is better as it suggests both alternation and alternatives.

The content models for each pair of terms is the same, so that we could, for economy and elegance, use the parameter entities %fs.cmpd;, %f.cmpd; and %atm.cmpd; for these pairs. To cover the tagset at each level, we could use the parameter entities %fs.family; (covering fs.grp, fs.alt and fs), %f.family; (covering f.grp, f.alt and f), and %atm.family; (covering atm.grp, atm.alt and atm).

In the case of fs.grp and fs.alt, the content is two or more occurrences of any of the following tags. fs.grp fs.alt fs xref The xref should point to an fs.grp, fs.alt, or fs. The possibility of xref limits SGML validation, see section .

The content of f.grp and f.alt is analogous to that of fs.grp, consisting of two or more occurrences of any of the following tags. f.grp f.alt f xref The xref should point to an f.grp, f.alt, or f; see also .

The content models for the four compound tags all provide for unbounded nesting. However, there is no need for subgrouping and subalternation at the atomic level, nor do we anticipate any need for pointers at this level. Hence the content model for atm.grp and atm.alt can simply be two or more occurrences of atm. Noncompound forms of fs, f and atm

The content model for fs should be one or more occurrences of the following tags. f.grp f.alt f xref The xref should point to an f.grp, f.alt, or f; see also .

The content model for f should be one or more occurrences of the following tags. fs.grp fs.alt fs xref This tag points to fs.grp, fs.alt, or fs; see also . atm.grp atm.alt atm plus minus any none This tag replaces not.applicable. default no.claim all some

Treating the underspecification values any, none, default, and no.claim as tags was suggested in my posting on lexical encoding. I suggest adding all and some for completeness. all means that all legal values are present; some indicates that some legal values are present. some differs from no.claim in that the latter includes the possibility that no legal values are present.

Following a suggestion of Gary Simons, in addition to the following tags: atm.grp atm.alt atm it may be desirable to add the following feature-value tagset. unit.grp unit.alt unit

The distinction between atm and unit is that the range of possible values for atm is assumed to be fixed, though perhaps large, whereas the range of possible values for unit is assumed not to be fixed. If this idea is implemented, the values for atm would be specified as values of an attribute that we can call value. The values of value would be entered as CDATA. On the other hand, the values for unit would be specified as content and entered as parsed character data (#PCDATA), just as the values for atm are now specified. In the following, I assume that this refinement is not made, and that the unit tagset is not defined. Negation and its elimination

The recommendations in TEI P1 included f.s.not for expressing a negative feature value, which could be used recursively, like f.s.or and f.s.and. Steven and I concluded that for representational purposes, negation need not be recursive and could therefore be expressed as the value of an attribute at the atm and fs levels (that is, all tags in the atm and fs tagsets). We propose to name the relevant attribute domain and to permit its values to be self and cmp, with the default value being self. Negation and its elimination from atm levels

Suppose that we wish to represent the lexical structure of a particular word in a document as a member of the word-class noun and as not having dative case. Using the domain attribute on atm, we could represent this structure as follows.

<! [ CDATA [ <fs> <f name=word-class><atm>noun <f name=case><atm cmp>dative </fs> ]]

Let us assume that the feature-structure declaration lists the possible values for the content of atm occurring as the content of an f whose name is case which also occurs in an fs with another f whose name is word-class and which contains an atm whose content is noun, as follows. nominative genitive dative accusative instrumental Then the structure in could be represented without the use of the cmp value of the domain attribute as in ; we say that the structure in has been reduced to the structure in .

<! [ CDATA [ <fs> <f name=word-class><atm>noun <f name=case> <atm.alt> <atm>nominative <atm>genitive <atm>accusative <atm>instrumental </atm.alt> </fs> <p> Under the same conditions, the structure in <xref target=ex3> can be reduced to the structure in <xref target=ex4>. <xmp id=ex3><! [ CDATA [ <fs> <f name=word-class><atm>noun <f name=case> <atm.alt cmp> <atm>nominative <atm>genitive </atm.alt> </fs> ]] <! [ CDATA [ <fs> <f name=word-class><atm>noun <f name=case> <atm.alt> <atm>dative <atm>accusative <atm>instrumental </atm.alt> </fs> ]]

The structures in and are also equivalent to the structure in , though the latter is not a reduced structure, since the cmp value has not been eliminated.

<! [ CDATA [ <fs> <f name=word-class><atm>noun <f name=case> <atm.grp> <atm cmp>nominative <atm cmp>genitive </atm.grp> </fs> ]]

Structures containing atm.grp cmp, such as in , cannot in general be reduced.

<! [ CDATA [ <fs> <f name=word-class><atm>noun <f name=case> <atm.grp cmp> <atm>nominative <atm>genitive </atm.grp> </fs> ]] To say that a particular noun does not have both nominative and genitive case is not to say that has any particular case or cases or alternation or grouping of cases.

We assume that the domain attribute is also defined for the other nonstructured feature-value tags, with the following reductions. = = = = = = non-default-value-n (n = 1); = non-default-value-1 ... non-default-value-n (n > 1)

On the other hand, the following nonstructured feature-value tags have no reduced versions.

These specifications, however, should be considered equivalent, as any feature specified as not having all (possible) values can be understood as not having some particular value, and conversely. Negation and its elimination from fs levels

Suppose that we wish to represent an agreement structure for a verb structure as having neither third person or singular number specifications. This could be done by providing the cmp value of the domain attribute on the fs which occurs as the value of an f, inside a larger fs, as in .

<! [ CDATA [ <fs> <f name=word-class><atm>verb <f name=agreement> <fs cmp> <f name=person><atm>third <f name=number><atm>singular </fs> </fs> ]]

To reduce the structure in , we first distribute the cmp value among the enclosed fs, and enclose them all within f.alt, as in .

<! [ CDATA [ <fs> <f name=word-class><atm>verb <f name=agreement> <fs> <f.alt> <f name=person><atm cmp>third <f name=number><atm cmp>singular </f.alt> </fs> </fs> ]]

Second, we eliminate the cmp from the atms in accordance with the feature-structure declaration, as for example in .

<! [ CDATA [ <fs> <f name=word-class><atm>verb <f name=agreement> <fs> <f.alt> <f name=person> <atm.alt> <atm>first <atm>second </atm.alt> <f name=number><atm>plural </f.alt> </fs> </fs> ]]

On the other hand, suppose that we wish to represent a lexical entry as not being a third person, singular number verb. This could be done by providing the cmp value of the domain attribute on the containing fs in and eliminating the cmp value from the contained fs. The resulting structure is shown in .

<! [ CDATA [ <fs cmp> <f name=word-class><atm>verb <f name=agreement> <fs> <f name=person><atm>third <f name=number><atm>singular </fs> </fs> ]]

After a series of steps, this structure could be reduced to the structure shown in .

<! [ CDATA [ <fs> <f.alt> <f name=word-class> <atm.alt> <atm>noun <atm>adjective ... </atm.alt> <f name=agreement> <fs> <f.alt> <f name=person> <atm.alt> <atm>first <atm>second </atm.alt> <f name=number><atm>plural </f.alt> </fs> </f.alt> </fs> ]]

If fs cmp contains f.alt, the latter must be converted to f.grp; conversely, if fs cmp contains f.grp, the latter must be converted to f.alt.

Finally, fs.grp cmp and fs.alt cmp are governed by the following equivalences.

<! [ CDATA [ <fs.alt cmp> <fs.grp> <fs> ... </fs> <fs cmp> ... </fs> <fs> ... </fs> = <fs cmp> ... </fs> ... ... </fs.alt cmp> </fs.grp> ]] <! [ CDATA [ <fs.grp cmp> <fs.alt> <fs> ... </fs> <fs cmp> ... </fs> <fs> ... </fs> = <fs cmp> ... </fs> ... ... </fs.grp cmp> </fs.alt> ]] The use of fs.lib and f.lib

We assume, following the recommendations of the Baltimore meeting of the AI1 group, that fss can appear anywhere within a text as an inclusion exception. When they do so, it is probably advisable not to use xref to point from within an fs to predefined substructures, as SGML validation is not possible. That is, SGML cannot enforce the restriction that xref must point to a member of the fs or f family, since the target attribute is specified as IDREF, which can occur on essentially any tag.

We suggest wherever possible that members of the fs family be gathered together into a separate file or subdocument, or designated part of the main document (say, as a daughter of tei.1) identified as fs.lib, and that these elements be pointed to at appropriate places within the text by means of xref.

The grp suffix is on analogy with the suffix in corresp.grp. It is neutral between list, set and and, and thus seems appropriate if we have only one grouping set of tags instead of three, as in the original TEI P1 recommendations (namely f.s.and, f.list, and f.set). The alt suffix is my idea. Steven and I were using or in our discussions, but I think alt is better as it suggests both alternation and alternatives.

The content model for fs should be one or more occurrences of the following tags. f.grp f.alt f xref The xref should point to an f.grp, f.alt, or f; see also .

Following a suggestion of Gary Simons, we may wish to add the following feature-value tagset, which we can refer to as %unit.family;. unit.grp unit.alt unit

<! [ CDATA [ <fs> <f name=word-class><atm>noun <f name=case><atm cmp>dative </fs> ]]>

Let us assume that the feature-structure declaration lists the possible values for the content of atm occurring as the content of an f whose name is case which also occurs in an fs with another f whose name is word-class and which contains an atm whose content is noun, as follows. nominative genitive dative accusative instrumental Then the structure in example could be represented without the use of the cmp value of the domain attribute as in example ; we say that the structure in example has been reduced to the structure in example .

<! [ CDATA [ <fs> <f name=word-class><atm>noun <f name=case> <atm.alt> <atm>nominative <atm>genitive <atm>accusative <atm>instrumental </atm.alt> </fs> ]]>

Under the same conditions, the structure in example can be reduced to the structure in example .

<! [ CDATA [ <fs> <f name=word-class><atm>noun <f name=case> <atm.alt cmp> <atm>nominative <atm>genitive </atm.alt> </fs> ]]> <! [ CDATA [ <fs> <f name=word-class><atm>noun <f name=case> <atm.alt> <atm>dative <atm>accusative <atm>instrumental </atm.alt> </fs> ]]>

The structures in examples and are also equivalent to the structure in example , though the latter is not a reduced structure, since the cmp value has not be eliminated.

<! [ CDATA [ <fs> <f name=word-class><atm>noun <f name=case> <atm.grp> <atm cmp>nominative <atm cmp>genitive </atm.grp> </fs> ]]>

Structures containing atm.grp cmp, such as in example , cannot in general be reduced.

<! [ CDATA [ <fs> <f name=word-class><atm>noun <f name=case> <atm.grp cmp> <atm>nominative <atm>genitive </atm.grp> </fs> ]]> To say that a particular noun does not have both nominative and genitive case is not to say that has any particular case or cases or alternation or grouping of cases.

We assume that the domain attribute is also defined for the other nonstructured feature-value tags, with the reductions given in examples through .

<! [ CDATA [ <minus cmp> = <plus> ]]> <! [ CDATA [ <plus cmp> = <minus> ]]> <! [ CDATA [ <any cmp> = <none> ]]> <! [ CDATA [ <none cmp> = <any> ]]> <! [ CDATA [ <no.claim cmp> = <no.claim> ]]> <! [ CDATA [ <default cmp> = <atm>non-default-value-n (n = 1) ]]> <! [ CDATA [ <default cmp> = <atm.alt> <atm>non-default-value-1 ... <atm>non-default-value-n </atm.alt> (n > 1) ]]>

On the other hand, the nonstructured feature-value tags in examples and have no easily computed reduced versions, except in special cases.

<! [ CDATA [ <all cmp> ]]> <! [ CDATA [ <some cmp> ]]> Negation and its elimination from fs levels

Suppose that we wish to represent an agreement structure for a verb structure as not having both third person and singular number specifications. This could be done by providing an fs cmp as the value of an f inside a larger fs, as in example .

<! [ CDATA [ <fs> <f name=word-class><atm>verb <f name=agreement> <fs cmp> <f name=person><atm>third <f name=number><atm>singular </fs> </fs> ]]>

To reduce the structure in , we first replace the fs cmp by an fs.alt which encloses two fss. The number of enclosed fss is equal to the number of fs enclosed by the original fs cmp; if that tag encloses only one f, then it is simply replaced by a fs. In the first of these fss, we replace the value of the first f by its complement, and the value of the other f by no.claim; in the second, we do the opposite. The result is shown in .

<! [ CDATA [ <fs> <f name=word-class><atm>verb <f name=agreement> <fs.alt> <fs> <f name=person><atm cmp>third <f name=number><no.claim> </fs> <fs> <f name=person><no.claim> <f name=number><atm cmp>singular </fs> </fs.alt> </fs> ]]>

Second, we eliminate the cmp from the atms in accordance with the feature-structure declaration, and strengthen, if possible, the no.claim values. Let us assume that the FSD specifies first, second, and third as the possible values of f name=person; and singular and plural as the possible values of f name=number. Let us also assume that all combinations of these feature-value pairs are possible. Then the representation in example can be reduced to that in example .

<! [ CDATA [ <fs> <f name=word-class><atm>verb <f name=agreement> <fs.alt> <fs> <f name=person> <atm.alt> <atm>first <atm>second </atm.alt> <f name=number><any> </fs> <fs> <f name=person><any> <f name=number><atm>plural </fs> </fs.alt> </fs> ]]>

On the other hand, suppose that we wish to represent a lexical entry as not being a third person, singular number verb. This could be done by replacing the containing fs in example by fs cmp and replacing the contained fs cmp by fs. The resulting structure is shown in example .

<! [ CDATA [ <fs cmp> <f name=word-class><atm>verb <f name=agreement> <fs> <f name=person><atm>third <f name=number><atm>singular </fs> </fs> ]]>

On the assumption that the FSD specifies that f name=agreement has values only in fss which contain f name=word-class whose value is verb, and assuming the list of possible word-class (there called category) values in TEI AI1 W9, then this structure could be reduced to the structure shown in example .

<! [ CDATA [ <fs.alt> <fs> <f name=word-class> <atm.alt> <atm>adjective <atm>adverb <atm>article <atm>coordinator <atm>interjection <atm>noun <atm>particle <atm>preposition <atm>pronoun <atm>punctuation <atm>subordinator </atm.alt> <f name=agreement><none> </fs> <fs> <f name=word-class><atm>verb <f name=agreement> <fs.alt> <fs> <f name=person> <atm.alt> <atm>first <atm>second </atm.alt> <f name=number><any> </fs> <fs> <f name=person><any> <f name=number><atm>plural </fs> </fs.alt> </fs> ]]>

Further, if fs cmp contains f.alt, the latter must be converted to f.grp; conversely, if fs cmp contains f.grp, the latter must be converted to f.alt.

Finally, fs.grp cmp and fs.alt cmp are governed by the equivalences in examples and , which are analogues of DeMorgan's laws.

<! [ CDATA [ <fs.alt cmp> <fs.grp> <fs> ... </fs> <fs cmp> ... </fs> <fs> ... </fs> = <fs cmp> ... </fs> ... ... </fs.alt cmp> </fs.grp> ]]> <! [ CDATA [ <fs.grp cmp> <fs.alt> <fs> ... </fs> <fs cmp> ... </fs> <fs> ... </fs> = <fs cmp> ... </fs> ... ... </fs.grp cmp> </fs.alt> ]]> The use of fs.lib and f.lib

Similarly, we suggest that individual features be gathered together into a separate file, subdocument or part of the main document, identified as f.lib, and that these elements be pointed to at appropriate places within the elements occurring in the fs.lib. Moreover, whenever a feature value is a member of the fs family, it can be replaced by a pointer to the appropriate element in fs.lib. ========================================================================= Date: Sun, 15 Dec 1991 06:11:54 CST Reply-To: "Eric Johnson DSU, Madison, SD 57042" Sender: "TEI-L: Text Encoding Initiative public discussion list" From: "Eric Johnson DSU, Madison, SD 57042" Subject: CALL FOR PAPERS SIXTH INTERNATIONAL CONFERENCE on SYMBOLIC and LOGICAL COMPUTING DAKOTA STATE UNIVERSITY MADISON, SOUTH DAKOTA OCTOBER 15 - 16, 1992 ICEBOL6, the Sixth International Conference on Symbolic and Logical Computing, is designed for teachers, scholars, and programmers who want to meet to exchange ideas about computer programming for non-numeric applications -- especially those in the humanities. In addition to a focus on SNOBOL4, SPITBOL, and Icon, ICEBOL6 invites presentations on textual and logical processing in a variety of programming languages such as Prolog and C. Topics of discussion will include artificial intelligence and expert systems, and a wide range of analyses of texts in English and other natural languages. Parallel tracks of concurrent sessions are planned. ICEBOL's coffee breaks, social hours, lunches, and banquet will provide a series of opportunities for participants to meet and informally exchange information. CALL FOR PAPERS Abstracts (250-750 words) of proposed papers to be read at ICEBOL6 are invited in any area of non-numeric programming. Planned sessions include the following: analysis of texts (including bibliography, concordance, and index generation) artificial intelligence and expert systems computational linguistics computer languages and compilers designed for non-numeric processing electronic texts and encoding grammar and style checkers linguistic and lexical analysis (including parsing and machine translation) music analysis preparation of texts for publishing Papers must be in English and may not exceed twenty minutes reading time. Abstracts should be received by March 1, 1992. Notification of acceptance (based on recommendations of readers) will follow promptly. Papers will be published in ICEBOL6 Proceedings. Presentations at previous ICEBOL conferences were made by Paul Abrahams (ACM President), Gene Amdahl (Andor Systems), Robert Dewar (New York University), Mark Emmer (Catspaw, Inc.), James Gimpel (Lehigh), Ralph Griswold (Arizona), Susan Hockey (Oxford), Nancy Ide (Vassar) and many others. Copies of the ICEBOL5 Proceedings are available. FOR FURTHER INFORMATION All correspondence including abstracts of proposed papers as well as requests for registration materials may be sent to: Eric Johnson ICEBOL Director 114 Beadle Hall Dakota State University Madison, SD 57042 U.S.A. Inquiries, abstracts, and correspondence are encouraged via electronic mail, and they may be sent to Eric Johnson at: ERIC@SDNET.BITNET or johnsone@dsuvax.dsu.edu