The Divergloss XML

Chusslove Illich

<caslav.ilic@gmx.net>

Abstract

This document informally describes the Divergloss XML format. It can be read as a tutorial and usage guide to Divergloss, with some design rationales provided along the way.

Warning

Divergloss is work in progress, and should be considered experimental at this moment.

Table of Contents

1. Design Considerations
2. Basics
3. Top Structure
4. Concepts
5. Terms
6. Key Definitions
7. Text Markup
A. Distribution
B. Acknowledgments

1. Design Considerations

Divergloss, short of "Diversity Glossary", is an instance of XML for handling glossary data -- succinct descriptions connected with terms which name certain concepts. A necessary question to answer up front is, why yet another XML format for glossaries? For example, there is the TBX format ("TermBase eXchange") by LISA, used widely in localization industry and submitted for adoption as ISO standard. Another option is the glossary document type of the Docbook format. To answer this, we must consider the goals behind Divergloss:

Producers of glossaries should not need in-depth understanding of linguistics or markup languages. The glossary data comes from "ordinary" fields, rather than e.g. linguistic research or natural language AI.
Advanced glossary features are hidden when not needed, rather than requiring a lot of scaffolding for simple glossaries. This relates to both the format as such, and tools needed to validate and produce it.
The format should be well human readable and editable. While special GUI applications can present glossary data in nicer form for user consumption, the plain XML source should make intuitive sense.
The same glossary may be used in various contexts within same language. There may be several terms naming the same concept, not only synonymously, but according to the user's environment. The same holds for the concept descriptions.
Glossaries should evolve by contributions of many people, and be updated according to various external sources of glossary data. A form of version control on the file level is assumed, but some intrinsic support for this process should be provided as well.

Conversely, Divergloss does not place much importance into the following:

Wide generality, with features for custom specialization according to the field and desired levels of representation depth. Instead, common needs are supported by many built-in tags in a rather flat hierarchy.
Strict adherence to the XML data representation practices, when these would impair readability and editability. For example, some records may be encoded as special strings, rather than long sequences of nodes and subnodes.

In essence, Divergloss favors simplicity, ease of getting involved both in terms of cognitive effort and needed tools, at the expense of generality and formal correctness as viewed from pure data-encoding standpoint. This is in contrast with the aforementioned TBX, which is intended as end-all of glossary formats, and in practice mostly handled through features of dedicated glossary or CAT (Computer-Aided Translation) tools. Docbook glossary format, on the other hand, while offering similar simplicity, is too lightweight to cover all envisaged uses of Divergloss.

It is, however, possible that Divergloss could be made an instance of TBX, i.e. one of its XCS ("eXtensible Constraint Specification") customizations. This would facilitate the conversion of Divergloss into TBX glossaries, for use by existing CAT tools. Some sort of conversion to TBX is, of course, possible in any case, but no effort to that end has been undertaken as of yet.

2. Basics

Divergloss strives for language neutrality. There is no limitation on how many languages, and precisely in which fashion, may a single glossary document support. This need is highly dependent on locale and field of use: a glossary of pre-Columbian civilizations for American users may get away fully monolingual, a glossary of oceanic freight for German users may state English terms too, while a glossary of astronomy for Tunisian users may need to be fully Arabic-French bilingual (both descriptions and terms).

Within each language, Divergloss supports another level of diversity: in different environments, e.g. within two large companies doing similar business, same concepts may be named differently. An outsider may thus consider them synonymous, but a user within one of those environments should be primarily presented with the term native to it. "Fully" synonymous terms, in this sense, will have both the language and the environment same.

Aside from language and environment, each description and term can be equipped with a myriad of supplementary data. For example, the description may have an additional comment on source of the text, while a base term may come with few declinations which would be hard for user to guess by general rules of the language.

Bearing previous passages in mind, the entries in Divergloss documents are organized by concepts, not apriori linked to any of the terms. Instead, each concept is referenced by a unique key, and contains any number of descriptions and terms. Here is the simplest example of glossary with two terms in it:

<?xml version="1.0" encoding="UTF-8"?>
<glossary id="cosmogloss" lang="en">

  <metadata>
    <title>A Short Cosmic Glossary</title>
  </metadata>

  <keydefs>
    <languages>
      <language id="en">
        <name>English</name>
        <shortname>En.</shortname>
      </language>
    </languages>
  </keydefs>

  <concepts>

    <concept id="blackhole">
      <desc>
        The leftover core of a super massive star after a supernova,
        that exerts a tremendous gravitational pull.</desc>
      <term>black hole</term>
    </concept>

    <concept id="quasar">
      <desc>
        A distant energy source which gives off vast amounts of
        radiation, including radio waves and X-rays.</desc>
      <term>quasar</term>
    </concept>

  </concepts>

</glossary>

The glossary top node states the unique glossary identifier, by the id attribute. The default language used in the glossary is given by lang attribute, which applies to all text in the document where not locally overridden. The metadata node contains general info about the glossary, like title, description, etc. All keys in a Divergloss glossary, such as the language value to the lang attributes, are defined by the the keydefs node.

Glossary entries are grouped under the concepts node. Each concept node has an id attribute, a key which uniquely defines the concept. Each concept can contain any number of descriptions and terms, defined by desc and term nodes; each may override the default language using the lang attribute.

Let us now add terms in few other languages:

<concept id="blackhole">
  <desc>
    The leftover core of a super massive star after a supernova,
    that exerts a tremendous gravitational pull.</desc>
  <term>black hole</term>
  <term lang="fr">trou noir</term>
  <term lang="de">Schwarzes Loch</term>
</concept>

The term without a lang attribute is in English, which was stated as the default language, whereas the two other terms override the language to French and German. In a fully multilingual case, description nodes too could be stated in several languages in the same way.

How about different environments? Here's a tentative example from a computer glossary:

<concept id="directory">
  <desc>
    An entity in a file system which contains a group of files
    and other directories.</desc>
  <term>directory</term>
  <term env="mac">folder</term>
  <term env="amiga">drawer</term>
</concept>

Here, the first term, as before, defines no environment. Other two terms define an environment by use of env attribute. The value of env attribute is not a text that may be shown to the user, but an environment key. It must be defined elsewhere in the document, together with proper environment name (see the environments node). Same as with language, it depends on the client how the environment information will be used when presenting concepts to the user. For example, if the user is reading the glossary on an Amiga, the application may show "drawer" as primary term.

In the last example, the description text itself mentioned the word "directories". If the user is using an Amiga, shouldn't he see "drawers" instead? When this occurs in a multienvironment glossary, the descriptions too can be specialized by environment. Furthermore, the terms should be properly crossreferenced, so that the client may allow the user to click on the term in description and go to corresponding concept. Putting it all together, we get:

<concept id="directory">
  <desc>
    An entity in a <ref c="filesystem">file system</r> which contains
    a group of <ref c="file">files</r> and other directories.</desc>
  <desc env="mac">
    An entity in a <ref c="filesystem">file system</r> which contains
    a group of <ref c="file">files</r> and other folders.</desc>
  <desc env="amiga">
    An entity in a <ref c="filesystem">file system</r> which contains
    a group of <ref c="file">files</r> and other drawers.</desc>
  <term>directory</term>
  <term env="mac">folder</term>
  <term env="amiga">drawer</term>
</concept>

The reference node tag ref points to a concept key by its c attribute. If duplication of descriptions due to different terms by environment is to be frequently expected, alternatively the terse embedded selection can be used.

Descriptions and terms may also credit the person who added them into the glossary. This person is an editor of the glossary, not necessarily the one who wrote the description, or coined the term. The editor's credit is assigned using the by attribute, for example:

<desc by="hal">
  An instruction by a superior factor which, if improperly formulated,
  can cause a sentient being to undertake unethical actions.</desc>
<term by="hal">order</term>

where "hal" is the person key, possibly made out of the person's initials. These keys and corresponding editors' real names and contact data are defined within the editors node.

3. Top Structure

A Divergloss document is divided into the following top-level sections:

<glossary id="glosskey" lang="xx" env="yyyy">
  <metadata>
    ...
  </metadata>

  <keydefs>
    ...
  </keydefs>

  <concepts>
    ...
  </concepts>
</glossary>

A particular combination of values to id, lang, and env attributes, should uniquely determine the glossary within the ecosystem of other published Divergloss glossaries. The lang and env attributes are not mandatory; if provided, their values apply to all subnodes where meaningfull and not locally overridden.

The metadata node provides information about the glossary in general. These include the following child nodes:

title: The title of the glossary.
desc: The description of the glossary.
version: The release version. There are no constraints or recommendations on the versioning scheme.
date: The release date. The format is YYYY-MM-DD regardless of the languages of the glossary; it is client's duty to format for presentation according to user's locale.

All child nodes except the title are optional. The title and desc can have the attributes of language (lang) and environment (env), and can be repeated for unique combinations of those. The values of these attributes are keys defined by the keydefs node.

The main body of the glossary, concepts and terms naming them, is given by the concepts node. This node is mandatory -- not much point in a glossary without a single concept.

Within metadata and concepts, many keyword-valued attributes may be used, which denote global data in the glossary: languages, environments, editors, topics, etc. These keywords are defined by the the keydefs node, where also the user-presentable info on them is stated. This node is not mandatory, but will be needed for all but the simplest glossaries.

keydefs and concepts nodes can actually appear more than once, each instance containing the data as described previously. In this way the document can be chunked into files, such that each file can still remain a valid XML. E.g. files containing groups of concepts, categorized in some way, can all start with their own concepts root element.

4. Concepts

Concept nodes reside within the concepts node of the glossary:

<glossary id="glosskey" lang="en">
  ...
  <concepts>

    <concept id="ckey1">
      ...
    </concept>

    <concept id="ckey2">
      ...
    </concept>

    ...

  </concepts>
  ...
</glossary>

The ordering of concepts is not important, and each concept key, given by id attribute, must be unique. The keys are best chosen mnemonically, for easier crossreferencing.

The concept node may have the following optional attributes:

topic: A list of topic keys under which this concept may be grouped. The topic keys and names are defined within the topics node.
level: Usage level of the concept, as in basic, intermediate, advanced, etc. among the users of the terminology described by the glossary. The level value is a key, defined with corresponding level name by the levels node.
related: Reference to closely related concepts, given by a list of concept keys.

An example of a concept node equipped with several attributes:

<concept id="saturnv"
         topic="apollo spacerace" level="newbie" related="n1 proton">
  <desc>
    A multistage liquid-fuel expendable rocket used by NASA's Apollo
    and Skylab programs. Popularly known as the Moon Rocket.</desc>
  <term>Saturn V</term>
</concept>

All children nodes of the concept may have the attributes lang, env, and by. The client should use these attributes to decide, possibly based on the execution environment, how and which information to present to the user.

Furthermore, description and term nodes may have the src attribute, which unlike the by attribute, defines the source of description or term: another glossary, publication in the field, etc. The source value is a key, with the source data behind it defined by the sources node.

There can be one or several description nodes, or even none. When two descriptions have the same values of lang and env attributes, they are to be considered as different takes at same explanation. Similarly, the concept may be named by one or several terms (with term nodes being detailed in their own section); but there may also be no terms, for concepts which are as of yet unnamed. The following is a valid concept definition:

<concept id="alexq" topic="arch">
  <desc>The quality without a name.</desc>
</concept>

Other than descriptions and terms, concept node can have the following children nodes:

details: Reference to external information about the concept, explaining concept in more details. The source of the information is pointed to using two attributes: the rel attribute states the relative path, while the root attribute is a key identifying the root of the path. The root keys and related names and data retrieval instructions are defined in the the extroots node. The text content of the node, if non-empty, is a free-form remark (only for special cases, generally not necessary).
media: Non-text resource of value to the conveyance of the concept. This could be, for example, an image of the embodiment of the concept. The media file is pointed to with attributes just like for the details node. The text content is the caption of the data, which can be empty, but should be provided nevertheless.
origin: Information on when, where, how, and by whom the concept was originally formulated, introduced, or demonstrated.
comment: Editor's comment on the concept. For example, doubts on the accuracy or wording of the description, topic qualification, etc. Several editors may add their own comments.

5. Terms

Each concept may be named by several terms, given by term child node of concept node. In the simplest case of a traditional glossary, these terms would be synonymous. In a Divergloss glossary, clients should consider as synonymous only those terms with equal language and environment attributes.

All attributes to the term node are optional, and are as follows:

lang: The language of the term. The value is a list of language codes, as defined by the languages node.
env: The environment in which this term is used. The value is a list of environment keys, as defined by the environments node.
by: The editor who added the term into the glossary. Value is a key defined by the editors node.
src: The source of the term: an organization, publication, another glossary, a person. The value is a list of source keys, as defined by the sources node.
gr: Any grammatical categories to which the term may belong. These can be, for example, gender for a noun, aspect for a verb, etc. The value is a list of category keys, defined by the grammar node.

Terms may sometimes need additional information, in which case the extended eterm node is used. It has the same attributes as the ordinary term node, but branches into child nodes, where the nominal form of the term is stated by the nom child node:

<eterm>
  <nom>phenomenon</nom>
  ...
</eterm>

Other, optional child nodes of the extended term include:

stem: Especially for inflected languages, it may be useful to know the stem of the nominal form of the term. Clients may use it for processing user queries into the glossary. Only one stem node is allowed.
decl: A particular declension of the term: cases, genders, moods, etc. The declension category is stated by the mandatory gr attribute (like for the term node itself), which holds one of the grammar keys defined the grammar node. There can be as many declension nodes as needed. Several declensions may be of the same grammar category, which means that they are all an acceptable variation of that category.
origin: Text describing the history behind the term. Optional attributes are by (the editor who added the text), src as the key of the source from where the information was obtained, as well as lang and env when differing from that of the term. There can be more than one origin node, possibly offering alternative views.
comment: Editor's comment on the term. Optional attributes are by, stating the editor's key, and lang and env, in case the language or environment of the comment are different from that of the term. There can be several comments, by the same or different editors.

An example of a term with some of the extended data:

<eterm>
  <nom>phenomenon</nom>
  <decl gr="plu">phenomena</decl>
  <stem>phenomen</stem>
  <origin src="dictcom">
    From Greek "phainómenon", over Late Latin "phaenomenon",
    to appear.</origin>
</eterm>

6. Key Definitions

All the various keys used in concepts and terms are collected and defined within this node, by sections:

<keydefs>
  <languages>
    ...
  </languages>
  <environments>
    ...
  </environments>
  ...
</keydefs>

The keys themselves are always defined by the id attribute of the respective key definition node.

Each node that has a text value, within any of the key definition sections, may be equipped with lang and env attributes, and repeated for different combinations of them. Clients should use this info to select the text to present to the user as a description behind a particular key.

The key definition sections are as follows:

languages

The languages used within the glossary, as applied by the lang attribute. The definition of a language provides its full and short name:

<languages>
  <language id="en">
    <name>English</name>
    <shortname>En.</shortname>
  </language>
  <language id="fr">
    <name>French</name>
    <shortname>Fr.</shortname>
  </language>
  ...
</languages>

Language identifiers should follow the codes from ISO 639, when available. This is important for relating the language to a system locale, such that language-dependent processing (e.g. alphabetical sorting) may be correctly performed.

environments

Usage environments for text content, as applied by the env attribute. For each environment the full and short name are given, and the description of the environment:

<environments>
  <environment id="unix">
    <name>Unix</name>
    <shortname>U.</shortname>
    <desc>
      A computer operating system originally developed in 1969
      by a group of AT&T employees...</desc>
  </environment>
  ...
</environments>

If the environment is also one of the concepts, note the difference between the description here and the concept description: environment's description may provide info on the terminology aspects of the environment (if none are needed, the environment description may just point to the concept by a ref node).

An environment may specify terminology-wise close environments: if a term is not defined in the present environment, another from a close environment can be used as if it were its own. Close environments are specified by a list of environment keys in the closeto attribute of the environment node. The list order matters: the first environment is considered the closest, etc.

Clients may sometimes need to pick one environment among others, or to order them in a certain way. Two additional attributes may be specified to influence clients at this. The meta attribute states that the environment is not a true environment (e.g. it may be an umbrella for several environments), and takes one of truth values 1|y|yes|t|true. The weight attribute specifies environment's priority, in a case-dependent sense, and takes numbers from 0 to 9 as values (0 is default).

editors

People who are, or were at one point, adding and modifying the content of the glossary. These keys are applied using the by attribute. An editor definition contains the name and short name (usually initials), email address, affiliation, and description:

<editors>
  <editor id="hjjr">
    <name>Henry Jones, Jr.</name>
    <shortname>IJ</shortname>
    <email>henry.jones.jr@barnett.edu</email>
    <affiliation>
      Barnett College, visiting professor</affiliation>
    <desc>
      Dr. Jones is an eminent archaeologist, who teaches at
      Barnett College in New York...</desc>
  </editor>
  ...
</editors>

Email address, affiliation and description are optional.

sources

Sources which the editors use to assemble the glossary, and applied by the src attribute. A source can be just about anything: a publication, an institution, a person, etc. Each source defines its full and short name, description, email address, and an URL:

<sources>
  <source id="wp">
    <name>Wikipedia, the Free Encyclopedia</name>
    <shortname>Wp.</shortname>
    <url>http://en.wikipedia.org</url>
    <desc>
      A free, multilingual, open content encyclopedia project
      operated by the non-profit Wikimedia Foundation...</desc>
  </source>
  <source id="jbl">
    <name>J. Bigshot Linguist</name>
    <shortname>JBL</shortname>
    <email>jbl@allknow.edu</email>
    <desc>
      Esteemed and prolific originator and commentator of many
      of the terms found within this glossary...</desc>
  </source>
  ...
</sources>

Email address and URL are optional.

topics

The topics to which the concepts belong, as applied by the topic attribute of the concept. The definition contains the full and short name, and a description:

<topics>
  <topic id="apollo">
    <name>The Apollo Program</name>
    <shortname>Apollo</shortname>
    <desc>
      The Apollo program was a human spaceflight program
      undertaken by NASA during the years...</desc>
  </topic>
  ...
</topics>

levels

Usage levels applied to the concepts by the level attribute. They are defined by the full and short name, and a description:

<levels>
  <level id="basic">
    <name>Basic Concepts</name>
    <shortname>basic</shortname>
    <desc>
      The concepts that every user should know about.</desc>
  </level>
  ...
</levels>

The description is optional.

grammar

Grammar categories for terms and declensions, as given by their gr attributes. Each is defined by the full and short name, and a description:

<grammar>
  <gramm id="pl">
    <name>plural</name>
    <shortname>pl.</shortname>
    <desc>The plural form of the word.</desc>
  </gramm>
  ...
</grammar>

The description is optional.

extroots

External locations of more detail info on the concept, as provided by the details child node of a concept. An external root is defined by its full and short name, description, the URL root to which relative paths are appended (given by the rel attribute in concepts), and an URL for manual browsing:

<extroots>
  <extroot id="rloc">
    <name>Local files</name>
    <shortname>loc.</shortname>
    <rooturl>file://usr/share/thisgloss/data</rooturl>
    <desc>Files on local disk, installed by this glossary.</desc>
  </extroot>
  <extroot id="rwp">
    <name>Wikipedia</name>
    <shortname>Wp.</shortname>
    <rooturl>http://en.wikipedia.org/wiki</rooturl>
    <browseurl>http://en.wikipedia.org</browseurl>
    <desc>Links to articles on Wikipedia.</desc>
  </extroot>
  ...
</extroots>

The URL for manual browsing is not mandatory.

7. Text Markup

Some nodes may contain larger bodies of text, where additional markup is advantageous (e.g. referencing). Such nodes are desc, comment, origin, etc. The following markup can be applied within text contents of such nodes:

ref: A reference to a concept defined within the glossary. It wraps a phrase indicative of the concept, and points to a concept using the c attribute (the value being the key of the concept).
em: An emphasis on a word or a phrase.
ol: A word or a phrase in another language, as opposed to that of the text. Must have a lang attribute stating the language of the phrase. An optional argument is wl, which if present indicates that the short language name should be formatted together with the phrase (its value must be one of 1|y|yes|t|true).
link: Link to an external resource. The URL of the resource is given by the url attribute, which is mandatory.

Although glossary texts should be kept short and to the point, sometimes the text content could still be long enough to warrant splitting into several paragraphs, and other higher level groupings. All nodes which could reasonably benefit from such structure have an l* variant, which contain structured text content:

<ldesc>
  <para>
    A huge cloud which is thought to surround our solar system and
    reach over halfway to the nearest star.</para>
  <para>
    Comets originate in the Oort cloud.</para>
</ldesc>

Such nodes are: ldesc, lcomment, and lorigin. These can be used everywhere instead of their simpler counterparts. For the moment, the only structuring element are paragraphs (the para nodes), but more may be introduced in the future.

Sometimes a lot of text may need duplicating due to a single phrase in it differing across environments, typically in description nodes -- see an earlier example. To prevent this duplication, clients will support special embedded text selection by environment. Using embedded selection, the mentioned example can be rewritten as:

<concept id="directory">
  <desc>
    An entity in a <ref c="filesystem">file system</r> which
    contains a group of <ref c="file">files</r> and other
    ~directories|mac:folders|amiga:drawers~.</desc>
  <term>directory</term>
  <term env="mac">folder</term>
  <term env="amiga">drawer</term>
</concept>

i.e. the embedded selector is of the form ~env1:phrase1|env2:phrase2|...~, where if one of the environment keys is empty or omitted (as in the example), that phrase inherits the surrounding text's environment. Instead of a single environment key, a whitespace separated list can also be given. The tilde character (~) cannot be a part of ordinary text by itself, but it can be escaped by doubling it (~~). This kind of special-form selection is unusual by XML standards, but has been introduced due to being more human-readable and editable in the running text than e.g. a selection node with subnodes per environment: <select><for env="env1">phrase1</for><for env="env2">phrase2</for>...</select>.

A. Distribution

Divergloss is distributed in a package which, aside from the format definition and documentation, contains command-line tools for processing Divergloss glossaries into various end-user formats, and requires minimum installation fuss. This enables users to quickly start writting and putting glossary data to practical uses.

Since Divergloss is still early in the development, the best place to get the package from is Github, by cloning the Git repository:

$ git clone https://github.com/caslav-ilic/divergloss.git

This will create directory divergloss/ with the complete repository. In it there will be the README file with short setup instructions. The repository can later always be updated to the newest version by issuing:

$ cd divergloss/
$ git pull

In the package there is a Python module, dg, which provides easy access to glossary content and functionality frequently needed for manipulating glossary data. While e.g. XSLT is very succinct for straightforward mappings of XML data, building glossary outputs (among other things) may be much more demanding than that, and therefore more easily tackled with a general purpose programming language such as Python. Not the least is Python's ease of use and rich variety of modules, which makes any special processing of glossaries that much more viable.

The packaged dgproc.py script is one immediate user of the dg module. It operates by pushing Divergloss files through sieves, which build outputs and perform other operations on the glossary. In the basic mode, when run with the glossary file as the single argument, dgproc.py will validate the glossary, reporting also the problems not discoverable by DTD validation. If the glossary file is gloss.xml, then executing:

$ dgproc.py gloss.xml

will give no output if the glossary is technically valid.

The list of applicable sieves may be seen by issuing the --list-sieves (-S) option. For example, if the glossary contain terms in English (en) and German (de), bidict-html sieve may be used to create an embeddable HTML dictionary table, with collapsible concept descriptions:

$ dgproc.py html-bidict gloss.xml -solang:en -stlang:de -sfile:gloss.html

where -s... options issue sieve parameters. Or, to create a TBX glossary file for use in tools that can make use of it (e.g. a translation editor may automatically issue terminology recommendations):

$ dgproc.py tbx gloss.xml -sfile:gloss.tbx

List of parameters for each sieve may be seen by following the sieve name with the --help-sieves (-H). Each sieve is described in more detail in the dg.sieve module documentation contained in the package.

B. Acknowledgments

The following sources were used when making up the examples:

Wikipedia, the Free Encyclopedia.
StarChild, NASA's learning center for young astronomers.
barnettcollege.com, the official website for the development of a freeware point&click adventure "Indiana Jones and The Fountain of Youth", by Screen 7.