Abstract
This document informally describes the Divergloss XML format. It can be read as a tutorial and usage guide to Divergloss, with some design rationales provided along the way.
Divergloss is work in progress, and should be considered experimental at this moment.
Table of Contents
Divergloss, short of "Diversity Glossary", is an instance of XML for handling glossary data -- succinct descriptions connected with terms which name certain concepts. A necessary question to answer up front is, why yet another XML format for glossaries? For example, there is the TBX format ("TermBase eXchange") by LISA, used widely in localization industry and submitted for adoption as ISO standard. Another option is the glossary document type of the Docbook format. To answer this, we must consider the goals behind Divergloss:
Producers of glossaries should not need in-depth understanding of linguistics or markup languages. The glossary data comes from "ordinary" fields, rather than e.g. linguistic research or natural language AI.
Advanced glossary features are hidden when not needed, rather than requiring a lot of scaffolding for simple glossaries. This relates to both the format as such, and tools needed to validate and produce it.
The format should be well human readable and editable. While special GUI applications can present glossary data in nicer form for user consumption, the plain XML source should make intuitive sense.
The same glossary may be used in various contexts within same language. There may be several terms naming the same concept, not only synonymously, but according to the user's environment. The same holds for the concept descriptions.
Glossaries should evolve by contributions of many people, and be updated according to various external sources of glossary data. A form of version control on the file level is assumed, but some intrinsic support for this process should be provided as well.
Conversely, Divergloss does not place much importance into the following:
Wide generality, with features for custom specialization according to the field and desired levels of representation depth. Instead, common needs are supported by many built-in tags in a rather flat hierarchy.
Strict adherence to the XML data representation practices, when these would impair readability and editability. For example, some records may be encoded as special strings, rather than long sequences of nodes and subnodes.
In essence, Divergloss favors simplicity, ease of getting involved both in terms of cognitive effort and needed tools, at the expense of generality and formal correctness as viewed from pure data-encoding standpoint. This is in contrast with the aforementioned TBX, which is intended as end-all of glossary formats, and in practice mostly handled through features of dedicated glossary or CAT (Computer-Aided Translation) tools. Docbook glossary format, on the other hand, while offering similar simplicity, is too lightweight to cover all envisaged uses of Divergloss.
It is, however, possible that Divergloss could be made an instance of TBX, i.e. one of its XCS ("eXtensible Constraint Specification") customizations. This would facilitate the conversion of Divergloss into TBX glossaries, for use by existing CAT tools. Some sort of conversion to TBX is, of course, possible in any case, but no effort to that end has been undertaken as of yet.
Divergloss strives for language neutrality. There is no limitation on how many languages, and precisely in which fashion, may a single glossary document support. This need is highly dependent on locale and field of use: a glossary of pre-Columbian civilizations for American users may get away fully monolingual, a glossary of oceanic freight for German users may state English terms too, while a glossary of astronomy for Tunisian users may need to be fully Arabic-French bilingual (both descriptions and terms).
Within each language, Divergloss supports another level of diversity: in different environments, e.g. within two large companies doing similar business, same concepts may be named differently. An outsider may thus consider them synonymous, but a user within one of those environments should be primarily presented with the term native to it. "Fully" synonymous terms, in this sense, will have both the language and the environment same.
Aside from language and environment, each description and term can be equipped with a myriad of supplementary data. For example, the description may have an additional comment on source of the text, while a base term may come with few declinations which would be hard for user to guess by general rules of the language.
Bearing previous passages in mind, the entries in Divergloss documents are organized by concepts, not apriori linked to any of the terms. Instead, each concept is referenced by a unique key, and contains any number of descriptions and terms. Here is the simplest example of glossary with two terms in it:
<?xml version="1.0" encoding="UTF-8"?> <glossary id="cosmogloss" lang="en"> <metadata> <title>A Short Cosmic Glossary</title> </metadata> <keydefs> <languages> <language id="en"> <name>English</name> <shortname>En.</shortname> </language> </languages> </keydefs> <concepts> <concept id="blackhole"> <desc> The leftover core of a super massive star after a supernova, that exerts a tremendous gravitational pull.</desc> <term>black hole</term> </concept> <concept id="quasar"> <desc> A distant energy source which gives off vast amounts of radiation, including radio waves and X-rays.</desc> <term>quasar</term> </concept> </concepts> </glossary>
The glossary
top node states the unique glossary identifier, by the id
attribute. The default language used in the glossary is given by lang
attribute, which applies to all text in the document where not locally overridden. The metadata
node contains general info about the glossary, like title, description, etc. All keys in a Divergloss glossary, such as the language value to the lang
attributes, are defined by the the keydefs
node.
Glossary entries are grouped under the concepts
node. Each concept
node has an id
attribute, a key which uniquely defines the concept. Each concept can contain any number of descriptions and terms, defined by desc
and term
nodes; each may override the default language using the lang
attribute.
Let us now add terms in few other languages:
<concept id="blackhole"> <desc> The leftover core of a super massive star after a supernova, that exerts a tremendous gravitational pull.</desc> <term>black hole</term> <term lang="fr">trou noir</term> <term lang="de">Schwarzes Loch</term> </concept>
The term without a lang
attribute is in English, which was stated as the default language, whereas the two other terms override the language to French and German. In a fully multilingual case, description nodes too could be stated in several languages in the same way.
How about different environments? Here's a tentative example from a computer glossary:
<concept id="directory"> <desc> An entity in a file system which contains a group of files and other directories.</desc> <term>directory</term> <term env="mac">folder</term> <term env="amiga">drawer</term> </concept>
Here, the first term, as before, defines no environment. Other two terms define an environment by use of env
attribute. The value of env
attribute is not a text that may be shown to the user, but an environment key. It must be defined elsewhere in the document, together with proper environment name (see the environments
node). Same as with language, it depends on the client how the environment information will be used when presenting concepts to the user. For example, if the user is reading the glossary on an Amiga, the application may show "drawer" as primary term.
In the last example, the description text itself mentioned the word "directories". If the user is using an Amiga, shouldn't he see "drawers" instead? When this occurs in a multienvironment glossary, the descriptions too can be specialized by environment. Furthermore, the terms should be properly crossreferenced, so that the client may allow the user to click on the term in description and go to corresponding concept. Putting it all together, we get:
<concept id="directory"> <desc> An entity in a <ref c="filesystem">file system</r> which contains a group of <ref c="file">files</r> and other directories.</desc> <desc env="mac"> An entity in a <ref c="filesystem">file system</r> which contains a group of <ref c="file">files</r> and other folders.</desc> <desc env="amiga"> An entity in a <ref c="filesystem">file system</r> which contains a group of <ref c="file">files</r> and other drawers.</desc> <term>directory</term> <term env="mac">folder</term> <term env="amiga">drawer</term> </concept>
The reference node tag ref
points to a concept key by its c
attribute. If duplication of descriptions due to different terms by environment is to be frequently expected, alternatively the terse embedded selection can be used.
Descriptions and terms may also credit the person who added them into the glossary. This person is an editor of the glossary, not necessarily the one who wrote the description, or coined the term. The editor's credit is assigned using the by
attribute, for example:
<desc by="hal"> An instruction by a superior factor which, if improperly formulated, can cause a sentient being to undertake unethical actions.</desc> <term by="hal">order</term>
where "hal"
is the person key, possibly made out of the person's initials. These keys and corresponding editors' real names and contact data are defined within the editors
node.
A Divergloss document is divided into the following top-level sections:
<glossary id="glosskey" lang="xx" env="yyyy"> <metadata> ... </metadata> <keydefs> ... </keydefs> <concepts> ... </concepts> </glossary>
A particular combination of values to id
, lang
, and env
attributes, should uniquely determine the glossary within the ecosystem of other published Divergloss glossaries. The lang
and env
attributes are not mandatory; if provided, their values apply to all subnodes where meaningfull and not locally overridden.
The metadata
node provides information about the glossary in general. These include the following child nodes:
title
The title of the glossary.
desc
The description of the glossary.
version
The release version. There are no constraints or recommendations on the versioning scheme.
date
The release date. The format is YYYY-MM-DD regardless of the languages of the glossary; it is client's duty to format for presentation according to user's locale.
All child nodes except the title
are optional. The title
and desc
can have the attributes of language (lang
) and environment (env
), and can be repeated for unique combinations of those. The values of these attributes are keys defined by the keydefs
node.
The main body of the glossary, concepts and terms naming them, is given by the concepts
node. This node is mandatory -- not much point in a glossary without a single concept.
Within metadata and concepts, many keyword-valued attributes may be used, which denote global data in the glossary: languages, environments, editors, topics, etc. These keywords are defined by the the keydefs
node, where also the user-presentable info on them is stated. This node is not mandatory, but will be needed for all but the simplest glossaries.
keydefs
and concepts
nodes can actually appear more than once, each instance containing the data as described previously. In this way the document can be chunked into files, such that each file can still remain a valid XML. E.g. files containing groups of concepts, categorized in some way, can all start with their own concepts
root element.
Concept nodes reside within the concepts
node of the glossary:
<glossary id="glosskey" lang="en"> ... <concepts> <concept id="ckey1"> ... </concept> <concept id="ckey2"> ... </concept> ... </concepts> ... </glossary>
The ordering of concepts is not important, and each concept key, given by id
attribute, must be unique. The keys are best chosen mnemonically, for easier crossreferencing.
The concept node may have the following optional attributes:
topic
A list of topic keys under which this concept may be grouped. The topic keys and names are defined within the topics
node.
level
Usage level of the concept, as in basic, intermediate, advanced, etc. among the users of the terminology described by the glossary. The level value is a key, defined with corresponding level name by the levels
node.
related
Reference to closely related concepts, given by a list of concept keys.
An example of a concept node equipped with several attributes:
<concept id="saturnv" topic="apollo spacerace" level="newbie" related="n1 proton"> <desc> A multistage liquid-fuel expendable rocket used by NASA's Apollo and Skylab programs. Popularly known as the Moon Rocket.</desc> <term>Saturn V</term> </concept>
All children nodes of the concept may have the attributes lang
, env
, and by
. The client should use these attributes to decide, possibly based on the execution environment, how and which information to present to the user.
Furthermore, description and term nodes may have the src
attribute, which unlike the by
attribute, defines the source of description or term: another glossary, publication in the field, etc. The source value is a key, with the source data behind it defined by the sources
node.
There can be one or several description nodes, or even none. When two descriptions have the same values of lang
and env
attributes, they are to be considered as different takes at same explanation. Similarly, the concept may be named by one or several terms (with term nodes being detailed in their own section); but there may also be no terms, for concepts which are as of yet unnamed. The following is a valid concept definition:
<concept id="alexq" topic="arch"> <desc>The quality without a name.</desc> </concept>
Other than descriptions and terms, concept node can have the following children nodes:
details
Reference to external information about the concept, explaining concept in more details. The source of the information is pointed to using two attributes: the rel
attribute states the relative path, while the root
attribute is a key identifying the root of the path. The root keys and related names and data retrieval instructions are defined in the the extroots
node. The text content of the node, if non-empty, is a free-form remark (only for special cases, generally not necessary).
media
Non-text resource of value to the conveyance of the concept. This could be, for example, an image of the embodiment of the concept. The media file is pointed to with attributes just like for the details
node. The text content is the caption of the data, which can be empty, but should be provided nevertheless.
origin
Information on when, where, how, and by whom the concept was originally formulated, introduced, or demonstrated.
comment
Editor's comment on the concept. For example, doubts on the accuracy or wording of the description, topic qualification, etc. Several editors may add their own comments.
Each concept may be named by several terms, given by term
child node of concept
node. In the simplest case of a traditional glossary, these terms would be synonymous. In a Divergloss glossary, clients should consider as synonymous only those terms with equal language and environment attributes.
All attributes to the term
node are optional, and are as follows:
lang
The language of the term. The value is a list of language codes, as defined by the languages
node.
env
The environment in which this term is used. The value is a list of environment keys, as defined by the environments
node.
by
The editor who added the term into the glossary. Value is a key defined by the editors
node.
src
The source of the term: an organization, publication, another glossary, a person. The value is a list of source keys, as defined by the sources
node.
gr
Any grammatical categories to which the term may belong. These can be, for example, gender for a noun, aspect for a verb, etc. The value is a list of category keys, defined by the grammar
node.
Terms may sometimes need additional information, in which case the extended eterm
node is used. It has the same attributes as the ordinary term
node, but branches into child nodes, where the nominal form of the term is stated by the nom
child node:
<eterm> <nom>phenomenon</nom> ... </eterm>
Other, optional child nodes of the extended term include:
stem
Especially for inflected languages, it may be useful to know the stem of the nominal form of the term. Clients may use it for processing user queries into the glossary. Only one stem node is allowed.
decl
A particular declension of the term: cases, genders, moods, etc. The declension category is stated by the mandatory gr
attribute (like for the term
node itself), which holds one of the grammar keys defined the grammar
node. There can be as many declension nodes as needed. Several declensions may be of the same grammar category, which means that they are all an acceptable variation of that category.
Text describing the history behind the term. Optional attributes are by
(the editor who added the text), src
as the key of the source from where the information was obtained, as well as lang
and env
when differing from that of the term. There can be more than one origin node, possibly offering alternative views.
Editor's comment on the term. Optional attributes are by
, stating the editor's key, and lang
and env
, in case the language or environment of the comment are different from that of the term. There can be several comments, by the same or different editors.
An example of a term with some of the extended data:
<eterm> <nom>phenomenon</nom> <decl gr="plu">phenomena</decl> <stem>phenomen</stem> <origin src="dictcom"> From Greek "phainĂ³menon", over Late Latin "phaenomenon", to appear.</origin> </eterm>
All the various keys used in concepts and terms are collected and defined within this node, by sections:
<keydefs> <languages> ... </languages> <environments> ... </environments> ... </keydefs>
The keys themselves are always defined by the id
attribute of the respective key definition node.
Each node that has a text value, within any of the key definition sections, may be equipped with lang
and env
attributes, and repeated for different combinations of them. Clients should use this info to select the text to present to the user as a description behind a particular key.
The key definition sections are as follows:
languages
The languages used within the glossary, as applied by the lang
attribute. The definition of a language provides its full and short name:
<languages> <language id="en"> <name>English</name> <shortname>En.</shortname> </language> <language id="fr"> <name>French</name> <shortname>Fr.</shortname> </language> ... </languages>
Language identifiers should follow the codes from ISO 639, when available. This is important for relating the language to a system locale, such that language-dependent processing (e.g. alphabetical sorting) may be correctly performed.
environments
Usage environments for text content, as applied by the env
attribute. For each environment the full and short name are given, and the description of the environment:
<environments> <environment id="unix"> <name>Unix</name> <shortname>U.</shortname> <desc> A computer operating system originally developed in 1969 by a group of AT&T employees...</desc> </environment> ... </environments>
If the environment is also one of the concepts, note the difference between the description here and the concept description: environment's description may provide info on the terminology aspects of the environment (if none are needed, the environment description may just point to the concept by a ref
node).
An environment may specify terminology-wise close environments: if a term is not defined in the present environment, another from a close environment can be used as if it were its own. Close environments are specified by a list of environment keys in the closeto
attribute of the environment
node. The list order matters: the first environment is considered the closest, etc.
Clients may sometimes need to pick one environment among others, or to order them in a certain way. Two additional attributes may be specified to influence clients at this. The meta
attribute states that the environment is not a true environment (e.g. it may be an umbrella for several environments), and takes one of truth values 1|y|yes|t|true
. The weight
attribute specifies environment's priority, in a case-dependent sense, and takes numbers from 0 to 9 as values (0 is default).
editors
People who are, or were at one point, adding and modifying the content of the glossary. These keys are applied using the by
attribute. An editor definition contains the name and short name (usually initials), email address, affiliation, and description:
<editors> <editor id="hjjr"> <name>Henry Jones, Jr.</name> <shortname>IJ</shortname> <email>henry.jones.jr@barnett.edu</email> <affiliation> Barnett College, visiting professor</affiliation> <desc> Dr. Jones is an eminent archaeologist, who teaches at Barnett College in New York...</desc> </editor> ... </editors>
Email address, affiliation and description are optional.
sources
Sources which the editors use to assemble the glossary, and applied by the src
attribute. A source can be just about anything: a publication, an institution, a person, etc. Each source defines its full and short name, description, email address, and an URL:
<sources> <source id="wp"> <name>Wikipedia, the Free Encyclopedia</name> <shortname>Wp.</shortname> <url>http://en.wikipedia.org</url> <desc> A free, multilingual, open content encyclopedia project operated by the non-profit Wikimedia Foundation...</desc> </source> <source id="jbl"> <name>J. Bigshot Linguist</name> <shortname>JBL</shortname> <email>jbl@allknow.edu</email> <desc> Esteemed and prolific originator and commentator of many of the terms found within this glossary...</desc> </source> ... </sources>
Email address and URL are optional.
topics
The topics to which the concepts belong, as applied by the topic
attribute of the concept. The definition contains the full and short name, and a description:
<topics> <topic id="apollo"> <name>The Apollo Program</name> <shortname>Apollo</shortname> <desc> The Apollo program was a human spaceflight program undertaken by NASA during the years...</desc> </topic> ... </topics>
levels
Usage levels applied to the concepts by the level
attribute. They are defined by the full and short name, and a description:
<levels> <level id="basic"> <name>Basic Concepts</name> <shortname>basic</shortname> <desc> The concepts that every user should know about.</desc> </level> ... </levels>
The description is optional.
grammar
Grammar categories for terms and declensions, as given by their gr
attributes. Each is defined by the full and short name, and a description:
<grammar> <gramm id="pl"> <name>plural</name> <shortname>pl.</shortname> <desc>The plural form of the word.</desc> </gramm> ... </grammar>
The description is optional.
extroots
External locations of more detail info on the concept, as provided by the details
child node of a concept. An external root is defined by its full and short name, description, the URL root to which relative paths are appended (given by the rel
attribute in concepts), and an URL for manual browsing:
<extroots> <extroot id="rloc"> <name>Local files</name> <shortname>loc.</shortname> <rooturl>file://usr/share/thisgloss/data</rooturl> <desc>Files on local disk, installed by this glossary.</desc> </extroot> <extroot id="rwp"> <name>Wikipedia</name> <shortname>Wp.</shortname> <rooturl>http://en.wikipedia.org/wiki</rooturl> <browseurl>http://en.wikipedia.org</browseurl> <desc>Links to articles on Wikipedia.</desc> </extroot> ... </extroots>
The URL for manual browsing is not mandatory.
Some nodes may contain larger bodies of text, where additional markup is advantageous (e.g. referencing). Such nodes are desc
, comment
, origin
, etc. The following markup can be applied within text contents of such nodes:
ref
A reference to a concept defined within the glossary. It wraps a phrase indicative of the concept, and points to a concept using the c
attribute (the value being the key of the concept).
em
An emphasis on a word or a phrase.
ol
A word or a phrase in another language, as opposed to that of the text. Must have a lang
attribute stating the language of the phrase. An optional argument is wl
, which if present indicates that the short language name should be formatted together with the phrase (its value must be one of 1|y|yes|t|true
).
link
Link to an external resource. The URL of the resource is given by the url
attribute, which is mandatory.
Although glossary texts should be kept short and to the point, sometimes the text content could still be long enough to warrant splitting into several paragraphs, and other higher level groupings. All nodes which could reasonably benefit from such structure have an l*
variant, which contain structured text content:
<ldesc> <para> A huge cloud which is thought to surround our solar system and reach over halfway to the nearest star.</para> <para> Comets originate in the Oort cloud.</para> </ldesc>
Such nodes are: ldesc
, lcomment
, and lorigin
. These can be used everywhere instead of their simpler counterparts. For the moment, the only structuring element are paragraphs (the para
nodes), but more may be introduced in the future.
Sometimes a lot of text may need duplicating due to a single phrase in it differing across environments, typically in description nodes -- see an earlier example. To prevent this duplication, clients will support special embedded text selection by environment. Using embedded selection, the mentioned example can be rewritten as:
<concept id="directory"> <desc> An entity in a <ref c="filesystem">file system</r> which contains a group of <ref c="file">files</r> and other ~directories|mac:folders|amiga:drawers~.</desc> <term>directory</term> <term env="mac">folder</term> <term env="amiga">drawer</term> </concept>
i.e. the embedded selector is of the form ~env1:phrase1|env2:phrase2|...~
, where if one of the environment keys is empty or omitted (as in the example), that phrase inherits the surrounding text's environment. Instead of a single environment key, a whitespace separated list can also be given. The tilde character (~) cannot be a part of ordinary text by itself, but it can be escaped by doubling it (~~). This kind of special-form selection is unusual by XML standards, but has been introduced due to being more human-readable and editable in the running text than e.g. a selection node with subnodes per environment: <select><for env="env1">phrase1</for><for env="env2">phrase2</for>...</select>
.
Divergloss is distributed in a package which, aside from the format definition and documentation, contains command-line tools for processing Divergloss glossaries into various end-user formats, and requires minimum installation fuss. This enables users to quickly start writting and putting glossary data to practical uses.
Since Divergloss is still early in the development, the best place to get the package from is Github, by cloning the Git repository:
$ git clone https://github.com/caslav-ilic/divergloss.git
This will create directory divergloss/
with the complete repository. In it there will be the README
file with short setup instructions. The repository can later always be updated to the newest version by issuing:
$ cd divergloss/ $ git pull
In the package there is a Python module, dg
, which provides easy access to glossary content and functionality frequently needed for manipulating glossary data. While e.g. XSLT is very succinct for straightforward mappings of XML data, building glossary outputs (among other things) may be much more demanding than that, and therefore more easily tackled with a general purpose programming language such as Python. Not the least is Python's ease of use and rich variety of modules, which makes any special processing of glossaries that much more viable.
The packaged dgproc.py script is one immediate user of the dg
module. It operates by pushing Divergloss files through sieves, which build outputs and perform other operations on the glossary. In the basic mode, when run with the glossary file as the single argument, dgproc.py will validate the glossary, reporting also the problems not discoverable by DTD validation. If the glossary file is gloss.xml
, then executing:
$ dgproc.py gloss.xml
will give no output if the glossary is technically valid.
The list of applicable sieves may be seen by issuing the --list-sieves
(-S
) option. For example, if the glossary contain terms in English (en
) and German (de
), bidict-html
sieve may be used to create an embeddable HTML dictionary table, with collapsible concept descriptions:
$ dgproc.py html-bidict gloss.xml -solang:en -stlang:de -sfile:gloss.html
where -s...
options issue sieve parameters. Or, to create a TBX glossary file for use in tools that can make use of it (e.g. a translation editor may automatically issue terminology recommendations):
$ dgproc.py tbx gloss.xml -sfile:gloss.tbx
List of parameters for each sieve may be seen by following the sieve name with the --help-sieves
(-H
). Each sieve is described in more detail in the dg.sieve
module documentation contained in the package.
The following sources were used when making up the examples:
Wikipedia, the Free Encyclopedia.
StarChild, NASA's learning center for young astronomers.
barnettcollege.com, the official website for the development of a freeware point&click adventure "Indiana Jones and The Fountain of Youth", by Screen 7.