The Zorba XQuery engine implements the XQuery and XPath Full Text 1.0 specification that, among other things, adds the ability to use a thesaurus for text-matching via the thesaurus option. For example, the query:
let $x := <msg>affluent man</msg> return $x contains text "wealthy" using thesaurus default
returns true
because $x
contains "wealthy" that the thesaurus identified as a synonym of "affluent".
The initial implementation of the thesaurus option uses the WordNet lexical database, version 3.0.
The stock WordNet database files are plain ASCII text files. In many ways this is very convenient for portability, grep-ability, vi-ability, etc. However, the sum total of the files is approximately 27MB (which is quite large) and accessing the database would be inefficient since the files would have to be parsed for every access.
Instead, the database files are compiled into a single binary file that is 6MB and can be efficiently accessed from Zorba using mmap(2) with no parsing of the data. The only caveat of the binary format is that it is endian-dependent, i.e., a binary file created on a computer having a little-endian CPU won't work on a computer having a big-endian CPU.
To download and install the WordNet database on a Unix-like system, follow these steps:
WNdb-3.0.tar.gz
)./usr/local/wordnet-3.0/dict
.dict
directory into a Zorba-compatible binary thesaurus as described below.To compile the WordNet database files, use the zt-wn-compile
script found in the scripts
subdirectory of the Zorba distribution. (Note: this script is written in perl.) The usage message is:
zt-wn-compile [-v] wordnet_dict_dir [thesaurus_file]
-v
option specifies verbose output.dict
directory.wordnet-en.zth
("en" for English and "zth" for "Zorba Thesaurus file").For example:
zt-wn-compile -v /usr/local/wordnet-3.0/dict
Move the wordnet-en.zth
file to a location of your choosing.
Alternatively, you can download a precompiled WordNet database from here.
In order to use thesauri, you need to specify where they are to the Zorba engine via one or more thesaurus mappings. A mapping maps a symbolic URI to URI for an actual thesaurus. A mapping is of the form:
from_uri:=
[implementation|
]to_uri
For example:
http://wordnet.princeton.edu:=wordnet|/usr/local/zorba/thesauri/wordnet-en.zth
says that the symbolic URI http://wordnet.princeton.edu
maps to the WordNet implementation having a database file at the given path. Once a mapping is established for a symbolic URI, it can be used in a query:
let $x := <msg>affluent man</msg> return $x contains text "wealthy" using thesaurus at "http://wordnet.princeton.edu"
If the implementation is omitted, it defaults to wordnet
. As a special-case, the from_uri can be default
or ##default
to allow for specifying the default thesaurus as was done for the first example on this page.
To specify the location of the thesaurus to Zorba from the command-line, use one or more –thesaurus options:
zorba --thesaurus default:=/usr/local/zorba/thesauri/wordnet-en.zth ...
Using the WordNet database, Zorba supports all of the thesaurus relationships specified by [ISO 2788] and [ANSI/NISO Z39.19-2005] with the exceptions of "HN" (history note) and "X SN" (see scope note for).
These relationships are:
Rel. | Meaning | WordNet Rel. |
---|---|---|
BT | broader term | hypernym |
BTG | broader term generic | hypernym |
BTI | broader term instance | instance hypernym |
BTP | broader term partitive | part meronym |
NT | narrower term | hyponym |
NTG | narrower term generic | hyponym |
NTI | narrower term instance | instance hyponym |
NTP | narrower term partitive | part holonym |
RT | related term | also see |
SN | scope note | n/a |
TT | top term | hypernym |
UF | non-preferred term | n/a |
USE | preferred term | n/a |
and can be used in a query like:
let $x := <msg>breakfast of champions</msg> return $x contains text "meal" using thesaurus at "http://wordnet.princeton.edu" relationship "NT"
that returns true
because $x
contains "breakfast" that the thesaurus identified as a "narrower term" (NT) of "meal."
Note that you can specify relationships either by their abbreviation or their meaning. Relationships are case-insensitive. The above query is equivalent to:
let $x := <msg>breakfast of champions</msg> return $x contains text "meal" using thesaurus at "http://wordnet.princeton.edu" relationship "narrower term"
Since Zorba's thesaurus is implemented using WordNet, the [ISO 2788] relationships map to WordNet relationships that are shown in the "WordNet Rel." column. WordNet relationships are explained in the next section.
In addition to the [ISO 2788] and [ANSI/NISO Z39.19-2005] relationships, Zorba also supports all of the relationships offered by WordNet. These relationships are:
Relationship | Meaning |
---|---|
also see | A word that is related to another, e.g., for "varnished" (furniture) one should also see "finished." |
antonym | A word opposite in meaning to another, e.g., "light" is an antonym for "heavy." |
attribute | A noun for which adjectives express values, e.g., "weight" is an attribute for which the adjectives "light" and "heavy" express values. |
cause | A verb that causes another, e.g., "show" is a cause of "see." |
derivationally related form | A word that is derived from a root word, e.g., "metric" is a derivationally related form of "meter." |
derived from adjective | An adverb that is derived from an adjective, e.g., "correctly" is derived from the adjective "correct." |
entailment | A verb that presupposes another, e.g., "snoring" entails "sleeping." |
hypernym | A word with a broad meaning that more specific words fall under, e.g., "meal" is a hypernym of "breakfast." |
hyponym | A word of more specific meaning than a general term applicable to it, e.g., "breakfast" is a hyponym of "meal." |
instance hypernym | A word that denotes a category of some specific instance, e.g., "author" is an instance hypernym of "Asimov." |
instance hyponym | A term that donotes a specific instance of some general category, e.g., "Asimov" is an instance hyponym of "author." |
member holonym | A word that denotes a collection of individuals, e.g., "faculty" is a member holonym of "professor." |
member meronym | A word that denotes a member of a larger group, e.g., a "person" is a member meronym of a "crowd." |
part holonym | A word that denotes a larger whole comprised of some part, e.g., "car" is a part holonym of "engine." |
part meronym | A word that denotes a part of a larger whole, e.g., an "engine" is part meronym of a "car." |
participle of verb | An adjective that is the participle of some verb, e.g., "breaking" is the participle of the verb "break." |
pertainym | An adjective that classifies its noun, e.g., "musical" is a pertainym in "musical instrument." |
similar to | Similar, though not necessarily interchangeable, adjectives. For example, "shiny" is similar to "bright", but they have subtle differences. |
substance holonym | A word that denotes a larger whole containing some constituent substance, e.g., "bread" is a substance holonym of "flour." |
substance meronym | A word that denotes a constituant substance of some larger whole, e.g., "flour" is a substance meronym of "bread." |
verb group | A verb that is a member of a group of similar verbs, e.g., "live" is in the verb group of "dwell", "live", "inhabit", etc. |
In no levels are specified in a query, Zorba defaults the WordNet implementation to be 2 levels. The rationale can be found here.