Parametric Search Appliance
The XML schema for entities is below. Optional elements are shown in square brackets; elements that may repeat are followed by ellipses. The syntax is largely compatible with the Google Search Appliance's Entity Recognition XML format, but see here for differences.
<?xml version="1.0"?>
<instances>
<instance>
<name>Counties</name>
[<case_sensitive>N</case_sensitive>]
[<apply_case>as_is</apply_case>]
[<store_term_or_name>term</store_term_or_name>]
[<store_regex_or_name>regex</store_regex_or_name>]
[<pattern>(?:[[:upper:]]\w+\s+)+County</pattern>
...]
[<term>Adams County</term>
...]
</instance>
...
</instances>
The root element is <instances>, which contains one or more
entities, each defined in an <instance> element. Each
<instance> has the following children:
<name> (required) - The name of the entity. Like
Parametric Fields, an entity name must be composed solely of 1 to
29 alphanumerics or underscores (with the first character
alphabetic), and the name must not be a SQL keyword.<case_sensitive> (optional) - Whether <term>s
match case-sensitively or not; a Y or N value.
The default if unspecified is N.<apply_case> (optional) - How to transform the case of
text matches, before storing the entity. One of the following
values:
as_is - Leave text as-is; no transformationlowercase - Lower-case the matchuppercase - Upper-case the matchtitlecase - Title-case the match: capitalize the
first letter of each wordtitlecase_first_word - Title-case just the first word
as_is. Note that only
matches stored from document text are affected:
<term> matches when <store_term_or_name> is
name or term_tag, and <pattern> matches when
<store_regex_or_name> is name, are not modified.
This allows mixed-case <term> values - e.g. McDuff
- to retain their custom-specified case when stored, while still
canonicalizing the possibly-variant cases of <pattern>
matches in text, when both are specified for the same entity.<store_term_or_name> (optional) - What to store as the
entity for <term> matches. One of the following values:
term - Store the text matched; this is the default
if unspecified. Useful if knowing which <term> matched
is significant; e.g. when looking for a list of cities, and
search results will be Grouped By city.name - Store the entity <name> value.
Useful when just the existence of the entity matters, i.e.
all the terms are synonymous. (E.g. an entity named
Water with terms water, H2O and
dihydrogen monoxide, and any occurrence should be
stored as Water.)term_tag - Store the <term> value. Useful
if the specific term matters, and it should be saved with the
same case as in the <term>, not the text. E.g. if a
custom-case <term> like McDuff is set, it may
match Mcduff, MCDUFF etc. in the text - the
McDuff case variant is stored.
<store_regex_or_name> (optional) - What to store as the
entity for <pattern> matches. One of the following values:
regex - Store the text matched; this is the default
if unspecified.name - Store the entity <name> value.
Useful if just the existence of the entity matters; e.g. the
<pattern>s are looking for credit-card or phone
numbers, and the exact digits do not matter, just the fact
that the document contains a credit-card or phone number.regex_tagged_as_first_group - Store the text
matched by the first parenthetical capture group of the
<pattern>. For example, the pattern "Mr\. (\w+)"
could be used with regex_tagged_as_first_group to store
just the last name found, without the "Mr." title.
Note that REX syntax uses the \P and \F
operators to indicate what part of the expression to store,
and does not support capture groups; thus
regex_tagged_as_first_group is not valid for REX
<pattern>s.
<pattern> (optional; zero or more occurrences) - A
regular expression (regex) to match entities in document text.
The default syntax is that of Google's RE2 library. REX syntax
may also be used, by preceding the expression with \<rex\>.
To store just part of the text matched, use a parenthetical
capture group in the expression and set
<store_regex_or_name> to
regex_tagged_as_first_group; or use a REX expression with
the \P and \F operators.
Note: On some platforms, RE2 syntax is not supported, and
REX syntax must be used. These platforms will give the
error message "REX: RE2 not supported on this platform"
when uploading an entity file containing RE2 <pattern>s.
(Windows, Linux 2.6 and later versions except
i686-unknown-linux2.6.17-64-32 are supported.)
RE2 syntax is documented at
https://github.com/google/re2/wiki/Syntax.
<term> (optional; zero or more occurrences) - A term to
find as an entity in document text. The term is searched for
exactly, as a phrase (no quotes needed). It is matched
case-insensitively, unless <case_sensitive> is set to
Y.
Note that more than one entity may be defined in a file, since the
<instance> element defining an entity may occur repeatedly.