KWJP - User guide

User guide

The corpus query language will be mostly familiar to users of Korpusomat.pl. The query language is explained in its documentation. Here, we will limit ourselves to a brief description of the attributes and examples of their use.

The basic unit of search in the corpus is a segment, which roughly corresponds to a word. A single segment in the query language is shown in square brackets, inside of which the conditions to be met by the searched segments are specified. However, some layers of corpus annotation allow units that include strings of multiple segments to be referred to and such strings are tagged with angle brackets – this applies mainly to the named entity layer and the syntactic layer (phrases).

The most common queries relate to text and keyword forms:

[orth="koty"] — text form of the segment; the result will be all occurrences of the word koty,
[orth_lc="koty"] — case-insensitive text form of the segment; the result may contain e.g. koty, KOTY, Koty
[lemma="kot"] lub [base="kot"] — lemma of the segment; the result will be all occurrences of inflectional forms of the lexeme kot.

Queries involving morphosyntactic interpretations are also possible:

[pos="praet"] — all segments that are assigned to the grammatical class preat corresponding to the past forms of verbs; a full list of grammatical classes can be found in this table and in the Query Builder,
[feat="m2"] — all segments that have been assigned m2among various grammatical category values (the so-called masculine animate gender); a full list of grammatical categories and their values can be found in this table and in the Query Builder,
[tag="subst:sg:acc:m2"] — full morphosyntactic tag; the result will be all masculine animate nouns in singular number and accusative case.

The corpus search engine also indexes partial syntactic information from the dependency parser. This makes it possible to formulate queries that not only refer to the morphosyntactic features of a segment, but also to the features of its direct head and the label of the dependency edge in the structure leading to that segment.

[deprel="subj"] — the label of the edge leading to a given segment, usually determining the syntactic function of a given segment in an utterance or construction; in the example, the label subj refers to the function of the subject; all possible values of this attribute can be found in the documentation of the Polish Dependency Bank and in the Query Builder,
[head.lemma="wskoczyć"] — lemma of the direct head of the searched segment; the query result will be all direct subordinates of occurrences of the verb form wskoczyć,
[head.pos="prep"] — grammatical class of the head of the searched segment (in this case: preposition),
[head.feat="acc"] — one of the values of the grammatical category of the head of the searched segment (in this case: accusative case),
[head.position="left"] — position (left or right side) of the head in the linear order of the utterance,
[head.distance="5"] — distance (counted in segments) of the head from the searched segment.

An annotation layer that is not available at Korpusomat.pl, but has been introduced in KWJP, is the constituent layer. Constituents (phrases) are strings of segments that perform certain syntactic functions in an utterance. Queries for constituents (phrases) are enclosed in angle brackets with a c attribute followed by an additional specification of the constituent type. For example, the query <c="AdjP" /> will return all occurrences of adjective phrases, while the query <c="NumP" /> will return all numerical phrases. All possible values for this attribute can be found in the Query Builder.

Typically, searches for all phrases of a given type occurring in the entire corpus are of little use to researchers. Much more interesting is the possibility of constituency layer with other annotation layers in a query. For example, the query <c="NP" /> containing [deprel="subj" & lemma="kot"] (combining information from the constituent, dependency and morphosyntactic layers) will search for all nominal phrases containing a segment that constitutes the subject and is the inflectional form of the lexeme kot.

The last annotation layer consists of named entities. As in the case of phrases, named entities may consist of more than one segment, so queries involving this layer should be enclosed in angle brackets with the ne attribute and the type of named entity specified. For example, <ne="persName" /> will return all person names, and <ne="geogName" /> will return all place names.