ΤΑΥΤΟΤΗΤΑ
TimeEL: Recognition of Temporal Expressions in Greek texts
Έτος: | 2009 | ||||
---|---|---|---|---|---|
Συγγραφείς: | Προκόπης Προκοπίδης; Ελίνα Δεσύπρη; Χάρης Παπαγεωργίου; G. Markopoulos | ||||
Τίτλος βιβλίου: | Proceedings of the 9th International Conference on Greek Linguistics | ||||
Περίληψη: | In this paper we present work for the development of TimeEL, a rule-based
software module that performs recognition of TIMEXes in Greek texts.
In the first section of the paper, we provide details on the corpus
we have used for development and evaluation purposes. The corpus,
which amounts to 26.5K tokens and 1.7K sentences, comprises financial
web documents, sport-related articles and transcribed documentaries
about political affairs. An adaptation of the TIDES scheme for Greek
was compiled by three postgraduate linguistics students, who worked
on Callisto, a suitable annotation tool to mark the extent of 613
relevant time expressions (including 243 dates and 225 durations)
in the corpus. Each TIMEX annotation also included attributes like
VAL (the normalized, IS0 8601 compatible format of the expression),
MOD (a representation of temporal modification by lexical items like
?an??io and ooci an??), etc. The TimeEL module is discussed in the
following sections of the paper. Raw input is enriched with annotations
generated automatically by a NLP pipeline that integrates a tokenizer,
a sentence splitter, a POS tagger, a lemmatizer and a surface syntactic
analyzer that recognizes non-recursive chunks. The core resource
of TimeEL is a grammar that was developed and tested using the JAPE
framework [Cunningham, 2008] for writing cascades of pattern-based
rules over lexical items and annotations. The grammar contains macros
concerning, among others, parts of days, duration adjectives and
adverbial pre- and post-modifiers of temporal expressions. The main
rules are grouped in three stages: a) markup of (combinations of)
trigger lemmas, b) expansion of markables to the extent of chunks
recognized by the syntactic analyzer, and c) post-processing of certain
expressions (involving, for example, splitting TIMEXes representing
time ranges into two distinct expressions). In the evaluation section
of the paper we report encouraging results (recall of 88.5%, precision
of 82.7%) on TIMEX extent recognition using strict exact match and
testing on the whole corpus. Respective figures concerning a subset
of the corpus containing only financial documents (where frequency
of vague TIMEXes is smaller) are 96.1% and 91.3%. In the error analysis
section, we show that most false positives partially overlap with
missed true positives. We conclude with discussion of ongoing work
on normalization of recognized expressions, and on plans for the
manual annotation and automatic identification of temporal relations
between events and TIMEXes. |
||||
[Bibtex] |