A suite of NLP tools for Greek
Άρθρο σε πρακτικά
|Προκόπης Προκοπίδης; Βύρων Γεωργαντόπουλος; Χάρης Παπαγεωργίου
|The 10th International Conference of Greek Linguistics
The vast amount of electronically available textual data constitutes a wealth of information for both researchers and application developers. On the other hand, the overwhelmingly big datasets of today ask for robust and efficient processing tools. While a variety of relevant processors exist for well-resourced languages like English, it is often difficult to find similar tools for texts in less-spoken languages. In this paper we provide an overview of natural language technologies available from the Institute for Language and Speech Processing. This NLP suite is unique for the Greek language and comprises a series of processing units based on both machine learning algorithms and rule-based approaches. We report on updated versions of tools and, taking into account latest developments in this field, on new processors that we have implemented, together with the resources we created for their training and evaluation. Our infrastructure can be used by researchers interested in studying linguistic properties of the Greek language. At the same time, it can be employed in application scenarios involving fast processing of large document collections. The paper is organized as follows. Section 2 discusses detection of paragraph, sentence and token boundaries in input text. Modules presented in Section 3 assign POS tags and lemmas to tokens. Section 4 presents a dependency treebank for training data-driven parsers. A term spotting algorithm is discussed in Section 5. Sections 6 and 7 focus on modules for sentence compression and text summarization. In Section 8, we discuss integration and use of the tools via standards-compliant web services.