Δημοσίευση - The ILSP/ARC submission to the WMT 2016 Bilingual Document Alignment Shared Task
ΕΡΕΥΝΑ

The ILSP/ARC submission to the WMT 2016 Bilingual Document Alignment Shared Task

Ερευνητική περιοχή:  
Άλλα θέματα Πληροφορικής
    
Είδος:  
Άρθρο σε πρακτικά

 

Έτος: 2016
Συγγραφείς: Βασίλης Παπαβασιλείου; Προκόπης Προκοπίδης; Στέλιος Πιπερίδης
Τίτλος βιβλίου: Proceedings of the First Conference on Machine Translation
Σελίδες: 733-739
Διεύθυνση: Berlin, Germany
Ημερομηνία: August
Περίληψη:
This paper describes ILSP-ARC-pv42, the Institute for Language and Speech Processing/Athena Research and Innovation Center submission for the WMT 2016 Bilingual Document Alignment shared task. We describe several document and collection-aware features that our system explored in the context of the task. On the test dataset, our submission achieved a recall of 84.93%, even though it does not make use of any language-specific resources like bilingual lexica or MT output. Instead, our system is based on shallow features (including links to documents in the same webdomain, URLs, digits, image filenames and HTML structure) that can be easily extracted from web documents. We also present examples to show that when de-duplication issues in the test dataset are properly addressed, our system reaches a significantly higher recall of 92.5%.
[Bibtex]