Document Type
Conference Paper
Rights
Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence
Disciplines
Computer Sciences
Abstract
Crosslingual document classification aims to classify documents written in different languages that share a common genre, topic or author. Knowledge-based methods and others based on machine translation deliver state-of-the-art classification accuracy, however because of their reliance on external resources, poorly resourced languages present a challenge for these type of methods. In this paper, we propose a novel set of language independent features that capture language use from a document at a deep level, using features that are intrinsic to the document. These features are based on vocabulary richness measurements and are text length independent and self-contained, meaning that no external resources such as lexicons or machine translation software are needed. Preliminary evaluation results show promising results for the task of crosslingual authorship attribution, outperforming similar methods.
DOI
https://doi.org/10.21427/r8v7-fv65
Recommended Citation
Llorens, M. (2016). Deep Level Lexical Features for Cross-lingual Authorship Attribution. MultiLingMine 2016:Modeling, Learning and Mining for Cross/Multilinguality."Proceedings of the First Workshop on Modeling, Learning and Mining for Cross/Multilinguality (MultiLingMine 2016) co-located with the 38th European Conference on Information Retrieval (ECIR 2016), Padova, Italy, 20 March. doi:10.21427/r8v7-fv65
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.
Publication Details
MultiLingMine 2016:Modeling, Learning and Mining for Cross/Multilinguality."Proceedings of the First Workshop on Modeling, Learning and Mining for Cross/Multilinguality (MultiLingMine 2016) co-located with the 38th European Conference on Information Retrieval (ECIR 2016), Padova, Italy, 20 March