Conference papers

hr500k – A Reference Training Corpus of Croatian.

Nikola Ljubešić, Jožef Stefan InstituteFollow
Željko Agić, University of CopenhagenFollow
Filip Klubicka, Technological University DublinFollow
Vuk Batanović, University of BelgradeFollow
Tomaž Erjavec, Jožef Stefan InstituteFollow

Document Type

Conference Paper

Rights

Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence

Disciplines

Computer Sciences, Information Science, Specific languages, Linguistics

Publication Details

The article was published at the Language Technologies and Digital Humanities Conference in Ljubljana, Slovenia.

http://www.sdjt.si/wp/dogodki/konference/jtdh-2018-english/

The paper can be found at this link:

http://www.sdjt.si/wp/wp-content/uploads/2018/09/JTDH-2018_Ljubesic-et-al_hr500k-A-Reference-Training-Corpus-of-Croatian.pdf

Abstract

In this paper we present hr500k, a Croatian reference training corpus of 500 thousand tokens, segmented at document, sentence and word level, and annotated for morphosyntax, lemmas, dependency syntax, named entities, and semantic roles. We present each annotation layer via basic label statistics and describe the final encoding of the resource in CoNLL and TEI formats. We also give a description of the rather turbulent history of the resource and give insights into the topic and genre distribution in the corpus. Finally, we discuss further enrichments of the corpus with additional layers, which are already underway.

DOI

https://doi.org/10.21427/0pjb-f168

Recommended Citation

Ljubešić, N., Agić, Z. & Klubicka, F. (2018). hr500k – A reference training corpus of Croatian, Language Technologies and Digital Humanities Conference, Ljubljana, Slovenia, 20-21 September. doi:10.21427/0pjb-f168

Download

Included in

Digital Humanities Commons, Slavic Languages and Societies Commons

COinS

Conference papers

hr500k – A Reference Training Corpus of Croatian.

Document Type

Rights

Disciplines

Publication Details

Abstract

DOI

Recommended Citation

Included in

Search

Browse

Author Corner

Links

Conference papers

hr500k – A Reference Training Corpus of Croatian.

Authors

Document Type

Rights

Disciplines

Publication Details

Abstract

DOI

Recommended Citation

Included in

Share

Search

Browse

Author Corner

Links