Document Type
Dataset
Funders
The creation of these resources was supported by the ADAPT Centre for Digital Content Technology (https://www.adaptcentre.ie), funded under the SFIResearch Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.
Description
This archive contains a collection of language corpora. These are text files that contain samples of text collected from English Wikipedia.
Methodology
We have not collected the text ourselves, but rather downloaded an already available Wikipedia corpus collected for the Polyglot project. We partitioned this large corpus into smaller chunks so that we might be able to use them to explore the impact of various sizes of training corpora on word embedding performance. These smaller corpus chunks are what we're making available here, so they can be more easily related to our research.
Disciplines
Computer Sciences, Information Science, Linguistics
Publisher
Technological University Dublin
Language
eng
File Format
.txt
Viewing Instructions
The corpora are compressed into a gzip archive. To view them they first need to be extracted, which can be done using most standard archive managers (e.g 7-Zip, WinRAR, etc.) Once extracted, the provided .txt files can be viewed with a simple text editor, such as notepad or similar.
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 License.
Recommended Citation
Klubicka, F., Maldonado, A., Mahalunkar, A., Kelleher, J. D. English Wikipedia Corpus Chunks. Dataset. Technological University Dublin.
Readme file containing a detailed description of the resource