Datasets

English Wikipedia Corpus Chunks

Filip Klubicka, Technological University DublinFollow
Alfredo Maldonado, Trinity College DublinFollow
Abhijit Mahalunkar, Technological University DublinFollow
John D. Kelleher, Technological University DublinFollow

Document Type

Dataset

Rights

Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence

Disciplines

Computer Sciences, Information Science, Linguistics

Abstract

This archive contains a collection of language corpora. These are text files that contain samples of text collected from English Wikipedia.

Recommended Citation

Klubicka, F., Maldonado, A., Mahalunkar, A., Kelleher, J. D. English Wikipedia Corpus Chunks. Dataset. Technological University Dublin. DOI: 10.21427/wgvf-be42

DOI

https://doi.org/10.21427/wgvf-be42

Methodology

We have not collected the text ourselves, but rather downloaded an already available Wikipedia corpus collected for the Polyglot project. We partitioned this large corpus into smaller chunks so that we might be able to use them to explore the impact of various sizes of training corpora on word embedding performance. These smaller corpus chunks are what we're making available here, so they can be more easily related to our research.

Language

eng

File Format

.txt

Viewing Instructions

The corpora are compressed into a gzip archive. To view them they first need to be extracted, which can be done using most standard archive managers (e.g 7-Zip, WinRAR, etc.) Once extracted, the provided .txt files can be viewed with a simple text editor, such as notepad or similar.

Funder

The creation of these resources was supported by the ADAPT Centre for Digital Content Technology (https://www.adaptcentre.ie), funded under the SFIResearch Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.

Download

Files available below

Included in

Computer Sciences Commons

Article Location

COinS

Datasets

English Wikipedia Corpus Chunks

Document Type

Rights

Disciplines

Abstract

Recommended Citation

DOI

Methodology

Related Content

Language

File Format

Viewing Instructions

Funder

Creative Commons License

Included in

Article Location

Search

Browse

Author Corner

Article Locations

Datasets

English Wikipedia Corpus Chunks

Authors

Document Type

Rights

Disciplines

Abstract

Recommended Citation

DOI

Methodology

Related Content

Language

File Format

Viewing Instructions

Funder

Creative Commons License

Included in

Share

Article Location

Search

Browse

Author Corner

Article Locations