Document Type
Dataset
Rights
Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence
Disciplines
Computer Sciences, Information Science, Linguistics
Abstract
This archive contains a collection of computational models called word embeddings. These are vectors that contain numerical representations of words. These have been trained on real language sentences collected from the English Wikipedia. As such, they contain contextual (thematic) knowledge about words (rather than taxonomic).
Recommended Citation
Klubicka, F., Maldonado, A., Mahalunkar, A., Kelleher, J. D. Contextual Word Embeddings - Trained on English Wikipedia Corpora. Dataset. Technological University Dublin, DOI: 10.21427/4z8m-ev84
DOI
https://doi.org/10.21427/4z8m-ev84
Methodology
We have trained a separate embedding model for every one of the 20 differently-sized Wikipedia corpora that we used for our experiments, and thus make available 20 different embedding models.
For training we used an off-the-shelf implementation of Pytorch and changed no major parameters, essentially using it 'as is'.
Each model has been trained for 30 epochs.
As the corpus files differed with regards to their sizes, these are also reflected in the model's names. The size of the training corpora is expressed in number of tokens (i.e. words), or percentage of the total of the original Wikipedia corpus.
Readme file containing a detailed description of the resource
Language
eng
File Format
.dat
Viewing Instructions
The models provided here are compressed into a gzip archive. To view them they first need to be extracted, which can be done using most standard archive managers (e.g. 7-Zip, WinRAR, etc.) Once extracted, the models need to be used with a programming language (we recommend Python 3.6) and can be utilised with the appropriate Python packages.
Funder
The creation of these resources was supported by the ADAPT Centre for Digital Content Technology (https://www.adaptcentre.ie), funded under the SFIResearch Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.