Datasets

Taxonomic Word Embeddings - Trained on English WordNet Random Walk Pseudo-Corpora

Filip Klubicka, Technological University DublinFollow
Alfredo Maldonado, Trinity College DublinFollow
Abhijit Mahalunkar, Technological University DublinFollow
John D. Kelleher, Technological University DublinFollow

Document Type

Dataset

Rights

Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence

Disciplines

Computer Sciences, Information Science, Linguistics

Abstract

This archive contains a collection of computational models called word embeddings. These are vectors that contain numerical representations of words. They have been trained on pseudo-sentences generated artificially from a random walk over the English WordNet taxonomy, and thus reflect taxonomic knowledge about words (rather than contextual).

Recommended Citation

Klubicka, F., Maldonado, A., Mahalunkar, A., Kelleher, J. D. Taxonomic Word Embeddings - Trained on English WordNet Random Walk Pseudo-Corpora. Dataset. Technological University Dublin. DOI: 10.21427/dwpw-1d69

DOI

https://doi.org/10.21427/dwpw-1d69

Methodology

We have trained a separate embedding model for every one of the 72 random walk corpora we generate, and thus make available 72 different embedding models. For training we used an off-the-shelf implementation of pytorch and changed no major parameters, essentially using it 'as is'. Each model has been trained for 30 epochs.

As the corpus files differed with regards to the parameters used in their generation, these are also reflected in the models.
The parameters are:
- size : number of sentences/lines in the training corpus
- direction : the direction that the random walk over WordNet was allowed to go while generating sentences (possibilities are up/down/both)
- minimal sentence length : the shortest length sentence (in number of words)

Language

eng

File Format

.dat

Viewing Instructions

The models provided here are compressed into a gzip archive. To view them they first need to be extracted, which can be done using most standard archive managers (e.g. 7-Zip, WinRAR, etc.) Once extracted, the models need to be used with a programming language (we recommend Python 3.6) and can be utilised with the appropriate Python packages.

Funder

The creation of these resources was supported by the ADAPT Centre for Digital Content Technology (https://www.adaptcentre.ie), funded under the SFIResearch Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.

Download

Files available below

Included in

Computer Sciences Commons

Article Location

COinS

Datasets

Taxonomic Word Embeddings - Trained on English WordNet Random Walk Pseudo-Corpora

Document Type

Rights

Disciplines

Abstract

Recommended Citation

DOI

Methodology

Related Content

Language

File Format

Viewing Instructions

Funder

Creative Commons License

Included in

Article Location

Search

Browse

Author Corner

Article Locations

Datasets

Taxonomic Word Embeddings - Trained on English WordNet Random Walk Pseudo-Corpora

Authors

Document Type

Rights

Disciplines

Abstract

Recommended Citation

DOI

Methodology

Related Content

Language

File Format

Viewing Instructions

Funder

Creative Commons License

Included in

Share

Article Location

Search

Browse

Author Corner

Article Locations