# Computational Model: Taxonomic Word Embeddings - Trained on English WordNet Random Walk Pseudo-Corpora This archive contains a collection of computational models called word embeddings. These are vectors that contain numerical representations of words. They have been trained on pseudo-sentences generated artificially from a random walk over the English WordNet taxonomy, and thus reflect taxonomic knowledge about words (rather than contextual). ### Resource description and methodology We have trained a separate embedding model for every one of the 72 random walk corpora we generate, and thus make available 72 different embedding models. For training we used an off-the-shelf implementation of [pytorch](https://pytorch.org) and changed no major parameters, essentially using it 'as is'. Each model has been trained for 30 epochs. Each of the 72 models provided is saved in a separate folder, which contains two files: * `word2idx.dat` - a mapping of all words in the model's vocabulary to indexes * `idx2vec-e30.dat` - a map containing actual embeddings (numeric vectors) for every word, mapped to the word's index (we only provide the result of the final epoch of model training, epoch 30) Both files are needed to successfully use the embeddings. Each model's folder is named after the corpus the model was trained on. As the corpus files differed with regards to the parameters used in their generation, these are also reflected in the model's folder name. The parameters are: * size : number of sentences/lines in the corpus * direction : the direction that the random walk over WordNet was allowed to go while generating sentences (possibilities are up/down/both) * minimal sentence length : the shortest length sentence (in number of words) For example: The folder `wn-rw-corpus.100k.up.2ws` contains a model that was trained on a corpus that has *100 000* sentences, the random walk was only going *up* the taxonomy, and the minimum sentence length is *2*. This naming convention applies to all provided models. The models provided here are compressed into a gzip archive. To view them they first need to be extracted, which can be done using most standard archive managers (e.g. 7-Zip, WinRAR, etc.) Once extracted, the models need to be used with a programming language (we recommend Python 3.6) and can be utilised with the appropriate Python packages. On our [GitHub Page](https://github.com/GreenParachute/wordnet-randomwalk-python), you can find code that can utilise the provided word embedding models to measure word similarity or word relatedness. ### Contact and citation If you have any questions, feel free to: * read the below papers that describe the nature and use of these resources in more detail * contact us with any questions or concerns, and we'll be happy to discuss our work E-mail: filip.klubicka@adaptcentre.ie If you use any of the data or code in your research, please cite the following papers: ``` @inproceedings{klubicka2019synthetic, title="Synthetic, yet natural: Properties of WordNet random walk corpora and the impact of rare words on embedding performance", author="Filip Klubi\v{c}ka and Alfredo Maldonado and Abhijit Mahalunkar and John D. Kelleher", booktitle ="Proceedings of GWC: 10th Global Wordnet Conference", year="2019", link=" " } ``` You can download the paper here. ``` @article{maldonado2019size, author="Maldonado, Alfredo and Klubi{\v{c}}ka, Filip and Kelleher, John D.", title="Size Matters: The Impact of Training Size in Taxonomically-Enriched Word Embeddings", journal="Open Computer Science", publisher ="De Gruyter", year="2019", link=" " } ``` You can download the paper [here](https://www.degruyter.com/downloadpdf/j/comp.2019.9.issue-1/comp-2019-0009/comp-2019-0009.pdf).