Datasets

English WordNet Random Walk Pseudo-Corpora

Filip Klubicka, Technological University DublinFollow
Alfredo Maldonado, Trinity College DublinFollow
Abhijit Mahalunkar, Technological University DublinFollow
John D. Kelleher, Technological University DublinFollow

Document Type

Dataset

Rights

Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence

Grant Number

13/RC/2106

Disciplines

Computer Sciences, Information Science, Linguistics

Abstract

This archive contains a collection of pseudo-corpora. These are text files that contain pseudo-sentences generated artificially from a random walk over the English WordNet taxonomy.

Recommended Citation

Klubicka, F., Maldonado, A., Mahalunkar, A., Kelleher, J. D. (2019)English WordNet Random Walk Pseudo-Corpora. Dataset. Technological University Dublin. doi:10.21427/he55-6481

DOI

https://doi.org/10.21427/he55-6481

Methodology

The random walk algorithm produces a pseudo-sentence from WordNet by randomly picking a node (SynSet) in WordNet, randomly choosing a word in the SynSet, and then randomly picking a connected node and repeating the process. At every step there is a 15% chance for the random walk to stop; it also stops if it has no more connected nodes to take. Once the walk stops, a sentence is generated, and the same process repeats for each new sentence.

Each line in the generated file represents one pseudo-sentence, where words are delimited by spaces.
Example sentences:
measure musical notation tonality minor mode
Dutch-processed cocoa powder chocolate milk

The corpus files are different in size, as well as in some parameters that were used to generate them.
The parameters are:
- size : number of sentences/lines in the corpus
- direction : the direction that the random walk over WordNet was allowed to go while generating sentences (possibilities are up/down/both)
- minimal sentence length : the shortest length sentence (in number of words)

Language

eng

File Format

.txt

Viewing Instructions

The corpora are compressed into a gzip archive. To view them they first need to be extracted, which can be done using most standard archive managers (e.g 7-Zip, WinRAR, etc.) Once extracted, the provided .txt files can be viewed with a simple text editor, such as notepad or similar.

Funder

The creation of these resources was supported by the ADAPT Centre for Digital Content Technology (https://www.adaptcentre.ie), funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.

Download

Files available below

Access the publication accompanying this data

Contact the Owner

Article Location

COinS

Datasets

English WordNet Random Walk Pseudo-Corpora

Document Type

Rights

Grant Number

Disciplines

Abstract

Recommended Citation

DOI

Methodology

Related Content

Language

File Format

Viewing Instructions

Funder

Creative Commons License

Article Location

Search

Browse

Author Corner

Article Locations

Datasets

English WordNet Random Walk Pseudo-Corpora

Authors

Document Type

Rights

Grant Number

Disciplines

Abstract

Recommended Citation

DOI

Methodology

Related Content

Language

File Format

Viewing Instructions

Funder

Creative Commons License

Share

Article Location

Search

Browse

Author Corner

Article Locations