Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence
Computer Sciences, Information Science, Linguistics
This archive contains a collection of pseudo-corpora. These are text files that contain pseudo-sentences generated artificially from a random walk over the English WordNet taxonomy.
The random walk algorithm produces a pseudo-sentence from WordNet by randomly picking a node (SynSet) in WordNet, randomly choosing a word in the SynSet, and then randomly picking a connected node and repeating the process. At every step there is a 15% chance for the random walk to stop; it also stops if it has no more connected nodes to take. Once the walk stops, a sentence is generated, and the same process repeats for each new sentence.
Each line in the generated file represents one pseudo-sentence, where words are delimited by spaces.
measure musical notation tonality minor mode
Dutch-processed cocoa powder chocolate milk
The corpus files are different in size, as well as in some parameters that were used to generate them.
The parameters are:
- size : number of sentences/lines in the corpus
- direction : the direction that the random walk over WordNet was allowed to go while generating sentences (possibilities are up/down/both)
- minimal sentence length : the shortest length sentence (in number of words)
Klubicka, F., Maldonado, A., Mahalunkar, A., Kelleher, J. D. (2019)English WordNet Random Walk Pseudo-Corpora. Dataset. Technological University Dublin. doi:10.21427/he55-6481
The creation of these resources was supported by the ADAPT Centre for Digital Content Technology (https://www.adaptcentre.ie), funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.
The corpora are compressed into a gzip archive. To view them they first need to be extracted, which can be done using most standard archive managers (e.g 7-Zip, WinRAR, etc.) Once extracted, the provided .txt files can be viewed with a simple text editor, such as notepad or similar.
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 License.