Document Type



Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence

Grant Number



Computer Sciences, Information Science, Linguistics


This archive contains a collection of pseudo-corpora. These are text files that contain pseudo-sentences generated artificially from a random walk over the English WordNet taxonomy.



The random walk algorithm produces a pseudo-sentence from WordNet by randomly picking a node (SynSet) in WordNet, randomly choosing a word in the SynSet, and then randomly picking a connected node and repeating the process. At every step there is a 15% chance for the random walk to stop; it also stops if it has no more connected nodes to take. Once the walk stops, a sentence is generated, and the same process repeats for each new sentence.

Each line in the generated file represents one pseudo-sentence, where words are delimited by spaces.
Example sentences:
measure musical notation tonality minor mode
Dutch-processed cocoa powder chocolate milk

The corpus files are different in size, as well as in some parameters that were used to generate them.
The parameters are:
- size : number of sentences/lines in the corpus
- direction : the direction that the random walk over WordNet was allowed to go while generating sentences (possibilities are up/down/both)
- minimal sentence length : the shortest length sentence (in number of words) (3 kB)
Readme file containing a detailed description of the resource



File Format


Viewing Instructions

The corpora are compressed into a gzip archive. To view them they first need to be extracted, which can be done using most standard archive managers (e.g 7-Zip, WinRAR, etc.) Once extracted, the provided .txt files can be viewed with a simple text editor, such as notepad or similar.

Creative Commons License

Creative Commons Attribution-Noncommercial 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 License.


The creation of these resources was supported by the ADAPT Centre for Digital Content Technology (, funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.


Article Location