Articles

Bigger versus Similar: Selecting a Background Corpus for First Story Detection Based on Distributional Similarity

Fei Wang, Technological University Dublin
Robert J. Ross, Technological University DublinFollow
John D. Kelleher, Technological University DublinFollow

Document Type

Article

Rights

Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence

Disciplines

Computer Sciences

Publication Details

Proceedings of Recent Advances in Natural Language Processing

Abstract

The current state of the art for First Story Detection (FSD) are nearest neighbourbased models with traditional term vector representations; however, one challenge faced by FSD models is that the document representation is usually defined by the vocabulary and term frequency from a background corpus. Consequently, the ideal background corpus should arguably be both large-scale to ensure adequate term coverage, and similar to the target domain in terms of the language distribution. However, given these two factors cannot always be mutually satisfied, in this paper we examine whether the distributional similarity of common terms is more important than the scale of common terms for FSD. As a basis for our analysis we propose a set of metrics to quantitatively measure the scale of common terms and the distributional similarity between corpora. Using these metrics we rank different background corpora relative to a target corpus. We also apply models based on different background corpora to the FSD task. Our results show that term distributional similarity is more predictive of good FSD performance than the scale of common terms; and, thus we demonstrate that a smaller recent domain-related corpus will be more suitable than a very largescale general corpus for FSD

DOI

https://doi.org/10.26615/978-954-452-056-4_150

Recommended Citation

Wang, F., Ross, R., & Kelleher, J. (2019). Bigger versus Similar: Selecting a Background Corpus for First Story Detection Based on Distributional Similarity. Proceedings of Recent Advances in Natural Language Processing, Varna, Bulgaria, Sept.2-4, pp.1312-1320. doi:10.26615/978-954-452-056-4_150

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.

Download

DOWNLOADS

Since May 21, 2020

PlumX Metrics
Citations
Citation Indexes: 1
Usage
Downloads: 77
Abstract Views: 8
Captures
Readers: 60
see details

Included in

Computer Sciences Commons

Share

COinS

Articles

Bigger versus Similar: Selecting a Background Corpus for First Story Detection Based on Distributional Similarity

Document Type

Rights

Disciplines

Publication Details

Abstract

DOI

Recommended Citation

Creative Commons License

Included in

Search

Browse

Author Corner

Links

Articles

Bigger versus Similar: Selecting a Background Corpus for First Story Detection Based on Distributional Similarity

Authors

Document Type

Rights

Disciplines

Publication Details

Abstract

DOI

Recommended Citation

Creative Commons License

Included in

Share

Search

Browse

Author Corner

Links