Document Type

Theses, Ph.D

Rights

This item is available under a Creative Commons License for non-commercial use only

Disciplines

1.2 COMPUTER AND INFORMATION SCIENCE

Publication Details

Thesis successfully submitted for the degree of Doctor of Philosophy.

Abstract

First Story Detection (FSD) is an important application of online novelty detection within Natural Language Processing (NLP). Given a stream of documents, or stories, about news events in a chronological order, the goal of FSD is to identify the very first story for each event. While a variety of NLP techniques have been applied to the task, FSD remains challenging because it is still not clear what is the most crucial factor in defining the “story novelty”. Giventhesechallenges,thethesisaddressedinthisdissertationisthat the notion of novelty in FSD is multi-dimensional. To address this, the work presented has adopted a three dimensional analysis of the relative qualities of FSD systems and gone on to propose a specific method that wearguesignificantlyimprovesunderstandingandperformanceofFSD. FSD is of course not a new problem type; therefore, our first dimen sion of analysis consists of a systematic study of detection models for firststorydetectionandthedistancesthatareusedinthedetectionmod els for defining novelty. This analysis presents a tripartite categorisa tion of the detection models based on the end points of the distance calculation. The study also considers issues of document representation explicitly, and shows that even in a world driven by distributed repres iv entations,thenearestneighbourdetectionmodelwithTF-IDFdocument representations still achieves the state-of-the-art performance for FSD. Weprovideanalysisofthisimportantresultandsuggestpotentialcauses and consequences. Events are introduced and change at a relatively slow rate relative to the frequency at which words come in and out of usage on a docu ment by document basis. Therefore we argue that the second dimen sion of analysis should focus on the temporal aspects of FSD. Here we are concerned with not only the temporal nature of the detection pro cess, e.g., the time/history window over the stories in the data stream, but also the processes that underpin the representational updates that underpin FSD. Through a systematic investigation of static representa tions, and also dynamic representations with both low and high update frequencies, we show that while a dynamic model unsurprisingly out performs static models, the dynamic model in fact stops improving but stays steady when the update frequency gets higher than a threshold. Our third dimension of analysis moves across to the particulars of lexicalcontent,andcriticallytheaffectoftermsinthedefinitionofstory novelty. Weprovideaspecificanalysisofhowtermsarerepresentedfor FSD, including the distinction between static and dynamic document representations, and the affect of out-of-vocabulary terms and the spe cificity of a word in the calculation of the distance. Our investigation showed that term distributional similarity rather than scale of common v terms across the background and target corpora is the most important factor in selecting background corpora for document representations in FSD. More crucially, in this work the simple idea of the new terms emerged as a vital factor in defining novelty for the first story.

DOI

https://doi.org/10.21427/spp0-zx14

Share

COinS