Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence
Machine learning has been successfully applied to a wide range of prediction problems, yet its application to data streams can be complicated by concept drift. Existing approaches to handling concept drift are overwhelmingly reliant on the assumption that it is possible to obtain the true label of an instance shortly after classification at a negligible cost. The aim of this thesis is to examine, and attempt to address, some of the problems related to handling concept drift when the cost of obtaining labels is high. This thesis presents Decision Value Sampling (DVS), a novel concept drift handling approach which periodically chooses a small number of the most useful instances to label. The newly labelled instances are then used to re-train the classifier, an SVM with a linear kernel, to handle any change in concept that might occur. In this way, only the instances that are required to keep the classifier up-to-date are labelled. The evaluation of the system indicates that a classifier can be kept up-to-date with changes in concept while only requiring 15% of the data stream to be labelled. In a data stream with a high throughput this represents a significant reduction in the number of labels required.
The second novel concept drift handling approach proposed in this thesis is Confidence Distribution Batch Detection (CDBD). CDBD uses a heuristic based on the distribution of an SVM’s confidence in its predictions to decide when to rebuild the clas- sifier. The evaluation shows that CDBD can be used to reliably detect when a change in concept has taken place and that concept drift can be handled if the classifier is rebuilt when CDBD sig- nals a change in concept. The evaluation also shows that CDBD obtains a considerable labels saving as it only requires labelled data when a change in concept has been detected. The two concept drift handling approaches deal with concept drift in a different manner, DVS continuously adapts the clas- sifier, whereas CDBD only adapts the classifier when a sizeable change in concept is suspected. They reflect a divide also found in the literature, between continuous rebuild approaches (like DVS) and triggered rebuild approaches (like CDBD). The final major contribution in this thesis is a comparison between continuous and triggered rebuild approaches, as this is an underexplored area. An empirical comparison between representative techniques from both types of approaches shows that triggered rebuild works slightly better on large datasets where the changes in concepts occur infrequently, but in general a continuous rebuild approach works the best.
Lindstrom, P. (2013). Handling Concept Drift in the Context of Expensive Labels. Doctoral Thesis. Technological University Dublin. doi:10.21427/D7HP48