This item is available under a Creative Commons License for non-commercial use only
Spam filtering is a text classification task to which Case-Based Reasoning (CBR) has been successfully applied. We describe the ECUE system, which classifies emails using a feature-based form of textual CBR. Then, we describe an alternative way to compute the distances between cases in a feature-free fashion, using a distance measure based on text compression. This distance measure has the advantages of having no set-up costs and being resilient to concept drift. We report an empirical comparison, which shows the feature-free approach to be more accurate than the feature-based system. These results are fairly robust over different compression algorithms in that we find that the accuracy when using a Lempel-Ziv compressor (GZip) is approximately the same as when using a statistical compressor (PPM). We note, however, that the feature-free systems take much longer to classify emails than the feature-based system.
Delany, Sarah Jane and Bridge, Derek: Feature based and feature free textual CBR: a comparison in spam filtering. Proceedings of the 17th. Irish Conference on Artificial Intelligence and Cognitive Science (AICS\06) pp.244-253. Edited by D. Bell, P. Milligan and P. Sage.