Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence
Spam filtering is a text classification task to which Case-Based Reasoning (CBR) has been successfully applied. We describe the ECUE system, which classifies emails using a feature-based form of textual CBR. Then, we describe an alternative way to compute the distances between cases in a feature-free fashion, using a distance measure based on text compression. This distance measure has the advantages of having no set-up costs and being resilient to concept drift. We report an empirical comparison, which shows the feature-free approach to be more accurate than the feature-based system. These results are fairly robust over different compression algorithms in that we find that the accuracy when using a Lempel-Ziv compressor (GZip) is approximately the same as when using a statistical compressor (PPM). We note, however, that the feature-free systems take much longer to classify emails than the feature-based system.
Delany, S. J. & Bridge, D. (2006). Feature based and feature free textual CBR: a comparison in spam filtering. Proceedings of the 17th. Irish Conference on Artificial Intelligence and Cognitive Science (AICS\06), pg. 244-253. Edited by D. Bell, P. Milligan and P. Sage. doi:10.21427/86mn-ks85