Document Type

Conference Paper


Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence

Publication Details

In Proceedings of the 17th. Irish Conference on Artificial Intelligence and Cognitive Science (AICS\06) pp.244-253. Edited by D. Bell, P. Milligan and P. Sage.


Spam filtering is a text classification task to which Case-Based Reasoning (CBR) has been successfully applied. We describe the ECUE system, which classifies emails using a feature-based form of textual CBR. Then, we describe an alternative way to compute the distances between cases in a feature-free fashion, using a distance measure based on text compression. This distance measure has the advantages of having no set-up costs and being resilient to concept drift. We report an empirical comparison, which shows the feature-free approach to be more accurate than the feature-based system. These results are fairly robust over different compression algorithms in that we find that the accuracy when using a Lempel-Ziv compressor (GZip) is approximately the same as when using a statistical compressor (PPM). We note, however, that the feature-free systems take much longer to classify emails than the feature-based system.