The University of Sussex

Virtual seens and the frequently used dataset

Chris Thornton

The paper considers the situation in which a learner's testing set contains close approximations of cases which appear in the training set. Such cases can be considered 'virtual seens' since they are approximately seen by the learner. Generalisation measures which do not take account of the frequency of virtual seens may be misleading. The paper shows that the 1-NN algorithm can be used to derive a normalising baseline for generalisation statistics. The normalisation process is demonstrated through application to Holte's [1993] study in which the generalisation performance of the 1R algorithm was tested against C4.5 on 16 commonly used datasets.

Download compressed postscript file