----------------------------------------------------------------------- HELP ID3 ID3 is a rule induction algorithm. It uses a set of example cases together with their classes to find a "rule" or decision tree which can classify all the examples into their respective classes. It depends upon the accuracy and representativity of the examples to produce a valid rule. There must be a pattern in the data for ID3 to be able to find one! For a description of the algorithm see: 1) Quinlan, J.R., "Discovering Rules by Induction from Large Collections of Examples", in Michie, D. (Ed) "Expert Systems in the Micro Electronic Age", Edinburgh University Press, 1979. 2) Thompson, B., and Thompson W., "Finding Rules in Data", in "Byte", Vol. 11, No. 12, Novermber 1986, pp149-158. Quinlan(1), is the usual reference but the article in Byte gives a highly readable description of the algorithm, very much in the Byte mould. This file describes the use of the ID3 procedures and provides some tutorial examples. --- PROCEDURES ----------------------------------------------------------- There are four user procedures, ID3, CLASSIFY, BATCH_CLASSIFY and ID3_CHISQD. id3(examples,attributes,classes) -> rule; classify(rule,attributes,prompts,class_strings); batch_classify(rule,attributes) -> class; id3_chisqd(examples,attributes,classes,significance) -> rule; --- EXAMPLES -------------------------------------------------------------- Suppose we have a problem concerning a simple two input logical AND gate. The output is true if and only if both of the inputs are true. This "rule" can be induced (inferred from examples) by ID3. To do this we must tell it what the attributes are (the inputs), and what values they can hold and they must be presented in list format as follows, Attributes are :- [ [LOGICAL Input_1 true false] [LOGICAL Input_2 true false] ] The different classes have to be supplied too, Classes are :- [CLASSES Output true false] And the example set. This is a list of lists of the form: [ ... ] In this problem the Example set is:- [ [true true true] [true false false] [false true false] [false false false] ] If the procedure ID3 is called with these parameters, it will return the decision rule shown below. ID3(Examples, Attributes, Classes) -> Rule; Rule==> [ LOGICAL Input_1 [true [LOGICAL Input_2 [true [CLASS Output true]] [false [CLASS Output false]]]] [false [CLASS Output false]]] This can then be used to classify further instances. In this case we have a very simple problem and the example set has covered all of the possibilities. We could have given the same information to ID3 in a more concise manner. Observing that if either input is false, then the output is false, we can use the special reserved symbol "*" to mean the "don't care" value. Thus we can use the following example set, and still obtain the same decision rule as for the larger example set: [ [true true true] ;;; Example set using the don't care value, "*". [false * false] [* false false] ] The value sets in this problem all happen to be {true false}. This is just a feature of the problem. They can all be completely different and the size of the sets is not restricted (but see ID3_CHISQD!). ID3 is best suited to problems involving symbols, but attributes can be integers as well, (positive and negative). Problems involving both numbers and symbols are permissible. The following example illustrates most of these points. It recommends a holiday destination (or none at all). It does make the assumption that you will go for the best holiday you can afford (the view the travel agent might take!). [ [LOGICAL abroad definitely definitely_not home_or_abroad] [INTEGER money] ] -> Attributes; [CLASSES Holiday save_up spain usa lake_district] -> Classes; [ [definitely 500 spain] [definitely 1000 usa] [definitely 300 save_up] [home_or_abroad 300 lake_district] [definitely_not 200 lake_district] [* 0 save_up] ] -> Examples; id3(Examples,Attributes,Classes) -> Rule; This generates the following rule, note that numerical attributes are always split into two subsets and the average of two numbers is used as the value on which to split, eg (1000+500)/2 = 750 is the value used to decide whether to recommend spain or the usa. In applications where numbers should not be considered as having a range, then don't use them, use a symbolic representation (eg "pounds1000") instead. Rule==> [INTEGER money [[< 400] [INTEGER money [[< 100] [CLASS Holiday save_up]] [[>= 100] [LOGICAL abroad [definitely [CLASS Holiday save_up]] [definitely_not [CLASS Holiday lake_district]] [home_or_abroad [CLASS Holiday lake_district]]]]]] [[>= 400] [INTEGER money [[< 750] [CLASS Holiday spain]] [[>= 750] [CLASS Holiday usa]]]]] Note, when the rule splits on the attribute "abroad" it partitions the examples into three subsets, one for each possible value. If ever it finds that there are no examples, (amongst those it has at that point), with a particular of the attribute, then it will create a null node for that value. Using the procedure CLASSIFY, we may consult the rule to find out what we should do for our holiday. The interaction might be as follows, (the asterisks denote what the user has entered). ------------------------------------------------------------------------------ : classify(Rule,Attributes,nil,nil); What is the value of money, integer value required ? 250 *** What is the value of abroad? Enter one of, definitely, definitely_not, home_or_abroad ? home_or_abroad ************** In this case Holiday has value lake_district. ------------------------------------------------------------------------------ In this case the decision rule advises us to go to the lakes. Note that the language is not domain specific so it seems stilted. It would be easy to associate question forms with each attribute and the final classification statement. This is the purpose of the "prompts" and "class_strings" parameters. These are lists and can be nil as in the above example, in which case the default forms are used. The prompts list indicates the question which should be used when asking about each attribute. The format of the "prompts" list is: [[ 'a prompt string'] .... ] -> prompts; Note that the keys are words and the prompts are strings enclosed in single quotes. If you save these to a file you should do so with the value of POP_PR_QUOTES set to true (default is false). The format of the class string list is similar. It is used when a test case has been classified to indicate what advice should be given the user. [[ 'a piece of advice'] ... ] -> class_strings; If there is no user-supplied string for a particular key, then the default will be used. Now let us repeat the previous interaction with the following addition: [[ money 'How much money do you have to spend on the holiday ?']] -> prompts; [[ lake_district 'Why not visit the lake district this year.']] -> class_strings; ------------------------------------------------------------------------------ : classify(Rule,Attributes,prompts,class_strings); How much money do you have to spend on the holiday ? Integer value required ? 250 *** What is the value of abroad? Enter one of, definitely, definitely_not, home_or_abroad ? home_or_abroad ************** Why not visit the lake district this year. ------------------------------------------------------------------------------ Since no entry was made for "abroad" in the prompts list it still uses the default. --- BATCH_CLASSIFY --------------------------------------------------------- This procedure is non-interactive and returns the class if it was able to classify the case and the value (as opposed to the word false) otherwise. batch_classify(rule, attributes) -> class; It is thus useful for testing a rule derived from an example set against a further test set. This could be used then to try generating different rules using different subsets of data to find a trade off between rule complexity and performance. This would be of particular use in problems which use ID3_CHISQD rather than the usual ID3 algorithm. --- ID3_CHISQD --------------------------------------------------------------- id3_chisqd(examples,attributes,classes,significance) -> rule; ID3 does not work so well in noisy domains, eg. where some of the example cases may be wrongly classified. The decision tree can become excessively "bushy". A solution proposed by Quinlan and others is to use the chi-squared significance test instead of entropy as a partioning and termination criterion. This is not so much a general purpose procedure as the ID3 procedure using entropy. It is subject to a number of restrictions which limit its applicability. These are enumerated below. 1) The chi-squared test uses a "degrees of freedom" value which is calculated for an attribute as (no_of_classes -1) * (no_of_values -1). This value has an upper limit of 40 as that is how much of the table is stored for the different significance levels. Furthermore, the number of d.o.f. must be the same for all attributes. This is necessary since it is not possible to compare the chi-squared value directly for different d.o.f. For instance, integer attributes are always split into two subsets, then dof = (no_of_classes -1) * (2 - 1) = no_of_classes - 1. Other binary attributes could be used together with integers since these would have the same d.o.f. Problems involving only n-ary attributes are also possible. 2) For a chi-squared test the minimum recommended total frequency is 50. This means at least 50 examples. Hundreds would be better. 3) The chi-squared test must be given the significance level, 90%, 95% and 99% are those built into ID3_CHISQD. 4) The "expected" frequency for each attribute-class combination must be at least 5. These are all imposed by the nature of the chi-squared test. The algorithm may consequently be unable to completely classify all the examples. In this situation it will produce a special leaf node which indicates the probabilities of each class on the basis of the composition of the example subset at that node. eg [[PROB_CLASS class_name value1 50] [PROB_CLASS class_name value2 30] [PROB_CLASS class_name value3 20] ] The CLASSIFY and BATCH_CLASSIFY routines can cope with this form. BATCH_CLASSIFY regards such a node as a failure however. This may not always be the desired response. --- SYNTAX OF THE INPUTS --------------------------------------------------- The following is a BNF description of the syntax of the input lists to the classify procedure. NB * means 0 or more of + means 1 or more of examples ::= [ + ] attributes ::= | classes ::= [ CLASSES + ] ::= [ + ] ::= [ LOGICAL + ] ::= [ INTEGER ] ,, :== 'any legal pop11 word' The "don't care" value is allowed as an attribute value in examples only. ---- END OF HELP FILE ---- END OF HELP FILE ---- END OF HELP FILE ---- --- $poplocal/local/help/id3_chisqd --- Copyright University of Sussex 1991. All rights reserved. ----------