Information gain is the criterion used by the ID3 decision tree algorithm to choose which attribute to select at each step of building the tree. It measures the reduction in entropy (or uncertainty) about the classification of training examples after the set has been partitioned on a specific attribute. The attribute with the highest information gain is chosen.
The computation is based on Shannon Entropy:
- First, the entropy of a set of instances , with respect to a binary classification, is calculated as: where,
- → is the proportion of positive examples.
- → is the proportion of negative examples in .
- Next, for a given attribute , the set is partitioned into subsets for each possible value of . The remaining entropy after the split is the weighted average of the entropies of these subsets: where,
- → number of instances in the subset with value .
- → is the total number of instances.
- Finally, the information gain for attribute is the entropy of the original set minus the remaining entropy after splitting on :