Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes.We know that decision tree have many type of structure of tree
1. CHAID (Chi-square Automatic Interaction Detector)
2. CART (Classification and Regression Tree)
3. The Iterative Dichotomiser 3 (ID3)
in this post we will discuse about CART. It is an algorithm to find out the statistical significance between the differences between sub-nodes and parent node. We measure it by sum of squares of standardized differences between observed and expected frequencies of target variable.
Let’s look at the basic terminology used with Decision trees:
- Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.
- Splitting: It is a process of dividing a node into two or more sub-nodes.
- Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
- Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
5. Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.
6. Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
7. Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.
These are the terms commonly used for decision trees. As we know that every algorithm has advantages and disadvantages, below are the important factors which one should know.
CHAID (Chi-square Automatic Interaction Detector) analysis is an algorithm used for discovering relationships between a categorical response variable and other categorical predictor variables. It is useful when looking for patterns in datasets with lots of categorical variables and is a convenient way of summarising the data as the relationships can be easily visualised.
As indicated in the name, CHAID uses Person’s chi-squared test of independence, which test for an association between two categorical variables. A statistically significant result indicates that the two variables are not independent, i.e., there is a relationship between them.
- It works with categorical target variable “Success” or “Failure”.
- It can perform two or more splits.
- Higher the value of Chi-Square higher the statistical significance of differences between sub-node and Parent node.
- Chi-Square of each node is calculated using formula,
- Chi-square = ((Actual — Expected)² / Expected)¹/2
- It generates tree called CHAID (Chi-square Automatic Interaction Detector)
Steps to Calculate Chi-square for a split:
- Calculate Chi-square for individual node by calculating the deviation for Success and Failure both
- Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of each node of the split
CHAID is an acronym for chi-squared automatic interaction detector. CHAID differs from CART by allowing multiple splits on a variable. For classification problems, it relies on the chi-squared test to determine the best split at each step. For regression problems , it uses the F-test.
F-tests (instead of Chi-square tests) to calculate the difference between two population means. If the F-test is significant, a new partition (child node) is created (which means that the partition is statistically different from the parent node). On the other hand, if the result of the F-test between target means is not significant, the categories are merged into a single node.Key elements of the CHAID process are as follows:
1. Preparing the predictor variables — Continuous variables are “binned” to create a set of categories, where each category is a subrange along the entire range of the variable. This binning operation permits CHAID to accept both categorical and continuous inputs, although it internally only works with categorical variables.
2. Merging categories — The categories of each variable are analyzed to determine which ones can be merged safely to reduce the number of categories.
3. Selecting the best split — The algorithm searches for the split point with the smallest adjusted P-value (probability value that can be related to significance).
Advantages of CHAID
- It is fast!
- CHAID builds “wider” decision trees, because it is not constrained (like CART) to make binary splits, making it very popular in market research.
- CHAID may yield many terminal nodes connected to a single branch, which can be conveniently summarized in a simple two-way contingency table, with multiple categories for each variable.
Disadvantages of CHAID
1. Since multiple splits fragment the variable’s range into smaller subranges, the algorithm requires larger quantities of data to get dependable results.
2. The CHAID tree may be unrealistically short and uninteresting, because the multiple splits are hard to relate to real business conditions.
3. Variables of the real data-type variables (continuous numbers with decimals) are forced into categorical bins before analysis, which may not be helpful, particularly if the order in the values should be preserved. The binned categories are inherently unordered; therefore, it is possible for CHAID to group “low” and “high” versus “middle,” which may not be desired.
CHAID does not replace missing values and handles them as a single class which may merge with another class if Suitable. It also produces DTs that tend to be wider rather than deeper (multiway characteristic), which may be unrealistically short and hard to relate to real business conditions. Additionally, it has no pruning function.
Although not the most powerful (in terms of detecting the smallest possible differences) or fastest DT algorithm out there, CHAID is easy to manage, flexible and can be very useful.