Missing values in CHAID: risks of their use
On Tuesday, the 17th of April, a new open seminar of the Research and Study Group (RSG) was held. On the meeting, Svetlana Zhuchkova presented the results of the research devoted to the usage of missing values in CHAID method. This report was previously successfully received by the public at the 19th April International Academic Conference on Economic and Social Development.
On the 17th of April, the seminar was totally methodological and devoted to CHAID, one of the most widely used in sociology algorithms for constructing decision trees. CHAID, like most other decision trees, allows to include in the analysis missing values of the variable, which are beginning to be treated as a separate variable category. During the study, which was held by Svetlana Zhuchkova and Alexey Rotmistrov, a total of 780 experiments was conducted on the sample of 3000 observations to determine how the inclusion of these missing values affects the construction of a decision tree. Replacing the valid values of variables with missing ones, the researchers fixed how the tree was changed in comparison with the fully studied original tree which consisted of a nominal dependent variable and three variable-predictors (nominal, scale, nominal in the sequence from the root of the tree). Missing values were set at random only to nominal predictors. Thus, half of the experiments were carried out with the addition of missing values to the variable at the root of the tree, and the other - with adding them to the predictor not at the root. Also, the experiments differed in the accuracy of the forecast of the source tree (75% and 100%) and the percentage of missing data (10%, 25%, and 50%).
The results of the experiments showed that, overall, CHAID correctly determines missing values on nodes, however, in most cases, the inclusion of missing values in the analysis accompanied by structural changes in the tree. For example, new nodes appear, and existing nodes disappear. Svetlana and Alexey compiled a so-called tree damage index, based on which the criteria for the acceptability of missing values in the analysis were derived. It was found that to build a good tree, firstly, the proportion of missing values should be low, secondly, missing values should be located away from the root of the tree, thirdly, the researcher should pursue the goal of the forecast and not the search for interactions, finally, the nature of the missing value should be nonrandom.
At the end of the speech, to more clearly demonstrate the significance of the study to the participants of RSG, Svetlana repeated the experiment on the model of Tamara Mkhitaryan, whose report was successfully held on the 20th of March. The results confirmed the previously found patterns: the tree changed significantly when missing values were added to it.
After talking about the limitations and perspectives of the study, the participants of the group discussed the CHAID method, its shortcomings, and advantages.
Handling missing data with CHAID.docx