The study is devoted to a comparison of three approaches to handling missing data of categorical variables: complete case analysis, multiple imputation (based on random forest), and the missing-indicator method. Focusing on OLS regression, we describe how the choice of the approach depends on the missingness mechanism, its proportion, and model specification. The results of a simulated statistical experiment show that each approach may lead to either almost unbiased or dramatically biased estimates. The choice of the appropriate approach should be primarily based on the missingness mechanism: one should choose CCA under MCAR, MI under MAR, and, again, CCA under MNAR. Although MIM produces almost unbiased estimates under MCAR and MNAR as well, it leads to inefficient regression coefficients—ones with too big standard errors and, consequently, incorrect p-values.
If missingness is encountered in a categorical regressor, which approach is preferable: complete case analysis or the missing-indicator method? The former approach implies including in analysis (linear regression in our research) only the cases without missingness across analyzed variables. This approach is embedded in many statistical applications by default, and despite the opinion that its applicability is rather restricted, up-to-date studies provide evidence for its wide applicability – even to missingness not at random. The missing-indicator method, according to which missing data are replaced with a single valid value and a new missing-indicator variable is created, pretends to be an alternative that keeps a full sample available for analysis and, hypothetically, does not lead to the deterioration of parameter estimates. By means of simulated data and a statistical experiment, controlling the factors of missingness mechanism, missingness proportion, and a regression model’s specification, we compare parameter estimates produced by each approach to handling missingness – how biased and inefficient they are. According to the results, no approach leads to crucially biased estimates, but the missing-indicator method produces ineffective estimates.
The paper is devoted to the procedures of automatic data extraction from web pages, i.e., web scraping of web data. We consider different types of web data such as digital traces and other numeric and text web data as well as its advantages (the speed of data collection and, as a consequence, the continuous coverage, efficiency, etc.) and limitations (the limited representativeness, difficulties in organizing storage of a large amount of data, deviation from the traditional procedure for setting up a study, etc.) in comparison with traditional methods of data collection. Various tools of web data extraction (API, requests, and selenium) are described to illustrate principles of handling static and dynamic web pages. The paper also gives an overview of the basic minimum of competencies for web scraping: in particular, programming using Python and navigating through the web pages’ code. A detailed illustration is given based on a fragment of the data collection process from a recent relevant Russian study.
Th e study presents an attempt of the complex exploratory analysis of Russian rap based on the corpus of texts of the Russian-language songs of this genre. Th e corpus contains more than 11,000 texts that vary in their date of creation and popularity by more than 500 artists collected by automatically extracting data from web pages (web scraping). Basing on the idea that media and music, in particular, can act as an agent of socialization, the research aims to search for those narratives that are represented in Russian rap and which can have a socializing eff ect on the multi-million audience of the genre, and especially on young people. Th e result of topic modeling using the BigARTM additive regularization model is an extraction of 17 main topics of Russian rap. Th e analysis of the results of the topic modeling shows that among the narratives searching for life path, sad love and death are the most prevalent and those dedicated to homeland and success are the least. To reveal the topics that are transmitted to the largest number of listeners, the prevalence of the topics in the texts of the three key artists (Basta, Timati and Oxxxymiron) of the Russian hip-hop stage is analyzed. From a substantive point of view, the research sheds an unexpected light on Russian rap, shows its features that distinguish it from the American rap, and can be used as a source of hypotheses for the future research on Russian rap. From a methodological point of view, the study becomes an extensive illustration of the possibilities of applying topic modeling in social science research.
The paper is addressed the problem of the elaborated concepts shortage which deals with the analysis of multivariate associations among categorical variables. Meanwhile, such associations are rather common in sociological research what is argued by a corpus of methodological works. In them, it is grounded the necessity of the analysis of multivariate associations among categorical variables. Nevertheless, sociological experience in such an analysis is pretty poor as well as its theoretical generalization. In this study, we have tried to fill this gap by comparing the three methods: CHAID, log-linear analysis, and multiple correspondence analysis. The methods were compared at both theoretical and empirical levels. The empirical objective was to create a portrait of various Russian political parties’ electorate using the data of the European social research conducted in 2016. By bringing the results of the application of methods to the form of categories combinations and by formulating numerical criteria for the comparison, the study allowed to identify the most effective method in two types of analytical tasks: description and forecasting. According to the results of the study, multiple correspondence analysis was the most effective in descriptive tasks, and log-linear analysis was the most effective in forecasting. The latter conclusion contradicts the currently predominating opinion regarding the CHAID’s efficiency in cases when a target variable is presented in data and, therefore, has high practical significance for the further development of the idea of building high-precision predictive models in sociological research.
The paper is addressed to an approach to working with a missing data "as is". I.e. it is supposed that missing data becomes one more category of the exploring variable. Such an approach to working with missings is radically different from alternative approaches: they are to delete those observations which contain missings or replace missings with valid data. The only method known to us which makes it possible to implement the approach of working with missings "as is" is CHAID. CHAID refers to the decision trees class of methods; in itself, this method is very interesting and relevant for researchers dealing with categorical variables and nonlinear associations.
In the literature, we did not find an answer to the question what are the advantages and limitations of the approach to working with missings "as is" implemented in CHAID comparing to the mentioned alternative approaches. Despite this, tree models with missing data are often found in empirical studies. To start a discussion considering this issue, we conducted several series of statistical experiments on generated data organized into three predictors of categorical and interval measure type. It was empirically established that, on the whole, the method correctly distributes missings in tree's nodes, but in most cases, the inclusion of missings in an analysis is accompanied by changes in tree's structure, and therefore there is a risk of obtaining incorrect, false, erroneous conclusions. The paper also provides recommendations on what factors should be considered when deciding whether to include missing in an analysis "as is".
The paper considers different approaches to the factor analysis (FA) for ordinal data. In some studies it is necessary to find a latent variable behind the observed indicators measured on an ordinal scale. Classical factor analysis cannot be applied to those indicators as it is built on the Pearson correlation coefficient which is only applicable to interval variables. So the researcher faces a choice: to treat the ordinal variables as the interval ones, to dichotomize ordinal variables or to use special techniques for ordinal indicators such as replacing the correlation matrix or using Categorical principal components analysis (CatPCA). The study is based on a theoretical comparison of assumptions that underpin the algorithms of each applications and a statistical experiment and provides an answer to the question which of the above-mentioned factorization approaches is optimal for indentifying latent variables measured by ordinal indicators on a 3-point, 5-point or 10-point scale.