The paper describes the procedures of automatic data extraction from web pages (web scraping), its advantages and limitations, as well as gives an overview of the basic minimum of competencies for web scraping: in particular, programming using Python and navigating through a web pages’ code. A detailed illustration is also given based on a fragment of the data collection process from a recent relevant Russian study.
The paper is addressed the problem of the elaborated concepts shortage which deals with the analysis of multivariate associations among categorical variables. Meanwhile, such associations are rather common in sociological research what is argued by a corpus of methodological works. In them, it is grounded the necessity of the analysis of multivariate associations among categorical variables. Nevertheless, sociological experience in such an analysis is pretty poor as well as its theoretical generalization. In this study, we have tried to fill this gap by comparing the three methods: CHAID, log-linear analysis, and multiple correspondence analysis. The methods were compared at both theoretical and empirical levels. The empirical objective was to create a portrait of various Russian political parties’ electorate using the data of the European social research conducted in 2016. By bringing the results of the application of methods to the form of categories combinations and by formulating numerical criteria for the comparison, the study allowed to identify the most effective method in two types of analytical tasks: description and forecasting. According to the results of the study, multiple correspondence analysis was the most effective in descriptive tasks, and log-linear analysis was the most effective in forecasting. The latter conclusion contradicts the currently predominating opinion regarding the CHAID’s efficiency in cases when a target variable is presented in data and, therefore, has high practical significance for the further development of the idea of building high-precision predictive models in sociological research.
The paper is addressed to an approach to working with a missing data "as is". I.e. it is supposed that missing data becomes one more category of the exploring variable. Such an approach to working with missings is radically different from alternative approaches: they are to delete those observations which contain missings or replace missings with valid data. The only method known to us which makes it possible to implement the approach of working with missings "as is" is CHAID. CHAID refers to the decision trees class of methods; in itself, this method is very interesting and relevant for researchers dealing with categorical variables and nonlinear associations.
In the literature, we did not find an answer to the question what are the advantages and limitations of the approach to working with missings "as is" implemented in CHAID comparing to the mentioned alternative approaches. Despite this, tree models with missing data are often found in empirical studies. To start a discussion considering this issue, we conducted several series of statistical experiments on generated data organized into three predictors of categorical and interval measure type. It was empirically established that, on the whole, the method correctly distributes missings in tree's nodes, but in most cases, the inclusion of missings in an analysis is accompanied by changes in tree's structure, and therefore there is a risk of obtaining incorrect, false, erroneous conclusions. The paper also provides recommendations on what factors should be considered when deciding whether to include missing in an analysis "as is".
The paper considers different approaches to the factor analysis (FA) for ordinal data. In some studies it is necessary to find a latent variable behind the observed indicators measured on an ordinal scale. Classical factor analysis cannot be applied to those indicators as it is built on the Pearson correlation coefficient which is only applicable to interval variables. So the researcher faces a choice: to treat the ordinal variables as the interval ones, to dichotomize ordinal variables or to use special techniques for ordinal indicators such as replacing the correlation matrix or using Categorical principal components analysis (CatPCA). The study is based on a theoretical comparison of assumptions that underpin the algorithms of each applications and a statistical experiment and provides an answer to the question which of the above-mentioned factorization approaches is optimal for indentifying latent variables measured by ordinal indicators on a 3-point, 5-point or 10-point scale.