• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Introduction to parsing: the experience of work with Python

Despite the upcoming May holidays, the seminar of the RSG, held on April 30, wasn't left without the attention of listeners. Anastasia Rodygina, invited third-year student of the Sociology Department of Faculty of Social Science, lectured about her experience of sites parsing in Python.

Anastasia have been interested in data analysis and in opportunities of using Python in social science for a long time. Her experience will help the RSG members to understand the web-scraping to get accomplished in their own research. Anastasia introduced three ways to collect information from the Internet: Scrapy, Beautiful Soup and API. Scrapy is a framework creating a "web spider" that executes GET requests and extracts data from an HTML file. Anastasia explained principles of work with Scrapy on her own experience of parsing of the site of the magazine "Knife": she had to collect links and titles from the first page of the How to section. The next way, Beautiful Soup, is a library for parsing HTML / XML files, its advantage is the ability to convert the wrong page layout into an HTML-tree. Anastasia demonstrated its work on the process of data collection from the site of the magazine Demoscope Weekly. Finally, the API (application programming interface) is an interface created by site developers to make information more accessible to users. For example, the API section of "VKontakte" allows you to upload private messages, comments in groups, friends lists, etc.

Not all participants managed to figure out a new programming language immediately. This fact prompted the head of the RSG Alexey Rotmistrov to think about the appointment of an unscheduled closed practice session devoted to improving the parsing skills.

All materials from the lesson are attached below, including presentation and files with codes, as well as an installation guide for Anaconda for Python.


Seminar materials Python (in Russian)