Web Scraping Basics

We will call the program that is used for scraping a parser. Because in fact, in addition to receiving data, it parses these data and returns it in specific format we need.

You are going to use parsers when you need to extract data from limited number of websites and the websites have more or less structured navigation. In contrast when you don't have a list of specific websites and collect data by following links on webpages you will use web crawlers. Crawlers have specific properties and requirements that will be discussed in the upcoming sections.

All parsers are arranged approximately the same way. The whole process is divided into several stages:

Session initialization
Getting Results
Iterating over pages with results - pagination
Iterating over the results list
Getting and processing the result

Individual parser may not have one or more of the steps listed, or vice versa, have additional steps. For example, iterating over pages may be missing if there are few results or all of them are returned on one page. Or you may need additional queries or information from the results list stage to obtain the full result data.

The sample code of the universal parser could be the following:

class Parser():
    def parse(self):
        session = self.init_session()
        results = self.get_results(session)
        for page in self.iter_pages(results):
            for item in self.iter_items(page):
                data = self.parse_item(item)
                yield data

We have omitted other functions not currently relevant.

In the process of parsing, we will need a lot of information obtained at different steps. It is convenient to store it in the object associated with the current parsing process, so we will program the parser as a class, and store the data of the current session in the objects of the class. In other programming languages, you can implement this as you like, but keep in mind that the parser will have a lot of state data.

Python Remark: It is convenient to design the parser code as a generator. In this case you can interrupt the process at any convenient moment by obtaining exactly the number of results that you need. Keeping the number of requests to the data source to a minimum.

Next, we'll look at each step in more detail.

BotProxy: Rotating Proxies Made For Professionals

Connect your software to ultra fast rotating proxies with daily fresh IPs and worldwide locations in minutes. We allow full speed multithreaded connections and charge only for bandwidth used. Typical integrations take less than 5 minutes into any script or application.