Need Proxy?

BotProxy: Rotating Proxies Made for professionals. Really fast connection. Built-in IP rotation. Fresh IPs every day.

Find out more


Scrapy pull data from table rows

Question

I'm trying to pull data from this page using Scrapy: https://www.interpol.int/notice/search/woa/1192802

The spider will crawl multiple pages but I have excluded the pagination code here to keep things simple. The problem is that the number of table rows that I want to scrape on each page can change each time.

So I need a way of scraping all the table data from the page no matter how many table rows it has.

First, I extracted all the table rows on the page. Then, I created a blank dictionary. Next, I tried to loop through each row and put it's cell data into the dictionary.

But it does not work and it is returning a blank file.

Any idea what's wrong?

# -*- coding: utf-8 -*-
import scrapy


class Test1Spider(scrapy.Spider):
    name = 'test1'
    allowed_domains = ['interpol.int']
    start_urls = ['https://www.interpol.int/notice/search/woa/1192802']

    def parse(self, response):
        table_rows = response.xpath('//*[contains(@class,"col_gauche2_result_datasheet")]//tr').extract()
        data = {}
        for table_row in table_rows:
            data.update({response.xpath('//td[contains(@class, "col1")]/text()').extract(): response.css('//td[contains(@class, "col2")]/text()').extract()})
        yield data

Answer

What is this?

response.css('//td[contains(@class, "col2")]/text()').extract()

You are calling css() method but you are giving it a xpath

Anyways, here is the 100% working code, I have tested it.

table_rows = response.xpath('//*[contains(@class,"col_gauche2_result_datasheet")]//tr')
data = {}
for table_row in table_rows:
    data[table_row.xpath('td[@class="col1"]/text()').extract_first().strip()] = table_row.xpath('td[@class="col2 strong"]/text()').extract_first().strip()
yield data

EDIT:

To remove the characters like \t\n\r etc, use regex.

import re
your_string = re.sub('\\t|\\n|\\r', '', your_string)

cc by-sa 3.0