Need Proxy?

BotProxy: Rotating Proxies Made for professionals. Really fast connection. Built-in IP rotation. Fresh IPs every day.

Find out more


How to remove '\n' from scrapy output in python

Question

I am trying to output to CSV but I realized that when scraping tripadvisor I am getting many carriage returns thus the array goes over 30 while there are only 10 reviews so I get many fields missing. Is there a way to remove the carriage returns.

spider.

from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapingtest.items import ScrapingTestingItem
from collections import OrderedDict
import json
from scrapy.selector.lxmlsel import HtmlXPathSelector
import csv
import html2text
import unicodedata


class scrapingtestspider(Spider):
    name = "scrapytesting"
    allowed_domains = ["tripadvisor.in"]
    base_uri = ["tripadvisor.in"]
    start_urls = [
        "http://www.tripadvisor.in/Hotel_Review-g297679-d736080-Reviews-Ooty_Elk_Hill_A_Sterling_Holidays_Resort-Ooty_Tamil_Nadu.html"]



    def parse(self, response):
        item = ScrapingTestingItem()
        sel = HtmlXPathSelector(response)
        converter = html2text.HTML2Text()
        sites = sel.xpath('//a[contains(text(), "Next")]/@href').extract()
##        dummy_test = [ "" for k in range(10)]

        item['reviews'] = sel.xpath('//div[@class="col2of2"]//p[@class="partial_entry"]/text()').extract()
        item['subjects'] = sel.xpath('//span[@class="noQuotes"]/text()').extract()
        item['stars'] = sel.xpath('//*[@class="rating reviewItemInline"]//img/@alt').extract()
        item['names'] = sel.xpath('//*[@class="username mo"]/span/text()').extract()
        item['location'] = sel.xpath('//*[@class="location"]/text()').extract()
        item['date'] = sel.xpath('//*[@class="ratingDate relativeDate"]/@title').extract()
        item['date'] += sel.xpath('//div[@class="col2of2"]//span[@class="ratingDate"]/text()').extract()


        startingrange = len(sel.xpath('//*[@class="ratingDate relativeDate"]/@title').extract())

        for j in range(startingrange,len(item['date'])):
            item['date'][j] = item['date'][j][9:].strip()

        for i in range(len(item['stars'])):
            item['stars'][i] = item['stars'][i][:1].strip()

        for o in range(len(item['reviews'])):
            print unicodedata.normalize('NFKD', unicode(item['reviews'][o])).encode('ascii', 'ignore')

        for y in range(len(item['subjects'])):
            item['subjects'][y] = unicodedata.normalize('NFKD', unicode(item['subjects'][y])).encode('ascii', 'ignore')

        yield item

#        print item['reviews']

        if(sites and len(sites) > 0):
            for site in sites:
                yield Request(url="http://tripadvisor.in" + site, callback=self.parse)

Is there possible a regex that I could use to go through the for loop and replace it. I tried replace but that did not do a thing. And also why does scrapy do that.

Answer

What I usually do to trim and clean up the output is using Input and/or Output Processors with Item Loaders - it makes things more modular and clean:

class ScrapingTestingLoader(ItemLoader):
    default_input_processor = MapCompose(unicode.strip)
    default_output_processor = TakeFirst()

Then, if you would use this Item Loader for loading your items, you'll get the extracted values stripped and as strings (instead of lists). For instance, if the extracted field is ["my value \n"] - you'll get my value as an output.

cc by-sa 3.0