Need Proxy?

BotProxy: Rotating Proxies Made for professionals. Really fast connection. Built-in IP rotation. Fresh IPs every day.

Find out more


Nesting item data in Scrapy

Question

I'm fairly new to Python and Scrapy and have issues wrapping my head around how to create nested JSON with the help of Scrapy.

Selecting the elements I want from HTML has not been a problem with the help of XPath Helper and some Googling. I am however not quite sure how I’m supposed to get the JSON structure that I want.

The JSON structure I desire would look like:

{"menu": {
    "Monday": {
        "alt1": "Item 1",
        "alt2": "Item 2",
        "alt3": "Item 3"
    },
    "Tuesday": {
        "alt1": "Item 1",
        "alt2": "Item 2",
        "alt3": "Item 3"
    }
}}

The HTML looks like:

<ul>
    <li class="title"><h2>Monday</h2></li>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
</ul>
<ul>
    <li class="title"><h2>Tuesday</h2></li>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
</ul>

I did find https://stackoverflow.com/a/25096896/6856987, I was however not able to adapt this to fit my needs. I would greatly appreciate a nudge in the right direction on how I would accomplish this.

Edit: With the nudge provided by Padraic I managed to get one step closer to what I want to accomplish. I've come up with the following, which is a slight improvement over my previous situation. The JSON is still not quite where I want it.

Scrapy spider:

import scrapy
from dmoz.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    start_urls = ['http://urlto.com']

    def parse(self, response):
        uls = response.xpath('//ul[position() >= 1 and position() < 6]')
        item = DmozItem()
        item['menu'] = {}
        item['menu'] = {"restaurant": "name"}
        for ul in uls:
                item['menu']['restaurant']['dayOfWeek'] = ul.xpath("li/h2/text()").extract()
                item['menu']['restaurant']['menuItem'] = ul.xpath("li/text()").extract()
                yield item

Resulting JSON:

[  
    {  
        "menu":{  
            "dayOfWeek":[  
                "Monday"
            ],
            "menuItem":[  
                "Item 1",
                "Item 2",
                "Item 3"
            ]
        }
    },
    {  
        "menu":{  
            "dayOfWeek":[  
                "Tuesday"
            ],
            "menuItem":[  
                "Item 1",
                "Item 2",
                "Item 3"
            ]
        }
    }
]

It sure feels like I'm doing a thousand and a one things wrong with this, hopefully someone more clever than me can point me the right way.

Answer

You just need to find all the uls and then extract the lis to group them, an example using lxml below:

from lxml import html

h = """<ul>
    <li class="title"><h2>Monday</h2></li>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
</ul>
<ul>
    <li class="title"><h2>Tuesday</h2></li>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
</ul>"""

tree = html.fromstring(h)

uls = tree.xpath("//ul")

data = {}
# iterate over all uls
for ul in uls:
    # extract the ul's li's
    lis = ul.xpath("li")
    # use the h2 text as the key and all the text from the remaining as values
    # with enumerate to add the alt logic
    data[lis[0].xpath("h2")[0].text] =  {"alt{}".format(i): node.text for i, node in enumerate(lis[1:], 1)}

print(data)

Which would give you:

{'Monday': {'alt1': 'Item 1', 'alt2': 'Item 2', 'alt3': 'Item 3'},
 'Tuesday': {'alt1': 'Item 1', 'alt2': 'Item 2', 'alt3': 'Item 3'}}

If you wanted to put it into a single comporehension:

data = {lis[0].xpath("h2")[0].text:
               {"alt{}".format(i): node.text for i, node in enumerate(lis[1:], 1)}
                    for lis in (ul.xpath("li") for ul in tree.xpath("//ul"))}

Working with your edited code in your question and following the same required output:

def parse(self, response):
    uls = response.xpath('//ul[position() >= 1 and position() < 6]')
    item = DmozItem()
    # just create an empty dict
    item['menu'] = {}
    for ul in uls:
        # for each ul, add a key value pair {day: {alti: each li_text skipping the first}}
        item['menu'][ul.xpath("li/h2/text()").extract_first()]\
            = {"alt{}".format(i): node.text for i, node in enumerate(ul.xpath("li[postition() > 1]/text()").extract(), 1)}
    # yield outside the loop 
    yield item

That will give you data in one dict like:

In [15]: d = {"menu":{'Monday': {'alt1': 'Item 1', 'alt2': 'Item 2', 'alt3': 'Item 3'},
                  'Tuesday': {'alt1': 'Item 1', 'alt2': 'Item 2', 'alt3': 'Item 3'}}}

In [16]: d["menu"]["Tuesday"]
Out[16]: {'alt1': 'Item 1', 'alt2': 'Item 2', 'alt3': 'Item 3'}

In [17]: d["menu"]["Monday"]
Out[17]: {'alt1': 'Item 1', 'alt2': 'Item 2', 'alt3': 'Item 3'}

In [18]: d["menu"]["Monday"]["alt1"]
Out[18]: 'Item 1'

That matches your original question expected output more than your new but I see no advantage to what you are doing in the new logic adding "dayOfWeek" etc..

cc by-sa 3.0