我正在与 ma project XYZ 合作
我被困在从源代码中提取文本
<a href="/gifts" class="title" data-tracking-id="mdd-heading">gifts</a>
我想将href提取为内容
我试过这个
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from XYZ.items import XYZ
class MySpider(BaseSpider):
name = "main"
allowed_domains = ["XYZ"]
start_urls = ["XYZ"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//a[@data-tracking-id='mdd-heading']")
items = []
for titles in titles:
item = XYZ()
item ["title"] = titles.select("text()").extract()
item ["link"] = titles.select("@href").extract()
items.append(item)
print "www.xyz.com"+str(item["link"])
return items
并且output
是
www.xyz.com[u'/gifts']
我期待输出为
www.xyz.com/gifts
我做错了什么......?