0

我真的需要帮助来完成这项任务,因为它与我的研究有关,而且我是 python 和 scrapy 的新手。

*任务是选择所有输入字段(类型=文本或密码或文件)并将其(id)存储在后端数据库中,除了此输入所属的页面链接*

我选择输入字段的代码

def parse_item(self, response):
    self.log('%s' % response.url)

    hxs = HtmlXPathSelector(response)
    item=IsaItem()
    item['response_fld']=response.url

    item['text_input']=hxs.select("//input[(@id or @name) and (@type = 'text' )]/@id ").extract()
    item['pass_input']=hxs.select("//input[(@id or @name) and (@type = 'password')]/@id").extract()
    item['file_input']=hxs.select("//input[(@id or @name) and (@type = 'file')]/@id").extract()

    return item

数据库管道代码:

class SQLiteStorePipeline(object):


def __init__(self):
    self.conn = sqlite3.connect('./project.db')
    self.cur = self.conn.cursor()


def process_item(self, item, spider):
    self.cur.execute("insert into inputs ( input_name) values(?)" , (item['text_input'][0] ), )
    self.cur.execute("insert into inputs ( input_name) values(?)" , (item['pass_input'][0]  ,))
    self.cur.execute("insert into inputs ( input_name) values(?)" ,(item['file_input'][0] ,  ))

    self.cur.execute("insert into links (link) values(?)", (item['response_fld'][0], ))

    self.conn.commit()
    return item

但我仍然收到这样的错误

self.cur.execute("insert into inputs ( input_name) values(?)" , (item['text_input'][0] ), )
exceptions.IndexError: list index out of range

或数据库仅存储第一个字母!

Database links table 
 ╔════════════════╗
 ║      links     ║ 
 ╠════════════════╣
 ║  id  │input    ║ 
 ╟──────┼─────────╢
 ║    1 │     t   ║ 
 ╟──────┼─────────╢
 ║    2 │     t   ║ 
 ╚══════╧═════════╝
Note it should "tbPassword" or "tbUsername"

输出来自 json 文件

{"pass_input": ["tbPassword"], "file_input": [], "response_fld":     "http://testaspnet.vulnweb.com/Signup.aspx", "text_input": ["tbUsername"]}
{"pass_input": [], "file_input": [], "response_fld": "http://testaspnet.vulnweb.com/default.aspx", "text_input": []}
{"pass_input": ["tbPassword"], "file_input": [], "response_fld": "http://testaspnet.vulnweb.com/login.aspx", "text_input": ["tbUsername"]}
{"pass_input": [], "file_input": [], "response_fld": "http://testaspnet.vulnweb.com/Comments.aspx?id=0", "text_input": []}
4

2 回答 2

0

我对这项技术一无所知,但这是我的猜测:

尝试 insert into inputs ( input_name) values(?)" , (item['text_input'] ) 代替 insert into inputs ( input_name) values(?)" , (item['text_input'][0] ).

至于“列表索引超出范围”错误,您的项目似乎是空的,您应该检查一下。

于 2012-07-09T20:18:01.867 回答
0

你得到IndexError是因为你试图得到列表中的第一个项目,有时它是空的。

我会这样做。

蜘蛛:

def parse_item(self, response):
    self.log('%s' % response.url)

    hxs = HtmlXPathSelector(response)
    item = IsaItem()
    item['response_fld'] = response.url

    res = hxs.select("//input[(@id or @name) and (@type = 'text' )]/@id ").extract()
    item['text_input'] = res[0] if res else None # None is default value in case no field found

    res = hxs.select("//input[(@id or @name) and (@type = 'password')]/@id").extract()
    item['pass_input'] = res[0] if res else None # None is default value in case no field found

    res = hxs.select("//input[(@id or @name) and (@type = 'file')]/@id").extract()
    item['file_input'] = res[0] if res else None # None is default value in case no field found

    return item

管道:

class SQLiteStorePipeline(object):

    def __init__(self):
        self.conn = sqlite3.connect('./project.db')
        self.cur = self.conn.cursor()


    def process_item(self, item, spider):
        self.cur.execute("insert into inputs ( input_name) values(?)", (item['text_input'],))
        self.cur.execute("insert into inputs ( input_name) values(?)", (item['pass_input'],))
        self.cur.execute("insert into inputs ( input_name) values(?)", (item['file_input'],))

        self.cur.execute("insert into links (link) values(?)", (item['response_fld'],))

        self.conn.commit()
        return item
于 2012-07-10T04:08:05.260 回答