django - lxml 不适用于 django，scraperwiki

Question

我正在开发一个 django 应用程序，该应用程序通过伊利诺伊州的 General Assembly 网站来抓取一些 pdf。虽然部署在我的桌面上，但它工作正常，直到 urllib2 超时。当我尝试在我的 Bluehost 服务器上部署时，代码的 lxml 部分会引发错误。任何帮助，将不胜感激。

import scraperwiki
from bs4 import BeautifulSoup
import urllib2
import lxml.etree
import re
from django.core.management.base import BaseCommand
from legi.models import Votes

class Command(BaseCommand):
    def handle(self, *args, **options):
        chmbrs =['http://www.ilga.gov/house/', 'http://www.ilga.gov/senate/']
        for chmbr in chmbrs:
            site = chmbr    
            url = urllib2.urlopen(site)
            content = url.read()
            soup = BeautifulSoup(content)
            links = []
            linkStats = []
            x=0
            y=0
            table = soup.find('table', cellpadding=3)
            for a in soup.findAll('a',href=True):
                if re.findall('Bills', a['href']):
                    l = (site + a['href']+'&Primary=True')
                    links.append(str(l))
                    x+=1
                    print x
            for link in links:
                url = urllib2.urlopen(link)
                content = url.read()
                soup = BeautifulSoup(content)
                table = soup.find('table', cellpadding=3)
                for a in table.findAll('a',href=True):
                    if re.findall('BillStatus', a['href']):
                        linkStats.append(str('http://ilga.gov'+a['href']))
            for linkStat in linkStats:
                url = urllib2.urlopen(linkStat)
                content = url.read()
                soup = BeautifulSoup(content)
                for a in soup.findAll('a',href=True):
                    if re.findall('votehistory', a['href']):
                        vl = 'http://ilga.gov/legislation/'+a['href']
                        url = urllib2.urlopen(vl)
                        content = url.read()
                        soup = BeautifulSoup(content)
                        for b in soup.findAll('a',href=True):
                            if re.findall('votehistory', b['href']):
                                llink = 'http://ilga.gov'+b['href']
                                try:
                                    u = urllib2.urlopen(llink)
                                    x = scraperwiki.pdftoxml(u.read())
                                    root = lxml.etree.fromstring(x)
                                    pages = list(root)
                                    chamber = str()
                                    for page in pages:
                                        print "working_1"
                                        for el in page:
                                            print "working_2"
                                            if el.tag == 'text':
                                                if int(el.attrib['top']) == 168:
                                                    chamber = el.text
                                                if re.findall("Senate Vote", chamber):
                                                    if int(el.attrib['top']) >= 203 and int(el.attrib['top']) < 231:
                                                        title = el.text
                                                        if (re.findall('House', title)):
                                                            title = (re.findall('[0-9]+', title))
                                                            title = "HB"+title[0]
                                                        elif (re.findall('Senate', title)):
                                                            title = (re.findall('[0-9]+', title))
                                                            title = "SB"+title[0]
                                                    if int(el.attrib['top']) >350 and int(el.attrib['top']) <650:
                                                        r = el.text
                                                        names = re.findall(r'[A-z-\u00F1]{3,}',r)
                                                        vs = re.findall(r'[A-Z]{1,2}\s',r)
                                                        for name in names:
                                                            legi = name
                                                            for vote in vs:
                                                                v = vote
                                                            if Votes.objects.filter(legislation=title).exists() == False:
                                                                c = Votes(legislation=title, legislator=legi, vote=v)
                                                                c.save()    
                                                                print 'saved'
                                                            else:
                                                                print 'not saved'                                                       
                                                elif int(el.attrib['top']) == 189:
                                                    chamber = el.text
                                                if re.findall("HOUSE ROLL CALL", chamber):
                                                    if int(el.attrib['top']) > 200 and int(el.attrib['top']) <215:
                                                        title = el.text
                                                        if (re.findall('HOUSE', title)):
                                                            title = (re.findall('[0-9]+', title))
                                                            title = "HB"+title[0]
                                                        elif (re.findall('SENATE', title)):
                                                            title = (re.findall('[0-9]+', title))
                                                            title = "SB"+title[0]
                                                    if int(el.attrib['top']) >385 and int(el.attrib['top']) <1000:
                                                        r = el.text
                                                        names = re.findall(r'[A-z-\u00F1]{3,}',r)
                                                        votes = re.findall(r'[A-Z]{1,2}\s',r)
                                                        for name in names:
                                                            legi = name
                                                            for vote in votes:
                                                                v = vote
                                                            if Votes.objects.filter(legislation=title).exists() == False:
                                                                c = Votes(legislation=title, legislator=legi, vote=v)
                                                                c.save()
                                                                print 'saved'
                                                            else:
                                                                print 'not saved'

                                except:
                                    pass

编辑 1 这是错误跟踪

    Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/__init__.py", line 399, in execute_from_command_line
    utility.execute()
  File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/__init__.py", line 392, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/base.py", line 242, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/base.py", line 285, in execute
    output = self.handle(*args, **options)
  File "/home7/maythirt/GAB/legi/management/commands/vote.py", line 51, in handle
    root = lxml.etree.fromstring(x)
  File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src/lxml/lxml.etree.c:68121)
  File "parser.pxi", line 1786, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102470)
  File "parser.pxi", line 1674, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:101299)
  File "parser.pxi", line 1074, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:96481)
  File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91290)
  File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92476)
  File "parser.pxi", line 633, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91939)
lxml.etree.XMLSyntaxError: None

score 0 · Accepted Answer

他们这样做的方法是添加一个 try: except: 子句，当您收到错误时，您只需保存 xml 文件以及指向硬盘驱动器的链接。这样您就可以单独检查 xml 文件。

可能是因为某种原因，scraperwiki.pdftoxml 制作了一个非法的 xml 文件。我在使用另一个 pdftoxml 工具时遇到过这种情况。

并且请将您的代码重构为更多功能，它将变得更容易阅读和维护:)。

另一种方法当然是先下载所有的 pdf，然后再解析它们。这样您就可以避免因某种原因失败时多次访问该网站。

score 0 · Accepted Answer

正如乔纳森所说，这可能是scraperwiki.pdftoxml()导致问题的输出。您可以显示或记录的值x来确认它。

具体来说，pdftoxml()运行一个外部程序pdftohtml并使用临时文件来存储 PDF 和 XML。

我还要检查的是：

您的服务器上是否pdftohtml正确设置？
如果是这样，如果您使用代码失败的 PDF 在服务器上的 shell 中直接运行它，那么转换为 XML 是否有效？它正在执行的命令是pdftohtml -xml -nodrm -zoom 1.5 -enc UTF-8 -noframes "input.pdf" "output.xml"

如果直接运行命令时出现问题，那就是你的问题所在。通过pdftohtml在代码中运行的方式scraperwiki，没有简单的方法可以判断命令是否失败。

django - lxml 不适用于 django，scraperwiki

2 回答 2

Related

Reference