python - 提取验证码图像

Question

我正在从CAPTCHA图像中提取的各种数字构建神经元网络的训练集。我正在使用Python 2.7.3、lxml库和XPath选择器。

为了从验证码中获取正确的图像，我需要提取img src动态加载到 www 的图像，并且每次都不同，所以我的 Python 代码是：

import urllib
from lxml import etree, html

adres_prefix = "https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/"
adres_sufix = etree.XPath('string(//img[@class="captcha"]/@src)')
sock = urllib.urlopen("https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/Search.aspx")
htmlSource = sock.read()                             
sock.close()
root = etree.HTML(htmlSource)
result = etree.tostring(root, pretty_print=True, method="html")
result2 = adres_sufix(root)
www = adres_prefix + result2
print www

所以每次我得到www：

https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/captcha.ashx?id=1b7d2b6d-70a6-4ce3-bedd-fe89038fb7f3&empty=1

出了什么问题，因为当将此链接复制到我的浏览器中时，我什么也没得到。

带有验证码的源页面

我不知道出了什么问题。为什么 XPath 选择器得到'&empty=1'？有任何想法吗？

score 0 · Accepted Answer

原始 HTML 源代码确实有“empty=1”，因此您的代码是正确的。要获得图像，只需剪掉“&empty=1”部分。

https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/captcha.ashx?id=1b7d2b6d-70a6-4ce3-bedd-fe89038fb7f3

python - 提取验证码图像

1 回答 1

Related

Reference