python - 使用scrapy（Python）获取信息

Question

当我想捕获以下信息时：

<td>But<200g/M2</td>


name = fila.select('.//td[2]/text()').extract()

我捕获以下内容

"But"

显然与这些字符“< /”有冲突

score 0 · Accepted Answer

这是一种使用 BeautifulSoup 的方法，以防您对不同的库有更多的运气：

from bs4 import BeautifulSoup

soup = BeautifulSoup("""<html><head><title>StackOverflow-Question</title></head><body>
 <table>
  <tr>
   <td>Ifs</td>
   <td>Ands</td>
   <td>But<200g/M2</td>
  </tr>
 </table>
</body></html>""")

print soup.find_all('td')[2].get_text()

这个的输出是：

But<200g/M2

如果你想使用 XPath，你也可以使用ElementTree XML API。在这里，我使用 BeautifulSoup 获取 HTML 并将其转换为有效的 XML，以便我可以针对它运行 XPath 查询：

from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET

html = """<html><head><title>StackOverflow-Question</title></head><body>
 <table>
  <tr>
   <td>Ifs / Ands / Or</td>
   <td>But<200g/M2</td>
  </tr>
 </table>
</body></html>"""

soup = BeautifulSoup(html)

root = ET.fromstring(soup.prettify())

print root.findall('.//td[2]')[0].text

它的输出是相同的（注意 HTML 略有不同，这是因为 XPath 数组从 1 开始，而 Python 数组从 0 开始）。

score 0 · Accepted Answer

用'\'转义特殊字符，所以：

But\<200g\/M2

请注意，使用这些字符创建文件并不容易

python - 使用scrapy（Python）获取信息

2 回答 2

Related

Reference