python - 使用 Python 和 Beautiful Soup 创建 CSV 文件

Question

是否可以使用 Python 将文件名的一部分保存为 CSV 文件中的字段？我有一系列名为的 HTML 文件，比如说，"000000001 8375739.html"我"000000021 5748922937574.html"希望能够删除前 10 个字符（第一个数字总是 9 位，然后是空格），然后保存其余的文件名（减去 .html）到 CSV 文件中名为 ID 的字段中，然后该字段本身由 html 文件的内容填充。事实上，我正在尝试做的是使用 Beautiful Soup 从 HTML 文件中提取文本，将第一行保存在名为“title”的字段中，将其余文本保存在名为“body”的字段中并保存文件名的第二部分在名为“ID”的字段中。html 到文本部分的效果很好，但我似乎无法理解其余部分。

这是剥离 HTML 并写入（单个）文本文件的代码。我假设我需要再次使用 glob，还是需要使用 igloo？

import os
import glob
import codecs
import csv
from bs4 import BeautifulSoup
dics = [{

path = "c:\\users\\zac\\downloads\\"

for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(codecs.open(markup, "r", "utf-8").read())
    with open("extracted.txt", "a") as myfile:
        myfile.write(soup.get_text())

这是 HTML 的示例，它们并不完全相同，但它们基本上遵循相同的格式：

<table>
      <tbody>
        <tr>
          <td></td>
          <td></td>

          <td>
            <p><a>Some Sample Text</a> </p>
            <p><a>A slightly larger body of text. Thus far, we see that the current python script is placing this directly under the previous text.</a> </p>
            <h3><a>And a final bit of text, this has so far been placed below the previous text, making three lines of text (or more, depending on how long the middle block is).</a></h3>

          <td ></td>

          <td></td>
        </tr>

        <tr>
          <td></td>
          <td></td>
        </tr>
        <tr>
          <td></td>
          <td></td>
        </tr>
      </tbody>
    </table>

score 0 · Accepted Answer

fname = "000000001 8375739.html"
trash, name = fname.split(" ")
data, trash = name.split(".")

print data

--output:--
8375739

为什么您认为不发布您的 html 文件样本是提出问题的正确方法？在您看来，所有 html 文件是否都相同，因此 BS 可以读取文件并分离出您需要的数据？

python - 使用 Python 和 Beautiful Soup 创建 CSV 文件

1 回答 1

Related

Reference