python - If 语句基于 jsonlines 文件中存在的值

Question

我的代码可以通过 Beautiful Soup 从网站上提取 400 多个 PDF。PyPDF2 将 PDF 转换为文本，然后将其保存为名为“output.jsonl”的 jsonlines 文件。

当我在未来的更新中保存新的 PDF 时，我希望 PyPDF 仅将新的 PDF 转换为文本并在 jsonlines 文件中附加该新文本，这正是我苦苦挣扎的地方。

jsonlines 文件如下所示：

{"id": "1234", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1235", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}...

PDF 被命名为“1234”、“1235”等，并保存在 file_path_PDFs 中。我试图识别“id”是否是 jsonlines 文件中的值，那么 PyPDF2 不需要将其转换为文本。如果它不存在，则照常处理。

file_path_PDFs = 'C:/Users/.../PDFs/'
json_list = []

for filename in os.listdir(file_path_PDFs):   
    if os.path.exists('C:/Users/.../PDFs/output.jsonl'):
        with jsonlines.open('C:/Users/.../PDFs/output.jsonl') as reader:
            mytext = jsonlines.Reader.iter(reader)
            for obj in mytext:
                if filename[:-4] in mytext: #filename[:-4] removes .pdf from string
                    continue
                else:
                    ~convert to text~

with jsonlines.open('C:/Users/.../PDFs/output.jsonl', 'a') as writer:
    writer.write_all(json_list)

照原样，我相信这段代码没有找到任何值，并且每次运行它时都会转换所有文本。显然，这是一个相当漫长的过程，每个文档跨越 200 或 300 页。

score 0 · Accepted Answer

更新：

优化为仅将id字段存储到 DataFrame。
- 保留一个 DataFrame （而不是 a list）以帮助未来的扩展和灵活性。

回答：

在完成（我认为是）您的方案后，我们有以下设置/要求：

您有一个名为 .jsonlines 的文件output.jsonl。
该output.jsonl文件包含 (n) 个字典；每个由 PyPDF2 解析的 PDF 一个。
我们必须遍历一个包含 400 多个已解析 PDF 文件的目录，并确定该 PDF 的文件名是否在output.jsonl.

如果这是正确的，让我们改变策略并采取以下方法：

创建一个listPDF 文件名（称为pdfs）。
id将 jsonlines 文件 ( output.jsonl)中的字段读入pandas.DataFrame( df) 中。
循环遍历pdfs列表并测试文件名 ( id) 是否在 DataFrame ( df) 中。
如果没有，请将文件名添加到列表中（称为notin）。
随心所欲地notin list将这些新文件解析成......任何你喜欢的东西。

我的（扩展的）output.jsonl文件如下所示：

{"id": "1234", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1235", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1236", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1237", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1238", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}

这是完成上述步骤的注释代码：

import os
import jsonlines
import pandas as pd

# Set the path to output.jsonl
path = os.path.expanduser('~/Desktop/output.jsonl')
# Build a list of PDFs (You'll use `os.listdir()`)
pdfs = ['1234.pdf', '1235.pdf', '1236.pdf', '1237.pdf', 
        '1238.pdf', '5000.pdf', '5001.pdf']
# Create an empty DataFrame.
df = pd.DataFrame()

# Read output.jsonl
with jsonlines.open(path) as reader:
    for line in reader.iter():
        # Add 'id' value to the DataFrame.
        df = df.append({'id': line.get('id')}, ignore_index=True)
# Display the DataFrame's contents.
print('Contents of the jsonlines file:\n')
print(df)

# Loop over the PDF filenames and test if each filename is in the DataFrame.
notin = [i for i in pdfs if os.path.splitext(i)[0] not in df['id'].values]
# Display the results.
print('\nThese PDFs are not in your jsonlines file:')
print(notin)

输出; 请注意，未找到文件 5000.pdf 和 5001.pdf：

Contents of the jsonlines file:

     id
0  1234
1  1235
2  1236
3  1237
4  1238

These PDFs are not in your jsonlines file:
['5000.pdf', '5001.pdf']

python - If 语句基于 jsonlines 文件中存在的值

1 回答 1

更新：

回答：

Related

Reference