python - 如何用pdfplumber完成for循环？

Question

问题

我正在关注本教程https://www.youtube.com/watch?v=eTz3VZmNPSE&list=PLxEus0qxF0wciRWRHIRck51EJRiQyiwZT&index=16

当代码返回我的这个错误时。

目标

我需要抓取一个看起来像这样的pdf（我想附上pdf，但我不知道如何）：

170001WO01 
English (US) into Arabic (DZ) 
Trans./Edit/Proof. 22.117,00 Words 1,350 29.857,95 
TM - Fuzzy Match 2.941,00 Words 0,500 1.470,50
TM - Exact Match 353,00 Words 0,100 35,30

方法

我正在按照前面提到的 pdfplumber 教程进行操作。

import re
import pdfplumber 
import PyPDF2
import pandas as pd
from collections import namedtuple 
ap = open('test.pdf', 'rb')

我将我想要作为最终产品的数据框列命名。

Serv = namedtuple('Serv', 'case_number language num_trans num_fuzzy num_exact')

问题

与有 2 个的教程示例相比，我有 5 个不同的行。

case_li = re.compile(r'(\d{6}\w{2}\d{2})')
language_li = re.compile(r'(nglish \(US\) into )(.*)')
trans_li = re.compile(r'(Trans./Edit/Proof.              )(\d{2}\.\d{3})')
fuzzy_li = re.compile(r'(TM - Fuzzy Match                )(\d{1}\.\d{3})')
exact_li = re.compile(r'(M - Exact Match                )(\d{3})')

问题

当我在代码中引入第三行时，出现了一个我不知道的错误。我已按照 2e0byo 的建议修改了代码，但仍然出现错误。

这是新代码：

line_items = []
with pdfplumber.open(ap) as pdf:
    page = pdf.pages
    for page in pdf.pages:
        text = page.extract_text()
        for line in text.split('\n'):
            
            line = case_li.search(line)
            if line:
                case_number = line
            
            line = language_li.search(line)
            if line:
                language = line.group(2)
            
            line = trans_li.search(line)
            if line:
                num_trans = line.group(2)
            
            line = fuzzy_li.search(line)
            if line:
                num_fuzzy = line.group(2)
            
            line = exact_li.search(line)
            if line:
                num_exact = line.group(2)
                
            line_items.append(Serv(case_number, language, num_trans, num_fuzzy, num_exact))```
---------------------------------------------------------------------------

这是新的错误：

TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_13992/1572426536.py in <module>
     10                 case_number = line
     11 
---> 12             line = language_li.search(line)
     13             if line:
     14                 language = line.group(2)

TypeError: expected string or bytes-like object
TypeError: expected string or bytes-like object

# GOAL
It would be to append the lines to line_items and eventually

df = pd.DataFrame(line_items)

score 0 · Accepted Answer

您已line在此处重新分配：

for line in text.split("\n"):
    # line is a str (the line)
    line = language_li.search(line)
    # line is no longer a str, but the result of a re.search

所以 line 不再是文本行，而是匹配的结果。因此trans_li.search(line)不是搜索您认为的行。

要修复您的代码，请采用一致的模式：

for line in text.split("\n"):
    match = language_li.search(line)
    # line is still a str (the line)
    # match is the result of re.search
    if match:
        do_something(match.groups())
        ...

    # line is *still* a str

match = trans_li.search(line):
if match:
    ...

为了完整起见，使用可怕的海象运算符，您现在可以这样写：

if match := language_li.search(line) is not None:
    do_something(match.groups())

我曾一度认为它更整洁，但现在认为它很丑。我完全希望仅仅因为提到海象操作员而被否决。（如果你看一下这篇文章的编辑历史你会发现我什至忘记了如何使用它，并且先倒着写。）

PS：您可能希望阅读 python 中的变量范围，尽管我知道没有一种语言会允许这种特定的范围冲突（在循环中覆盖循环变量）。顺便说一句，错误地做这种事情就是为什么我们通常会避免使用类似命名的变量（比如lineand Line）而使用 andline之类的东西match。

python - 如何用pdfplumber完成for循环？

问题

目标

方法

我将我想要作为最终产品的数据框列命名。

问题

问题

1 回答 1

Related

Reference