python - 如何改进印地语文本提取？

Question

我正在尝试从 PDF 中提取印地语文本。我尝试了所有从 PDF 中提取的方法，但都没有奏效。有解释为什么它不起作用，但没有答案。因此，我决定将 PDF 转换为图像，然后用于pytesseract提取文本。我已经下载了经过印地语训练的数据，但这也给出了非常不准确的文本。

这是 PDF 中的实际印地语文本（下载链接）：

到目前为止，这是我的代码：

import fitz

filepath = "D:\\BADI KA BANS-Ward No-002.pdf"

doc = fitz.open(filepath)
page = doc.loadPage(3)  # number of page
pix = page.getPixmap()
output = "outfile.png"
pix.writePNG(output)
from PIL import Image
import pytesseract

# Include tesseract executable in your path
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Create an image object of PIL library
image = Image.open('outfile.png')

# pass image into pytesseract module
# pytesseract is trained in many languages
image_to_text = pytesseract.image_to_string(image, lang='hin')

# Print the text
print(image_to_text)

这是一些输出样本：

कार बिता देवी व ०... नाम बाइुनान िक०क नाक तो
पति का नाव: रवजी लात. “50९... पिला का सामशामाव.... “पति का नाम: बादुलल
कान सब: 43 लसमनंध्या: 93९. मकान ंब्या: 3९
आप: 29 _ लिंग सी. | आइ 57 लिंग पुरुष आप: 62 लिंग सी
एजगल्णब्णस्य (बन्द जगाख्मिणण्य
नमः बायगी बसों ०४... नि बयावर्णो ०५०... निफर सनक नी
चिता का नामजबूजल वर्ष.“ ००० | पिला का नामब्राइलाल वर्षो... 0 2... | पिता कामामशुल चब्द .... “20०
|सकानसंब्या: 43९ बसवकंब्या: 43९. कान संब्या: 44
जाए: 27 लिंग सो कई: 27 नि खी मा लिंग पुरुष

这个问题有一个答案我想用 python 抓取一个印地语（印度语言）pdf文件，这似乎告诉了如何做到这一点，但没有提供任何解释。

除了自己训练语言模型之外，还有什么方法可以做到这一点？

score 10 · Accepted Answer

我将提供一些想法如何处理您的图像，但我会将其限制在给定文档的第 3 页，即问题中显示的页面。

为了将 PDF 页面转换为某些图像，我使用了pdf2image.

对于 OCR，我使用pytesseract，而不是lang='hin'，我使用lang='Devanagari'，参见。正方体GitHub。一般来说，确保通过提高Tesseract 文档的输出质量，尤其是页面分割方法。

这是整个过程的（冗长）描述：

对图像进行反向二值化以查找轮廓：黑色背景上的白色文本、形状等。
找到所有轮廓，过滤掉两个非常大的轮廓，即这两个表。
提取两个表之外的文本：
1. 屏蔽二值化图像中的表格。
2. 进行形态闭合以连接剩余的文本行。
3. 查找这些文本行的轮廓和边界矩形。
4. 运行pytesseract以提取文本。
提取两个表中的文本：
1. 从当前表中提取单元格，更好：它们的边界矩形。
2. 对于第一个表：
  1. 运行pytesseract以按原样提取文本。
3. 对于第二个表：
  1. 填充数字周围的矩形以防止错误的 OCR 输出。
  2. 掩盖左侧（印地语）和右侧（英语）部分。
  3. 在左侧运行pytesseractusing ，在右侧运行 using以提高两者的 OCR 质量。lang='Devaganari'lang='eng'

这就是整个代码：

import cv2
import numpy as np
import pdf2image
import pytesseract

# Extract page 3 from PDF in proper quality
page_3 = np.array(pdf2image.convert_from_path('BADI KA BANS-Ward No-002.pdf',
                                              first_page=3, last_page=3,
                                              dpi=300, grayscale=True)[0])

# Inverse binarize for contour finding
thr = cv2.threshold(page_3, 128, 255, cv2.THRESH_BINARY_INV)[1]

# Find contours w.r.t. the OpenCV version
cnts = cv2.findContours(thr, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]

# STEP 1: Extract texts outside of the two tables

# Mask out the two tables
cnts_tables = [cnt for cnt in cnts if cv2.contourArea(cnt) > 10000]
no_tables = cv2.drawContours(thr.copy(), cnts_tables, -1, 0, cv2.FILLED)

# Find bounding rectangles of texts outside of the two tables
no_tables = cv2.morphologyEx(no_tables, cv2.MORPH_CLOSE, np.full((21, 51), 255))
cnts = cv2.findContours(no_tables, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
rects = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda r: r[1])

# Extract texts from each bounding rectangle
print('\nExtract texts outside of the two tables\n')
for (x, y, w, h) in rects:
    text = pytesseract.image_to_string(page_3[y:y+h, x:x+w],
                                       config='--psm 6', lang='Devanagari')
    text = text.replace('\n', '').replace('\f', '')
    print('x: {}, y: {}, text: {}'.format(x, y, text))

# STEP 2: Extract texts from inside of the two tables

rects = sorted([cv2.boundingRect(cnt) for cnt in cnts_tables],
               key=lambda r: r[1])

# Iterate each table
for i_r, (x, y, w, h) in enumerate(rects, start=1):

    # Find bounding rectangles of cells inside of the current table
    cnts = cv2.findContours(page_3[y+2:y+h-2, x+2:x+w-2],
                            cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    inner_rects = sorted([cv2.boundingRect(cnt) for cnt in cnts],
                         key=lambda r: (r[1], r[0]))

    # Extract texts from each cell of the current table
    print('\nExtract texts inside table {}\n'.format(i_r))
    for (xx, yy, ww, hh) in inner_rects:

        # Set current coordinates w.r.t. full image
        xx += x
        yy += y

        # Get current cell
        cell = page_3[yy+2:yy+hh-2, xx+2:xx+ww-2]

        # For table 1, simply extract texts as-is
        if i_r == 1:
            text = pytesseract.image_to_string(cell, config='--psm 6',
                                               lang='Devanagari')
            text = text.replace('\n', '').replace('\f', '')
            print('x: {}, y: {}, text: {}'.format(xx, yy, text))

        # For table 2, extract single elements
        if i_r == 2:

            # Floodfill rectangles around numbers
            ys, xs = np.min(np.argwhere(cell == 0), axis=0)
            temp = cv2.floodFill(cell.copy(), None, (xs, ys), 255)[1]
            mask = cv2.floodFill(thr[yy+2:yy+hh-2, xx+2:xx+ww-2].copy(),
                                 None, (xs, ys), 0)[1]

            # Extract left (Hindi) and right (English) parts
            mask = cv2.morphologyEx(mask, cv2.MORPH_CLOSE,
                                    np.full((2 * hh, 5), 255))
            cnts = cv2.findContours(mask,
                                    cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
            cnts = cnts[0] if len(cnts) == 2 else cnts[1]
            boxes = sorted([cv2.boundingRect(cnt) for cnt in cnts],
                           key=lambda b: b[0])

            # Extract texts from each part of the current cell
            for i_b, (x_b, y_b, w_b, h_b) in enumerate(boxes, start=1):

                # For the left (Hindi) part, extract Hindi texts
                if i_b == 1:

                    text = pytesseract.image_to_string(
                        temp[y_b:y_b+h_b, x_b:x_b+w_b],
                        config='--psm 6',
                        lang='Devanagari')
                    text = text.replace('\f', '')

                # For the left (English) part, extract English texts
                if i_b == 2:

                    text = pytesseract.image_to_string(
                        temp[y_b:y_b+h_b, x_b:x_b+w_b],
                        config='--psm 6',
                        lang='eng')
                    text = text.replace('\f', '')

                print('x: {}, y: {}, text:\n{}'.format(xx, yy, text))

而且，这里是输出的前几行：

Extract texts outside of the two tables

x: 972, y: 93, text: राज्य निर्वाचन आयोग, राजस्थान
x: 971, y: 181, text: पंचायत चुनाव निर्वाचक नामावली, 2021
x: 166, y: 610, text: मिश्र का बाढ़ ,श्रीराम की नॉगल
x: 151, y: 3417, text: आयु 1 जनवरी 2021 के अनुसार
x: 778, y: 3419, text: पृष्ठ संख्या : 3 / 10

Extract texts inside table 1

x: 146, y: 240, text: जिलापरिषद का नाम : जयपुर
x: 1223, y: 240, text: जि° प° सदस्य निर्वाचन क्षेत्र : 21
x: 146, y: 327, text: पंचायत समिति का नाम : सांगानेर
x: 1223, y: 327, text: पं° स° सदस्य निर्वाचन क्षेत्र : 6
x: 146, y: 415, text: ग्रामपंचायत : बडी का बांस
x: 1223, y: 415, text: वार्ड क्रमांक : 2
x: 146, y: 502, text: विधानसभा क्षेत्र की संख्या एवं नाम:- 56-बगरु

Extract texts inside table 2

x: 142, y: 665, text:
1 RBP2469583
नाम: आरती चावला
पिता का नामःलाला राम चावला
मकान संख्याः १९
आयुः 21 लिंगः स्त्री

x: 142, y: 665, text:
Photo is
Available

x: 867, y: 665, text:
2 MRQ3101367
नामः सूरज देवी
पिता का नामःरामावतार
मकान संख्याः डी /18
आयुः 44 लिंगः स्त्री

x: 867, y: 665, text:
Photo is
Available

我使用手动逐字符比较检查了一些文本，并认为它看起来相当不错，但无法理解印地语或阅读梵文脚本，我无法评论 OCR 的整体质量。请告诉我！

令人讨厌的是9，相应“卡”中的数字被错误地提取为2。我认为，这是由于与文本的其余部分相比字体不同，并且与lang='Devanagari'. 找不到解决方案 - 没有从“卡片”中单独提取矩形。

----------------------------------------
System information
----------------------------------------
Platform:      Windows-10-10.0.19041-SP0
Python:        3.9.1
PyCharm:       2021.1.1
NumPy:         1.19.5
OpenCV:        4.5.2
pdf2image      1.14.0
pytesseract:   5.0.0-alpha.20201127
----------------------------------------

score 5 · Accepted Answer

如果您想从这些“卡片”中获取文本，我已经设法通过模块通过tabula-py以下方式为第 3 页做到了：

import tabula

pdf_file = "BADI KA BANS-Ward No-002.pdf"
page = 3

x = 30      # left edge of the table
y = 160     # top edge of the table
w = 173     # width of a card
h = 73      # height of a card
photo = 61  # width of a photo

rows = 8    # number of rows of the table
cols = 3    # number of columns of the table

counter = 1

def get_area(row, col):
    ''' return area of the card in given position in the table '''
    top    = y + h * row
    left   = x + w * col
    bottom = top + h
    right  = left + w - photo
    return (top, left, bottom, right)

for row in range(rows):
    for col in range(cols):
        file_name = "card_" + str(counter).zfill(3) + ".txt"
        tabula.convert_into(pdf_file, file_name,
        pages=page,
        output_format = "csv",
        java_options = "-Dfile.encoding=UTF8",
        lattice = False,
        area = get_area(row, col))
        counter += 1

输入：

输出

24个txt文件：

card_001.txt
card_002.txt
card_003.txt
card_004.txt
.
.
.
card_023.txt
card_024.txt

card_001.txt：

1 RBP2469583
नरम: आरतल चररलर
नपतर कर नरम:लरलर ररम चररल
मकरन सखजर: १९
आज:  21 ललग: सल

card_002.txt

2 MRQ3101367
नरम: सरज दरल
नपतर कर नरम:ररमररतरर
मकरन सखजर: रल /18
आज:  44 ललग: सल

card_024.txt

24 RBP0230979
नरम: सनमतकरर
पनत कर नरम: हररलसह
मकरन सखजर: 13
आज:  41 ललग: सल

据我所见，所有“卡片”的尺寸都相同。如果它们看起来相似，则该解决方案可以应用于所有页面。不幸的是，页面有差异。因此，必须为每一页更改初始变量。我看不到自动进行更改的方法。除了卡的编号可以从卡中获取，而不是简单的计数器。

https://pypi.org/project/tabula-py/

https://aegis4048.github.io/parse-pdf-files-while-retaining-structure-with-tabula-py

score 1 · Accepted Answer

似乎该模块pdfplumber可以完成工作：

import pdfplumber

pdf = pdfplumber.open('BADI KA BANS-Ward No-002.pdf')

pages = pdf.pages
text = ""

for page in pages:
    text += page.extract_text()

pdf.close()

with open('output.txt', 'w', encoding="utf8") as f:
    f.write(text)

输出（片段）：

ररजज ननरररचन आजयग, ररजससरन 
 पपचरजत चचनरर ननरररचक नरमररलल, 2021   
नजलरपररषद कर नरम : जजपचर नज॰ प॰ सदसज ननरररचन ककत : 21
पपचरजत सनमनत कर नरम : सरपगरनकर पप॰ स॰ सदसज ननरररचन ककत : 6
गरमपपचरजत : बरल कर बरपस रररर कमरपक : 2
नरधरनसभर ककत कक सपखजर एरप नरम:-56-बगर
मचखज गरपर        : लकमलपचरर उरर नटरनलपचरर
तहसलल         : सरपगरनकर
नजलर            : जजपचर
पचनरलकण कर नरररण
पचनरलकण कर रषर  :  2021
पचनरलकण कर पकरर               :  गहन पचनरलकण
अहतर र ददनरपक  :  01-01-2021
अपनतम पकरशन कक ददनरपक     :  19-04-2021
...

输入（第一页）：

但我对印地语一无所知。我不明白输出是否足够好。

https://github.com/jsvine/pdfplumber

要安装模块（Windows 7、Python 3.8）：

pip install pdfplumber

据说该模块甚至可以处理表格。不过，我还没有尝试过。

score 0 · Accepted Answer

如果你想从 pdf 中抓取100% 正确的文本，你应该使用正确的字体系列并在从图像到文本的解析时进行编码。

python - 如何改进印地语文本提取？

4 回答 4

Related

Reference