machine-learning - 识别考试中的问题（文本识别）

Question

我有成千上万的 pdf 考试，我想将其问题提取为标准格式（JSON、YML 或 XML）。

它们是多项选择：

问题 1

谁是第一个在月球上行走的人？

a) 尤里·加加林

b) 艾伦·里普利

c) 尼尔·阿姆斯特朗

d) 谢泼德

问题2

太阳系中有多少颗行星？

一）10

b) 12

c) 14

d) 15

(...)

在 JSON 中：

{
  "number": 1,
  "wording": "Who as the first man to walk on the moon",
  "alternatives": {
    "a": Yuri Gagarin
    "b": Ellen Ripley
    "c": Neil Armstrong
    "d": Shepard
  }
}

需要注意的是，由于这些考试是由不同的老师进行的，所以它们可能会略有不同。这意味着即使提取为纯文本，我也无法使用正则表达式进行匹配。（我试过了，组合（措辞结构/替代结构）是巨大的）

例如：

“问题 X (...)”。

“问题 (X) (...)”。

“问题 X - (...)”。

“X）（...）”。

“X- （...）”。

替代方案也可能会改变：

一个）（...）

一个。(...)

一个- （...）

1) (...)

我想我需要某种机器学习工具来“教”程序什么是问题并让它找到。

作为替代方案，由于问题（印刷中的）在物理上彼此相距很远，我想我可以将这些 PDF 转换为图像并使用某种图像识别。

可行吗？是否有用于识别这些问题的工具（包、库、算法）？

score 0 · Accepted Answer

There is no straight forward machine learning solution to your problem. If your PDFs are in 1000, and the formats are in 10s, better you write a string parser for each format. If you take the path of machine learning, the time to find solution may be longer. Python should help.

machine-learning - 识别考试中的问题（文本识别）

1 回答 1

Related

Reference