我在网上下载了一本pdf格式的书,想在我的ios项目中使用这本书。所需格式为 xml。格式如下:
<q>question here</q>
<a>answer here</a>
<q>question2</q>
<a>answer2</a>
pdf格式如下:
the question is centered
the answer has several paragraphs that start with 4 white space.
This is another paragraph
This is the second question and so on
This is the answer to the second question
The third question and there may be a blank line above
This is the 4th question and no blank line above
我尝试使用 word/pages 将 pdf 转换为 txt 并逐行阅读文本,但我无法识别问题和答案。另一个问题是当我进行转换时,pdf 的自动换行被转换为换行符。
注:流程为
pdf -> use word/pages -> txt -> python program -> xml -> python program -> sqlite database
关键部分是如何将 pdf 转换为正确的 xml 文件。