python - 编辑 .txt 文件，然后使用 python 转换为有效的 xml

Question

我有很多文本文件需要转换为 .xml 以便能够更有效地使用（我应该做几个语言模型来分析英语方言）

文件是这样的：

<I> <IFL-IDN W2C-001 #1:1> <#> <h> <bold> Some Statement that I can edit </bold> <bold> followed by another </bold> </h> 
    <IFL-IDN W2C-001 #2:1> <p> <#> more and more text that is not very relevant . </p></I>

每个文件大约有 500 个单词，我要做的是识别标签，并关闭像 <#> 和句子末尾的未关闭的标签。

然后我想在每个单词之前和之后将整个 .txt 文件转换为有效的 xml 文件。我可以用 .split() 将其分开，但问题是这些标签中有空格。

我能想到的最好的代码是 splilines()，然后 .split() 在一个句子上，然后尝试识别

这是代码

Korpus = open("w2c-001.txt").read().splitlines()

for i in Korpus:
    Sentence = i.split()
    for j in  range(0,len(Sentence)-2):
        if((Sentence[j][0]=='<' and Sentence[j][len(Sentence[j])-1]!='>') or( Sentence[j][0]!='<' and Sentence[j][len(Sentence[j])-1]=='>')):
            Sentence[j] = Sentence[j] + " " + Sentence[j+1] +" " + Sentence[j+2]
            Sentence.remove(Sentence[j+1])
            Sentence.remove(Sentence[j+2])
            #print(Sentence[j])
        print(Sentence[j])

我最初的想法是，如果我可以写一些东西甚至将有效的 xml 保存在 .txt 文件中，那么将该文件转换为 .xml 应该不是一个大问题。我找不到可以做到这一点的python库，eltree库可以创建xml，但是我找不到任何可以识别和转换它的东西。

提前感谢您，任何帮助将不胜感激。

score 0 · Accepted Answer

首先，您不必加载文件和拆分行，您可以遍历这些行。xml 解析器可以分别应用于每一行。

Korpus = open("w2c-001.txt")
for line in Korpus:
    ...

如果你想自己解析，使用正则表达式查找标签

import re
re.findall(r'<[a-z]*>','<h> <bold> Some Statement that I can edit </bold> <bold> followed by another </bold> </h>')

XML 不是一种文件格式，它是一种语言，只需将纯文本写入 .xml 文件即可。

python - 编辑 .txt 文件，然后使用 python 转换为有效的 xml

1 回答 1

Related

Reference