4

我正在使用 bs4 解析一个 xml 文件并再次将其写回一个新的 xml 文件。

输入文件:

<tag1>
  <tag2 attr1="a1">&quot; example text &quot;</tag2>
  <tag3>
    <tag4 attr2="a2">&quot; example text &quot;</tag4>
    <tag5>
      <tag6 attr3="a3">&apos; example text &apos;</tag6>
    </tag5>
  </tag3>
</tag1>

脚本:

soup = BeautifulSoup(open("input.xml"), "xml")
f = open("output.xml", "w") 
f.write(soup.encode(formatter='minimal'))
f.close()

输出:

<tag1>
  <tag2 attr1="a1"> " example text "  </tag2>
  <tag3>
    <tag4 attr2="a2"> " example text " </tag4>
    <tag5>
      <tag6 attr3="a3"> ' example text ' </tag6>
    </tag5>
  </tag3>
</tag1>

我想保留&quot;&apos;。我尝试使用编码格式化程序的所有选项 - 最小、xml、html、无。但他们都没有解决这个问题。

&quot; 然后我尝试手动替换 " 。

for tag in soup.find_all(text=re.compile("\"")):
    res = tag.string
    res1 = res.replace("\"","&quot;")
    tag.string.replaceWith(res1)

但这给出了以下输出

<tag1>
  <tag2 attr1="a1"> &amp;quot; example text &amp;quot;  </tag2>
  <tag3>
    <tag4 attr2="a2"> &amp;quot; example text &amp;quot; </tag4>
    <tag5>
      <tag6 attr3="a3"> &apos; example text &apos; </tag6>
    </tag5>
  </tag3>
</tag1>

它将 & 替换为&amp;. 我在这里很困惑。请帮我解决这个问题。

4

1 回答 1

1

Custom Encode & Output Formatting

You can use a custom formatter function to add these specific entities to the entity substitution.

from bs4 import BeautifulSoup
from bs4.dammit import EntitySubstitution

def custom_formatter(string):
    """add &quot; and &apos; to entity substitution"""
    return EntitySubstitution.substitute_html(string).replace('"','&quot;').replace("'",'&apos;')

input_file = '''<tag1>
  <tag2 attr1="a1">&quot; example text &quot;</tag2>
  <tag3>
    <tag4 attr2="a2">&quot; example text &quot;</tag4>
    <tag5>
      <tag6 attr3="a3">&apos; example text &apos;</tag6>
    </tag5>
  </tag3>
</tag1>
'''

soup = BeautifulSoup(input_file, "xml")

print soup.encode(formatter=custom_formatter)

<?xml version="1.0" encoding="utf-8"?>
<tag1>
<tag2 attr1="a1">&quot; example text &quot;</tag2>
<tag3>
<tag4 attr2="a2">&quot; example text &quot;</tag4>
<tag5>
<tag6 attr3="a3">&apos; example text &apos;</tag6>
</tag5>
</tag3>
</tag1>

The trick is to do it after the EntitySubstitution.substitute_html() so your &s don't get substituted to &amp;s.

于 2015-05-02T19:08:53.053 回答