-1

我想找到一个以“section_”开头的字符串,并将其作为值添加到同一行中的标记中。示例:以下是 ditamap 类型文件中的输入。

<topicref href="xyz/debug_logging_in_xyz-section_i_y_mn.dita"/>
<topicref href="xyz/workflows_id-section_exf_zaz_lo.dita"/>
<topicref href="xyz/images_id-section_ekl_bbz_lo.dita"/>

期望的输出:

<topicref href="xyz/debug_logging_in_xyz-section_i_y_mn.dita" keys="section_i_y_mn"/>
<topicref href="xyz/workflows_id-section_exf_zaz_lo.dita" keys="section_exf_zaz_lo"/>
<topicref href="xyz/images_id-section_ekl_bbz_lo.dita" keys="section_ekl_bbz_lo"/>

我了解 BeautifulSoup 可以用来实现这一点。但是,我是新手,不知道语法。任何人都可以帮忙吗?

这是我尝试使用的代码:

import os
from bs4 import BeautifulSoup as bs
globpath = "C:/DATA" #add your directory path here 

def main(path):
    with open(path, encoding="utf-8") as f:
        s = f.read()
    s = bs(s, "xml")
    imgs = s.find_all("topicref")
    for i in imgs:
        if "section" in i["href"]:
            i["keys"] = i["href"].replace("*-","").replace(".dita*","")
    s = str(s)
    with open(path, "w", encoding="utf-8") as f:
        f.write(s)

for dirpath, directories, files in os.walk(globpath):
         for fname in files:
            if fname.endswith(".ditamap"):
                path = os.path.join(dirpath, fname)
                main(path)

但是,它在 keys 属性中添加了整个路径。我只需要以section 开头并在.dita 之前结束的部分。

正则表达式有效:这是最终代码

from bs4 import BeautifulSoup as bs
import re
globpath = "C:/DATA" #add your directory path here
 

def main(path):
    with open(path, encoding="utf-8") as f:
        s = f.read()
    s = bs(s, "xml")
    imgs = s.find_all("topicref")
    for i in imgs:
        if "section" in i["href"]:
            try:
                i["keys"] = re.findall("section[^\.]*",i["href"])[0]
            except:
                print("Could not replace")
    s = str(s)
    with open(path, "w", encoding="utf-8") as f:
        f.write(s)```
4

1 回答 1

0

我认为应该用正则表达式来完成(因为这是我能做的最多的事情)

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup('your-string-input-of-tags-goes-here', 'html.parser')
soup.find_all('topicref', {'keys': re.compile(r'(section_([^ "])+)')}) 

返回匹配标签的列表

检查此代码是否有效

于 2021-08-19T14:11:04.280 回答