python - 将两个文件相互匹配并将输出写入文件 - Python

Question

我是 Python 新手。我第二次在里面编码。该脚本的主要目的是获取一个包含数千行文件名的文本文件（sNotUsed 文件）并将其与大约 50 个 XML 文件进行匹配。每个 XML 文件可能包含多达数千行，并且按照大多数 XML 的格式进行格式化。我不确定到目前为止代码的问题是什么。代码不完全完整，因为我没有添加将输出写回 XML 文件的部分，但当前的最后一行应该至少打印一次。但事实并非如此。

两种文件格式的示例如下：

文本文件：

fileNameWithoutExtension1
fileNameWithoutExtension2
fileNameWithoutExtension3
etc.

XML 文件：

<blocks> 

<more stuff="name"> 
     <Tag2> 
        <Tag3 name="Tag3">
                 <!--COMMENT-->
                 <fileType>../../dir/fileNameWithoutExtension1</fileType>
                 <fileType>../../dir/fileNameWithoutExtension4</fileType>
</blocks>

到目前为止我的代码：

import os
import re

sNotUsed=list()
sFile = open("C:\Users\xxx\Desktop\sNotUsed.txt", "r") # open snotused txt file
for lines in sFile:
    sNotUsed.append(lines)
#sNotUsed = sFile.readlines() # read all lines and assign to list
sFile.close() # close file

xmlFiles=list() # list of xmlFiles in directory
usedS=list() # list of S files that do not match against sFile txt

search = "\w/([\w\-]+)"

# getting the list of xmlFiles
filelist=os.listdir('C:\Users\xxx\Desktop\dir')
for files in filelist:
    if files.endswith('.xml'):
        xmlFile = open(files, "r+") # open first file with read + write access
        xmlComp = xmlFile.readlines() # read lines and assign to list
        for lines in xmlComp: # iterate by line in list of lines
            temp = re.findall(search, lines)
            #print temp
            if temp:
                if temp[0] in sNotUsed:
                    print "yes" # debugging. I know there is at least one match for sure, but this is not being printed.

帮助澄清事情： 对不起，我想我的问题不是很清楚。我希望脚本逐行遍历每个 XML，并查看该行的 FILENAME 部分是否与 sNotUsed.txt 文件的确切行匹配。如果有匹配，那么我想从 XML 中删除它。如果该行与 sNotUsed.txt 中的任何行都不匹配，那么我希望它成为新修改的 XML 文件输出的一部分（这将覆盖旧文件）。如果仍然不清楚，请告诉我。

编辑，工作代码

import os
import re
import codecs

sFile = open("C:\Users\xxx\Desktop\sNotUsed.txt", "r") # open sNotUsed txt file
sNotUsed=sFile.readlines() # read all lines and assign to list
sFile.close() # close file

search = re.compile(r"\w/([\w\-]+)")

sNotUsed=[x.strip().replace(',','') for x in sNotUsed]
directory=r'C:\Users\xxx\Desktop\dir'
filelist=os.listdir(directory) # getting the list of xmlFiles
# for each file in the list
for files in filelist:
    if files.endswith('.xml'): # make sure it is an XML file
        xmlFile = codecs.open(os.path.join(directory, files), "r", encoding="UTF-8") # open first file with read
        xmlComp = xmlFile.readlines() # read lines and assign to list
        print xmlComp
        xmlFile.close() # closing the file since the lines have already been read and assigned to a variable
        xmlEdit = codecs.open(os.path.join(directory, files), "w", encoding="UTF-8") # opening the same file again and overwriting all existing lines
        for lines in xmlComp: # iterate by line in list of lines
            #headerInd = re.search(search, lines) # used to get the headers, comments, and ending blocks
            temp = re.findall(search, lines) # finds all strings that match the regular expression compiled above and makes a list for each
            if temp: # if the list is not empty
                if temp[0] not in sNotUsed: # if the first (and only) value in each list is not in the sNotUsed list
                    xmlEdit.write(lines) # write it in the file
            else: # if the list is empty
                xmlEdit.write(lines) # write it (used to preserve the beginning and ending blocks of the XML, as well as comments)

score 5 · Accepted Answer

有很多话要说，但我会尽量保持简洁。

PEP8：Python 代码风格指南

对于局部变量，您应该使用带下划线的小写字母。查看 PEP8：Python 代码样式指南。

文件对象和`with`语句

使用with语句打开文件，请参见：文件对象：http ://docs.python.org/2/library/stdtypes.html#bltin-file-objects

转义 Windows 文件名

Windows 文件名中的反斜杠可能会导致 Python 程序出现问题。您必须使用双反斜杠转义字符串或使用原始字符串。

例如：如果您的 Windows 文件名是"dir\notUsed.txt"，您应该像这样转义它："dir\\notUsed.txt"或使用原始字符串r"dir\notUsed.txt"。如果您不这样做，"\n"将被解释为换行符！

注意：如果需要支持 Unicode 文件名，可以使用 Unicode 原始字符串：ur"dir\notUsed.txt".

另请参阅 StockOverFlow 中的问题 19065115。

将文件名存储在 a 中set：这是一个没有重复的优化集合

not_used_path = ur"dir\sNotUsed.txt"
with open(not_used_path) as not_used_file:
    not_used_set = set([line.strip() for line in not_used_file])

编译你的正则表达式

多次使用时编译正则表达式会更有效。同样，您应该使用原始字符串来避免反斜杠解释。

pattern = re.compile(r"\w/([\w\-]+)")

警告： os.listdir()函数返回文件名列表而不是完整路径列表。请参阅 Python 文档中的此函数。

在您的示例中，您'C:\Users\xxx\Desktop\dir'使用os.listdir(). 然后你想用 . 打开这个目录中的每个 XML 文件open(files, "r+")。但这是错误的，直到您当前的工作目录不是您的桌面目录。经典用法是这样使用os.path.join()函数：

desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
    desktop_path = os.path.join(desktop_dir, filename)

如果要提取文件名的扩展名，可以使用该os.path.splitext()函数。

desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
    if os.path.splitext(filename)[1].lower() != '.xml':
        continue
    desktop_path = os.path.join(desktop_dir, filename)

您可以使用理解列表来简化它：

desktop_dir = r'C:\Users\xxx\Desktop\dir'
xml_list = [os.path.join(desktop_dir, filename)
            for filename in os.listdir(desktop_dir)
            if os.path.splitext(filename)[1].lower() == '.xml']

解析 XML 文件

如何解析 XML 文件？这是一个很好的问题！有几种可能性： - 使用正则表达式，高效但危险；- 使用 SAX 解析器，同样高效，但令人困惑且难以维护；- 使用 DOM 解析器，效率较低但更清晰......考虑使用 lxml 包（@see： http: //lxml.de/）

这很危险，因为您读取文件的方式并不关心 XML 编码。而且很糟糕！确实非常糟糕！XML 文件通常以 UTF-8 编码。您应该首先解码 UTF-8 字节流。一个简单的方法是使用 codecs.open() 打开一个编码文件。

for xml_path in xml_list:
    with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
        content = xml_file.read()

使用此解决方案，完整的 XML 内容content作为 Unicode 字符串存储在变量中。然后，您可以使用 Unicode 正则表达式来解析内容。

最后，您可以使用集合交集来查找给定的 XML 文件是否包含文本文件的公用名称。

for xml_path in xml_list:
    with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
        content = xml_file.read()
    actual_set = set(pattern.findall(content))
    print(not_used_set & actual_set)

python - 将两个文件相互匹配并将输出写入文件 - Python

1 回答 1

PEP8：Python 代码风格指南

文件对象和with语句

转义 Windows 文件名

编译你的正则表达式

解析 XML 文件

Related

文件对象和`with`语句