python - 在 python 和 linux shell 中处理（二进制？）文件

Question

我最近在 python 中编写了一个脚本，它处理 Microsoft Windows DHCP 服务器转储文件并使用电子表格 XML 格式生成当前保留的 XML 文件。

该脚本基本上使用 python open()命令打开一个文件，然后遍历每一行（对于文件中的行）并查找关键字reservedip。如果找到关键字，则使用 shlex split()命令将该行分成多个字段。

但是，当我使用 microsoft DHCP 服务器的默认转储文件运行此脚本时，我没有得到任何结果。另请注意，我无法使用 Linux 的 grep 命令在文件中进行搜索

然后我尝试在 gedit 中打开文件并将其保存为 unix 文本文件。完成此操作后，我得到了结果并能够在文件中进行 grep。然而，这种方法破坏了编写脚本来自动化我的工作的全部意义。

我一直在谷歌上搜索，但没有找到我想要的东西。我也尝试以二进制模式打开文件，但这也无济于事。

我希望有人可以帮助我解决这个问题。

根据请求，以下是脚本的作用（至少是循环部分）和 DHCP 服务器输出的示例：

脚本

# Setup an empty dictionary to store the extracted records
records = {}

# Open dhcp dump file
f = open(dhcp.txt, "r")

# Iterate file line by line
for line in f:

  # Only use line with the word "reservedip" in it
  if "reservedip" in line:

    # Split line into fields by spaces (excluding quoted substrings)
    field = shlex.split(line)

    # Add new entry for each record using the 32bit IP address int as it's key
    records[addr_to_int(field[7])] = [field[7], field[8], field[9], field[10]]

*注意：addr_to_int 是我编写的将点分 IPv4 地址转换为整数的函数*

DHCP 转储

不幸的是，由于公司政策，我无法包含真正的 DHCP 服务器转储。但是我试图从文件中删除的行如下所示：

Dhcp Server \\servername.company.local Scope 172.16.104.0 添加reservedip 172.16.104.207 003386dd00gg "hostname.company.local" "Host Description" "BOTH"

在此先感谢，帕斯卡

score 1 · Accepted Answer

消除结束行字符问题的一种方法是使用 re 将结束行字符设置为 Unix 样式：

import re

dhcp_file = open( path_to_dhcp_file, 'r' )
for line in dhcp_file:
    # Change en line char to UNIX style
    line = re.sub( "\r\n", r"\n", line )

    # now do your things on line

score 1 · Accepted Answer

基于这两行，您作为 DHCP 转储文件内容的示例，我制作了以下测试用例（为了在此示例中清晰起见，我在开头添加了 l1、l2、l3、...每行，指的是行号）

所以这是我在 Linux Fedora Core 17 (x86_64) data.txt 上创建的转储文件：

l1: Dhcp Server \\servername.company.local Scope 172.16.104.0 Add reservedip 172.16.104.207 
l2: 003386dd00gg "hostname.company.local" "Host Description" "BOTH"
l3: Dhcp Server \\servername.company.local Scope 172.16.104.0 Add reservedip 172.16.104.207 
l4: 003386dd00gg "hostname.company.local" "Host Description" "BOTH"
l5: Dhcp Server \\servername.company.local Scope 172.16.104.0 Add  172.16.104.207 
l6: 003386dd00gg "hostname.company.local" "Host Description" "BOTH"
l7: Dhcp Server \\servername.company.local Scope 172.16.104.0 Add  172.16.104.207 
l8: 003386dd00gg "hostname.company.local" "Host Description" "BOTH"
l9: Dhcp Server \\servername.company.local Scope 172.16.104.0 Add reservedip 172.16.104.207 
l10: 003386dd00gg "hostname.company.local" "Host Description" "BOTH"

你之前这么说：

另请注意，我无法使用 Linux 的 grep 命令在文件中进行搜索

这是我使用上述示例文件运行 grep 时得到的结果

$ cat data.txt | grep reservedip
l1: Dhcp Server \\servername.company.local Scope 172.16.104.0 Add reservedip 172.16.104.207 
l3: Dhcp Server \\servername.company.local Scope 172.16.104.0 Add reservedip 172.16.104.207 
l9: Dhcp Server \\servername.company.local Scope 172.16.104.0 Add reservedip 172.16.104.207 
$

这也是我使用 python 脚本进行的测试，以检查脚本是否能够在示例文件中找到关键字“reservedip”：

lineNumber = 0
with open("./data.txt") as dhcpDumpFile:
    for line in dhcpDumpFile:
        lineNumber += 1
        if "reservedip" in line:
            print("Found 'reservedip' at the line: ", lineNumber)

我得到的结果是：

$ python -tt myscript.py
("Found 'reservedip' at the line: ", 1)
("Found 'reservedip' at the line: ", 3)
("Found 'reservedip' at the line: ", 9)
$

所以，它对我有用。

问候，

达里约什

score 1 · Accepted Answer

文件中这些字符串的编码可能不是 ASCII 兼容的字符编码。UTF-8 和 latin 应该兼容，因为它们对 ASCII 字符只使用一个字节。UTF-16和 UTF-32 不兼容，它们总是使用每个字符超过一个字节。UTF-16 在 MS 文件中并不少见，有时文件甚至是混合的。

转储可能使用 2 个字节，即使是 ASCII 字符也是如此。然后你会r~e~s~e~r~v~e~d~i~p在文件中有~一些其他字节（也可以是~r甚~~至仍然编码为r.

Just a wild guess, since you are not allowed to post the actual file and I don't know anything about MS DHCP server dumps.

What does

file file.txt

give you?

What about

file --mime-type --mime-encoding

That won't necessarily tell you the encoding if it is a "mixed" binary/strings file, but if it is plain UTF/ASCII text, it should tell you.

python - 在 python 和 linux shell 中处理（二进制？）文件

3 回答 3

Related

Reference