python - 如何从字符串中删除不可打印的字符？

Question

我正在使用以下代码读取一个 word 文件：

import win32com.client as win32

word = win32.dynamic.Dispatch("Word.Application")
word.Visible = 0
doc = word.Documents.Open(SigLexiconFilePath)

我从包含许多不可打印字符的文件中获取字符串：

str = "\xa0keine\xa0freigäbü\xa0\x0b\r\x07"

我尝试使用以下代码删除不可打印的字符：

import string 

str = "\xa0keine\xa0freigäbü\xa0\x0b\r\x07"
filtered_string = "".join(filter(lambda x:x in string.printable, str))

这给了我以下输出：

keinefreigb\x0b\r

我尝试过的其他代码：

str = str.split('\r')[0]
str = str.strip()

这给了我以下输出：

keine\xa0freigäbü

如何使用最少的代码删除所有这些不可打印的字符以低于所需的输出：

keine freigäbü

score 1 · Accepted Answer

从 python 中的字符串中剥离“不可打印”字符的优雅 pythonic 解决方案是使用 isprintable() 字符串方法以及生成器表达式或列表理解，具体取决于用例，即。字符串的大小：

''.join(c for c in str if c.isprintable())

返回“keinefreigäbü”

str.isprintable() 如果字符串中的所有字符都可打印或字符串为空，则返回 True，否则返回 False。不可打印字符是在 Unicode 字符数据库中定义为“其他”或“分隔符”的字符，但 ASCII 空格 (0x20) 除外，它被认为是可打印的。（请注意，此上下文中的可打印字符是在字符串上调用 repr() 时不应转义的字符。它与写入 sys.stdout 或 sys.stderr 的字符串的处理无关。）

score 1 · Accepted Answer

这些字符似乎都是空白字符。您可以尝试使用 Python 的unicodedata模块将它们中的一些一致地转换为正确的空白字符：

>>> unicodedata.normalize("NFKD","\xa0keine\xa0freigäbü\xa0\x0b\r\x07")
' keine freigäbü \x0b\r\x07'

然后，如果您尝试删除的字符集不是很多，您可以通过一系列替换和剥离命令来获得您想要的内容。

>>> ' keine freigäbü \x0b\r\x07'.replace("\x0b"," ").replace("\r"," ").\
        replace("\x07"," ").strip()
'keine freigäbü'

希望这些帮助。

score 0 · Accepted Answer

试试这条线。

import re

def convert_tiny_str(x:str):
    """ Taking in consideration this:

    > https://www.ascii-code.com/

    Citting: "The first 32 characters in the ASCII-table are unprintable control
    codes and are used to control peripherals such as printers." 
    From Hex code 00 to Hec code 2F, [00, 2F].

    Now, from ASCII Extended, the printable characters are listed
    from \x20 to \xFF in Hexadecimal code, [20, FF].

    For that the Regular Expression that I can show like a possible
    solution it is this:

    1- Replace "all the characers, except the printable characters", by a ''.

    2- Then, the character \xa0 it is still componing the str result.
    Replace it by an ' '.
    """

    _out = re.sub(r'[^\x20-\xff]',r'', _str)
    # >> '\xa0keine\xa0freigäbü\xa0'

    return re.sub(r'\xa0',r' ', _out)


_str = "\xa0keine\xa0freigäbü\xa0\x0b\r\x07"
x = convert_tiny_str(_str)

print(x)
# >>' keine freigäbü '

完毕。

python - 如何从字符串中删除不可打印的字符？

3 回答 3

Related

Reference