python-3.x - 无法删除 python3 字符串中的 \n 和 \t 吗？

Question

所以我一直在尝试格式化从 CL 获取的网页，以便我可以将其发送到我的电子邮件，但这是我每次尝试删除\n和\t

b'\n\n\n\t\n\t\n\t\n\t\n\t\n\t\n\n\n\n\t\n\n\n\t
\n\t\t\t
\n\t
\n\t\t
\n\t\t\t
\n 0 favorites\n
\n\n\t\t
\n\t\t
∨
\n\t\t
∧
\n\t\t
\n \n
\n
\n\t \tCL wenatchee all personals casual encounters\n
\n
\n\t\t
\n\t
\n
\n\n\t\t
\n\t\t\t
\n\t\n\t\t\n\t\n\n\n\nReply to: 59nv6-4031116628@pers.craigslist.org\n
\n\n\n\t
\n\t\n\t\tflag [?] :\n\t\t\n\t\t\tmiscategorized\n\t\t\n\t\t\tprohibited\n\t\t\n\t\t\tspam\n\t\t\n\t\t\tbest of\n\t\n
\n\n\t\t

Posted: 2013-08-28, 8:23AM PDT
\n
\n\n
\n \n Well... - w4m - 22 (Wenatchee)\n

我尝试过剥离、替换甚至正则表达式，但没有任何问题，它总是出现在我的电子邮件中，不受任何影响。

这是代码：

try:
    if url.find('http://') == -1:
        url = 'http://wenatchee.craigslist.org' + url
    html = urlopen(url).read()
    html = str(html)
    html = re.sub('\s+',' ', html)
    print(html)
    part2 = MIMEText(html, 'html')
    msg.attach(part2)
    s = smtplib.SMTP('localhost')
    s.sendmail(me, you, msg.as_string())
    s.quit()

score 6 · Accepted Answer

您的问题是，尽管有所有相反的证据，您仍然有一个bytes对象，而不是str您希望的对象。因此，您的尝试一无所获，因为没有指定编码，就无法将任何内容（正则表达式、替换参数等）与您的html字符串匹配。

您需要做的是首先解码字节。

就个人而言，我最喜欢的清理空白的方法是使用string.splitand string.join。这是一个工作示例。我删除了任何类型的空格的所有运行，并用单个空格替换它们。

try:
    html = urlopen('http://wenatchee.craigslist.org').read()
    html = html.decode("utf-8") # Decode the bytes into a useful string
    # Now split the string over all whitespace, then join it together again.
    html = ' '.join(html.split())
    print(html)
    s.quit()
except Exception as e:
    print(e)

python-3.x - 无法删除 python3 字符串中的 \n 和 \t 吗？

1 回答 1

Related

Reference