1

我正在使用一个外部 API,它从 HTML 电子邮件中向我发送文本。文本通过没有 HTML 结构(例如<html> ... </html>等)。我需要清理此文本并输出到 Slack。我尝试过使用 BeautifulSoup 和 Bleach,这两种方法都不起作用,大概是由于输入中 HTML 的部分性质。

输入文本的示例如下所示:

&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Bacon ipsum dolor amet cupim meatball ham hock pancetta ball tip ribeye cow brisket bresaola short ribs drumstick short loin. Turkey pastrami boudin andouille fatback tenderloin pork beef jowl rump hamburger buffalo capicola prosciutto. Meatball jerky pig filet mignon cow. Tenderloin flank tongue venison. Spare ribs fatback jerky pig boudin biltong filet mignon pancetta capicola.&lt;/div&gt;
&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Jerky salami brisket, landjaeger beef ribs meatball swine alcatra. Pork chop doner kielbasa jowl biltong tri-tip. Sausage sirloin prosciutto ribeye meatball capicola andouille picanha rump bacon turkey kevin pancetta landjaeger jowl. Spare ribs burgdoggen landjaeger buffalo capicola cow corned beef flank frankfurter boudin salami t-bone doner. Kevin filet mignon ribeye, pork belly andouille chuck pig drumstick. Short ribs tri-tip ball tip rump flank.&lt;/div&gt;
&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Pig biltong doner fatback. Tail hamburger kielbasa pastrami buffalo boudin cupim, pig jerky prosciutto venison pork chop chuck sirloin kevin. Bresaola bacon drumstick ball tip salami ribeye capicola beef ribs. Meatball tenderloin drumstick bresaola rump short ribs. Salami venison chuck burgdoggen.&lt;/div&gt;
&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Strip steak ham prosciutto, biltong meatball kielbasa boudin shankle ground round bacon. Alcatra short loin chuck shankle hamburger shank, buffalo sausage turkey prosciutto tongue kielbasa venison. Shank cow turducken beef ribs meatloaf pork belly. Pastrami leberkas ball tip pancetta short loin sirloin turducken rump hamburger cupim strip steak ground round brisket filet mignon pork. Beef shankle kevin tail picanha bacon beef ribs cow ground round pig ham rump. Bresaola spare ribs tenderloin pastrami, ham jowl short loin hamburger shankle tail venison pig meatloaf.&lt;/div&gt;

我想要上面输入的以下输出:

Bacon ipsum dolor amet cupim meatball ham hock pancetta ball tip ribeye cow brisket bresaola short ribs drumstick short loin. Turkey pastrami boudin andouille fatback tenderloin pork beef jowl rump hamburger buffalo capicola prosciutto. Meatball jerky pig filet mignon cow. Tenderloin flank tongue venison. Spare ribs fatback jerky pig boudin biltong filet mignon pancetta capicola.
Jerky salami brisket, landjaeger beef ribs meatball swine alcatra. Pork chop doner kielbasa jowl biltong tri-tip. Sausage sirloin prosciutto ribeye meatball capicola andouille picanha rump bacon turkey kevin pancetta landjaeger jowl. Spare ribs burgdoggen landjaeger buffalo capicola cow corned beef flank frankfurter boudin salami t-bone doner. Kevin filet mignon ribeye, pork belly andouille chuck pig drumstick. Short ribs tri-tip ball tip rump flank.
Pig biltong doner fatback. Tail hamburger kielbasa pastrami buffalo boudin cupim, pig jerky prosciutto venison pork chop chuck sirloin kevin. Bresaola bacon drumstick ball tip salami ribeye capicola beef ribs. Meatball tenderloin drumstick bresaola rump short ribs. Salami venison chuck burgdoggen.
Strip steak ham prosciutto, biltong meatball kielbasa boudin shankle ground round bacon. Alcatra short loin chuck shankle hamburger shank, buffalo sausage turkey prosciutto tongue kielbasa venison. Shank cow turducken beef ribs meatloaf pork belly. Pastrami leberkas ball tip pancetta short loin sirloin turducken rump hamburger cupim strip steak ground round brisket filet mignon pork. Beef shankle kevin tail picanha bacon beef ribs cow ground round pig ham rump. Bresaola spare ribs tenderloin pastrami, ham jowl short loin hamburger shankle tail venison pig meatloaf.

我使用了以下简单的漂白程序:

def textify(html):
 text = bleach.clean(html)
 return text

使用 BeautifulSoup,我还使用了一些正则表达式来清理输出:

def textify(html):
  html = re.sub('<br>', '\n', html)
  soup = BeautifulSoup(html)
  text = soup.getText()
  text = re.sub(r'\&lt;', '<', text)
  text = re.sub(r'\&gt\;', '>', text)
  text = re.sub(r'\&\#39\;', "'", text)
  return text
4

1 回答 1

1

在将字符串传递给bleach 或beautifulsoup 之前,您首先需要取消转义字符串,使用标准库的html 模块

from html import unescape

html = "&lt;div style=&#39;bo...div&gt;"
unescaped_html = unescape(html)

text = bleach.clean(unescaped_html)
soup = BeautifulSoup(unescaped_html)
于 2019-09-20T17:24:58.117 回答