python - Python: Better way to search and collect text strings from html. Strip off markdowns, tags, etc

Question

There are many modules like lxml, Beautiful soup, nltk and pyenchant to correctly filter out proper english words. But then what is the cleanest shortest way like html2text offers, also if markdowns could be stripped off as well (While I write, there are scores of possible similar questions on the right) There could be a universal regex which could take away all the html tags?

def word_parse(f):
    raw = nltk.clean_html(f) #f = url.content here, from "requests" module
    regex = r'[a-zA-Z]+' # | ^[a-zA-Z]+\b'
    match = re.compile(regex)
    ls = []
    for line in raw.split():
        for mat in line.split():
            try:
                v = match.match(mat).group()
                map(ls.append, v.split())
            except AttributeError, e:
                pass

Is there some good code snippet somebody could suggest? Can someone suggest a much cleaner and optimized code here?

score 2 · Accepted Answer

我强烈建议使用现有的库，而不是尝试为此编写自己的正则表达式。例如，其他人在 Beautiful Soup 中投入了大量工作，您不妨从中受益。

对于这种特定情况，Beautiful Soup 提供了get_text方法：

text = BeautifulSoup(f).get_text()

python - Python: Better way to search and collect text strings from html. Strip off markdowns, tags, etc

1 回答 1

Related

Reference