There are many modules like lxml, Beautiful soup, nltk and pyenchant to correctly filter out proper english words. But then what is the cleanest shortest way like html2text offers, also if markdowns could be stripped off as well (While I write, there are scores of possible similar questions on the right) There could be a universal regex which could take away all the html tags?
def word_parse(f):
raw = nltk.clean_html(f) #f = url.content here, from "requests" module
regex = r'[a-zA-Z]+' # | ^[a-zA-Z]+\b'
match = re.compile(regex)
ls = []
for line in raw.split():
for mat in line.split():
try:
v = match.match(mat).group()
map(ls.append, v.split())
except AttributeError, e:
pass
Is there some good code snippet somebody could suggest? Can someone suggest a much cleaner and optimized code here?