我正在使用 Ubuntu 12.04、Python 2.7
我从给定 URL 获取内容的代码:
def get_page(url):
'''Gets the contents of a page from a given URL'''
try:
f = urllib.urlopen(url)
page = f.read()
f.close()
return page
except:
return ""
return ""
要过滤由提供的页面的内容get_page(url)
:
def filterContents(content):
'''Filters the content from a page'''
filteredContent = ''
regex = re.compile('(?<!script)[>](?![\s\#\'-<]).+?[<]')
for words in regex.findall(content):
word_list = split_string(words, """ ,"!-.()<>[]{};:?!-=/_`&""")
for word in word_list:
filteredContent = filteredContent + word
return filteredContent
def split_string(source, splitlist):
return ''.join([ w if w not in splitlist else ' ' for w in source])
如何对 in 进行索引filteredContent
,Xapian
以便在查询时返回URLs
查询所在的位置?