我正在使用 lxml 和 etree 来解析 html 文件。代码如下所示:
def get_all_languages():
allLanguages = "http://wold.livingsources.org/vocabulary"
f = urllib.urlopen(allLanguages).read()
#inFile = "imLog.xml"
#html = f.read()
#f.close()
#encoding = chardet.detect(f)['encoding']
#f.decode(encoding, 'replace').encode('utf-8')
html = etree.HTML(f)
#result = etree.tostring(html, pretty_print=True, method="html")
#print result #is of type string
#print type(result)
return html
比,我从网上提取一些信息并将其保存在一个数组中。将字符串附加到数组时,字符串的编码或格式会发生变化。我认为是某种 unicode 对象左右?所以我认为这对我来说可能并不具有挑战性,因为我将它从数组中删除并打印到输出文件中。
def print_file():
#outputfile
output = 'WOLDDictionaries.txt'
dataLanguage = get_data_all_languages()
dataDictionaries = get_site_single_lang()
outputFile = open(output, "w")
outputFile.flush()
for index, array in enumerate(dataLanguage):
indexLang = index
for item in array:
string = item
#indexLang = index
outputFile.write(string + "\t")
outputFile.write("\n")
#outputFile.flush()
for index, array in enumerate(dataDictionaries):
#stringArray = str(array)
indexDic = index
#outputFile.write(index + stringArray + "\t")
if(indexLang == indexDic):
#outputFile.write(string + "\t")
for data in array:
#stringData = str(data)
#outputFile.write(stringData + "\t")
for word in data:
stringWord = word
outputFile.write(stringWord + "\t")
outputFile.write("\n")
#outputFile.flush()
outputFile.close()
好吧,这个想法是错误的。将其打印到文件时,编码仍然错误。我该怎么做才能获得正确的角色?