0

我在 python 中有一个代码来索引包含阿拉伯语单词的文本文件。我在英文文本上测试了代码,它运行良好,但是当我测试阿拉伯文时它给了我一个错误。注意:文本文件以 unicode 编码而非 ANSI 编码保存。

这是我的代码:

from whoosh import fields, index
import os.path
import csv
import codecs
from whoosh.qparser import QueryParser

# This list associates a name with each position in a row
columns = ["juza","chapter","verse","voc"]

schema = fields.Schema(juza=fields.NUMERIC,
                       chapter=fields.NUMERIC,
                       verse=fields.NUMERIC,
                       voc=fields.TEXT)

# Create the Whoosh index
indexname = "indexdir"
if not os.path.exists(indexname):
  os.mkdir(indexname)
ix = index.create_in(indexname, schema)

# Open a writer for the index
with ix.writer() as writer:
  with open("h.txt", 'r') as txtfile:
    lines=txtfile.readlines()

    # Read each row in the file
    for i in lines:

      # Create a dictionary to hold the document values for this row
      doc = {}
      thisline=i.split()
      u=0

      # Read the values for the row enumerated like
      # (0, "juza"), (1, "chapter"), etc.
      for w in thisline: 
        # Get the field name from the "columns" list
          fieldname = columns[u]
          u+=1
          #if isinstance(w, basestring):
          #     w = unicode(w)
          doc[fieldname] = w
      # Pass the dictionary to the add_document method
      writer.add_document(**doc)
with ix.searcher() as searcher:
    query = QueryParser("voc", ix.schema).parse(u"بسم")
    results = searcher.search(query)
    print(len(results))
    print(results[1]) 

然后错误是:

Traceback (most recent call last):
  File "C:\Python27\yarab.py", line 38, in <module>
    fieldname = columns[u]
IndexError: list index out of range

这是文件的示例:

1   1   1   كتاب
1   1   2   قرأ
1   1   3   لعب
1   1   4   كتاب
4

2 回答 2

0

虽然我看不出有什么明显的错误,但我会确保你是为 error 设计的。确保您发现任何 split() 返回的元素数量超过预期的情况并及时处理(例如打印和终止)。看起来您可能正在处理格式错误的数据。

于 2013-02-21T16:35:02.777 回答
0

您错过了脚本中的 Unicode 标头。第一行应该是:

编码:utf-8

还可以使用 unicode 编码打开文件:

import codecs 
with codecs.open("s.txt",encoding='utf-8') as txtfile:
于 2015-08-24T12:58:54.523 回答