我有一个包含三列的文件(由 \t 分隔;第一列是单词,第二列是引理,第三列是标签)。有些行仅由点或逗号组成。
<doc n=1 id="CMP/94/10">
<head p="80%">
Customs customs tag1
union union tag2
in in tag3
danger danger tag4
of of tag5
the the tag6
</head>
<head p="80%">
New new tag7
restrictions restriction tag8
in in tag3
the the tag6
.
Hi hi tag8
假设用户搜索引理“in”。我想要“in”的频率和“in”之前和之后的引理频率。所以我想要整个语料库中“union”、“danger”、“restriction”和“the”的频率。结果应该是:
union 1
danger 1
restriction 1
the 2
我怎么做?我尝试使用lemma_counter = {}
,但它不起作用。
我对python语言没有经验,所以如果我有任何错误,请纠正我。
c = open("corpus.vert")
corpus = []
for line in c:
if not line.startswith("<"):
corpus.append(line)
lemma = raw_input("Lemma you are looking for: ")
counter = 0
lemmas_before_after = []
for i in range(len(corpus)):
parsed_line = corpus[i].split("\t")
if len(parsed_line) > 1:
if parsed_line[1] == lemma:
counter += 1 #this counts lemma frequency
new_list = []
for j in range(i-1, i+2):
if j < len(corpus) and j >= 0:
parsed_line_with_context = corpus[j].split("\t")
found_lemma = parsed_line_with_context[0].replace("\n","")
if len(parsed_line_with_context) > 1:
if lemma != parsed_line_with_context[1].replace("\n",""):
lemmas_before_after.append(found_lemma)
else:
lemmas_before_after.append(found_lemma)
print "list of lemmas ", lemmas_before_after
lemma_counter = {}
for i in range(len(corpus)):
for lemma in lemmas_before_after:
if parsed_line[1] == lemma:
if lemma in lemma_counter:
lemma_counter[lemma] += 1
else:
lemma_counter[lemma] = 1
print lemma_counter
fA = counter
print "lemma frequency: ", fA