python - 创建一个频率表，用于捕获特定长度字符串中的流行子字符串 - Python

Question

我正在尝试对正在编译的斯瓦希里语语料库进行频率分析。目前，这就是我所拥有的：

import os
import sys
from collections import Counter
import re


path = 'C:\Python27\corpus\\'
cnt = Counter()
listing = os.listdir(path)
for infile in listing:
    print "Currently parsing: " + path + infile
    corpus = open(path+infile, "r")
    for lines in corpus:
        for words in lines.split(' '):
            if len(words) >= 2 and re.match("^[A-Za-z]*$", words):
                words = words.strip()
                cnt[words] += 1
    print "Completed parsing: " + path + infile
    #output = open(n + ".out", "w")
    #print "current file is: " + infile

    corpus.close()
    #output.close()
for (counter, content) in enumerate(cnt.most_common(1000)):
    print str(counter+1) + " " + str(content)

所以这个程序将遍历给定路径中的所有文件，读入每个文件的文本，并显示 1000 个最常用的单词。问题是：斯瓦希里语是一种粘着性语言，这意味着在单词中添加中缀、后缀和前缀以传达诸如时态、因果关系、虚拟语气、介词等内容。

所以像“-fanya”这样的动词词根意思是“做”可能是 nitakufanya -“我要做你”。结果，该频率列表偏向于连接不使用所述中缀的单词，例如“for”、“in”、“out”。

有没有一种简单的方法来查看像“nitakufanya”或“tunafanya”这样的词，并将“fanya”这个词包括在总数中？

一些潜在的事情要看：

动词词根将在单词的末尾
单词开头的主题标记可以是以下之一：'ni'（我），'u'（你），'a'（他/她），'wa'（他们），'tu'（我们），'m'（你们所有人）
主语标记后面是时态标记，它们是：“na”（现在）、“li”（过去）、“ta”（未来）、“ji”（反身）、“nge”（条件式）

谢谢

score 0 · Accepted Answer

首先进行频率分析而不用担心前缀。然后修复频率列表中的前缀。为此，更容易根据单词对列表进行排序，以便具有相同前缀的单词彼此相邻。这将使手工修剪变得非常容易。

score 0 · Accepted Answer

你可以做：

root_words = [re.sub(
    '^(ni|u|a|wa|tu|m)(na|li|ta|ji|nge)',
    '', x) for word in words]

从每个单词中删除前缀，但是如果根单词也以这些序列开头，则您无能为力。

python - 创建一个频率表，用于捕获特定长度字符串中的流行子字符串 - Python

2 回答 2

Related

Reference