python - 如何在我的话语列表中准确地挑选出正确的字符串？

Question

因此，我正在使用 NumPy、Pandas 和 NLTK 编写一个 python 脚本，以从 CHILDES 数据库的普罗维登斯语料库中获取话语。

作为参考，我的脚本的想法是为语料库中的每个孩子填充一个数据框，其中包含他们的姓名、包含我正在寻找的语言特征的话语（否定类型）、他们说这句话时的年龄以及他们的 MLU说。

伟大的。

现在，用户将能够在数据框填充此信息并将每个话语标记为特定类别后进入，控制台将打印出他们将在任一侧用一行上下文标记的话语（如果他们只是看到孩子说“不”，如果没有看到妈妈之前说的话或之后有人说的话，很难说出他们的意思）。

所以我的诀窍是了解上下文。我已经在程序中使用其他方法设置它以使这一切发生，但我希望您查看用于最初填充数据帧的方法之一的一部分，如以下行所示：“if line == line_context:"，为我提供了大约 91 个误报！

我知道为什么，因为我正在逐行制作每个文件的临时副本，以便对于最终有否定的每个话语，子数据框中的话语索引将用作 HashMap 中的键（或 dict in Python）到三个字符串的列表（嗯，字符串列表，因为这是 CHILDESCorpusReader 给我句子的方式），话语，它之前的话语，以及它之后的话语......

所以我有那条有问题的行“if line == line_context”来检查它在遍历字符串列表列表时是否与“line”对齐，或者是孩子话语的那行被迭代，以便稍后我可以让索引匹配。

问题是这些“句子”中有许多是相同的字符序列，（['no'] 本身出现了很多！）所以我的程序会认为它是相同的，看到它有一个否定, 并将其保存到数据框中，但每次在我的文件话语副本中找到一个 ['no'] 的实例时，它都会保存它，该实例与该文件中只有孩子的语音的行之一相同，所以我得到了大约 91 个相同事物的额外实例！

呸！无论如何，有没有什么办法可以让我得到类似“if line == line_context”的东西来挑选文件中['no']的单个实例，这样我就知道我在文件中的同一点双方？？？我正在使用 NLTK CHILDESCorpusReader，它似乎没有这类东西的资源（否则我不必使用这种可笑的迂回方式来获取上下文！）

也许有一种方法，当我迭代我为每个文件制作的 utterance_list 时，在一个话语与我也在迭代的孩子的话语匹配之后，我可以更改和/或删除 utterance_list 中的那个项目，所以以防止它给我一个误报 c。还有91次？！

谢谢。

这是 le 代码（我添加了一些额外的注释，希望能帮助您准确理解每一行应该做什么）：

for file in value_corpus.fileids(): #iterates through the .xml files in the corpus_map
    for line_total in value_corpus.sents(fileids=file, speaker='ALL'): #creates a copy of the utterances by all speakers 
        utterance_list.append(line_total) #adds each line from the file to the list
    for line_context in utterance_list: #iterates through the newly created list
        for line in value_corpus.sents(fileids=file, speaker='CHI'): #checks through the original file's list of children's utterances
            if line == line_context: #tries to make sure that for each child's utterance, I'm at the point in the embedded for loop where the utterance in my utterance_list and the utterance in the file of child's sentences is the same exact sentence BUGGY(many lines are the same --> false positives)
                for type in syntax_types: #iterates through the negation syntactic types
                    if type in line: #if the line contains a negation
                        value_df.iat[i,5] = type #populates the "Syntactic Type" column
                        value_df.iat[i,3] = line #populates the "Utterance" column
                        MLU = str(value_corpus.MLU(fileids=file, speaker='CHI'))
                        MLU = "".join(MLU)
                        value_df.iat[i,2] =  MLU #populates the "MLU" column
                        value_df.iat[i,1] = value_corpus.age(fileids=file, speaker='CHI',month=True) #populates the "Ages" column
                        utterance_index = utterance_list.index(line_context)
                        try:
                            before_line = utterance_list[utterance_index - 1]
                        except IndexError: #if no line before, doesn't look for context
                            before_line = utterance_list[utterance_index]
                        try:
                            after_line = utterance_list[utterance_index + 1]
                        except IndexError: #if no line after, doesn't look for context
                            after_line = utterance_list[utterance_index] 
                            value_dict[i] = [before_line, line, after_line]
                            i = i + 1 #iterates to next row in "Utterance" column of df

python - 如何在我的话语列表中准确地挑选出正确的字符串？

0 回答 0

Related

Reference