3

我正在使用 python 编写一个我似乎无法正确处理的脚本。它使用两个输入:

  1. 数据文件
  2. 停止文件

数据文件由 4 个制表符分隔的列组成,这些列已排序。停止文件由同样排序的单词列表组成。

该脚本的目标是:

  • 如果数据文件第 1 列中的字符串与“停止文件”中的字符串匹配,则删除整行。

以下是数据文件的示例:

abandonment-n   after+n-the+n-a-j   stop-n  1
abandonment-n   against+n-the+ns    leave-n 1
cake-n  against+n-the+vg    rest-v  1
abandonment-n   as+n-a+vd   require-v   1
abandonment-n   as+n-a-j+vg-up  use-v   1

以下是停止文件的示例:

apple-n
banana-n
cake-n
pigeon-n

这是我到目前为止的代码:

with open("input1", "rb") as oIndexFile:
        for line in oIndexFile: 
            lemma = line.split()
            #print lemma

with open ("input2", "rb") as oSenseFile:
    with open("output", "wb") as oOutFile:
        for line in oSenseFile:
            concept, slot, filler, freq = line.split()
            nounsInterest = [concept, slot, filler, freq]
            #print concept
            if concept != lemma:
                outstring = '\t'.join(nounsInterest)
                oOutFile.write(outstring + '\n')
            else: 
                pass

所需的输出如下:

abandonment-n   after+n-the+n-a-j-stop-n    1
abandonment-n   against+n-the+ns-leave-n    1
abandonment-n   as+n-a+vd-require-v 1
abandonment-n   as+n-a-j+vg-up-use-v    1

有什么见解吗?

截至目前,我得到的输出如下,基本上只是我一直在做的打印:

abandonment-n   after+n-the+n-a-j   stop-n  1
abandonment-n   against+n-the+ns    leave-n 1
cake-n  against+n-the+vg    rest-v  1
abandonment-n   as+n-a+vd   require-v   1
abandonment-n   as+n-a-j+vg-up  use-v   1

*** 我尝试过但仍然无效的一些事情是:

而不是if concept != lemma: 我第一次尝试if concept not in lemma:

它产生与前面提到的相同的输出。

我也怀疑该函数没有调用第一个输入文件,但即使将其合并到代码中:这样:

with open ("input2", "rb") as oSenseFile:
    with open("tinput1", "rb") as oIndexFile:
        for line in oIndexFile: 
            lemma = line.split()
            with open("out", "wb") as oOutFile:
                for line in oSenseFile:
                    concept, slot, filler, freq = line.split()
                    nounsInterest = [concept, slot, filler, freq]
                    if concept not in lemma:
                        outstring = '\t'.join(nounsInterest)
                        oOutFile.write(outstring + '\n')
                    else: 
                        pass

这会产生一个空白的输出文件。

我还尝试了一种不同的方法,如下所示:

filename = "input1.txt" 
filename2 = "input2.txt"
filename3 = "output1"

def fixup(filename): 
    fin1 = open(filename) 
    fin2 = open(filename2, "r")
    fout = open(filename3, "w") 
    for word in filename: 
        words = word.split()
    for line in filename2:
        concept, slot, filler, freq = line.split()
        nounsInterest = [concept, slot, filler, freq]
        if True in [concept in line for word in toRemove]:
            pass
        else:
            outstring = '\t'.join(nounsInterest)
            fout.write(outstring + '\n')
    fin1.close() 
    fin2.close() 
    fout.close()

已从此处改编,但没有成功。在这种情况下,根本不会产生输出。

有人可以指出我在解决此任务时出错的方向吗?尽管示例文件很小,但我必须在一个大文件上运行它。感谢您提供任何帮助。

4

3 回答 3

4

我认为你正在尝试做这样的事情

with open('input1', 'rb') as indexfile:
    lemma = {x.strip() for x in indexfile}

with open('input2', 'rb') as sensefile, open('output', 'wb') as outfile:
    for line in sensefile:
        nouns_interest = concept, slot, filler, freq = line.split()
        if concept not in lemma:
            outfile.write('\t'.join(nouns_interest) + '\n')

您想要的输出似乎在 and 之间加上了一个连字符slotfiller因此您可能想要使用

            outfile.write('{}\t{}-{}\t{}\n'.format(*nouns_interest))
于 2013-11-13T10:25:55.927 回答
1

我还没有检查你的逻辑,但你正在覆盖lemma你在那里的每一行。也许将其附加到列表中?

lemma = []
for line in oIndexFile:
    lemma.append(line.strip())  #strips everything except the text

或者,正如@gnibbler 所建议的那样,您可以使用 set 以提高效率:

lemma = set()
for line in oIndexFile:
    lemma.add(line.strip())

编辑:看起来您不想拆分它,而是去掉换行符。是的,你的逻辑几乎是正确的

这就是第二部分的样子:

with open ("data_php.txt", "rb") as oSenseFile:
    with open("out_FILTER_LINES", "wb") as oOutFile:
        for line in oSenseFile:
            concept, slot, filler, freq = line.split()
            nounsInterest = [concept, slot, filler, freq]
            #print concept
            if concept not in lemma: #check if the concept exists in lemma
                outstring = '\t'.join(nounsInterest)
                oOutFile.write(outstring + '\n')
            else: 
                pass
于 2013-11-13T10:16:18.270 回答
1

如果您确定数据文件中的行不是以空格开头的,那么我们不需要拆分行。这是对@gnibbler 答案的细微调整。

with open('input1', 'rb') as indexfile:
    lemma = {x.strip() for x in indexfile}

with open('input2', 'rb') as sensefile, open('output', 'wb') as outfile:
    for line in sensefile:
        if not any([line.startswith(x) for x in lemma]):
            outfile.write(line)
于 2013-11-13T10:44:04.320 回答