1

我有 3 个文本文件(A、B 和 C),每个文件都有数百个电子邮件地址。我想将列表 A 和列表 B 合并到一个文件中,忽略大小写和空格的差异。然后我想删除列表 C 中新列表中的所有电子邮件,再次忽略大小写和空格的差异。

我选择的编程语言通常是 C++,但它似乎不太适合这项任务。是否有一种脚本语言可以在相对较少的行中执行此操作(和类似任务)?

还是已经有软件(免费或商业)可以让我这样做?例如,是否可以在 Excel 中执行此操作?

4

7 回答 7

3

最快的方法可能不一定需要编码。您可以在一个工作表中将文件 A 和 B 导入 Excel,然后(如有必要)对生成的地址列表进行过滤以删除任何重复项。

下一步是将文件 C 导入第二个工作表。在第三个工作表中,您将执行 VLOOKUP 以挑选出您的第一个列表中的所有地址,如果它们在您的“列表 C”中,则将它们删除。

VLOOKUP 看起来像这样:

=IF(ISNA(VLOOKUP( email_address_cell , Sheet2! email_duplicates_list , 1, false), "", (VLOOKUP( email_address_cell , Sheet2! email_duplicates_list , 1, false)))

我还检查了公式是否返回“值不可用”错误,在这种情况下,单元格只显示一个空白值。从那里,你只需要删除你的空白,然后你的最终列表就出来了。

现在说了这么多,你仍然可以做一个 VBA 宏来做同样的事情,但可能会清理一下列表,这取决于你需要什么。希望有帮助!

于 2008-11-09T23:42:01.100 回答
3

正如提到的 Excel,您也可以使用 Jet 和 VBScript 来做这种事情。

Set cn = CreateObject("ADODB.Connection")
strCon = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\Docs\;" _
& "Extended Properties=""text;HDR=No;FMT=Delimited"";"

cn.Open strCon

strSQL = "SELECT F1 Into New.txt From EmailsA.txt " _
    & "WHERE UCase(F1) Not IN (SELECT UCase(F1) From EmailsC.txt)"
cn.Execute strSQL

strSQL = "INSERT INTO New.txt ( F1 ) SELECT F1 FROM EmailsB.txt " _
    & "WHERE UCase(F1) Not IN (SELECT UCase(F1) From EmailsC.txt)"
cn.Execute strSQL
于 2008-11-10T00:09:05.693 回答
2

对于您描述的那种文本处理,perl 或 python 都是理想的。

您可以使用关联数组(在这种情况下为带有字符串索引的数组)将电子邮件地址存储在列表中。

使用小写的、未加空格的电子邮件地址作为键,使用真实的电子邮件地址作为值。

然后是读入并存储第一个文件,读入并存储第二个文件(这将使用相同的密钥覆盖电子邮件地址),然后读入第三个文件并使用该密钥从列表中删除条目。

然后剩下的就是您想要的列表(A + B - C)。

这里的伪代码:

set list to empty
foreach line in file one:
    key = unwhitespace(tolowercase(line))
    list{key} = line
foreach line in file two:
    key = unwhitespace(tolowercase(line))
    list{key} = line
foreach line in file three:
    key = unwhitespace(tolowercase(line))
    if exists(list{key})
        delete list{key}
foreach key in list:
    print list{key}
于 2008-11-09T23:46:27.427 回答
1

Sadly this answer probably won't help you, but if in fact you were using Unix (Linux for example) you could do something like:

cat filea >> fileb # append file a to file b

sort fileb | uniq > newFile # newFile now contains a merger of file a and file b, with sorted and unique email addresses

The above could all be done on one line as follows: cat filea >> fileb | sort | uniq > newFile

Now you're left with simply removing common emails. Some variation of "diff" should be helpful there such as perhaps: diff newFile fileC > finalFile

diff will give you a list of differences between the two files, so the output in "finalFile" should be a list of emails that are in "newFile" (the merger of A & B) but are NOT in fileC. Options to the various tools allow you to ignore whitespace and case. I'd have to play with it a bit to get it exactly right but the above is the general idea.

I used to have an extra box running Linux for the sole purpose of doing stuff like the above which is a hassle under Windoze but a breeze under Unix type operating systems. When my hardware died I never got around to building another Linux box.

I believe the MKS toolkit for Windoze probably has all of the above utilities.

于 2008-11-11T06:06:42.273 回答
1

我想上面的答案,回答技术HOW TO问题;唯一需要考虑的是您必须执行多少次任务。如果这是一次性的事情,并且您对 Excel 更满意,请从那里开始。如果您知道您将执行此任务至少两次甚至更多次,那么编写脚本或可执行文件是您的最佳选择。

于 2008-11-10T00:23:57.287 回答
1

在 Python 中,是这样的:

请注意,这会将小写电子邮件写入最终输出。如果这不行,那么基于字典的解决方案将是必要的。

def read_file(filename):
    with file(filename, "r") as f:
        while True:
            line = f.readline();
            if not line:
                break;
            line = line.rstrip();
            if line:
                yield line;

def write_file(filename, lines):
    with file(filename, "w") as f:
        for line in lines:
            f.write(line + "\n");

set_a = set((line.lower() for line in read_file("file_a.txt")));
set_b = set((line.lower() for line in read_file("file_b.txt")));
set_c = set((line.lower() for line in read_file("file_c.txt")));

# Calculate (a + b) - c
write_file("result.txt", set_a.union(set_b).difference(set_c));
于 2008-11-09T23:55:47.843 回答
-1

Excel 可以做到,如上。最适合的编程语言是 Perl。

于 2008-11-09T23:51:38.290 回答