我有 3 个文本文件(A、B 和 C),每个文件都有数百个电子邮件地址。我想将列表 A 和列表 B 合并到一个文件中,忽略大小写和空格的差异。然后我想删除列表 C 中新列表中的所有电子邮件,再次忽略大小写和空格的差异。
我选择的编程语言通常是 C++,但它似乎不太适合这项任务。是否有一种脚本语言可以在相对较少的行中执行此操作(和类似任务)?
还是已经有软件(免费或商业)可以让我这样做?例如,是否可以在 Excel 中执行此操作?
最快的方法可能不一定需要编码。您可以在一个工作表中将文件 A 和 B 导入 Excel,然后(如有必要)对生成的地址列表进行过滤以删除任何重复项。
下一步是将文件 C 导入第二个工作表。在第三个工作表中,您将执行 VLOOKUP 以挑选出您的第一个列表中的所有地址,如果它们在您的“列表 C”中,则将它们删除。
VLOOKUP 看起来像这样:
=IF(ISNA(VLOOKUP( email_address_cell , Sheet2! email_duplicates_list , 1, false), "", (VLOOKUP( email_address_cell , Sheet2! email_duplicates_list , 1, false)))
我还检查了公式是否返回“值不可用”错误,在这种情况下,单元格只显示一个空白值。从那里,你只需要删除你的空白,然后你的最终列表就出来了。
现在说了这么多,你仍然可以做一个 VBA 宏来做同样的事情,但可能会清理一下列表,这取决于你需要什么。希望有帮助!
正如提到的 Excel,您也可以使用 Jet 和 VBScript 来做这种事情。
Set cn = CreateObject("ADODB.Connection")
strCon = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\Docs\;" _
& "Extended Properties=""text;HDR=No;FMT=Delimited"";"
cn.Open strCon
strSQL = "SELECT F1 Into New.txt From EmailsA.txt " _
& "WHERE UCase(F1) Not IN (SELECT UCase(F1) From EmailsC.txt)"
cn.Execute strSQL
strSQL = "INSERT INTO New.txt ( F1 ) SELECT F1 FROM EmailsB.txt " _
& "WHERE UCase(F1) Not IN (SELECT UCase(F1) From EmailsC.txt)"
cn.Execute strSQL
对于您描述的那种文本处理,perl 或 python 都是理想的。
您可以使用关联数组(在这种情况下为带有字符串索引的数组)将电子邮件地址存储在列表中。
使用小写的、未加空格的电子邮件地址作为键,使用真实的电子邮件地址作为值。
然后是读入并存储第一个文件,读入并存储第二个文件(这将使用相同的密钥覆盖电子邮件地址),然后读入第三个文件并使用该密钥从列表中删除条目。
然后剩下的就是您想要的列表(A + B - C)。
这里的伪代码:
set list to empty
foreach line in file one:
key = unwhitespace(tolowercase(line))
list{key} = line
foreach line in file two:
key = unwhitespace(tolowercase(line))
list{key} = line
foreach line in file three:
key = unwhitespace(tolowercase(line))
if exists(list{key})
delete list{key}
foreach key in list:
print list{key}
Sadly this answer probably won't help you, but if in fact you were using Unix (Linux for example) you could do something like:
cat filea >> fileb # append file a to file b
sort fileb | uniq > newFile # newFile now contains a merger of file a and file b, with sorted and unique email addresses
The above could all be done on one line as follows: cat filea >> fileb | sort | uniq > newFile
Now you're left with simply removing common emails. Some variation of "diff" should be helpful there such as perhaps: diff newFile fileC > finalFile
diff will give you a list of differences between the two files, so the output in "finalFile" should be a list of emails that are in "newFile" (the merger of A & B) but are NOT in fileC. Options to the various tools allow you to ignore whitespace and case. I'd have to play with it a bit to get it exactly right but the above is the general idea.
I used to have an extra box running Linux for the sole purpose of doing stuff like the above which is a hassle under Windoze but a breeze under Unix type operating systems. When my hardware died I never got around to building another Linux box.
I believe the MKS toolkit for Windoze probably has all of the above utilities.
我想上面的答案,回答技术HOW TO问题;唯一需要考虑的是您必须执行多少次任务。如果这是一次性的事情,并且您对 Excel 更满意,请从那里开始。如果您知道您将执行此任务至少两次甚至更多次,那么编写脚本或可执行文件是您的最佳选择。
在 Python 中,是这样的:
请注意,这会将小写电子邮件写入最终输出。如果这不行,那么基于字典的解决方案将是必要的。
def read_file(filename):
with file(filename, "r") as f:
while True:
line = f.readline();
if not line:
break;
line = line.rstrip();
if line:
yield line;
def write_file(filename, lines):
with file(filename, "w") as f:
for line in lines:
f.write(line + "\n");
set_a = set((line.lower() for line in read_file("file_a.txt")));
set_b = set((line.lower() for line in read_file("file_b.txt")));
set_c = set((line.lower() for line in read_file("file_c.txt")));
# Calculate (a + b) - c
write_file("result.txt", set_a.union(set_b).difference(set_c));
Excel 可以做到,如上。最适合的编程语言是 Perl。