从大型文件集中获取 user@host.com 组合的最佳方法是什么?
我认为 sed/awk 可以做到这一点,但我对正则表达式不是很熟悉。
我们有一个文件,即 Staff_data.txt,其中不仅包含电子邮件,而且希望剥离其余数据,仅收集电子邮件地址(即 h@south.com)
我认为最简单的方法是在终端中通过 sed/awk,但是看到正则表达式有多复杂,我希望能得到一些指导。
谢谢。
这是我几年前为完成这项工作而编写的一个有点尴尬但显然有效的脚本:
# Get rid of any Message-Id line like this:
# Message-ID: <AANLkTinSDG_dySv_oy_7jWBD=QWiHUMpSEFtE-cxP6Y1@mail.gmail.com>
#
# Change any character that can't be in an email address to a space.
#
# Print just the character strings that look like email addresses.
#
# Drop anything with multple "@"s and change any domain names (i.e.
# the part after the "@") to all lower case as those are not case-sensitive.
#
# If we have a local mail box part (i.e. the part before the "@") that's
# a mix of upper/lower and another that's all lower, keep them both. Ditto
# for multiple versions of mixed case since we don't know which is correct.
#
# Sort uniquely.
cat "$@" |
awk '!/^Message-ID:/' |
awk '{gsub(/[^-_.@[:alnum:]]+/," ")}1' |
awk '{for (i=1;i<=NF;i++) if ($i ~ /.+@.+[.][[:alpha:]]+$/) print $i}' |
awk '
BEGIN { FS=OFS="@" }
NF != 2 { printf "Badly formatted %s skipped.\n",$0 | "cat>&2"; next }
{ $2=tolower($2); print }
' |
tr '[A-Z]' '[a-z]' |
sort -u
它不漂亮,但似乎很健壮。
你不想grep
在这里sed
或awk
。例如显示来自域的所有电子邮件south.com
:
grep -o '[^ ]*@south\.com ' file