我正在尝试获取一个profiles
包含一列email
地址的数据框,并添加一个由每个电子邮件地址的可注册域部分组成的新列,domain
.
我registerable_domains
在一个过于复杂的过程中单独创建唯一的向量,无法针对数据帧中的每一行运行,其结果是一个向量,该向量必然小于profiles
数据帧中的行数。然后我检查向量中的每个条目是否出现在数据帧中每个地址registerable_domains
的末尾,并将数据帧的列条目设置为匹配的位置。email
profiles
domain
下面的代码是可复制的数据,您可以复制粘贴并在 R 中执行,每行注释以解释它的作用。
该for()
循环正是我想做的:它在数据框的domain
列中创建适当的条目。profiles
问题是在这个例子中,profiles
数据框有 12 行,registerable_domains
向量有 8 个条目。在实际数据集中,profiles
数据框有大约 500,000 行,registerable_domains
向量有大约 110,000 个条目。结果,虽然for()
循环适用于小数据集,但对于非常大的数据集,我需要一种不同的方法(我的估计是,这种方法需要大约 75 年才能在完整的数据集上完成!)。
非常感谢您帮助将此for()
循环转换为大型数据集的时间实际操作。我查看了许多其他线程,但找不到任何解决这种特殊情况的答案(尽管解决了许多其他类似但不同的情况)。谢谢!
# Data frame consisting of a column of 12 emails, and a column of 12 NA entries:
email <- c( "john@doe.com",
"mary@smith.co.uk",
"peter@microsoft.com",
"jane@admins.microsoft.com",
"luke@star.wars.com",
"leia@star.wars.com",
"yoda@masters.star.wars.com",
"grandma@bletchly.ww2.wars.com",
"searchfor@janedoe.com",
"fan@mail.starwars.com",
"city@toronto.ca",
"area@toronto.canada.ca");
domain <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA);
profiles <- data.frame(email, domain);
profiles; # See what the initial data frame looks like
# email domain
# 1 john@doe.com NA
# 2 mary@smith.co.uk NA
# 3 peter@microsoft.com NA
# 4 jane@admins.microsoft.com NA
# 5 luke@star.wars.com NA
# 6 leia@star.wars.com NA
# 7 yoda@masters.star.wars.com NA
# 8 grandma@bletchly.ww2.wars.com NA
# 9 searchfor@janedoe.com NA
# 10 fan@mail.starwars.com NA
# 11 city@toronto.ca NA
# 12 area@toronto.canada.ca NA
# Vector consisting of email addresses stripped to registerable domain component only, created through a separate process that is too complex to run on each row entry:
registerable_domains <- c( "doe.com",
"smith.co.uk",
"microsoft.com",
"wars.com",
"janedoe.com",
"starwars.com",
"toronto.ca",
"canada.ca");
# Credit to Nick Kennedy for his help with this original solution (http://stackoverflow.com/users/4998761/nick-kennedy)
for (domains in registerable_domains) { # Iterate through each of the registerable domains
domains_pattern <- paste("[.@]", domains, "$", sep=""); # Add regex characters to ensure that it's only the end part to deal with nested domain names
found <- grepl(domains_pattern, profiles$email, ignore.case=TRUE, perl=TRUE); # Grep for the current domain pattern in all of the emails and build a boolean table for entry locations
profiles[which(found & is.na(profiles$domain)), "domain"] <- domains; # Modify profile data table at TRUE entry locations not yet set
}
profiles; # Expected and desired outcome:
# email domain
# 1 john@doe.com doe.com
# 2 mary@smith.co.uk smith.co.uk
# 3 peter@microsoft.com microsoft.com
# 4 jane@admins.microsoft.com microsoft.com
# 5 luke@star.wars.com wars.com
# 6 leia@star.wars.com wars.com
# 7 yoda@masters.star.wars.com wars.com
# 8 grandma@bletchly.ww2.wars.com wars.com
# 9 searchfor@janedoe.com janedoe.com
# 10 fan@mail.starwars.com starwars.com
# 11 city@toronto.ca toronto.ca
# 12 area@toronto.canada.ca canada.ca