从一个故事开始你的规格/问题是个好主意(“我有一个......”)。但是这样一个故事——不管是真的还是编造的,因为你不能透露真相——应该给
- 情况/问题/任务的生动画面
- 必须完成所有工作的原因
- 不常用(常用)术语的定义
所以我会开始:我在监狱里工作,必须扫描囚犯的电子邮件
- 文中任何地方提到的名字(如“Al Capone”);主任想把这些邮件全部读完
- 订单行(如“武器:AK 4711 数量:14”);军械官想要这些信息来计算所需的弹药量和机架空间
- 包含“家庭”的段落——诸如“妻子”、“孩子”等关键词;牧师想有效地准备她的布道
就其本身而言,每个术语“关键字”(〜运行文本)和“属性”(〜结构化文本)可能是“明确的”,但如果两者都应用于“XI必须搜索”,事情就会变得糊涂. 您应该使用“真实世界”(行)和特定(段落)词,而不是一般(“块”)和技术(“字符串”)术语。您的输入示例:
From: Robin Hood
To: Scarface
Hi Scarface,
tell Al Capone to send a car to the prison gate on sunday.
For the riot we need:
weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Regards
Robin
和您的预期输出:
--- Robin.txt ----
keywords:
Al Capone: Yes
Billy the Kid: No
Scarface: Yes
order lines:
knife:
knife: Bowie quantity: 8
machine gun:
stinger rocket:
weapon:
weapon: AK 4711 quantity: 14
social relations paragaphs:
Tell my wife in Folsom to send some money to my son in
Alcatraz.
伪代码应该从顶层开始。如果你从
for each file in folder
load search list
process current file('s content) using search list
很明显
load search list
for each file in folder
process current file using search list
会好很多。
基于这个故事、示例和顶层计划,我将尝试为“使用搜索列表处理当前文件(的内容)”任务的简化版本提供概念验证代码:
given file/text to search in and list of keywords/attributes
print file name
print "keywords:"
for each boolean item
print boolean item text
if found anywhere in whole text
print "Yes"
else
print "No"
print "order line:"
for each line item
print line item text
if found anywhere in whole text
print whole line
print "social relations paragaphs:"
for each paragraph
for each social relation item
if found
print paragraph
no need to check for other items
第一次实施尝试:
use Modern::Perl;
#use English qw(-no_match_vars);
use English;
exit step_00();
sub step_00 {
# given file/text to search in
my $whole_text = <<"EOT";
From: Robin Hood
To: Scarface
Hi Scarface,
tell Al Capone to send a car to the prison gate on sunday.
For the riot we need:
weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Regards
Robin
EOT
# print file name
say "--- Robin.txt ---";
# print "keywords:"
say "keywords:";
# for each boolean item
for my $bi ("Al Capone", "Billy the Kid", "Scarface") {
# print boolean item text
printf " %s: ", $bi;
# if found anywhere in whole text
if ($whole_text =~ /$bi/) {
# print "Yes"
say "Yes";
# else
} else {
# print "No"
say "No";
}
}
# print "order line:"
say "order lines:";
# for each line item
for my $li ("knife", "machine gun", "stinger rocket", "weapon") {
# print line item text
# if found anywhere in whole text
if ($whole_text =~ /^$li.*$/m) {
# print whole line
say " ", $MATCH;
}
}
# print "social relations paragaphs:"
say "social relations paragaphs:";
# for each paragraph
for my $para (split /\n\n/, $whole_text) {
# for each social relation item
for my $sr ("wife", "son", "husband") {
# if found
if ($para =~ /$sr/) {
## if ($para =~ /\b$sr\b/) {
# print paragraph
say $para;
# no need to check for other items
last;
}
}
}
return 0;
}
输出:
perl 16953439.pl
--- Robin.txt ---
keywords:
Al Capone: Yes
Billy the Kid: No
Scarface: Yes
order lines:
knife: Bowie quantity: 8
weapon: AK 4711 quantity: 14
social relations paragaphs:
tell Al Capone to send a car to the prison gate on sunday.
Tell my wife in Folsom to send some money to my son in
Alcatraz.
这样的(过早的)代码可以帮助您
- 澄清您的规格(未找到的关键字是否应该进入输出?
- 您的搜索列表真的是扁平的还是应该结构化/分组?)
- 检查您对如何做事的假设(是否应该在整个文本的行数组上进行订单行搜索?)
- 确定进一步研究/ rtfm的主题(例如正则表达式(监狱!))
- 计划你的下一步(文件夹循环,读取输入文件)
(另外,知道的人会指出我所有的不良做法,所以你可以从一开始就避免它们)
祝你好运!