0

Perl 新手在这里寻求帮助。

我有一个文件目录和一个“关键字”文件,其中包含要搜索的属性和属性类型。

例如:

关键字.txt

Attribute1 boolean
Attribute2 boolean
Attribute3 search_and_extract
Attribute4 chunk

对于目录中的每个文件,我必须:

  • 查找keywords.txt
  • 根据Attribute类型搜索

类似下面的东西。

IF attribute_type = boolean THEN
 search for attribute;
 set found = Y if attribute found;
ELSIF attribute_type = search_and_extract THEN
 extract string where attribute is Found
ELSIF attribute_type = chunk THEN
 extract the complete chunk of paragraph where attribute is found.

这就是我到目前为止所拥有的,我相信有一种更有效的方法可以做到这一点。

我希望有人可以指导我朝着正确的方向做上述事情。谢谢和问候,司马

# Reads attributes from config file
# First set boolean attributes. IF keyword is found in text, 
# variable flag is set to Y else N
# End Code: For each  text file in directory loop. 
# Run the below for each document.

use strict;
use warnings;

# open Doc
open(DOC_FILE,'Final_CLP.txt');
while(<DOC_FILE>) {
    chomp;
    # open the file
    open(FILE,'attribute_config.txt');
    while (<FILE>) {
        chomp;
        ($attribute,$attribute_type) = split("\t");

        $is_boolean = ($attribute_type eq "boolean") ? "N" : "Y";

        # For each boolean attribute, check if the keyword exists 
        # in the file and return Y or N
        if ($is_boolean eq "Y") {
            print "Yes\n";
            # search for keyword in doc and assign values
        }   

        print "Attribute: $attribute\n";
        print "Attribute_Type: $attribute_type\n";
        print "is_boolean: $is_boolean\n";
        print "-----------\n";
    }   
    close(FILE);
}
close(DOC_FILE);
exit;
4

1 回答 1

0

从一个故事开始你的规格/问题是个好主意(“我有一个......”)。但是这样一个故事——不管是真的还是编造的,因为你不能透露真相——应该给

  • 情况/问题/任务的生动画面
  • 必须完成所有工作的原因
  • 不常用(常用)术语的定义

所以我会开始:我在监狱里工作,必须扫描囚犯的电子邮件

  • 文中任何地方提到的名字(如“Al Capone”);主任想把这些邮件全部读完
  • 订单行(如“武器:AK 4711 数量:14”);军械官想要这些信息来计算所需的弹药量和机架空间
  • 包含“家庭”的段落——诸如“妻子”、“孩子”等关键词;牧师想有效地准备她的布道

就其本身而言,每个术语“关键字”(〜运行文本)和“属性”(〜结构化文本)可能是“明确的”,但如果两者都应用于“XI必须搜索”,事情就会变得糊涂. 您应该使用“真实世界”(行)和特定(段落)词,而不是一般(“块”)和技术(“字符串”)术语。您的输入示例:

From: Robin Hood
To: Scarface

Hi Scarface,

tell Al Capone to send a car to the prison gate on sunday.

For the riot we need:

weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8

Tell my wife in Folsom to send some money to my son in
Alcatraz.

Regards
Robin

和您的预期输出:

--- Robin.txt ----
keywords:
  Al Capone: Yes
  Billy the Kid: No
  Scarface: Yes
order lines:
  knife:
    knife: Bowie quantity: 8
  machine gun:
  stinger rocket:
  weapon:
    weapon: AK 4711 quantity: 14
social relations paragaphs:
  Tell my wife in Folsom to send some money to my son in
  Alcatraz.

伪代码应该从顶层开始。如果你从

for each file in folder
    load search list
    process current file('s content) using search list

很明显

load search list
for each file in folder
    process current file using search list

会好很多。

基于这个故事、示例和顶层计划,我将尝试为“使用搜索列表处理当前文件(的内容)”任务的简化版本提供概念验证代码:

given file/text to search in and list of keywords/attributes

print file name
print "keywords:"
for each boolean item
  print boolean item text
  if found anywhere in whole text
     print "Yes"
  else
     print "No"
print "order line:"
for each line item
  print line item text
  if found anywhere in whole text
     print whole line
print "social relations paragaphs:"
for each paragraph
    for each social relation item
        if found
           print paragraph
           no need to check for other items

第一次实施尝试:

use Modern::Perl;

#use English qw(-no_match_vars);
use English;

exit step_00();

sub step_00 {
  # given file/text to search in
  my $whole_text = <<"EOT";
From: Robin Hood
To: Scarface

Hi Scarface,

tell Al Capone to send a car to the prison gate on sunday.

For the riot we need:

weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8

Tell my wife in Folsom to send some money to my son in
Alcatraz.

Regards
Robin
EOT

  #  print file name
  say "--- Robin.txt ---";
  # print "keywords:"
  say "keywords:";
  # for each boolean item
  for my $bi ("Al Capone", "Billy the Kid", "Scarface") {
  #   print boolean item text
      printf " %s: ", $bi;
  #   if found anywhere in whole text
      if ($whole_text =~ /$bi/) {
  #      print "Yes"
         say "Yes";
  #   else
      } else {
  #      print "No"
         say "No";
      }
  }
  # print "order line:"
  say "order lines:";
  # for each line item
  for my $li ("knife", "machine gun", "stinger rocket", "weapon") {
  #   print line item text
  #   if found anywhere in whole text
      if ($whole_text =~ /^$li.*$/m) {
  #      print whole line
         say " ", $MATCH;
      }
  }
  # print "social relations paragaphs:"
  say "social relations paragaphs:";
  # for each paragraph
  for my $para (split /\n\n/, $whole_text) {
  #     for each social relation item
        for my $sr ("wife", "son", "husband") {
  #         if found
            if ($para =~ /$sr/) {
        ##  if ($para =~ /\b$sr\b/) {
  #            print paragraph
               say $para;
  #            no need to check for other items
               last;
            }
        }
  }
  return 0;
}

输出:

perl 16953439.pl
--- Robin.txt ---
keywords:
 Al Capone: Yes
 Billy the Kid: No
 Scarface: Yes
order lines:
 knife: Bowie quantity: 8
 weapon: AK 4711 quantity: 14
social relations paragaphs:
tell Al Capone to send a car to the prison gate on sunday.
Tell my wife in Folsom to send some money to my son in
Alcatraz.

这样的(过早的)代码可以帮助您

  • 澄清您的规格(未找到的关键字是否应该进入输出?
  • 您的搜索列表真的是扁平的还是应该结构化/分组?)
  • 检查您对如何做事的假设(是否应该在整个文本的行数组上进行订单行搜索?)
  • 确定进一步研究/ rtfm的主题(例如正则表达式(监狱!))
  • 计划你的下一步(文件夹循环,读取输入文件)

(另外,知道的人会指出我所有的不良做法,所以你可以从一开始就避免它们)

祝你好运!

于 2013-06-06T08:29:58.130 回答