1

我有一大堆词。我想计算两个特定单词出现的次数小于给定距离的次数。

例如,如果“时间”和“迟到”之间的距离不超过三个字,那么我想增加一个计数器。“时间”和“迟到”这两个词可以在数组中出现数百次。我怎样才能找到它们彼此靠近的次数?

4

3 回答 3

3

你没有问问题,所以我认为你正在想出一个算法。

  1. 遍历索引。
    1. 如果在该索引处找到第一个单词,
      1. 注意那个索引。
    2. 如果在该索引处找到第二个单词,
      1. 注意那个索引。
  2. 从另一个索引中减去一个索引。

笔记:

  • 您可能需要添加检查以确保找到每个单词。
  • 当其中一个词出现不止一次时,您没有指定应该发生什么。

对于评论中提出的问题:

  1. 遍历索引。
    1. 如果在该索引处找到第一个单词,
      1. 注意那个索引。
    2. 如果在该索引处找到第二个单词,
      1. 如果当前指数与标注指数之差≤3,
        1. 增加计数器。

笔记:

  • 假设您只关心第二个单词与第一个单词的前一个实例之间的距离。
于 2013-03-02T09:47:47.700 回答
0

使用索引哈希将是非常有效的解决方案:

my @words = qw( word1 word2 word3 word4 word5 word6 );

# That can be expensive, but you do it only once
my %index;
@index{@words} = (0..$#words);

# That will be real quick
my $distance = $index{"word6"} - $index{"word2"}
print "Distance: $distance \n";

上面脚本的输出将是:

Distance: 4

注意:创建索引哈希可能很昂贵。但是,如果您计划进行多次距离检查,这可能是值得的,因为任何查找都很快(恒定时间,而不是事件日志(n))。

于 2013-03-02T09:52:38.690 回答
0

是否需要支持重复单词?

#! /usr/bin/perl
use strict;
use warnings;
use constant DEBUG => 0;

my @words;
if( $ARGV[0] && -f $ARGV[0] ) {
    open my $fh, "<", $ARGV[0] or die "Could not read $ARGV[0], because: $!\n";
    my $hughTestFile = do { local $/; <$fh> };
    @words = split /[\s\n]/, $hughTestFile;  # $#words == 10M words with my test.log
    # Test words (below) were manually placed at equal distances (~every 900K words) in test.log
    # With above, TESTS ran in avg of 15 seconds.  Likely test.log was in buffers/cache.
} else {
    @words = qw( word1 word2 word3 word4 word5 word6 word7 word8 word4 word9 word0 );
}

sub IndexOf {
    my $searchFor = shift;
    return undef if( !$searchFor );
    my $Nth = shift || 1;

    my $length = $#words;
    my $cntr = 0;
    for my $word (@words) {
        if( $word eq $searchFor ) {
            $Nth--;
            return $cntr if( $Nth == 0 );
        }
        $cntr++;
    }
    return undef;
}

sub Distance {
# args:  <1st word>, <2nd word>, [occurrence_of_1st_word], [occurrence_of_2nd_word]
# for occurrence counts:  0, 1 & undef - all have the same effect (1st occurrence)
    my( $w1, $w2 ) = ($_[0], $_[1]);
    my( $n1, $n2 ) = ($_[2] || undef, $_[3] || undef );
    die "Missing words\n" if( !$w1 );
    $w2 = $w1 if( !$w2 );

    my( $i1, $i2 ) = ( IndexOf($w1, $n1), IndexOf($w2, $n2) );
    if( defined($i1) && defined($i2) ) {
        my $offset = $i1-$i2;
        print "  Distance (offset) = $offset\n";
        return undef;
    } elsif( !defined($i1) && !defined($i2) ) {
        print "  Neither words were ";
    } elsif( !defined($i1) ) {
        print "  First word was not ";
    } else {
        print "  Second word was not ";
    }
    print "found in list\n";

    return undef;
}

# TESTS
print "Your array has ".$#words." words\n";
print "When 1st word is AFTER 2nd word:\n";
Distance( "word7", "word3" );
print "When 1st word is BEFORE 2nd word:\n";
Distance( "word2", "word5" );
print "When 1st word == 2nd word:\n";
Distance( "word4", "word4" );
print "When 1st word doesn't exist:\n";
Distance( "word00", "word6" );
print "When 2nd word doesn't exist:\n";
Distance( "word1", "word99" );
print "When neither 1st or 2nd words exist:\n";
Distance( "word00", "word99" );
print "When the 1st word is AFTER the 2nd OCCURRENCE of 2nd word:\n";
Distance( "word9", "word4", 0, 2 );
print "When the 1st word is BEFORE the 2nd OCCURRENCE of the 2nd word:\n";
Distance( "word7", "word4", 1, 2 );
print "When the 2nd OCCURRENCE of the 2nd word doesn't exist:\n";
Distance( "word7", "word99", 0, 2 );
print "When the 2nd OCCURRENCE of the 1st word is AFTER the 2nd word:\n";
Distance( "word4", "word2", 2, 0 );
print "When the 2nd OCCURRENCE of the 1st word is BEFORE the 2nd word:\n";
Distance( "word4", "word0", 2, 0 );
print "When the 2nd OCCURRENCE of the 1st word exists, but 2nd doesn't:\n";
Distance( "word4", "word99", 2, 0 );
print "When neither of the 2nd OCCURRENCES of the words exist:\n";
Distance( "word00", "word99", 2, 2 );
print "Distance between 2nd and 1st OCCURRENCES of the same word:\n";
Distance( "word4", "", 2, 1 );
于 2013-03-02T18:50:41.740 回答