code-golf - Code Golf：从文本中快速构建关键字列表，包括实例数

Question

我已经用 PHP 为自己制定了这个解决方案，但我很好奇它可以如何以不同的方式完成——甚至更好。我主要感兴趣的两种语言是 PHP 和 Javascript，但我很想看看在今天的任何其他主要语言（主要是 C#、Java 等）中这可以多快完成。

仅返回出现次数大于 X 的单词
仅返回长度大于 Y 的单词
忽略常见的术语，如“and, is, the, etc”
在处理之前随意去除标点符号（即“John's”变成“John”）
在集合/数组中返回结果

额外学分

将引用的陈述放在一起，（即“他们显然‘好得令人难以置信’”）
其中“好得令人难以置信”将是实际陈述

额外的额外学分

您的脚本能否根据单词出现的频率确定应该放在一起的单词？这是在事先不知道单词的情况下完成的。例子：
*“果蝇在医学研究方面是一件了不起的事情。过去对果蝇进行了很多研究，并带来了许多突破。未来，果蝇将继续研究，但我们的方法可能会改变。"*
显然这里的词是“果蝇”，我们很容易找到。您的 search'n'scrape 脚本也可以确定这一点吗？

原文：http ://sampsonresume.com/labs/c.txt

答案格式

很高兴看到您的代码结果、输出以及操作持续了多长时间。

score 11 · Accepted Answer

GNU 脚本

sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | sort -nr

结果：

  7 be
  6 to
[...]
  1 2.
  1 -

出现次数大于 X：

sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | awk '$1>X'

仅返回长度大于 Y 的单词（在第二个 grep 中放置 Y+1 个点）：

sed -e 's/ /\n/g' | grep -v '^ *$' | grep .... | sort | uniq -c

忽略常用术语，如“and, is, the, etc”（假设常用术语在文件'ignored'中）

sed -e 's/ /\n/g' | grep -v '^ *$' | grep -vf ignored | sort | uniq -c

在处理之前随意去除标点符号（即“John's”变成“John”）：

sed -e 's/[,.:"\']//g;s/ /\n/g' | grep -v '^ *$' | sort | uniq -c

在集合/数组中返回结果：它已经像 shell 的数组：第一列是计数，第二列是单词。

score 6 · Accepted Answer

Perl 只有 43 个字符。

perl -MYAML -anE'$_{$_}++for@F;say Dump\%_'

这是它的使用示例：

echo a a a b b c  d e aa | perl -MYAML -anE'$_{$_}++for@F;say Dump \%_'

---
a: 3
aa: 1
b: 2
c: 1
d: 1
e: 1

如果您只需要列出小写版本，则需要另外两个字符。

perl -MYAML -anE'$_{lc$_}++for@F;say Dump\%_'

要使其在指定文本上工作，需要 58 个字符。

curl http://sampsonresume.com/labs/c.txt |
perl -MYAML -F'\W+' -anE'$_{lc$_}++for@F;END{say Dump\%_}'

实际0m0.679s
用户 0m0.304s
系统 0m0.084s

这是最后一个扩展了一点的例子。

#! perl
use 5.010;
use YAML;

while( my $line = <> ){
  for my $elem ( split '\W+', $line ){
    $_{ lc $elem }++
  }
  END{
    say Dump \%_;
  }
}

score 4 · Accepted Answer

F#：304 个字符

let f =
    let bad = Set.of_seq ["and";"is";"the";"of";"are";"by";"it"]
    fun length occurrence msg ->
        System.Text.RegularExpressions.Regex.Split(msg, @"[^\w-']+")
        |> Seq.countBy (fun a -> a)
        |> Seq.choose (fun (a, b) -> if a.Length > length && b > occurrence && (not <| bad.Contains a) then Some a else None)

score 3 · Accepted Answer

C# 3.0（使用 LINQ）

这是我的解决方案。它利用 LINQ/扩展方法的一些非常好的特性来保持代码简短。

public static Dictionary<string, int> GetKeywords(string text, int minCount, int minLength)
{
    var commonWords = new string[] { "and", "is", "the", "as", "of", "to", "or", "in",
        "for", "by", "an", "be", "may", "has", "can", "its"};
    var words = Regex.Replace(text.ToLower(), @"[,.?\/;:\(\)]", string.Empty).Split(' ');
    var occurrences = words.Distinct().Except(commonWords).Select(w =>
        new { Word = w, Count = words.Count(s => s == w) });
    return occurrences.Where(wo => wo.Count >= minCount && wo.Word.Length >= minLength)
        .ToDictionary(wo => wo.Word, wo => wo.Count);
}

然而，这远非最有效的方法，O(n^2)使用单词的数量，而不是O(n)，在这种情况下我相信这是最佳的。我会看看我是否可以创建一个更有效的稍长的方法。

以下是在示例文本上运行的函数的结果（最少出现次数：3，最少长度：2）。

  3 x 这样
  4 个代码
  4 x 其中
  4 x 声明
  5倍功能
  4 x 语句
  3 x 新
  3 种类型
  3 个关键词
  7 x 声明
  3 种语言
  3 x 表达式
  3 次执行
  3 次编程
  4 x 操作员
  3 x 变量

还有我的测试程序：

static void Main(string[] args)
{
    string sampleText;
    using (var client = new WebClient())
        sampleText = client.DownloadString("http://sampsonresume.com/labs/c.txt");
    var keywords = GetKeywords(sampleText, 3, 2);
    foreach (var entry in keywords)
        Console.WriteLine("{0} x {1}", entry.Value.ToString().PadLeft(3), entry.Key);
    Console.ReadKey(true);
}

score 3 · Accepted Answer

#! perl
use strict;
use warnings;

while (<>) {
  for my $word (split) {
    $words{$word}++;
  }
}
for my $word (keys %words) {
  print "$word occurred $words{$word} times.";
}

这就是简单的形式。如果要排序、过滤等：

while (<>) {
  for my $word (split) {
    $words{$word}++;
  }
}
for my $word (keys %words) {
  if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
    print "$word occurred $words{$word} times.";
  }
}

您还可以很容易地对输出进行排序：

...
for my $word (keys %words) {
  if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
    push @output, "$word occurred $words{$word} times.";
  }
}
$re = qr/occurred (\d+) /;
print sort {
  $a = $a =~ $re;
  $b = $b =~ $re;
  $a <=> $b
} @output;

一个真正的 Perl 黑客会很容易地在每行一两行上得到这些，但我追求的是可读性。

编辑：这就是我将如何重写最后一个示例

...
for my $word (
  sort { $words{$a} <=> $words{$b} } keys %words
){
  next unless length($word) >= $MINLEN;
  last unless $words{$word) >= $MIN_OCCURRENCE;

  print "$word occurred $words{$word} times.";
}

或者如果我需要它运行得更快，我什至可以这样写：

for my $word_data (
  sort {
    $a->[1] <=> $b->[1] # numerical sort on count
  } grep {
    # remove values that are out of bounds
    length($_->[0]) >= $MINLEN &&      # word length
    $_->[1] >= $MIN_OCCURRENCE # count
  } map {
    # [ word, count ]
    [ $_, $words{$_} ]
  } keys %words
){
  my( $word, $count ) = @$word_data;
  print "$word occurred $count times.";
}

它使用 map 来提高效率，使用 grep 来删除多余的元素，当然还有 sort 来进行排序。（它是按这个顺序做的）

这是Schwartzian 变换的轻微变体。

score 3 · Accepted Answer

红宝石

当“缩小”时，这个实现变成 165 个字符长。它用于array#inject给出一个起始值（一个默认为 0 的 Hash 对象），然后循环遍历元素，然后将其滚动到哈希中；然后从最小频率中选择结果。

请注意，我没有计算要跳过的单词的大小，这是一个外部常量。当常数也被计算在内时，解的长度为 244 个字符。

撇号和破折号没有被删除，但包括在内；它们的使用修改了这个词，因此如果不删除符号之外的所有信息，就不能简单地删除它们。

执行

CommonWords = %w(the a an but and is not or as of to in for by be may has can its it's)
def get_keywords(text, minFreq=0, minLen=2)
  text.scan(/(?:\b)[a-z'-]{#{minLen},}(?=\b)/i).
    inject(Hash.new(0)) do |result,w|
      w.downcase!
      result[w] += 1 unless CommonWords.include?(w)
      result
    end.select { |k,n| n >= minFreq }
end

测试台

require 'net/http'

keywords = get_keywords(Net::HTTP.get('www.sampsonresume.com','/labs/c.txt'), 3)
keywords.sort.each { |name,count| puts "#{name} x #{count} times" }

测试结果

code x 4 times
declarations x 4 times
each x 3 times
execution x 3 times
expression x 4 times
function x 5 times
keywords x 3 times
language x 3 times
languages x 3 times
new x 3 times
operators x 4 times
programming x 3 times
statement x 7 times
statements x 4 times
such x 3 times
types x 3 times
variables x 3 times
which x 4 times

score 2 · Accepted Answer

C#代码：

IEnumerable<KeyValuePair<String, Int32>> ProcessText(String text, int X, int Y)
{
    // common words, that will be ignored
    var exclude = new string[] { "and", "is", "the", "as", "of", "to", "or", "in", "for", "by", "an", "be", "may", "has", "can", "its" }.ToDictionary(word => word);
    // regular expression to find quoted text
    var regex = new Regex("\"[^\"]\"", RegexOptions.Compiled);

    return
        // remove quoted text (it will be processed later)
        regex.Replace(text, "")
        // remove case dependency
        .ToLower()
        // split text by all these chars
        .Split(".,'\\/[]{}()`~@#$%^&*-=+?!;:<>| \n\r".ToCharArray())
        // add quoted text
        .Concat(regex.Matches(text).Cast<Match>().Select(match => match.Value))
        // group words by the word and count them
        .GroupBy(word => word, (word, words) => new KeyValuePair<String, Int32>(word, words.Count()))
        // apply filter(min word count and word length) and remove common words 
        .Where(pair => pair.Value >= X && pair.Key.Length >= Y && !exclude.ContainsKey(pair.Key));
}

ProcessText(text, 3, 2) 调用的输出：

3 x languages
3 x such
4 x code
4 x which
3 x based
3 x each
4 x declarations
5 x function
4 x statements
3 x new
3 x types
3 x keywords
3 x variables
7 x statement
4 x expression
3 x execution
3 x programming
3 x operators

score 2 · Accepted Answer

另一个 Python 解决方案，247 个字符。实际代码是一行 134 个字符的高密度 Python 行，它在一个表达式中计算整个事情。

x=3;y=2;W="and is the as of to or in for by an be may has can its".split()
from itertools import groupby as gb
d=dict((w,l)for w,l in((w,len(list(g)))for w,g in
    gb(sorted(open("c.txt").read().lower().split())))
    if l>x and len(w)>y and w not in W)

一个更长的版本，有很多评论供您阅读：

# High and low count boundaries.
x = 3
y = 2

# Common words string split into a list by spaces.
Words = "and is the as of to or in for by an be may has can its".split()

# A special function that groups similar strings in a list into a 
# (string, grouper) pairs. Grouper is a generator of occurences (see below).
from itertools import groupby

# Reads the entire file, converts it to lower case and splits on whitespace 
# to create a list of words
sortedWords = sorted(open("c.txt").read().lower().split())

# Using the groupby function, groups similar words together.
# Since grouper is a generator of occurences we need to use len(list(grouper)) 
# to get the word count by first converting the generator to a list and then
# getting the length of the list.
wordCounts = ((word, len(list(grouper))) for word, grouper in groupby(sortedWords))

# Filters the words by number of occurences and common words using yet another 
# list comprehension.
filteredWordCounts = ((word, count) for word, count in wordCounts if word not in Words and count > x and len(word) > y)

# Creates a dictionary from the list of tuples.
result = dict(filteredWordCounts)

print result

这里的主要技巧是使用 itertools.groupby 函数来计算排序列表中的出现次数。不知道它是否真的可以节省字符，但它确实允许所有处理都在一个表达式中进行。

结果：

{'function': 4, 'operators': 4, 'declarations': 4, 'which': 4, 'statement': 5}

score 1 · Accepted Answer

在 C# 中：

使用 LINQ，特别是 groupby，然后按组数过滤，并返回一个扁平化（selectmany）列表。
使用 LINQ，按长度过滤。
使用 LINQ，使用“badwords”过滤。包含。

score 1 · Accepted Answer

雷博尔

冗长，也许，所以绝对不是赢家，但可以完成工作。

min-length: 0
min-count: 0

common-words: [ "a" "an" "as" "and" "are" "by" "for" "from" "in" "is" "it" "its" "the" "of" "or" "to" "until" ]

add-word: func [
    word [string!]
    /local
        count
        letter
        non-letter
        temp
        rules
        match
][    
    ; Strip out punctuation
    temp: copy {}
    letter: charset [ #"a" - #"z" #"A" - #"Z" #" " ]
    non-letter: complement letter
    rules: [
        some [
            copy match letter (append temp match)
            |
            non-letter
        ]
    ]
    parse/all word rules
    word: temp

    ; If we end up with nothing, bail
    if 0 == length? word [
        exit
    ]

    ; Check length
    if min-length > length? word [
        exit
    ]

    ; Ignore common words
    ignore: 
    if find common-words word [
        exit
    ]

    ; OK, its good. Add it.
    either found? count: select words word [
        words/(word): count + 1
    ][
        repend words [word 1]
    ]
]

rules: [
    some [
        {"}
        copy word to {"} (add-word word)
        {"}
        |
        copy word to { } (add-word word)
        { }
    ]
    end
]

words: copy []
parse/all read %c.txt rules

result: copy []
foreach word words [
    if string? word [
        count: words/:word
        if count >= min-count [
            append result word
        ]
    ]
]

sort result
foreach word result [ print word ]

输出是：

act
actions
all
allows
also
any
appear
arbitrary
arguments
assign
assigned
based
be
because
been
before
below
between
braces
branches
break
builtin
but
C
C like any other language has its blemishes Some of the operators have the wrong precedence some parts of the syntax could be better
call
called
calls
can
care
case
char
code
columnbased
comma
Comments
common
compiler
conditional
consisting
contain
contains
continue
control
controlflow
criticized
Cs
curly brackets
declarations
define
definitions
degree
delimiters
designated
directly
dowhile
each
effect
effects
either
enclosed
enclosing
end
entry
enum
evaluated
evaluation
evaluations
even
example
executed
execution
exert
expression
expressionExpressions
expressions
familiarity
file
followed
following
format
FORTRAN
freeform
function
functions
goto
has
high
However
identified
ifelse
imperative
include
including
initialization
innermost
int
integer
interleaved
Introduction
iterative
Kernighan
keywords
label
language
languages
languagesAlthough
leave
limit
lineEach
loop
looping
many
may
mimicked
modify
more
most
name
needed
new
next
nonstructured
normal
object
obtain
occur
often
omitted
on
operands
operator
operators
optimization
order
other
perhaps
permits
points
programmers
programming
provides
rather
reinitialization
reliable
requires
reserve
reserved
restrictions
results
return
Ritchie
say
scope
Sections
see
selects
semicolon
separate
sequence
sequence point
sequential
several
side
single
skip
sometimes
source
specify
statement
statements
storage
struct
Structured
structuresAs
such
supported
switch
syntax
testing
textlinebased
than
There
This
turn
type
types
union
Unlike
unspecified
use
used
uses
using
usually
value
values
variable
variables
variety
which
while
whitespace
widespread
will
within
writing

score 1 · Accepted Answer

Python（258 个字符，包括 66 个用于第一行的字符和 30 个用于删除标点符号的字符）：

W="and is the as of to or in for by an be may has can its".split()
x=3;y=2;d={}
for l in open('c.txt') :
    for w in l.lower().translate(None,',.;\'"!()[]{}').split() :
        if w not in W: d[w]=d.get(w,0)+1
for w,n in d.items() :
    if n>y and len(w)>x : print n,w

输出：

4 code
3 keywords
3 languages
3 execution
3 each
3 language
4 expression
4 statements
3 variables
7 statement
5 function
4 operators
4 declarations
3 programming
4 which
3 such
3 types

score 0 · Accepted Answer

这是我的变体，在 PHP 中：



$str = implode(file('c.txt'));
$tok = strtok($str, " .,;()\r\n\t");

$splitters = '\s.,\(\);?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );

foreach($array as $key) {
    $res[$key] = $res[$key]+1;
}

$splitters = '\s.,\(\)\{\};?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );

foreach($array as $key) {
    $res[$key] = $res[$key]+1;
}

unset($res['the']);
unset($res['and']);
unset($res['to']);
unset($res['of']);
unset($res['by']);
unset($res['a']);
unset($res['as']);
unset($res['is']);
unset($res['in']);
unset($res['']);

arsort($res);
//var_dump($res); // concordance
foreach ($res AS $word => $rarity)
    echo $word . ' <b>x</b> ' . $rarity . '<br/>';

foreach ($array as $word) { // words longer than n (=5)
//    if(strlen($word) > 5)echo $word.'<br/>';
}

并输出：

statement x 7
be x 7
C x 5
may x 5
for x 5
or x 5
The x 5
as x 5
expression x 4
statements x 4
code x 4
function x 4
which x 4
an x 4
declarations x 3
new x 3
execution x 3
types x 3
such x 3
variables x 3
can x 3
languages x 3
operators x 3
end x 2
programming x 2
evaluated x 2
functions x 2
definitions x 2
keywords x 2
followed x 2
contain x 2
several x 2
side x 2
most x 2
has x 2
its x 2
called x 2
specify x 2
reinitialization x 2
use x 2
either x 2
each x 2
all x 2
built-in x 2
source x 2
are x 2
storage x 2
than x 2
effects x 1
including x 1
arguments x 1
order x 1
even x 1
unspecified x 1
evaluations x 1
operands x 1
interleaved x 1
However x 1
value x 1
branches x 1
goto x 1
directly x 1
designated x 1
label x 1
non-structured x 1
also x 1
enclosing x 1
innermost x 1
loop x 1
skip x 1
There x 1
within x 1
switch x 1
Expressions x 1
integer x 1
variety x 1
see x 1
below x 1
will x 1
on x 1
selects x 1
case x 1
executed x 1
based x 1
calls x 1
from x 1
because x 1
many x 1
widespread x 1
familiarity x 1
C's x 1
mimicked x 1
Although x 1
reliable x 1
obtain x 1
results x 1
needed x 1
other x 1
syntax x 1
often x 1
Introduction x 1
say x 1
Programming x 1
Language x 1
C, like any other language, has its blemishes. Some of the operators have the wrong precedence; some parts of the syntax could be better. x 1
Ritchie x 1
Kernighan x 1
been x 1
criticized x 1
For x 1
example x 1
care x 1
more x 1
leave x 1
return x 1
call x 1
&& x 1
|| x 1
entry x 1
include x 1
next x 1
before x 1
sequence point x 1
sequence x 1
points x 1
comma x 1
operator x 1
but x 1
compiler x 1
requires x 1
programmers x 1
exert x 1
optimization x 1
object x 1
This x 1
permits x 1
high x 1
degree x 1
occur x 1
Structured x 1
using x 1
struct x 1
union x 1
enum x 1
define x 1
Declarations x 1
file x 1
contains x 1
Function x 1
turn x 1
assign x 1
perhaps x 1
Keywords x 1
char x 1
int x 1
Sections x 1
name x 1
variable x 1
reserve x 1
usually x 1
writing x 1
type x 1
Each x 1
line x 1
format x 1
rather x 1
column-based x 1
text-line-based x 1
whitespace x 1
arbitrary x 1
FORTRAN x 1
77 x 1
free-form x 1
allows x 1
restrictions x 1
Comments x 1
C99 x 1
following x 1
// x 1
until x 1
*/ x 1
/* x 1
appear x 1
between x 1
delimiters x 1
enclosed x 1
braces x 1
supported x 1
if x 1
-else x 1
conditional x 1
Unlike x 1
reserved x 1
sequential x 1
provides x 1
control-flow x 1
identified x 1
do-while x 1
while x 1
any x 1
omitted x 1
break x 1
continue x 1
expressions x 1
testing x 1
iterative x 1
looping x 1
separate x 1
initialization x 1
normal x 1
modify x 1
control x 1
structures x 1
As x 1
imperative x 1
single x 1
act x 1
sometimes x 1
curly brackets x 1
limit x 1
scope x 1
language x 1
uses x 1
evaluation x 1
assigned x 1
values x 1
To x 1
effect x 1
semicolon x 1
actions x 1
common x 1
consisting x 1
used x 1

var_dump语句只是显示一致性。此变体保留双引号表达式。

对于提供的文件，此代码在0.047秒内完成。虽然较大的文件会消耗大量内存（由于file功能）。

score 0 · Accepted Answer

这不会赢得任何高尔夫奖项，但它确实将引用的短语放在一起并考虑停用词（并利用CPAN模块Lingua::StopWords和Text::ParseWords）。

此外，我使用to_Sfrom Lingua::EN::Inflect::Number仅计算单词的单数形式。

您可能还想查看Lingua::CollinsParser。

#!/usr/bin/perl

use strict; use warnings;

use Lingua::EN::Inflect::Number qw( to_S );
use Lingua::StopWords qw( getStopWords );
use Text::ParseWords;

my $stop = getStopWords('en');

my %words;

while ( my $line = <> ) {
    chomp $line;
    next unless $line =~ /\S/;
    next unless my @words = parse_line(' ', 1, $line);

    ++ $words{to_S $_} for
        grep { length and not $stop->{$_} }
        map { s!^[[:punct:]]+!!; s![[:punct:]]+\z!!; lc }
        @words;
}

print "=== only words appearing 4 or more times ===\n";
print "$_ : $words{$_}\n" for sort {
    $words{$b} <=> $words{$a}
} grep { $words{$_} > 3 } keys %words;

print "=== only words that are 12 characters or longer ===\n";
print "$_ : $words{$_}\n" for sort {
    $words{$b} <=> $words{$a}
} grep { 11 < length } keys %words;

输出：

=== 仅出现 4 次或更多次的单词 ===
声明：11
功能：7
表达：6
五月：5
代码：4
变量：4
操作员：4
声明：4
c : 4
类型：4
=== 仅限 12 个字符或更长的单词 ===
重新初始化：2
控制流：1
序列点：1
优化：1
大括号：1
基于文本行：1
非结构化：1
基于列：1
初始化：1

code-golf - Code Golf：从文本中快速构建关键字列表，包括实例数

13 回答 13

Perl 只有 43 个字符。

要使其在指定文本上工作，需要 58 个字符。

C# 3.0（使用 LINQ）

红宝石

执行

测试台

测试结果

雷博尔

Related

Reference