delphi - Delphi 短语计数/关键字密度

Question

有谁知道如何或有一些代码来计算文档中唯一短语的数量？（单个词，两个词短语，三个词短语）。

谢谢

我正在寻找的示例：我的意思是我有一个文本文档，我需要查看最流行的词组是什么。示例文本

我把车开到洗车场。

我：1
拿了：1
: 2
车：2
至：1
洗：1
我拿了：1
拿了：1
车：2
车去：1
到：1
洗车：1
我拿了：1
坐车：1
车去：1
车到：1
到车：1
洗车：1
我把车开到：1
把车开到：1
车对车：1
车到洗车：1

我需要这个短语，以及它出现的计数。

任何帮助，将不胜感激。我发现的最重要的东西是来自http://tools.seobook.com/general/keyword-density/source.php的 PHP 脚本

我曾经有一些代码，但我找不到它。

score 2 · Accepted Answer

这是一些可以解决您的问题的初始代码。

function CountWordSequences(const s:string; Counts:TStrings = nil):TStrings;
var
  words, seqs : TStrings;
  nw,i,j:integer;
  t :string;
begin
  if Counts=nil then Counts:=TStringList.Create;
  words:=TStringList.Create;        // build a list of all words
  words.DelimitedText:=s;
  seqs:=TStringList.Create;
  for nw:=1 to words.Count do       // build a list of all word sequences
   begin
    for i:=0 to words.Count-nw do
     begin
      t:='';
      for j:=0 to nw-1 do
       begin
        t:=t+words[i+j];
        if j<>nw-1 then t:=t+' ';
       end;
      seqs.Add(t);
     end;
   end;
  words.Destroy;
  for i:=0 to seqs.Count-1 do         // count repeated sequences
   begin
    j:=Counts.IndexOf(seqs.Strings[i]);
    if j=-1 then
      Counts.AddObject(seqs.Strings[i],TObject(1))
    else
      Counts.Objects[j] := TObject(Succ(Integer(Counts.Objects[j])));
   end;
  seqs.Destroy;
  result:=Counts;
end;

您将需要为实际生产详细说明此代码，例如，通过识别更多的单词分隔符（不仅仅是空格），并通过实现某种不区分大小写的功能。

要对其进行测试，请将 Button、EntryField 和 Memo 放入 Form 中，并添加以下代码。

procedure TForm1.Button1Click(Sender: TObject);
var i:integer; l:TStrings;
 begin
  l:=CountWordSequences(edit1.Text,TStringList.Create);
  for i:=1 to l.count do
    memo1.Lines.Add('"'+l.Strings[i-1]+'": '+inttostr(Integer(l.Objects[i-1])));
 end;

我首先尝试I took the car to the car wash

给

"I": 1
"took": 1
"the": 2
"car": 2
"to": 1
"wash.": 1
"I took": 1
"took the": 1
"the car": 2
"car to": 1
"to the": 1
"car wash.": 1
"I took the": 1
"took the car": 1
"the car to": 1
"car to the": 1
"to the car": 1
"the car wash.": 1
"I took the car": 1
"took the car to": 1
"the car to the": 1
"car to the car": 1
"to the car wash.": 1
"I took the car to": 1
"took the car to the": 1
"the car to the car": 1
"car to the car wash.": 1
"I took the car to the": 1
"took the car to the car": 1
"the car to the car wash.": 1
"I took the car to the car": 1
"took the car to the car wash.": 1
"I took the car to the car wash.": 1

score 0 · Accepted Answer

来自 Delphi Basics 网站。

var
  position : Integer;

begin
  // Look for the word 'Cat' in a sentence
  // Note : that this search is case sensitive, so that
  //        the first 'cat' is not matched
  position := AnsiPos('Cat', 'The cat sat on the Cat mat');
  if position = 0
  then ShowMessage('''Cat'' not found in the sentence')
  else ShowMessage('''Cat'' was found at character '+IntToStr(position));
end;

也许会有所帮助

score 0 · Accepted Answer

可能的组合数量增加得非常快。假设一种语言中主流使用的词有30000个，那么3个词组组合的数量在30000^3的数量级

无论如何，零级实现将是构建一个（散列）单词列表，如果需要，过滤列表以过滤非常常见的单词（the,of 等）以减少短语的数量。您可能想要做的其他事情是将复数减少为单数，删除尾随的，大小写等。

然后逐字逐句（tokenizer 风格）遍历文本，跳过常用词，并简单地将遇到的短语的有序列表保持为计数，并希望你的记忆不会用完，因为 Delphi 没有 64 位版本： )

Knuth 不是有一本关于组合的整本书吗？

score 0 · Accepted Answer

这就是我解决问题的方法。假设每次通过数据文件都会为下一步创建一个新的数据文件。提到的控制字符可以是任何不会自然出现在数据中的字符。当你写一个控制字符时，不要写重复。

浏览您的文档并分别计算每个单词。
再次运行您的文档并用控制字符替换任何只使用一次的单词，将出现的对添加到新列表中（单词 ABC 变为 item AB 和 item BC）。控制字符充当硬分隔符。任何单独在控制字符之间的单词也应该被转换，因为它不能被转换成对。
再次运行您的文档并将仅使用一次的任何对替换为控制字符，将出现的任何三元组添加到新列表中。将控制字符之间的对转换为控制字符。

重复向每个列表添加另一个单词级别，直到您获得一个空列表或您拥有想要支持的最大短语。

这种方法意味着您最常用的短语永远不会包含较少使用的较小短语。

delphi - Delphi 短语计数/关键字密度

4 回答 4

Related

Reference