0
<s> an evolutionary immune network for data clustering </s>
<s> an evolutionary immune network for data clustering </s>
<s> inet an extensible framework for simulating immune network </s>
<s> immunity based systems a survey </s>
<s> a recommender system based on the immune network </s>

我在 MATLAB 中工作,这些句子来自文本文件,我想逐行阅读这些句子,并想提取每个单词并计算每个单词的频率。如何使用“regexp”函数提取单词?

4

2 回答 2

0

我认为该字符串'<s></s>'实际上出现在您的文本文件中的某个位置。如果是这样的话,空间分割当然是不够的;您必须返回所有出现的'<s>''</s>'连续的非空格字符:

regexp(F, '<s>|\w*|</s>', 'match');

完整代码:

% Read file contents
fid = fopen('test.txt','r');
F = fread(fid, '*char').';
fclose(fid);

% Split all words
C = regexp(F, '<s>|\w*|</s>', 'match');

% Find word frequencies
words  = unique(C);
counts = cellfun(@(x)sum(strcmp(x,C)), words);

% Group them together for display
freq = [num2cell(counts.') words.']
于 2013-09-25T21:47:05.403 回答
0

被认为是一个词的原因</s><s>是您已经阅读了整个文件并且只在空格上拆分,而不是换行符和空格。

取而代之的是,逐行读取文件fgets并单独拆分行,并在执行过程中增加令牌计数。

于 2013-09-25T20:00:09.060 回答