1.添加分割文本的方法
由于 UniqueCount 和 SplitWords 都将处理从原始文本中提取的单词列表,因此为此创建一个函数是有意义的。
此方法接受一个包含您要使用的文本的字符串,并返回一个包含它所具有的单词的字符串数组。
private string[] GetWords(string text)
{
return text.Split(new char[] {' '}, StringSplitOptions.RemoveEmptyEntries);
}
2. 编写函数以处理数组
计算唯一词:
private int UniqueCount(string[] words)
{
var foundWords = new List<string>();
foreach (var word in words)
{
string word = word.ToLower();
if (!foundWords.Contains(word))
{
foundWords.Add(word);
}
}
return foundWords.Length;
}
计算总字数:
private int Count(string[] words)
{
return words.Length;
}
对于词汇密度:
private double CalculateLexicalDensity(string[] words)
{
return ((double)UniqueCount(words) / (double)Count(words));
}
注意:这些都没有更新标签,我想把这个问题分成另一个方法。
3. 创建更新标签的方法
此方法调用其他方法并更新标签
注意:我坚信 fbStatus 应该是一个参数。
private void UpdateLabels(string fbStatus)
{
var words = GetWords(fbStatus);
label_totalwordcount = Count(words).ToString();
label_totaluniquewords.Text = UniqueCount(words).ToString();
label_lexicaldensity = (CalculateLexicalDensity(words) * 100).ToString() + "%";
}
4.摆脱冗余计算
为此,我们有几个选择:
4.A. 再次混合关注点:
在这种情况下,我会将CalculateLexicalDensity 方法融合到UpdateLabels 中,这样我就可以避免同时执行UniqueCount 和Count 两次。
private void UpdateLabels(string fbStatus)
{
var words = GetWords(fbStatus);
int wordCount = Count(words);
int uniqueWordCount = UniqueWordCount(words);
double lexicalDensity = ((double)uniqueWordCount / (double)wordCount);
label_totalwordcount = wordCount.ToString();
label_totaluniquewords.Text = uniqueWordCount.ToString();
label_lexicaldensity = (lexicalDensity * 100).ToString() + "%";
}
4.B。使用元组作为返回类型:
在这种情况下,我会将 Count、UniqueCount 和 CalculateLexicalDensity 融合到一个方法中,这将允许 - 再次 - 避免两次执行 UniqueCount 和 Count。由于此方法需要返回三个值,因此它将返回一个元组 [它也可以是自定义类型]。
private UpdateLabels(string fbStatus)
{
var words = GetWords(fbStatus);
var info = Process(words);
label_totalwordcount = info.Item1.ToString();
label_totaluniquewords.Text = info.Item2.ToString();
label_lexicaldensity = (info.Item3 * 100).ToString() + "%";
}
private Tuple<int, int, double> Process(string[] words)
{
int wordCount = Count(words);
int uniqueWordCount = UniqueWordCount(words);
double lexicalDensity = ((double)uniqueWordCount / (double)wordCount);
return new Tuple<int, int, double>(wordCount, uniqueWordCount, lexicalDensity);
}
由于此选项将关注点分开,因此我更喜欢这个选项。然而,在您不能(或您不想)使用元组的情况下,您可以使用自定义类型......对于这种情况,我更喜欢结构......
4.C。使用结构作为返回类型:
struct LexicalInfo
{
public int WordCount;
public int UniqueWordCount;
public int LexicalDensity;
}
使用此结构,代码将是:
private UpdateLabels(string fbStatus)
{
var words = GetWords(fbStatus);
var info = Process(words);
label_totalwordcount = info.WordCount.ToString();
label_totaluniquewords.Text = info.UniqueWordCount.ToString();
label_lexicaldensity = (info.LexicalDensity * 100).ToString() + "%";
}
private LexicalInfo Process(string[] words)
{
int wordCount = Count(words);
int uniqueWordCount = UniqueWordCount(words);
double lexicalDensity = ((double)uniqueWordCount / (double)wordCount);
return new LexicalInfo()
{
WordCount = wordCount,
UniqueWordCount = uniqueWordCount,
LexicalDensity = lexicalDensity
};
}
此外,如果我们要使用结构......
4.D。使用结构进行计算:
注意:在这种情况下,它也可能是一个类。
struct LexicalInfo
{
private int wordCount;
private int uniqueWordCount;
public LexicalInfo(string text)
{
var words = GetWords(text);
wordCount = Count(words);
uniqueWordCount = UniqueCount(words);
}
private string[] GetWords(string text)
{
return text.Split(new char[] {' '}, StringSplitOptions.RemoveEmptyEntries);
}
private int UniqueCount(string[] words)
{
var foundWords = new List<string>();
foreach (var word in words)
{
string word = word.ToLower();
if (!foundWords.Contains(word))
{
foundWords.Add(word);
}
}
return foundWords.Length;
}
private int Count(string[] words)
{
return words.Length;
}
public int WordCount
{
get
{
return wordCount;
}
}
public int UniqueWordCount
{
get
{
return uniqueWordCount;
}
}
public double LexicalDensity
{
get
{
return ((double)uniqueWordCount / (double)wordCount);
}
}
}
//----
private UpdateLabels(string fbStatus)
{
var info = new LexicalInfo(words);
label_totalwordcount = info.WordCount.ToString();
label_totaluniquewords.Text = info.UniqueWordCount.ToString();
label_lexicaldensity = (info.LexicalDensity * 100).ToString() + "%";
}
5.优化
我将采用最终代码(使用 struct 进行计算的代码)并对其进行处理。
我们有两个只有一行的方法(方法是 GetWords 和 Count),我将摆脱它们并用方法体替换调用:
struct LexicalInfo
{
private int wordCount;
private int uniqueWordCount;
public LexicalInfo(string text)
{
var words = text.Split(new char[] {' '}, StringSplitOptions.RemoveEmptyEntries);
wordCount = words.Length;
uniqueWordCount = UniqueCount(words);
}
private int UniqueCount(string[] words)
{
var foundWords = new List<string>();
foreach (var word in words)
{
string word = word.ToLower();
if (!foundWords.Contains(word))
{
foundWords.Add(word);
}
}
return foundWords.Length;
}
public int WordCount
{
get
{
return wordCount;
}
}
public int UniqueWordCount
{
get
{
return uniqueWordCount;
}
}
public double LexicalDensity
{
get
{
return ((double)uniqueWordCount / (double)wordCount);
}
}
}
//----
private UpdateLabels(string fbStatus)
{
var info = new LexicalInfo(words);
label_totalwordcount = info.WordCount.ToString();
label_totaluniquewords.Text = info.UniqueWordCount.ToString();
label_lexicaldensity = (info.LexicalDensity * 100).ToString() + "%";
}
6. 林克?
如果我们可以使用 Linq,我们可以将 UniqueCount 替换为一行:
struct LexicalInfo
{
private int wordCount;
private int uniqueWordCount;
public LexicalInfo(string text)
{
var words = text.Split(new char[] {' '}, StringSplitOptions.RemoveEmptyEntries);
wordCount = words.Length;
uniqueWordCount = words.Distinct().Count();
}
public int WordCount
{
get
{
return wordCount;
}
}
public int UniqueWordCount
{
get
{
return uniqueWordCount;
}
}
public double LexicalDensity
{
get
{
return ((double)uniqueWordCount / (double)wordCount);
}
}
}
//----
private UpdateLabels(string fbStatus)
{
var info = new LexicalInfo(fbStatus);
label_totalwordcount = info.WordCount.ToString();
label_totaluniquewords.Text = info.UniqueWordCount.ToString();
label_lexicaldensity = (info.LexicalDensity * 100).ToString() + "%";
}
7. 测试和修复
我已使用以下文本进行测试:
ESTE ES UN TEXTO QUE HE ESCRITO EN ESPAÑOL。ESTE TEXTO FUE ESCRITO PARA DEMOSTRACIÓN。ESTE TEXTO REPITE ALGUNAS DE SUS PALABRAS Y ALGUNAS OTRAS NO.
La salida fue:
WordCount = 28
UniqueWordCount = 21
LexicalDensity = 75%
然而,检查代码发现我们将标点符号作为单词的一部分进行计数(即,由于标点符号,代码将ESPAÑOL
和ESPAÑOL.
视为两个不同的单词)。
您可以使用正则表达式进行快速修复,以便将 LexicalInfo 的构造函数替换为:
public LexicalInfo(string text)
{
var words = from match in (new Regex(@"\w+")).Matches(text).Cast<Match>() select match.Value;
wordCount = words.Count();
uniqueWordCount = words.Distinct().Count();
Console.WriteLine(words.Distinct().ToArray());
}
更改后的输出为:
WordCount = 28
UniqueWordCount = 20
LexicalDensity = 71.4285714285714%
您可能想要格式化 LexicalDensity,例如更改以下行:
label_lexicaldensity = (info.LexicalDensity * 100).ToString() + "%";
对此:
label_lexicaldensity = string.Format("{0:P2}", info.LexicalDensity);
会产生这个:
WordCount = 28
UniqueWordCount = 20
LexicalDensity = 71.43 %
注意:使用 string.Format 受其执行的文化影响。如果您不想更改文化,您可以指定一个,例如 InvariantCulture:
label_lexicaldensity = string.Format("{0:P2}", info.LexicalDensity, CultureInfo.InvariantCulture);
使用另一个测试文本,我发现我已经失去了检测大写字母的能力。文字是
Este es otro texto escrito en español, el objetivo de este texto es probar las mayúsculas al repetir texto。
在这种情况下,代码将Este
andeste
视为两个不同的词。这是 Linq 的另一个简单修复,更改此行:
uniqueWordCount = words.Distinct().Count();
对此:
uniqueWordCount = (from word in words select word.ToLower()).Distinct().Count();