2

我最近发现了 n-gram 以及将文本正文中的短语频率与其进行比较的很酷的可能性。现在我正在尝试制作一个简单的获取文本正文并返回最常用短语列表(其中 n >= 2)的 vb.net 应用程序。

我找到了一个如何从文本正文生成 n-gram 的 C# 示例,因此我开始将代码转换为 VB。问题是这段代码确实每个字符创建一克而不是每个单词一克。我想为单词使用的分隔符是:VbCrLf(新行)、vbTab(制表符)和以下字符:!@#$%^&*()_+-={}|\:\"'?¿ /.,<>'¡º×÷';«»[]

有谁知道我如何为此目的重写以下函数:

   Friend Shared Function GenerateNGrams(ByVal text As String, ByVal gramLength As Integer) As String()
    If text Is Nothing OrElse text.Length = 0 Then
        Return Nothing
    End If

    Dim grams As New ArrayList()
    Dim length As Integer = text.Length
    If length < gramLength Then
        Dim gram As String
        For i As Integer = 1 To length
            gram = text.Substring(0, (i) - (0))
            If grams.IndexOf(gram) = -1 Then
                grams.Add(gram)
            End If
        Next

        gram = text.Substring(length - 1, (length) - (length - 1))
        If grams.IndexOf(gram) = -1 Then
            grams.Add(gram)

        End If
    Else
        For i As Integer = 1 To gramLength - 1
            Dim gram As String = text.Substring(0, (i) - (0))
            If grams.IndexOf(gram) = -1 Then
                grams.Add(gram)

            End If
        Next

        For i As Integer = 0 To (length - gramLength)
            Dim gram As String = text.Substring(i, (i + gramLength) - (i))
            If grams.IndexOf(gram) = -1 Then
                grams.Add(gram)
            End If
        Next

        For i As Integer = (length - gramLength) + 1 To length - 1
            Dim gram As String = text.Substring(i, (length) - (i))
            If grams.IndexOf(gram) = -1 Then
                grams.Add(gram)
            End If
        Next
    End If
    Return Tokeniser.ArrayListToArray(grams)
End Function
4

2 回答 2

2

单词的n -gram 只是存储这些单词的长度为n的列表。那么一个n -gram 列表就是一个单词列表。如果你想存储频率,那么你需要一个由这些n -gram 索引的字典。对于 2 克的特殊情况,您可以想象这样的事情:

Dim frequencies As New Dictionary(Of String(), Integer)(New ArrayComparer(Of String)())
Const separators as String = "!@#$%^&*()_+-={}|\:""'?¿/.,<>’¡º×÷‘;«»[] " & _
                             ControlChars.CrLf & ControlChars.Tab
Dim words = text.Split(separators.ToCharArray(), StringSplitOptions.RemoveEmptyEntries)

For i As Integer = 0 To words.Length - 2
    Dim ngram = New String() { words(i), words(i + 1) }
    Dim oldValue As Integer = 0
    frequencies.TryGetValue(ngram, oldValue)
    frequencies(ngram) = oldValue + 1
Next

frequencies现在应该包含一个字典,其中包含文本中包含的所有两个连续单词对,以及它们出现的频率(作为连续对)。

此代码需要ArrayComparer类:

Public Class ArrayComparer(Of T)
    Implements IEqualityComparer(Of T())

    Private ReadOnly comparer As IEqualityComparer(Of T)

    Public Sub New()
        Me.New(EqualityComparer(Of T).Default)
    End Sub

    Public Sub New(ByVal comparer As IEqualityComparer(Of T))
        Me.comparer = comparer
    End Sub

    Public Overloads Function Equals(ByVal a As T(), ByVal b As T()) As Boolean _
            Implements IEqualityComparer(Of T()).Equals
        System.Diagnostics.Debug.Assert(a.Length = b.Length)
        For i As Integer = 0 to a.Length - 1
            If Not comparer.Equals(a(i), b(i)) Then Return False
        Next

        Return True
    End Function

    Public Overloads Function GetHashCode(ByVal arr As T()) As Integer _
            Implements IEqualityComparer(Of T()).GetHashCode
        Dim hashCode As Integer = 17
        For Each obj As T In arr
            hashCode = ((hashCode << 5) - 1) Xor comparer.GetHashCode(obj)
        Next

        Return hashCode
    End Function
End Class

不幸的是,这段代码不能在 Mono 上编译,因为 VB 编译器在查找泛型EqualityComparer类时遇到问题。因此,我无法测试GetHashCodeimplementationw 是否按预期工作,但应该没问题。

于 2010-03-10T16:36:54.163 回答
0

非常感谢 Konrad 提出的解决方案!

我尝试了您的代码并得到以下结果:

Text = "Hello I am a test Also I am a test"
(I also included whitespace as a separator)

frequencies now has 9 items:
---------------------
Keys: "Hello", "I"
Value: 1
---------------------
Keys: "I", "am"
Value: 1
---------------------
Keys: "am", "a"
Value: 1
---------------------
Keys: "a", "test"
Value: 1
---------------------
Keys: "test", "Also"
Value: 1
---------------------
Keys: "Also", "I"
Value: 1
---------------------
Keys: "I", "am"
Value: 1
---------------------
Keys: "am", "a"
Value: 1
---------------------
Keys: "a", "test"
Value: 1
---------------------

我的第一个问题:最后 3 个密钥对不应该获得 2 的值,因为它们在文本中被找到了两次吗?

第二:我采用 n-gram 方法的原因是我不想将字数 (n) 限制为特定长度。有没有办法制定一种动态方法,首先尝试找到最长的短语匹配,然后降到最后一个字数 2?

我对上面示例查询的目标结果是:

---------------------
Match: "I am a test"
Frequency: 2
---------------------
Match: "I am a"
Frequency: 2
---------------------
Match: "am a test"
Frequency: 2
---------------------
Match: "I am"
Frequency: 2
---------------------
Match: "am a"
Frequency: 2
---------------------
Match: "a test"
Frequency: 2
---------------------

Hatem Mostafa 在 codeproject.com 上为此编写了类似的 C++ 方法:N-gram and Fast Pattern Extraction Algorithm

遗憾的是,我不是 C++ 专家,也不知道如何转换这段代码,因为它包含大量 .Net 没有的内存处理。此示例的唯一问题是您必须指定最小单词模式长度,并且我希望它是动态的,从 2 到找到的最大值。

于 2010-03-10T23:09:35.613 回答