1

这个问题类似于How to emulate MySQLs utf8_general_ci collat​​ion in PHP string comparisons但我想要 vb.net 而不是 PhP 的函数。

最近我做了很多据说是独一无二的钥匙。

一些键在 UTF8 unicode 排序规则下是等效的。

例如,看看这 2 个键:

拜尔斯街小酒馆__38.15_-79.07 拜尔斯街小酒馆__38.15_-79.07

如果我将其粘贴到首页,并查看源代码,您会看到

拜尔斯街小酒馆__38.15_-79.07

拜尔斯街小酒馆‎__38.15_-79.07

注意:在堆栈溢出中,它们看起来仍然不同。

我知道这不一样。我想即使在堆栈交换中它也不会显示。假设我有 100 万条这样的记录,我想测试 MySQL UTF8 排序规则是否将 2 个字符串声明为相同。我想在上传之前知道这一点。我怎么做。

所以 vb.net 认为这些是不同的键。当我们创建 mysql 查询并将其上传到数据库时,数据库抱怨它是同一个键。只需一个投诉,100 万个数据库的上传就会卡住。

我们甚至不知道到底是什么?无论如何,我们在哪里可以查到呢?

无论如何,我想要一个函数,当给定 2 个字符串时,它会告诉我它们是否会被视为相同。

如果可能的话,我们想要一个将字符串转换为最“标准”形式的函数。

例如,‎ 似乎什么都不编码,该函数会重新识别所有这些无字符并消除它。

有这种事吗?

到目前为止,这就是我所做的。我需要更全面的东西。

    Private Function StraightenQuotesReplacement() As Generic.Dictionary(Of String, String)
    Static replacement As Generic.Dictionary(Of String, String)
    If replacement Is Nothing Then
        replacement = New Generic.Dictionary(Of String, String)
        replacement.Add(ChrW(&H201C), """")
        replacement.Add(ChrW(&H201D), """")
        replacement.Add(ChrW(&H2018), "'")
        replacement.Add(ChrW(&H2019), "'")
    End If
    Return replacement
End Function

<Extension()>
Public Function straightenQuotes(ByVal somestring As String) As String
    For Each key In StraightenQuotesReplacement.Keys
        somestring = somestring.Replace(key, StraightenQuotesReplacement.Item(key))
    Next
    Return somestring
End Function

<Extension()>
Public Function germanCharacter(ByVal s As String) As String
    Dim t = s
    t = t.Replace("ä", "ae")
    t = t.Replace("ö", "oe")
    t = t.Replace("ü", "ue")
    t = t.Replace("Ä", "Ae")
    t = t.Replace("Ö", "Oe")
    t = t.Replace("Ü", "Ue")
    t = t.Replace("ß", "ss")
    Return t
End Function
<Extension()>
Public Function japaneseCharacter(ByVal s As String) As String
    Dim t = s
    t = t.Replace("ヶ", "ケ")
    Return t
End Function

<Extension()>
Public Function greekCharacter(ByVal s As String) As String
    Dim t = s
    t = t.Replace("ς", "σ")
    t = t.Replace("ι", "ί")

    Return t
End Function
<Extension()>
Public Function franceCharacter(ByVal s As String) As String
    Dim t = s
    t = t.Replace("œ", "oe")
    Return t
End Function

<Extension()>
Public Function RemoveDiacritics(ByVal s As String) As String
    Dim normalizedString As String
    Dim stringBuilder As New StringBuilder
    normalizedString = s.Normalize(NormalizationForm.FormD)
    Dim i As Integer
    Dim c As Char
    For i = 0 To normalizedString.Length - 1
        c = normalizedString(i)
        If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
            stringBuilder.Append(c)
        End If
    Next
    Return stringBuilder.ToString()
End Function

<Extension()>
Public Function badcharacters(ByVal s As String) As String
    Dim t = s
    t = t.Replace(ChrW(8206), "")
    Return t
End Function

<Extension()>
Public Function sanitizeUTF8_Unicode(ByVal str As String) As String
    Return str.ToLower.removeDoubleSpaces.SpacetoDash.EncodeUrlLimited.straightenQuotes.RemoveDiacritics.greekCharacter.germanCharacter
End Function
4

2 回答 2

1

可能对看起来相似的字符使用不同的 unicode 代码点,例如连字符减号 (- U+002D)、短划线 (- U+2013) 和长破折号 (- U+2014) 是三个看起来都不同的字符相似的: - - -

使用 AscW() 函数检查每个字符。

编辑:

正如下面评论中所讨论的,使用 System.Text.NormalizationForm 命名空间来确定哪些 Unicode 代码点被视为等效字符。

于 2012-05-23T04:43:28.150 回答
0

我使用下面的 VBA 代码来调查奇怪的字符串。

我将“byers-street”行复制到 Excel 工作表的单元格 D18 并输入call DsplInHex(Range("D18"))到立即窗口中。结果是:

62 79 65 72 73 2D 73 74 72 65 65 74 2D 62 69 73 74 72 6F 5F 33 38 2E 31 35 2D 37 39 2E 30 37 20 62 79 65 72 73 2D 73 74 72 65 65 74 2D 62 69 73 74 72 6F 200E 5F 33 38 2E 31 35 2D 37 39 2E 30 37 

Adding a line break and some spaces gives:

62 79 65 72 73 2D 73 74 72 65 65 74 2D 62 69 73 74 72 6F      5F 33 38 2E 31 35 2D 37 39 2E 30 37 20 
62 79 65 72 73 2D 73 74 72 65 65 74 2D 62 69 73 74 72 6F 200E 5F 33 38 2E 31 35 2D 37 39 2E 30 37 

According to my Unicode book 200E is a Left-To-Right Mark. I would be interested to know how you managed to add that character to your key.

VB.NET is correct; these keys are different. Either MySQL deletes such characters or your transfer process deleted it. Either way, you need check your source data for funny characters.

Option Explicit
Public Sub DsplInHex(Stg As String)

  Dim Pos As Long

  For Pos = 1 To Len(Stg)
    Debug.Print Hex(AscW(Mid(Stg, Pos, 1))) & " ";
  Next
  Debug.Print

End Sub
于 2012-05-23T21:49:11.830 回答