java - 共享 GAE 数据存储，Go <-> Java，regexp.FindStringIndex 索引移位（字节索引与 utf-8-char-index）

Question

短版： 这会打印 3，这是有道理的，因为在 Go 中字符串基本上是一个字节切片，并且需要三个字节来表示这个字符。我怎样才能让 len 和 regexp 函数根据字符而不是字节来工作。

package main
import "fmt"
func main() {
    fmt.Println(len("ウ"))//returns 3
    fmt.Println(utf8.RuneCountInString("ウ"))//returns 1
}

背景：

我正在使用 JDO (Java) 将文本保存到 GAE 数据存储中。

然后我使用 Go 处理文本，特别是我使用 regexp.FindStringIndex 并将索引保存到数据存储区。

然后回到 Java 领域，我发送未修改的文本，并通过 json 将索引发送到 GWT 客户端。

在某个地方，索引正在“移动”，所以当它在客户端上时，它们已经关闭。

似乎问题与字符编码有关，我假设 Java/Go 以不同的方式解释文本（索引） utf-8 char/byte?。我在 regexp 包中看到了对符文的引用。

我想我可以让 regexp.FindStringIndex 在 go 中返回字节索引，或者让 GWT 客户端理解 utf-8 索引。

有什么建议么？我应该使用 UTF-8 以防将来需要将应用程序国际化，对吗？

谢谢

编辑：

此外，当我在服务器上使用 Java 查找索引时，一切正常。

在客户端（GWT）上，我正在使用 text.substring(start,end)

测试：

package main

import "regexp"
import "fmt"

func main() {
    fmt.Print(regexp.MustCompile(`a`).FindStringIndex("ウィキa")[1])
}

代码输出 10，而不是 4。

计划是让 FindStringIndex 返回 4，有什么想法吗？

更新 2：位置转换

func main() {
    s:="ab日aba本語ba";
    byteIndex:=regexp.MustCompile(`a`).FindAllStringIndex(s,-1)
    fmt.Println(byteIndex)//[[0 1] [5 6] [7 8] [15 16]]

    offset :=0
    posMap := make([]int,len(s))//maps byte-positions to char-positions
    for pos, char := range s {
        fmt.Printf("character %c starts at byte position %d, has an offset of %d, and a char position of %d.\n", char, pos,offset,pos-offset)
        posMap[pos]=offset
        offset += utf8.RuneLen(char)-1
    }
    fmt.Println("posMap =",posMap)
    for pos ,value:= range byteIndex{
        fmt.Printf("pos:%d value:%d subtract %d\n",pos,value,posMap[value[0]])
        value[1]-=posMap[value[0]]
        value[0]-=posMap[value[0]]
    }
    fmt.Println(byteIndex)//[[0 1] [3 4] [5 6] [9 10]]

}

* 更新 2 *

    lastPos:=-1
    for pos, char := range s {
        offset +=pos-lastPos-1
        fmt.Printf("character %c starts at byte position %d, has an offset of %d, and a char position of %d.\n", char, pos,offset,pos-offset)
        posMap[pos]=offset
        lastPos=pos
    }

score 4 · Accepted Answer

正如您可能已经收集到的，Go 和 Java 对待字符串的方式不同。在 Java 中，字符串是一系列代码点（字符）；在 Go 中，字符串是一系列字节。Go 中的文本操作函数在必要时理解 UTF-8 代码点，但由于字符串表示为字节，因此它们返回和使用的索引是字节索引，而不是字符索引。

正如您在评论中观察到的，您可以使用RuneReaderandFindReaderIndex来获取字符而不是字节的索引。strings.Reader提供了的实现RuneReader，因此您可以使用strings.NewReader将字符串包装在RuneReader.

另一种选择是获取您想要 in 字符长度的子字符串并将其传递给utf8.RuneLen，它返回 UTF-8 字符串中的字符数。但是，使用 aRuneReader可能更有效。

java - 共享 GAE 数据存储，Go <-> Java，regexp.FindStringIndex 索引移位（字节索引与 utf-8-char-index）

1 回答 1

Related

Reference