python-3.x - Go strings.Contains() 比 Python3 慢 2 倍？

Question

我正在将文本模式扫描仪从 Python3 转换为 Go1.10，但令我惊讶的是它实际上慢了 2 倍。分析后，罪魁祸首在strings.Contains(). 请参阅下面的简单基准。我错过了什么吗？您能否推荐一种更快的 Go 模式搜索算法，在这种情况下性能会更好？我不关心启动时间，相同的模式将用于扫描数百万个文件。

Py3 基准测试：

import time
import re

RUNS = 10000

if __name__ == '__main__':
    with open('data.php') as fh:
        testString = fh.read()

    def do():
        return "576ad4f370014dfb1d0f17b0e6855f22" in testString

    start = time.time()
    for i in range(RUNS):
        _ = do()
    duration = time.time() - start
    print("Python: %.2fs" % duration)

Go1.10 基准测试：

package main

import (
    "fmt"
    "io/ioutil"
    "log"
    "strings"
    "time"
)

const (
    runs = 10000
)

func main() {
    fname := "data.php"
    testdata := readFile(fname)
    needle := "576ad4f370014dfb1d0f17b0e6855f22"
    start := time.Now()

    for i := 0; i < runs; i++ {
        _ = strings.Contains(testdata, needle)

    }
    duration := time.Now().Sub(start)
    fmt.Printf("Go: %.2fs\n", duration.Seconds())
}

func readFile(fname string) string {
    data, err := ioutil.ReadFile(fname)
    if err != nil {
        log.Fatal(err)
    }
    return string(data)
}

data.php是一个 528KB 的文件，可以在这里找到。

输出：

Go:     1.98s
Python: 0.84s

score 2 · Accepted Answer

为什么 Python 3 (24.79s) 比 Go (5.47s) 慢 4.5 倍？你得到什么结果？

蟒蛇：

$ cat contains.py
import time
import re

RUNS = 10000

if __name__ == '__main__':
    # The Complete Works of William Shakespeare by William Shakespeare
    # http://www.gutenberg.org/files/100/100-0.txt
    file = '/home/peter/shakespeare.100-0.txt' # 'data.php'
    with open(file) as fh:
        testString = fh.read()

    def do():
        return "Means to immure herself and not be seen." in testString

    start = time.time()
    for i in range(RUNS):
        _ = do()
    duration = time.time() - start
    print("Python: %.2fs" % duration)
    print(do())
$ python3 --version
Python 3.6.5
$ python3 contains.py
Python: 24.79s
True
$

去：

$ cat contains.go
package main

import (
    "fmt"
    "io/ioutil"
    "log"
    "strings"
    "time"
)

const (
    runs = 10000
)

func main() {
    // The Complete Works of William Shakespeare by William Shakespeare
    // http://www.gutenberg.org/files/100/100-0.txt
    fname := `/home/peter/shakespeare.100-0.txt` // "data.php"
    testdata := readFile(fname)
    needle := "Means to immure herself and not be seen."
    start := time.Now()

    for i := 0; i < runs; i++ {
        _ = strings.Contains(testdata, needle)

    }
    duration := time.Now().Sub(start)
    fmt.Printf("Go: %.2fs\n", duration.Seconds())

    fmt.Println(strings.Contains(testdata, needle))
    fmt.Println(strings.Index(testdata, needle))

}

func readFile(fname string) string {
    data, err := ioutil.ReadFile(fname)
    if err != nil {
        log.Fatal(err)
    }
    return string(data)
}
$ go version
go version devel +5332b5e75a Tue Jul 31 15:44:37 2018 +0000 linux/amd64
$ go run contains.go
Go: 5.47s
true
5837178
$

score 0 · Accepted Answer

我对在Wikipedia上找到的各种字符串搜索实现进行了更多基准测试，例如：

https://github.com/cloudflare/ahocorasick
https://github.com/cubicdaiya/bms
https://github.com/kkdai/kmp
https://github.com/paddie/gokmp
https://github.com/hillu/go-yara（Yara似乎在后台实现了 Aho 和 Corasick）。

基准测试结果（代码在这里）：

BenchmarkStringsContains-4         10000        208055 ns/op
BenchmarkBMSSearch-4                1000       1856732 ns/op
BenchmarkPaddieKMP-4                2000       1069495 ns/op
BenchmarkKkdaiKMP-4                 1000       1440147 ns/op
BenchmarkAhocorasick-4              2000        935885 ns/op
BenchmarkYara-4                     1000       1237887 ns/op

strings.Contains然后，我针对本机 ( and regexp) 和基于 C 的 Yara 实现，针对 500KB 不匹配文件测试了大约 1100 个签名（100 个正则表达式，1000 个文字）的实际用例：

BenchmarkScanNative-4              2     824328504 ns/op
BenchmarkScanYara-4              300       5338861 ns/op

尽管 Go 中的 C 调用被认为是昂贵的，但在这些“繁重”的操作中，利润是显着的。侧面观察：Yara 匹配 1100 个签名而不是 1 个签名只需要 5 倍的 CPU 时间。

python-3.x - Go strings.Contains() 比 Python3 慢 2 倍？

2 回答 2

Related

Reference