optimization - Lua 挑战：你能提高 mandelbrot 实现的性能吗？

Question

状态：到目前为止，最佳答案的程序执行时间是原始程序的 33%！但可能还有其他方法可以优化它。

Lua 目前是最快的脚本语言，但是 Lua 在一些针对 C/C++ 的基准测试中得分非常低。

其中之一是 mandelbrot 测试（生成 Mandelbrot 设置便携式位图文件 N=16,000），它的得分是可怕的 1:109（多核）或 1:28（单核）

由于速度的 Delta 非常大，因此这是一个很好的优化候选者。另外我敢肯定，那些知道 Mike Pall 是谁的人可能会认为不可能进一步优化这个，但这是明显错误的。任何做过优化的人都知道，总是有可能做得更好。此外，我确实设法通过一些调整获得了一些额外的性能，所以我知道它是可能的 :)

-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
-- contributed by Mike Pall

local width = tonumber(arg and arg[1]) or 100
local height, wscale = width, 2/width
local m, limit2 = 50, 4.0
local write, char = io.write, string.char

write("P4\n", width, " ", height, "\n")

for y=0,height-1 do
  local Ci = 2*y / height - 1
  for xb=0,width-1,8 do
    local bits = 0
    local xbb = xb+7
    for x=xb,xbb < width and xbb or width-1 do
      bits = bits + bits
      local Zr, Zi, Zrq, Ziq = 0.0, 0.0, 0.0, 0.0
      local Cr = x * wscale - 1.5
      for i=1,m do
        local Zri = Zr*Zi
        Zr = Zrq - Ziq + Cr
        Zi = Zri + Zri + Ci
        Zrq = Zr*Zr
        Ziq = Zi*Zi
        if Zrq + Ziq > limit2 then
          bits = bits + 1
          break
        end
      end
    end
    if xbb >= width then
      for x=width,xbb do bits = bits + bits + 1 end
    end
    write(char(255-bits))
  end
end

那么如何优化它（当然，与任何优化一样，您必须测量您的实现以确保它更快）。并且你不能为此改变 Lua 的 C 核心，或者使用 LuaJit，它是关于寻找优化 Lua 弱点之一的方法。

编辑：为此设置赏金以使挑战更有趣。

score 7 · Accepted Answer

通过 2，（在我的机器上）比我以前的要好 30%。主要节省来自展开内部循环以摊销开销。

还包括（已注释掉）是当您卡在中央心形指向时提前退出（并将像素设置为黑色）以节省时间的中止尝试。它有效，但无论我如何调整它都会变慢。

我得跑了，但我会留下一个离别建议。通过运行长度编码结果可能会进行一些优化（因此，您可以保存一个列表（白点数、黑点数、白点数等），而不是保存一堆比特旋转字符。）。这个会：

减少存储/GC 开销
允许对输出生成进行一些优化（当数字为 >> 8 时）
允许一些轨道探测。

不知道它是否可以编码得足够紧密，可以飞行，但如果我有更多时间，我接下来会尝试这样做。

-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
-- contributed by Mike Pall
-- with optimizations by Markus J. Q. (MarkusQ) Roberts

local width = tonumber(arg and arg[1]) or 100
local height, wscale = width, 2/width
local m, limit2 = 50, 4.0
local write, char = io.write, string.char

local h2 = math.floor(height/2)
local hm = height - h2*2
local top_half = {}

for y=0,h2+hm do
    local Ci = 2*y / height - 1
    local line = {""}
    for xb=0,width-1,8 do
        local bits = 0
        local xbb = xb+7
        for x=xb,xbb < width and xbb or width-1 do
            bits = bits + bits
            local Zr, Zi, Zrq, Ziq = 0.0, 0.0, 0.0, 0.0
            local Cr = x * wscale - 1.5
            local Zri = Zr*Zi
            for i=1,m/5 do
                Zr = Zrq - Ziq + Cr
                Zi = Zri + Zri + Ci
                Zri = Zr*Zi

                Zr = Zr*Zr - Zi*Zi + Cr
                Zi = 2*Zri +         Ci
                Zri = Zr*Zi

                Zr = Zr*Zr - Zi*Zi + Cr
                Zi = 2*Zri +         Ci
                Zri = Zr*Zi

                Zr = Zr*Zr - Zi*Zi + Cr
                Zi = 2*Zri +         Ci
                Zri = Zr*Zi

                Zr = Zr*Zr - Zi*Zi + Cr
                Zi = 2*Zri +         Ci
                Zri = Zr*Zi

                Zrq = Zr*Zr
                Ziq = Zi*Zi
                Zri = Zr*Zi
                if Zrq + Ziq > limit2 then
                    bits = bits + 1
                    break
                    end
                -- if i == 1 then
                --    local ar,ai       = 1-4*Zr,-4*Zi
                --    local a_r         = math.sqrt(ar*ar+ai*ai)
                --    local k           = math.sqrt(2)/2
                --    local br,bi2      = math.sqrt(a_r+ar)*k,(a_r-ar)/2
                --    if (br+1)*(br+1) + bi2 < 1 then
                --        break
                --        end
                --    end
                end
            end
        for x=width,xbb do 
            bits = bits + bits + 1 
            end
        table.insert(line,char(255-bits))
        end
    line = table.concat(line) 
    table.insert(top_half,line)
    end

write("P4\n", width, " ", height, "\n")
for y=1,h2+hm do
    write(top_half[y])
    end
for y=h2,1,-1 do
    write(top_half[y])
   end

score 3 · Accepted Answer

所以这里是〜40％的开始：

-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
-- contributed by Mike Pall

local width = tonumber(arg and arg[1]) or 100
local height, wscale = width, 2/width
local m, limit2 = 50, 4.0
local write, char = io.write, string.char

function addChar (line, c)
    table.insert(line, c)
    for i=table.getn(line)-1, 1, -1 do
        if string.len(line[i]) > string.len(line[i+1]) then
            break
            end
        line[i] = line[i] .. table.remove(line)
        end
    end

local h2 = math.floor(height/2)
local hm = height - h2*2
local top_half = {}
for y=0,h2+hm do
    local Ci = 2*y / height - 1
    local line = {""}
    for xb=0,width-1,8 do
        local bits = 0
        local xbb = xb+7
        for x=xb,xbb < width and xbb or width-1 do
            bits = bits + bits
            local Zr, Zi, Zrq, Ziq = 0.0, 0.0, 0.0, 0.0
            local Cr = x * wscale - 1.5
            for i=1,m do
                local Zri = Zr*Zi
                Zr = Zrq - Ziq + Cr
                Zi = Zri + Zri + Ci
                Zrq = Zr*Zr
                Ziq = Zi*Zi
                if Zrq + Ziq > limit2 then
                    bits = bits + 1
                    break
                    end
                end
            end
        for x=width,xbb do 
            bits = bits + bits + 1 
            end
        addChar(line,char(255-bits))
        end
    line = table.concat(line) 
    table.insert(top_half,line)
    end

write("P4\n", width, " ", height, "\n")
for y=1,h2+hm do
    write(top_half[y])
    end
for y=h2,1,-1 do
    write(top_half[y])
    end

——马库斯

score 2 · Accepted Answer

现在至少有一个比我的解决方案更快的答案，我将发布我的答案

-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
-- contributed by Mike Pall

local width = tonumber(arg and arg[1]) or 100
local height, wscale = width, 2/width
local m, limit2 = 50, 4.0
local write, char = io.write, string.char
local insert = table.insert
local results={}
write("P4\n", width, " ", height, "\n")

for y=0,height-1 do
  local Ci = 2*y / height - 1
  for xb=0,width-1,8 do
    local bits = 0
    local xbb = xb+7
    for x=xb,xbb < width and xbb or width-1 do
      bits = bits + bits
      local Zr, Zi, Zrq, Ziq = 0.0, 0.0, 0.0, 0.0
      local Cr = x * wscale - 1.5
      for i=1,m do
        local Zri = Zr*Zi
        Zr = Zrq - Ziq + Cr
        Zi = Zri + Zri + Ci
        Zrq = Zr*Zr
        Ziq = Zi*Zi
        if Zrq + Ziq > limit2 then
          bits = bits + 1
          break
        end
      end
    end
    if xbb >= width then
      for x=width,xbb do bits = bits + bits + 1 end
    end
    insert(results,(char(255-bits)))
  end
end
write(table.concat(results))

诀窍是在将值发送到输出之前将它们存储到表中。像这样简单的事情将运行时间减少到 58%。

score 2 · Accepted Answer

我不知道 Lua 能很好地生成工作代码，但你应该能够通过使用一些数学技巧来大大提高 Mandelbrot 的性能。有一个关于使用对称性来加速它的建议，使用这种优化可以完成另一个重大改进：

使用使用 Mandelbrot 部分的矩形坐标的递归函数。然后它计算矩形边界线和中间分开的两条线的值。在此之后，有 4 个子矩形。如果其中一个具有所有相同的边框像素颜色，则可以简单地用该颜色填充它，如果没有，则递归调用该部分的函数。

我搜索了这个算法的另一种解释，并在这里找到了一个- 以及一个很好的可视化。旧的 DOS 程序 FRACTINT 将此优化称为“Tesseral 模式”。

score 1 · Accepted Answer

在基准测试游戏中，通过生成更多进程来使用可用内核，经过的时间从 674 秒减少到 211 秒。

score 1 · Accepted Answer

为什么要使用局部变量 Zri？可以通过重新排序接下来的两个语句来避免使用它：

Zi = 2*Zr*Zi + Ci Zr = Zrq - Ziq + Cr

也可以使用简单的周期性检查，但加速取决于 m。“m”越大，从周期性检查中获得的加速越好。

score 0 · Accepted Answer

Robert Gould >其中之一是 mandelbrot 测试（生成 Mandelbrot 集便携式位图文件 N=16,000），它的得分是可怕的 1:109

当您引用基准游戏中的数字时，请说明这些数字的来源，以便读者了解一些背景信息。

在这种情况下，您似乎已经在四核机器上测量了数字，其中最快的程序已被重写以利用多核。而不是查看按 CPU 时间排序的经过时间，您会看到比率下降到1:28。

或者查看中位数和四分位数，以便更好地了解 C++ 测量集与 Lua 测量集的比较。

或者有一整套测量程序被迫只使用一个内核 - Lua 与 C++ 相比- 如果你看看那些 Lua pi-digits程序，你会发现它们使用 C 语言 GNU GMP 库。

score 0 · Accepted Answer

我做的下一步是缓存一遍又一遍计算的东西，并将bit+bit替换为bit*2，这些简单的优化非常强大......

local width = tonumber(arg and arg[1]) or 100
local height, wscale = width, 2/width
local m, limit2 = 50, 4.0
local write, char = io.write, string.char
local results={}
write("P4\n", width, " ", height, "\n")
local height_minus_one = height - 1
local width_minus_one = width -1

for y=0,height_minus_one do
  local Ci = 2*y / height_minus_one
  for xb=0,width_minus_one,8 do
    local bits = 0
    local xbb = xb+7
    for x=xb,xbb < width and xbb or width_minus_one do
      bits = bits *2
      local Zr, Zi, Zrq, Ziq = 0.0, 0.0, 0.0, 0.0
      local Cr = x * wscale - 1.5
      for i=1,m do
        local Zri = Zr*Zi
        Zr = Zrq - Ziq + Cr
        Zi = Zri + Zri + Ci
        Zrq = Zr*Zr
        Ziq = Zi*Zi
        if Zrq + Ziq > limit2 then
          bits = bits + 1
          break
        end
      end
    end
    if xbb >= width then
      for x=width,xbb do bits = bits *2 + 1 end
    end
    table.insert(results,(char(255-bits)))
  end
end
write(table.concat(results))

这种优化使程序运行时间是原来的 34%，但 Markus Q 的优化仍然击败了我的；）

score 0 · Accepted Answer

这是另一种尝试，但结果比本地访问变量要慢（我想象使用干净的环境可以更快地找到变量，但事实并非如此，本地的虚拟寄存器稍微快一些）这个将运行时间降低到 41%。

local env={}
env.width = tonumber(arg and arg[1]) or 100
env.height = env.width
env.wscale = 2/env.width
env.m = 50
env.limit2 = 4.0
env.write = io.write
env.char = string.char
env.results={}
env.height_minus_one = env.height - 1
env.width_minus_one = env.width -1
env.insert = table.insert

setfenv(function()
    write("P4\n", env.width, " ", env.height, "\n")
    for y=0,height_minus_one do
      local Ci = 2*y / height_minus_one
      for xb=0,width_minus_one,8 do
        local bits = 0
        local xbb = xb+7
        for x=xb,xbb < width and xbb or width_minus_one do
          bits = bits *2
          local Zr, Zi, Zrq, Ziq = 0.0, 0.0, 0.0, 0.0
          local Cr = x * wscale - 1.5
          for i=1,m do
            local Zri = Zr*Zi
            Zr = Zrq - Ziq + Cr
            Zi = Zri + Zri + Ci
            Zrq = Zr*Zr
            Ziq = Zi*Zi
            if Zrq + Ziq > limit2 then
              bits = bits + 1
              break
            end
          end
        end
        if xbb >= width then
          for x=width,xbb do bits = bits *2 + 1 end
        end
        insert(results,(char(255-bits)))
      end
    end
end,env)()

io.write(table.concat(env.results))

optimization - Lua 挑战：你能提高 mandelbrot 实现的性能吗？

9 回答 9

Related

Reference