0

我已经开始尝试创建以下内容:

public static IEnumerable<List<T>> OptimizedBatches<T>(this IEnumerable<T> items)

那么这个扩展方法的客户端会像这样使用它:

foreach (var list in extracter.EnumerateAll().OptimizedBatches()) 
{
   // at some unknown batch size, process time starts to 
   // increase at an exponential rate
}

这是一个例子:

batch length         time
    1                 100ms
    2                 102ms
    4                 110ms
    8                 111ms
    16                118ms
    32                119ms
    64                134ms
    128               500ms <-- doubled length but time it took more than doubled
    256               1100ms <-- oh no!!

从上面可以看出,最好的批长度是 64,因为 64/134 是最好的长度/时间比。

所以问题是使用什么算法来根据迭代器步骤之间的连续时间自动选择最佳批次长度?

这是我到目前为止所拥有的 - 它还没有完成......

class LengthOptimizer
{
    private Stopwatch sw;
    private int length = 1;
    private List<RateRecord> rateRecords = new List<RateRecord>();

    public int Length
    {
        get
        {
            if (sw == null)
            {
                length = 1;
                sw = new Stopwatch();
            }
            else
            {
                sw.Stop();
                rateRecords.Add(new RateRecord { Length = length, ElapsedMilliseconds = sw.ElapsedMilliseconds });
                length = rateRecords.OrderByDescending(c => c.Rate).First().Length;
            }
            sw.Start();
            return length;
        }
    }
}

struct RateRecord
{
    public int Length { get; set; }
    public long ElapsedMilliseconds { get; set; }
    public float Rate { get { return ((float)Length) / ElapsedMilliseconds; } }
}
4

3 回答 3

1

我在这里看到的主要问题是创建“优化尺度”,即为什么您认为 32 -> 119ms 可以接受而 256 -> 1100ms 不可以;或者为什么某些配置比其他配置更好。

完成此操作后,算法将很简单:只需返回每个输入条件的排名值,并根据“哪个获得更高的值”做出决定。

创建此量表的第一步是找出更能描述您正在寻找的理想行为的变量。一个简单的第一种方法:长度/时间。也就是说,根据您的输入:

batch length           time             ratio1
    1                 100ms              0.01
    2                 102ms              0.019  
    4                 110ms              0.036  
    8                 111ms              0.072
    16                118ms              0.136
    32                119ms              0.269  
    64                134ms              0.478
    128               500ms              0.256
    256               1100ms             0.233

ratio1 越大越好。从逻辑上讲,长度为 32 的 0.269 与长度为 128 的 0.256 不同,因此必须考虑更多信息。

您可以创建一个更复杂的排名比,更好地对两个相关变量进行加权(例如,尝试不同的指数)。但我认为解决这个问题的最佳方法是创建一个“区域”系统并从中计算通用排名。例子:

Zone 1 -> length from 1 to 8; ideal ratio for this zone = 0.1
Zone 2 -> length from 9 to 32; ideal ratio for this zone = 0.3
Zone 3 -> length from 33 to 64; ideal ratio for this zone = 0.45
Zone 4 -> length from 65 to 256; ideal ratio for this zone = 0.35

与每个配置相关的排名将是给定 ratio1 相对于给定区域的理想值的结果。

2      102ms        0.019 -> (zone 1) 0.019/0.1 = 0.19 (or 1.9 in a 0-10 scale)
16     118ms        0.136 -> (zone 2) 0.136/0.3 = 0.45 (or 4.5 in a 0-10 scale)  
etc.

可以比较这些值,因此您会自动知道第二种情况比第一种情况好得多。

这只是一个简单的例子,但我想这可以很好地了解这里的真正问题是什么:建立一个准确的排名,以便完美地确定哪种配置更好。

于 2013-07-01T12:13:33.480 回答
1

我会采用 varocarbas 建议的排名方法。

这是帮助您入门的初始实现:

public sealed class DataFlowOptimizer<T>
{
    private readonly IEnumerable<T> _collection;
    private RateRecord bestRate = RateRecord.Default;
    private uint batchLength = 1;

    private struct RateRecord
    {
        public static RateRecord Default = new RateRecord { Length = 1, ElapsedTicks = 0 };
        private float _rate;

        public int Length { get; set; }
        public long ElapsedTicks { get; set; }
        public float Rate
        {
            get
            {
                if(_rate == default(float) && ElapsedTicks > 0)
                {
                    _rate = ((float)Length) / ElapsedTicks;
                }

                return _rate;
            }
        }
    }

    public DataFlowOptimizer(IEnumerable<T> collection)
    {
        _collection = collection;
    }

    public int BatchLength { get { return (int)batchLength; } }
    public float Rate { get { return bestRate.Rate; } }

    public IEnumerable<IList<T>> GetBatch()
    {
        var stopwatch = new Stopwatch();

        var batch = new List<T>();
        var benchmarks = new List<RateRecord>(5);
        IEnumerator<T> enumerator = null;

        try
        {
            enumerator = _collection.GetEnumerator();

            uint count = 0;
            stopwatch.Start();

            while(enumerator.MoveNext())
            {   
                if(count == batchLength)
                {
                    benchmarks.Add(new RateRecord { Length = BatchLength, ElapsedTicks = stopwatch.ElapsedTicks });

                    var currentBatch = batch.ToList();
                    batch.Clear();

                    if(benchmarks.Count == 10)
                    {
                        var currentRate = benchmarks.Average(x => x.Rate);
                        if(currentRate > bestRate.Rate)
                        {
                            bestRate = new RateRecord { Length = BatchLength, ElapsedTicks = (long)benchmarks.Average(x => x.ElapsedTicks) };
                            batchLength = NextPowerOf2(batchLength);
                        }
                        // Set margin of error at 10%
                        else if((bestRate.Rate * .9) > currentRate)
                        {
                            // Shift the current length and make sure it's >= 1
                            var currentPowOf2 = ((batchLength >> 1) | 1);
                            batchLength = PreviousPowerOf2(currentPowOf2);
                        }

                        benchmarks.Clear();
                    }
                    count = 0;
                    stopwatch.Restart();

                    yield return currentBatch;
                }

                batch.Add(enumerator.Current);
                count++;
            }
        }
        finally
        {
            if(enumerator != null)
                enumerator.Dispose();
        }

        stopwatch.Stop();
    }

    uint PreviousPowerOf2(uint x)
    {
        x |= (x >> 1);
        x |= (x >> 2);
        x |= (x >> 4);
        x |= (x >> 8);
        x |= (x >> 16);

        return x - (x >> 1);
    }

    uint NextPowerOf2(uint x)
    {
        x |= (x >> 1);
        x |= (x >> 2);
        x |= (x >> 4);
        x |= (x >> 8);
        x |= (x >> 16);

        return (x+1);
    }
}

LinqPad 中的示例程序:

public IEnumerable<int> GetData()
{
    return Enumerable.Range(0, 100000000);
}

void Main()
{
    var optimizer = new DataFlowOptimizer<int>(GetData());

    foreach(var batch in optimizer.GetBatch())
    {
        string.Format("Length: {0} Rate {1}", optimizer.BatchLength, optimizer.Rate).Dump();
    }
}
于 2013-07-01T13:22:38.483 回答
0
  1. 描述一个将批量大小和运行时间映射到分数的目标函数 fst(s)f(s,t(s))
  2. 尝试许多s值并评估f(s,t(s))每个值
  3. 选择s最大化的价值f
于 2013-07-01T15:15:47.023 回答