20

I've split a complex array processing task into a number of threads to take advantage of multi-core processing and am seeing great benefits. Currently, at the start of the task I create the threads, and then wait for them to terminate as they complete their work. I'm typically creating about four times the number of threads as there are cores, as each thread is liable to take a different amount of time, and having extra threads ensures all cores are kept occupied most of the time. I was wondering would there be much of a performance advantage to creating the threads as the program fires up, keeping them idle until required, and using them as I start processing. Put more simply, how long does it take to start and end a new thread above and beyond the processing within the thread? I'm current starting the threads using

CWinThread *pMyThread = AfxBeginThread(CMyThreadFunc,&MyData,THREAD_PRIORITY_NORMAL);

Typically I will be using 32 threads across 8 cores on a 64 bit architecture. The process in question currently takes < 1 second, and is fired up each time the display is refreshed. If starting and ending a thread is < 1ms, the return doesn't justify the effort. I'm having some difficulty profiling this.

A related question here helps but is a bit vague for what I'm after. Any feedback appreciated.

4

3 回答 3

18

很久以前,当我遇到相同的基本问题(以及另一个显而易见的问题)时,我写了这篇文章。我已经更新了它,不仅显示了创建线程需要多长时间,还显示了线程开始执行需要多长时间:

#include <windows.h>
#include <iostream>
#include <time.h>
#include <vector>

const int num_threads = 32;

const int switches_per_thread = 100000;

DWORD __stdcall ThreadProc(void *start) {
    QueryPerformanceCounter((LARGE_INTEGER *) start);
    for (int i=0;i<switches_per_thread; i++)
        Sleep(0);
    return 0;
}

int main(void) {
    HANDLE threads[num_threads];
    DWORD junk;

    std::vector<LARGE_INTEGER> start_times(num_threads);

    LARGE_INTEGER l;
    QueryPerformanceCounter(&l);

    clock_t create_start = clock();
    for (int i=0;i<num_threads; i++)
        threads[i] = CreateThread(NULL, 
                            0, 
                            ThreadProc, 
                            (void *)&start_times[i], 
                            0, 
                            &junk);
    clock_t create_end = clock();

    clock_t wait_start = clock();
    WaitForMultipleObjects(num_threads, threads, TRUE, INFINITE);
    clock_t wait_end = clock();

    double create_millis = 1000.0 * (create_end - create_start) / CLOCKS_PER_SEC / num_threads;
    std::cout << "Milliseconds to create thread: " << create_millis << "\n";
    double wait_clocks = (wait_end - wait_start);
    double switches = switches_per_thread*num_threads;
    double us_per_switch = wait_clocks/CLOCKS_PER_SEC*1000000/switches;
    std::cout << "Microseconds per thread switch: " << us_per_switch;

    LARGE_INTEGER f;
    QueryPerformanceFrequency(&f);

    for (auto s : start_times) 
        std::cout << 1000.0 * (s.QuadPart - l.QuadPart) / f.QuadPart <<" ms\n";

    return 0;
}

样本结果:

Milliseconds to create thread: 0.015625
Microseconds per thread switch: 0.0479687

前几个线程开始时间如下所示:

0.0632517 ms
0.117348 ms
0.143703 ms
0.18282 ms
0.209174 ms
0.232478 ms
0.263826 ms
0.315149 ms
0.324026 ms
0.331516 ms
0.3956 ms
0.408639 ms
0.4214 ms

请注意,尽管这些恰好是单调增加的,但这并不能保证(尽管在这个大方向上肯定存在趋势)。

当我第一次写这篇文章时,我使用的单位更有意义——在 33 MHz 486 上,这些结果不是像这样的小部分。:-) 我想有一天当我感到雄心勃勃时,我应该重写它以用于std::async创建线程并std::chrono进行计时,但是......

于 2013-08-16T14:17:16.573 回答
4

一些建议:

  1. 如果您有很多工作项要处理(或者没有太多,但您必须不时重复整个过程),请确保使用某种线程池。这样您就不必一直重新创建线程,并且您原来的问题将不再重要:线程只会被创建一次。我直接使用 QueueUserWorkItem API(因为我的应用程序不使用 MFC),即使那个也不是太痛苦。但在 MFC 中,您可能拥有更高级别的设施来利用线程池。( http://support.microsoft.com/kb/197728 )
  2. 尝试为一个工作项选择最佳工作量。当然,这取决于您的软件的功能:它应该是实时的,还是在后台处理数字?如果它不是实时的,那么每个工作项的工作量太少可能会损害性能:通过增加跨线程工作分配的开销比例。
  3. 由于硬件配置可能非常不同,如果您的最终用户可以拥有各种机器,您可以在软件启动期间异步包含一些校准例程,这样您就可以估计某些操作需要多少时间。校准的结果可以作为输入,以便稍后为实际计算提供更好的工作尺寸设置。
于 2013-08-16T19:16:00.240 回答
1

我对现代 Windows 调度程序很好奇,所以我编写了另一个测试应用程序。我尽我最大的努力通过选择旋转观察线程来测量线程停止时间。

// Tested on Windows 10 v1903 with E5-1660 v3 @ 3.00GHz, 8 Core(s), 16 Logical Processor(s)
// Times are (min, average, max) in milliseconds.

threads: 100, iterations: 1, testStop: true
Start(0.1083, 5.3665, 13.7103) - Stop(0.0341, 1.5122, 11.0660)

threads: 32, iterations: 3, testStop: true
Start(0.1349, 1.6423, 3.5561) - Stop(0.0396, 0.2877, 3.5195)
Start(0.1093, 1.4992, 3.3982) - Stop(0.0351, 0.2734, 2.0384)
Start(0.1159, 1.5345, 3.5754) - Stop(0.0378, 0.4938, 3.2216)

threads: 4, iterations: 3, testStop: true
Start(0.2066, 0.3553, 0.4598) - Stop(0.0410, 0.1534, 0.4630)
Start(0.2769, 0.3740, 0.4994) - Stop(0.0414, 0.1028, 0.2581)
Start(0.2342, 0.3602, 0.5650) - Stop(0.0497, 0.2199, 0.3620)

threads: 4, iterations: 3, testStop: false
Start(0.1698, 0.2492, 0.3713)
Start(0.1473, 0.2477, 0.4103)
Start(0.1756, 0.2909, 0.4295)

threads: 1, iterations: 10, testStop: false
Start(0.1910, 0.1910, 0.1910)
Start(0.1685, 0.1685, 0.1685)
Start(0.1564, 0.1564, 0.1564)
Start(0.1504, 0.1504, 0.1504)
Start(0.1389, 0.1389, 0.1389)
Start(0.1234, 0.1234, 0.1234)
Start(0.1550, 0.1550, 0.1550)
Start(0.2800, 0.2800, 0.2800)
Start(0.1587, 0.1587, 0.1587)
Start(0.1877, 0.1877, 0.1877)

资源:

#include <windows.h>
#include <iostream>
#include <vector>
#include <chrono>
#include <iomanip>

using namespace std::chrono;

struct Test
{
    HANDLE Thread = { 0 };
    time_point<steady_clock> Creation;
    time_point<steady_clock> Started;
    time_point<steady_clock> Stopped;
};

DWORD __stdcall ThreadProc(void* lpParamater) {
    auto test = (Test*)lpParamater;
    test->Started = steady_clock::now();
    return 0;
}

DWORD __stdcall TestThreadsEnded(void* lpParamater) {
    auto& tests = *(std::vector<Test>*)lpParamater;

    std::size_t finished = 0;
    while (finished < tests.size())
    {
        for (auto& test : tests)
        {
            if (test.Thread != NULL && WaitForSingleObject(test.Thread, 0) == WAIT_OBJECT_0)
            {
                test.Stopped = steady_clock::now();
                test.Thread = NULL;
                finished++;
            }
        }
    }

    return 0;
}

duration<double, std::milli> diff(time_point<steady_clock> start, time_point<steady_clock> stop)
{
    return stop - start;
}

struct Stats
{
    double min;
    double average;
    double max;
};

Stats stats(const std::vector<double>& durations)
{
    Stats stats = { 1000, 0, 0 };

    for (auto& duration : durations)
    {
        stats.min = duration < stats.min ? duration : stats.min;
        stats.max = duration > stats.max ? duration : stats.max;
        stats.average += duration;
    }

    stats.average /= durations.size();

    return stats;
}

void TestScheduler(const int threadCount, const int iterations, const bool testStop)
{
    std::cout << "\nthreads: " << threadCount << ", iterations: " << iterations << ", testStop: " << (testStop ? "true" : "false") << "\n";

    for (auto i = 0; i < iterations; i++)
    {
        std::vector<Test> tests(threadCount);
        HANDLE testThreadsEnded = NULL;

        if (testStop)
        {
            testThreadsEnded = CreateThread(NULL, 0, TestThreadsEnded, (void*)& tests, 0, NULL);
        }

        for (auto& test : tests)
        {
            test.Creation = steady_clock::now();
            test.Thread = CreateThread(NULL, 0, ThreadProc, (void*)& test, 0, NULL);
        }

        if (testStop)
        {
            WaitForSingleObject(testThreadsEnded, INFINITE);
        }
        else
        {
            std::vector<HANDLE> threads;
            for (auto& test : tests) threads.push_back(test.Thread);
            WaitForMultipleObjects((DWORD)threads.size(), threads.data(), TRUE, INFINITE);
        }

        std::vector<double> startDurations;
        std::vector<double> stopDurations;
        for (auto& test : tests)
        {
            startDurations.push_back(diff(test.Creation, test.Started).count());
            stopDurations.push_back(diff(test.Started, test.Stopped).count());
        }

        auto startStats = stats(startDurations);
        auto stopStats = stats(stopDurations);

        std::cout << std::fixed << std::setprecision(4);
        std::cout << "Start(" << startStats.min << ", " << startStats.average << ", " << startStats.max << ")";
        if (testStop)
        {
            std::cout << " - ";
            std::cout << "Stop(" << stopStats.min << ", " << stopStats.average << ", " << stopStats.max << ")";
        }
        std::cout << "\n";
    }
}

int main(void)
{
    TestScheduler(100, 1, true);
    TestScheduler(32, 3, true);
    TestScheduler(4, 3, true);
    TestScheduler(4, 3, false);
    TestScheduler(1, 10, false);
    return 0;
}
于 2019-07-11T23:23:54.673 回答