pthreads - 使用 pthread_cond_t 时出现 pthread 死锁问题

Question

我很难弄清楚为什么我的同步。使用 pthread 库时代码会死锁。使用 winapi 原语代替 pthread 可以毫无问题地工作。使用 c++11 线程也可以正常工作（除非使用 Visual Studio 2012 Service Pack 3 编译，否则它只会崩溃 - 微软接受它作为一个错误。）然而，使用 pthread 被证明是一个问题 - 至少在 linux 机器上运行，没有机会尝试不同的操作系统。

我写了一个简单的程序来说明这个问题。代码只是显示了死锁-我很清楚设计很差，可以写得更好。

typedef struct _pthread_event
{
     pthread_mutex_t Mutex;
     pthread_cond_t Condition;
     unsigned char  State;
} pthread_event;

void pthread_event_create( pthread_event * ev , unsigned char init_state )
{ 
    pthread_mutex_init( &ev->Mutex , 0 );
    pthread_cond_init( &ev->Condition , 0 );
    ev->State = init_state;
}

void pthread_event_destroy( pthread_event * ev )
{
    pthread_cond_destroy( &ev->Condition );
    pthread_mutex_destroy( &ev->Mutex );
}

void pthread_event_set( pthread_event * ev , unsigned char state )
{
    pthread_mutex_lock( &ev->Mutex );
    ev->State = state;
    pthread_mutex_unlock( &ev->Mutex );
    pthread_cond_broadcast( &ev->Condition );
}

unsigned char pthread_event_get( pthread_event * ev )
{
    unsigned char result;
    pthread_mutex_lock( &ev->Mutex );
    result = ev->State;
    pthread_mutex_unlock( &ev->Mutex );
    return result;
}

unsigned char pthread_event_wait( pthread_event * ev , unsigned char state , unsigned int timeout_ms )
{
    struct timeval time_now;
    struct timespec timeout_time;
    unsigned char result;

    gettimeofday( &time_now , NULL );
    timeout_time.tv_sec = time_now.tv_sec           + ( timeout_ms / 1000 );
    timeout_time.tv_nsec = time_now.tv_usec * 1000  + ( ( timeout_ms % 1000 ) * 1000000 );

    pthread_mutex_lock( &ev->Mutex );
    while ( ev->State != state ) 
          if ( ETIMEDOUT == pthread_cond_timedwait( &ev->Condition , &ev->Mutex , &timeout_time ) ) break;

    result = ev->State;
    pthread_mutex_unlock( &ev->Mutex );
    return result;
}

static pthread_t        thread_1;
static pthread_t        thread_2;
static pthread_event    data_ready;
static pthread_event    data_needed;

void * thread_fx1( void * c )
{
    for ( ; ; )
    {
        pthread_event_wait( &data_needed , 1 , 90 );
        pthread_event_set( &data_needed , 0 );
        usleep( 100000 );
        pthread_event_set( &data_ready , 1 );
        printf( "t1: tick\n" );
    }
}

void * thread_fx2( void * c )
{
    for ( ; ; )
    {
        pthread_event_wait( &data_ready , 1 , 50 );
        pthread_event_set( &data_ready , 0 );
        pthread_event_set( &data_needed , 1 );
        usleep( 100000 );
        printf( "t2: tick\n" );
    }
}


int main( int argc , char * argv[] )
{
    pthread_event_create( &data_ready , 0 );
    pthread_event_create( &data_needed , 0 );

    pthread_create( &thread_1 , NULL , thread_fx1 , 0 );
    pthread_create( &thread_2 , NULL , thread_fx2 , 0 );

    pthread_join( thread_1 , NULL );
    pthread_join( thread_2 , NULL );

    pthread_event_destroy( &data_ready );
    pthread_event_destroy( &data_needed );

    return 0;
}

基本上两个线程互相发出信号 - 开始做某事，并做自己的事情，即使在短暂的超时后没有发出信号。

知道那里出了什么问题吗？

谢谢。

score 1 · Accepted Answer

问题是timeout_time参数pthread_cond_timedwait()。你增加它的方式最终很快就会有一个无效的值，纳秒部分大于或等于十亿。在这种情况下，pthread_cond_timedwait()可能会返回EINVAL，并且可能实际上在等待条件之前。

可以很快找到问题valgrind --tool=helgrind ./test_prog（很快它说它已经检测到 10000000 个错误并放弃计数）：

bash$ gcc -Werror  -Wall -g test.c -o test -lpthread && valgrind --tool=helgrind ./test
==3035== Helgrind, a thread error detector
==3035== Copyright (C) 2007-2012, and GNU GPL'd, by OpenWorks LLP et al.
==3035== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==3035== Command: ./test
==3035== 
t1: tick
t2: tick
t2: tick
t1: tick
t2: tick
t1: tick
t1: tick
t2: tick
t1: tick
t2: tick
t1: tick
==3035== ---Thread-Announcement------------------------------------------
==3035== 
==3035== Thread #2 was created
==3035==    at 0x41843C8: clone (clone.S:110)
==3035== 
==3035== ----------------------------------------------------------------
==3035== 
==3035== Thread #2's call to pthread_cond_timedwait failed
==3035==    with error code 22 (EINVAL: Invalid argument)
==3035==    at 0x402DB03: pthread_cond_timedwait_WRK (hg_intercepts.c:784)
==3035==    by 0x8048910: pthread_event_wait (test.c:65)
==3035==    by 0x8048965: thread_fx1 (test.c:80)
==3035==    by 0x402E437: mythread_wrapper (hg_intercepts.c:219)
==3035==    by 0x407DD77: start_thread (pthread_create.c:311)
==3035==    by 0x41843DD: clone (clone.S:131)
==3035== 
t2: tick
==3035== 
==3035== More than 10000000 total errors detected.  I'm not reporting any more.
==3035== Final error counts will be inaccurate.  Go fix your program!
==3035== Rerun with --error-limit=no to disable this cutoff.  Note
==3035== that errors may occur in your program without prior warning from
==3035== Valgrind, because errors are no longer being displayed.
==3035== 
^C==3035== 
==3035== For counts of detected and suppressed errors, rerun with: -v
==3035== Use --history-level=approx or =none to gain increased speed, at
==3035== the cost of reduced accuracy of conflicting-access information
==3035== ERROR SUMMARY: 10000000 errors from 1 contexts (suppressed: 412 from 109)
Killed

还有另外两个小评论：

为了提高正确性，pthread_event_set()您可以在互斥锁解锁之前完成条件变量广播（错误排序的影响基本上会破坏调度的确定性；helgrind也抱怨这个问题）；
您可以安全地删除 pthread_event_get() 中的互斥锁以返回值ev->State- 这应该是原子操作。

pthreads - 使用 pthread_cond_t 时出现 pthread 死锁问题

1 回答 1

Related

Reference