1

Sorry for the title which might sound like an "allready answered" topic but I believe my case is unique.

Also, this is my first post so I apologize if I am not on the proper channel as I am not sure wether my problem is on the server administration side or the Laravel's configuration one.

I am trying to get some fresh ideas on how to resolve an issue with Horizon / Predis / HAProxy which I thought was fixed but is showing up again.

Some details on environment

  • 2x Apache servers : PHP Version 7.2.29-1+ubuntu18.04.1+deb.sury.org+1
    • thread safe is disabled and we use FPM
  • 2x Redis servers using a simple master-slave setup (no high availability, no sentinel) : redis version 4.0.9
  • load balancing with HAProxy version 1.9

Libraries

  • laravel/framework: 6.14.0
  • laravel/horizon": 3.7.2
  • redis/predis: 1.1.1

Horizon configuration

The Horizon daemon is managed through Supervisor.

This is the Redis client configuration in config/database.php:

'redis' => [

    'client' => 'predis',

    'options' => [
        'prefix' => strtoupper(env('APP_NAME') . ':')
    ],

    'default' => [
        'host' => env('REDIS_HOST'),
        'password' => env('REDIS_PASSWORD'),
        'port' => env('REDIS_PORT'),
        'database' => env('REDIS_DB'),
        'read_write_timeout' => -1
    ],
    ...

This the Redis connection configuration in config/queue.php:

    'redis' => [
        'driver' => 'redis',
        'connection' => 'default',
        'queue' => env('REDIS_QUEUE', 'default'),
        'retry_after' => 110
    ],

    'redis-long-run' => [
        'driver' => 'redis',
        'connection' => 'default',
        'queue' => env('REDIS_QUEUE', 'long-running-queue'),
        'retry_after' => 3620
    ],

As you can see there are two connections defined for the same physical Redis server. The application uses queues for 2 different types of jobs:

  • Fast / short processes like broadcasting, notifications or some Artisan commands calls. These use the first connection configuration with low timeout setting.
  • Long running processes which essentially query large amounts of data on a Snowflake DB (cloud based SQL like DB) and / or update / inserts documents on a Solr server. These processes use the 2nd connection configuration as they can take quite some time to complete (usually around 20 minutes for the one combining read from Snowflake and write to Solr)

The application is a business webapp meant for private use by my company and the load is rather small (around 200 jobs queued / day) but the long running processes are critical to the business : job failure or double run is not acceptable.

This is the config/horizon.php file:

'environments' => [
    'production' => [
        'supervisor-default' => [
            'connection' => 'redis',
            'queue' => ['live-rules', 'solr-cmd', 'default'],
            'balance' => 'simple',
            'processes' => 3,
            // must be lower than /config/queue.php > 'connections.redis'
            'timeout' => 90,
            'tries' => 3,
        ],
        'supervisor-long-run' => [
            'connection' => 'redis-long-run',
            'queue' => ['long-running-queue', 'solr-sync'],
            'balance' => 'simple',
            'processes' => 5,
            // must be lower than /config/queue.php > 'connections.redis-long-run'
            'timeout' => 3600,
            'tries' => 10,
        ],
    ],

    'staging' => [
    ...

Initial problem <solved>

When we went live at the beginning of the year we immediately hit a problem with the jobs running on the long-running-queue connection:

Error while reading line from the server. [tcp://redis_host:6379] errors started popping left and right.

These translated into jobs being stuck in pending state, until they finally ended up being marked as failed although the tasks had in reality succeeded.

At the time the application's long running processes were limited to the Snowflake SELECT queries.

After going through the numerous posts about it on Laravel Horizon's github issues as well as SO's topics and testing the suggestions without luck we finally figured out that the culprit was our load balancer closing the connection after 90 seconds.

Redis has a tcp-keepalive default config parameter of 300 secs so we tweaked the HAProxy's configuration to close at 310 secs and - poof! -, everything worked fine for a while.

This is HAProxy's configuration for the application nowadays:

listen PROD-redis
    bind                    0.0.0.0:6379
    mode                    tcp
    option                  tcplog
    option                  tcp-check
    balance                 leastconn
    timeout connect         10s
    timeout client          310s
    timeout server          310s
    server 1        192.168.12.34:6379      check inter 5s rise 2 fall 3
    server 2        192.168.43.21:6379      check inter 5s rise 2 fall 3 backup

New problem (initial reborn ?)

Coming back a few months later the application has evolved and we now have a job which reads and yields from Snowflake in batch to build a Solr update query. The Solr client is solarium/solarium and we use the addBuffered plugin.

This worked flawlessly on our pre-production environment which doesn't have load balancing.

So next we moved to the production environment and the Redis connection issues rose again unexpectedly, except this time we've got the HAProxy setup properly.

Monitoring the keys in Redis we can see that these jobs get indeed reserved but end up in the delayed state after some time, waiting to be tried again once the job's timeout is reached.

This is a real problem as we end up going through the job's max tries count until it eventually gets marked as failed, running it x times because it never gets the complete flag, putting unecessary stress on the environment and consuming resources when in fact the job DID succeed at first try.

This is what we get from HAProxy's logs:

Jun 26 11:35:43 apache_host haproxy[215280]: 127.0.0.1:42660 [26/Jun/2020:11:29:02.454] PROD-redis PROD-redis/redis_host 1/0/401323 61 cD 27/16/15/15/0 0/0

Jun 26 11:37:18 apache_host haproxy[215280]: 127.0.0.1:54352 [26/Jun/2020:11:28:23.409] PROD-redis PROD-redis/redis_host 1/0/535191 3875 cD 24/15/14/14/0 0/0

The cD part is the interesting information, as per haProxy's documentation:

c : the client-side timeout expired while waiting for the client to send or receive data.

D : the session was in the DATA phase.

There are more logs like this and there is no obvious pattern in the delay between the connection established and the moment it closes as you can see from dates.

Before getting there we have :

  • switched to a Redis version 5.0.3 server: same issue.
  • removed HAProxy from the equation and established direct connection between the client and Redis : works flawlessly.

I'm a bit at a loss as to how to figure out and fix the issue for good. Going back to the HAProxy log concerning client-side timeout, I wonder what could possibly be wrong about the client configuration and what I should try next.

Maybe someone here will come with a suggestion ? Thank you for reading.

4

1 回答 1

0

From Laravel documentation it is better to use PhpRedis client instead of Predis.

Predis has been abandoned by the package's original author and may be removed from Laravel in a future release.

In short, PhpRedis is a php module written in C. While Predis is php library written in PHP. Huge performance difference described here

BTW, we have similar stack: Laravel + Horizon -> HAProxy-> Redis Server. Wу have 3 redis servers (1 master, 2 slaves). And Sentinel to keep on actual master. And had similar problems with redis until we migrated from Predis to PhpRedis. When researching problems, the best answer was to use PhpRedis.

PS. We just changed REDIS_CLIENT in .env from Predis to phpredis and everything was still working.

于 2020-07-10T09:33:58.443 回答