2

Summary:

I am interested in knowing what's the best practice for high throughput applications that have bulk messages trying to update the same row and get oracle deadlock errors. I know you cannot avoid those errors but how do you recover from them gracefully without getting bogged down by such deadlock errors happening over and over again.

Details:

We are building a high throughput JMS messaging application. Production environment will be two weblogic 11g nodes (running 6 MDB listener instances each). We were getting Oracle deadlock errors (ORA-00060) when we get around 1000 messages all trying to update the same row in oracle database. Java synchronization across nodes is not possible in standard java threading API (unless there's no other solution we don't want to use any 3rd party solutions like terracotta etc).

We were hoping Oracle "select for update WAIT n secs" statement will help because that will essentially make the competing threads (for the same row) wait few seconds before the first thread (who got the lock on the row first) gets done with it.

First issue with "SELECT FOR UPDATE WAIT n" is it doesn't allow using milliseconds for wait times. This starts negatively affecting our application's throughput because putting 1 sec WAIT (least wait time) causes delays on the messages.

Second thing we are fiddling with weblogic queue re-delivery delay parameter (30 secs in our case). Whenever a thread bounces back because of the deadlock error, it will wait 30 seconds before being re-tried.

In our experience 1000 competing messages, in a lot of situations take forever to get processed because the deadlock keeps on happening over and over.

I understand that with the current architecture we are supposed to get deadlock errors regardless ( in case of 1000 competing messages) but application should be resilient enough to recover from these errors after retrying the looping messages.

Any idea what we are missing here ? anybody who has dealt with similar issues before?

I am looking for some design ideas that can make this work resiliently so that it recovers from this deadlock situation and eventually processes all messages in reasonable amount of time without using much additional hardware.

COMPUTATION DETAILS: These 1000 messages will EACH create 4 objects of 4 different position types each having a quantity associated with it. These quantities will have to merged into those 4 different slots (depending on the position type). The deadlock is happening when those 4 individual slots are being updated by each individual thread. We have already ordered those individual updates in a specific order before being applied to the database rows to avoid any possible race conditions.

4

2 回答 2

1

死锁意味着每个线程都试图更新单个事务中的多行,并且这些更新在线程之间以不同的顺序完成。因此,最简单的可能答案是修改代码,以使同一事务中的消息以某种定义的顺序(即按主键的顺序)应用。这将确保您永远不会遇到死锁,尽管当一个线程等待另一个线程提交其事务时您仍然会获得阻塞锁。

但是,退后一步,当您无法预测更新的顺序时,您似乎不太可能真的希望多个线程更新表中的同一行。这似乎很可能会导致大量丢失的更新和一些相当不可预测的行为。确切地说,您的应用程序在做什么会使这种事情变得明智?您是否在将行插入详细信息表后更新聚合表(即,除了记录有关特定视图的信息外,还更新帖子的视图数)?如果是这样,这些操作真的需要同步吗?或者您可以通过聚合过去 N 秒的视图定期在另一个线程中更新视图计数吗?

于 2013-09-10T23:49:01.543 回答
0

至于MDB

  1. 让它消费消息,并更新包含已处理消息数量的增量的实例变量(MDB 可以在其实例变量中跨多个消息携带状态)。

  2. 同一 MDB 中的@Schedule方法每秒使用一条 SQL 语句将数量持久化在单个数据库事务中(例如)

update x set q1 = q1 + delta1, q2 = q2 + delta2, ...

我做了一些测试:

  • 创建 1000 条消息需要 6 秒(JBoss 7 使用 HornetQ)
  • 在那段时间里,已经有 840 条消息被持久化了。
  • 需要另外 2s 来持久化剩余的(预定的方法每秒运行一次)
  • 这需要七个数据库事务中的七个 SQL 更新命令
  • 负载完全是由创建消息引起的;数据库上没有真正的负载

笔记

  • 您需要另一种@PreDestroy方法来保留待处理的增量,以确保不会丢失任何内容
  • 如果必须保证事务的正确性,这种方法是不合适的。在这种情况下,我建议使用普通队列接收器(= 无 MDB)、事务处理会话并receive(timeout)收集 100 - 10000 条消息(或直到超时),执行一个 DB 事务,然后立即提交队列会话。这更好,但它仍然不是 XA 事务性的。如果您需要,两个提交都需要由单个 XA 事务协调。
于 2013-09-11T20:36:55.087 回答