cassandra - 使用 spark-cassandra 连接器在 cassandra 中写入时间

Question

我有这个用例，我需要不断地听一个 kafka 主题并根据 Spark 流应用程序的列值写入 2000 个列族（每个 15 列..时间序列数据）。我有一个本地 Cassandra 安装设置。在使用 3 个内核和 12 GB 内存的 CentOS VM 上创建这些列族大约需要 1.5 小时。在我的 spark 流应用程序中，我正在做一些预处理以将这些流事件存储到 Cassandra。我遇到了我的流媒体应用程序完成此操作所需的时间问题。
我试图根据密钥将 300 个事件保存到多个列族（大约 200-250 个），我的应用程序需要大约 10 分钟来保存它们。这似乎很奇怪，因为将这些事件按键分组打印到屏幕上需要不到一分钟的时间，但只有当我将它们保存到 Cassandra 时才需要时间。我将 300 万条记录保存到 Cassandra 没有问题。花了不到 3 分钟（但这是针对 Cassandra 中的单个列族）。

我的要求是尽可能实时，这似乎还很遥远。生产环境每 3 秒大约有 400 个事件。

是否需要对 Cassandra 中的 YAML 文件进行任何调整或对 cassandra-connector 本身进行任何更改

INFO  05:25:14 system_traces.events                      0,0
WARN  05:25:14 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:14 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:15 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:15 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:15 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:15 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
INFO  05:25:16 ParNew GC in 340ms.  CMS Old Gen: 1308020680 -> 1454559048; Par Eden Space: 251658240 -> 0; 
WARN  05:25:16 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:16 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:17 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:17 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:17 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:17 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
INFO  05:25:17 ParNew GC in 370ms.  CMS Old Gen: 1498825040 -> 1669094840; Par Eden Space: 251658240 -> 0; 
WARN  05:25:18 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:18 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:18 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:18 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:19 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:19 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
INFO  05:25:19 ParNew GC in 382ms.  CMS Old Gen: 1714792864 -> 1875460032; Par Eden Space: 251658240 -> 0; 
W

score 1 · Accepted Answer

I suspect you're hitting edge cases in cassandra related to the large number of CFs/columns defined in the schema. Typically when you see tombstone warnings, it's because you've messed up the data model. However, these are in system tables, so obviously you've done something to the tables that the authors didnt expect (lots and lots of tables, and probably drop/recreating them a lot).

Those warnings were added because scanning past tombstones looking for live columns causes memory pressure, which causes GC, which causes pauses, which causes slowness.

Can you squish the data into significantly fewer column families? You may also want to try clearing out the tombstones (drop gcgs for that table to zero, run major compaction on system if it's allowed?, raise it back to default).

score 0 · Accepted Answer

您可以参考此博客了解 Spark-Cassandra 连接器调优。您将对可以预期的性能数字有所了解。您还可以尝试另一个开源产品 SnappyData，它是 Spark 数据库，它将在您的用例中为您提供非常高的性能。

cassandra - 使用 spark-cassandra 连接器在 cassandra 中写入时间

2 回答 2

Related

Reference