mysql - MySQL sharding approaches?

Question

What is the best approach for Sharding MySQL tables. The approaches I can think of are :

Application Level sharding?
Sharding at MySQL proxy layer?
Central lookup server for sharding?

Do you know of any interesting projects or tools in this area?

score 127 · Accepted Answer

The best approach for sharding MySQL tables to not do it unless it is totally unavoidable to do it.

When you are writing an application, you usually want to do so in a way that maximizes velocity, developer speed. You optimize for latency (time until the answer is ready) or throughput (number of answers per time unit) only when necessary.

You partition and then assign partitions to different hosts (= shard) only when the sum of all these partitions does no longer fit onto a single database server instance - the reason for that being either writes or reads.

The write case is either a) the frequency of writes is overloading this servers disks permanently or b) there are too many writes going on so that replication permanently lags in this replication hierarchy.

The read case for sharding is when the size of the data is so large that the working set of it no longer fits into memory and data reads start hitting the disk instead of being served from memory most of the time.

Only when you have to shard you do it.

The moment you shard, you are paying for that in multiple ways:

Much of your SQL is no longer declarative.

Normally, in SQL you are telling the database what data you want and leave it to the optimizer to turn that specification into a data access program. That is a good thing, because it is flexible, and because writing these data access programs is boring work that harms velocity.

With a sharded environment you are probably joining a table on node A against data on node B, or you have a table larger than a node, on nodes A and B and are joining data from it against data that is on node B and C. You are starting to write application side hash-based join resolutions manually in order to resolve that (or you are reinventing MySQL cluster), meaning you end up with a lot of SQL that no longer declarative, but is expressing SQL functionality in a procedural way (e.g. you are using SELECT statements in loops).

You are incurring a lot of network latency.

Normally, an SQL query can be resolved locally and the optimizer knows about the costs associated with local disk accesses and resolves the query in a way that minimizes the costs for that.

In a sharded environment, queries are resolved by either running key-value accesses across a network to multiple nodes (hopefully with batched key accesses and not individual key lookups per round trip) or by pushing parts of the WHERE clause onward to the nodes where they can be applied (that is called 'condition pushdown'), or both.

But even in the best of cases this involves many more network round trips that a local situation, and it is more complicated. Especially since the MySQL optimizer knows nothing about network latency at all (Ok, MySQL cluster is slowly getting better at that, but for vanilla MySQL outside of cluster that is still true).

You are losing a lot of expressive power of SQL.

Ok, that is probably less important, but foreign key constraints and other SQL mechanisms for data integrity are incapable of spanning multiple shards.

MySQL has no API which allows asynchronous queries that is in working order.

When data of the same type resides on multiple nodes (e.g. user data on nodes A, B and C), horizontal queries often need to be resolved against all of these nodes ("Find all user accounts that have not been logged in for 90 days or more"). Data access time grows linearly with the number of nodes, unless multiple nodes can be asked in parallel and the results aggregated as they come in ("Map-Reduce").

The precondition for that is an asynchronous communication API, which does not exist for MySQL in a good working shape. The alternative is a lot of forking and connections in the child processes, which is visiting the world of suck on a season pass.

Once you start sharding, data structure and network topology become visible as performance points to your application. In order to perform reasonably well, your application needs to be aware of these things, and that means that really only application level sharding makes sense.

The question is more if you want to auto-shard (determining which row goes into which node by hashing primary keys for example) or if you want to split functionally in a manual way ("The tables related to the xyz user story go to this master, while abc and def related tables go to that master").

Functional sharding has the advantage that, if done right, it is invisible to most developers most of the time, because all tables related to their user story will be available locally. That allows them to still benefit from declarative SQL as long as possible, and will also incur less network latency because the number of cross-network transfers is kept minimal.

Functional sharding has the disadvantage that it does not allow for any single table to be larger than one instance, and it requires manual attention of a designer.

Functional sharding has the advantage that it is relatively easily done to an existing codebase with a number of changes that is not overly large. http://Booking.com has done it multiple times in the past years and it worked well for them.

Having said all that, looking at your question, I do believe that you are asking the wrong questions, or I am completely misunderstanding your problem statement.

score 12 · Accepted Answer

应用程序级分片：dbShards 是我所知道的唯一一个执行“应用程序感知分片”的产品。网站上有几篇不错的文章。根据定义，应用程序感知分片将更加高效。如果应用程序无需查找或通过代理重定向即可准确地知道事务的去向，那么它本身会更快。当有人研究分片时，速度通常是主要关注点之一，如果不是唯一关注点的话。
有些人用代理“分片”，但在我看来，这违背了分片的目的。您只是在使用另一台服务器来告诉您的事务在哪里可以找到数据或将数据存储在哪里。通过应用程序感知分片，您的应用程序知道自己去哪里。效率更高。
这实际上与＃2相同。

score 7 · Accepted Answer

你知道这个领域有什么有趣的项目或工具吗？

该领域的几个新项目：

citusdata.com
spockproxy.sourceforge.net
~~github.com/twitter/gizzard/~~

score 5 · Accepted Answer

Shard-Query是一个基于 OLAP 的 MySQL 分片解决方案。它允许您定义分片表和非分片表的组合。非分片表（如查找表）可以自由连接到分片表，并且分片表可以相互连接，只要表通过分片键连接（没有跨分片或跨分片边界的自连接）。作为 OLAP 解决方案，Shard-Query 的最小响应时间通常为 100 毫秒或更短，即使对于简单的查询也是如此，因此它不适用于 OLTP。Shard-Query 专为并行分析大数据集而设计。

MySQL 也存在 OLTP 分片解决方案。闭源解决方案包括ScaleDB、DBShards。开源 OLTP 解决方案包括JetPants、Cubrid或Flock/ Gizzard （Twitter 基础设施）。

score 5 · Accepted Answer

当然是应用级别。

我在这本书中找到的最好的方法

高性能 MySQL http://www.amazon.com/High-Performance-MySQL-Jeremy-Zawodny/dp/0596003064

简短描述：您可以将数据拆分为多个部分，并在每台服务器上存储约 50 个部分。它将帮助您避免分片的第二大问题 - 重新平衡。只需将其中一些移到新服务器上，一切都会好起来的:)

我强烈建议您购买并阅读“mysql scaling”部分。

score 4 · Accepted Answer

截至 2018 年，似乎有一个 MySql-native 解决方案。实际上至少有 2 个 - InnoDB Cluster和NDB Cluster（有商业版和社区版）。

由于大多数使用 MySql 社区版的人都比较熟悉 InnoDB 引擎，因此这是应该作为首要探索的内容。它支持开箱即用的复制和分区/分片，并基于 MySql Router 提供不同的路由/负载平衡选项。

创建表的语法需要更改，例如：

    CREATE TABLE t1 (col1 INT, col2 CHAR(5), col3 DATETIME) PARTITION BY HASH ( YEAR(col3) );

（这只是四种分区类型之一）

一个非常重要的限制：

InnoDB 外键和 MySQL 分区不兼容。分区的 InnoDB 表不能有外键引用，也不能有外键引用的列。不能对具有外键或被外键引用的 InnoDB 表进行分区。

mysql - MySQL sharding approaches?

6 回答 6

Related

Reference