mysql - Optimal Mysql Config (Partiontion) & Indexes / Hypertable / RAID Config (Huge Database)

Question

tl;rd:

使用主键进行数据库分区
索引大小问题。
数据库大小每天增长约 1-3 GB
突袭设置。
你有使用 Hypertable 的经验吗？

长版：

我刚建/买了一个家庭服务器：

至强 E3-1245 3,4 HT
32GB 内存
6x 1.5 TB WD Cavier Black 7200

我将使用服务器主板 INTEL S1200BTL Raid（没有钱购买 RAID 控制器）。http://ark.intel.com/products/53557/Intel-Server-Board-S1200BTL

主板有 4x SATA 3GB/s 接口和 2x SATA 6GB/s

我还不确定我是否可以在 RAID 10 中设置所有 6 个硬盘，

如果不可能，我认为 4x hdds Raid 10 (MYSQL DB) & 2xhdds Raid 0 for (OS/Mysql Indexes)。

（如果raid 0中断，对我来说没问题，我只需要保护数据库）

关于数据库：

它是一个网络爬虫数据库，其中存储了域、url、链接和此类内容。所以我想我用每个表的主键对数据库进行分区，比如 (1-1000000) (1000001-2000000) 等等。

当我在数据库中搜索/插入/选择查询时，我需要扫描孔表，因为有些东西可能在第 1 行，而另一些在第 1000000000000 行。

如果我通过主键（auto_increment）进行这样的分区，这会使用我所有的 CPU 内核吗？以便它并行扫描每个分区？或者我应该坚持使用一个没有分区的巨大数据库。

数据库将非常大，现在在我的家庭系统上，

Table extract:  25,034,072 Rows
Data    2,058.7     MiB
Index   2,682.8     MiB
Total   4,741.5     MiB

Table Structure:
extract_id          bigint(20)      unsigned        NO  PRI     NULL    auto_increment
url_id       bigint(20)         NO      MUL     NULL    
extern_link     varchar(2083)           NO      MUL     NULL    
anchor_text     varchar(500)            NO      NULL    
http_status     smallint(2)     unsigned    NO      0

Indexes:
PRIMARY     BTREE   Yes No  extract_id      25034072

link        BTREE   Yes No  url_id
                            extern_link (400)   25034072

externlink      BTREE   No  No  extern_link (400)   1788148 


Table urls: 21,889,542 Rows
Data    2,402.3     MiB
Index   3,456.2     MiB
Total   5,858.4     MiB

Table Structure:
url_id      bigint(20)      NO  PRI     NULL    auto_increment
domain_id           bigint(20)      NO  MUL     NULL    
url             varchar(2083)       NO      NULL    
added       date    NO      NULL    
last_crawl      date    NO      NULL    
extracted           tinyint(2) unsigned NO  MUL     0   
extern_links    smallint(5) unsigned    NO      0   
crawl_status    tinyint(11) unsigned    NO      0   
status      smallint(2) unsigned    NO      0


INDEXES:
PRIMARY     BTREE   Yes No  url_id      21889542

domain_id       BTREE   Yes No  domain_id   0
                        url (330)   21889542

extracted_status    BTREE   No  No  extracted   2
                        status      31

我看到我可以修复外部链接和链接索引，我只是添加了外部链接，因为我需要查询该字段并且我无法使用链接索引。你看到了吗，我可以在索引上调整什么？我的新系统将有 32 GB，但如果数据库以这种速度增长，我将在 FEW 周/月内使用 90% 的 RAM。

打包的索引有帮助吗？（性能如何下降？）

其他重要表小于 500MB。

Only the URL Source table is huge: 48.6 GiB 
Structure: 

    url_id  BIGINT
    pagesource mediumblob data is packed with gzip high compression

    Index is only on url_id (unique).

当我提取了我需要的所有内容后，可以从此表中擦除数据。

你有使用Hypertables的经验吗？http://hypertable.org/ <= Google 的 Bigtables。如果我转向 Hypertables，这对我的性能有帮助吗（提取数据/搜索/插入/选择和数据库大小）。我在页面上阅读，但我仍然有些无能为力。因为你不能直接比较 MYSQL 和 Hypertables。我会尽快尝试，必须先阅读文档。

我需要一个适合我的设置的解决方案，因为我没有钱用于任何其他硬件设置。

感谢帮助。

score 0 · Accepted Answer

关于#4（RAID 设置），不建议将 RAID5 用于生产服务器。关于它的好文章-> http://www.dbasquare.com/2012/04/02/should-raid-5-be-used-in-a-mysql-server/

score 0 · Accepted Answer

Hypertable 是爬取数据库的绝佳选择。Hypertable 是一个以 Google 的 Bigtable 为模型的开源、高性能、可扩展的数据库。Google 专门为他们的爬虫数据库开发了 Bigtable。我建议阅读Bigtable 论文，因为它使用爬网数据库作为运行示例。

mysql - Optimal Mysql Config (Partiontion) & Indexes / Hypertable / RAID Config (Huge Database)

2 回答 2

Related

Reference