我完全没有使用 MDADM 设置 Raid 的运气。在这一点上,我怀疑这是我的硬件。在初始设置后不久以及成功同步过程期间和之后,驱动器将被标记为失败并从阵列中删除。我已经尝试过使用 RAW 驱动器方法和分区方法。使用分区方法,我尝试了全部容量和较小的分区大小(分区开始时的容量为-100MB)。我从许多帖子中得出结论,我发现在原始未分区驱动器上添加分区大小小于实际驱动器容量的分区驱动器是设置 mdadm RAID 的推荐方法。这允许更轻松的管理,即更换故障驱动器等。

我的硬件从 Dell PowerEdge R410 服务器开始。我有一个 esata 适配器(不是高端),连接到带有 4 个 4TB WD Red NAS 驱动器的 5 托架 Sans Digital TowerRaid TR5M-(B)。我想将数据存储区与物理服务器分开。我还没有尝试将磁盘移动到戴尔服务器,因为我不希望 RAID 阵列上的操作系统。我想我可以尝试从外部驱动器启动,但这太不正统了,我真的不想朝那个方向发展。

我遇到过一两个谈到“时间”问题的帖子,想知道这是否真的是我问题的根源。但他们谈到了失败时的“同步过程”。就我而言,我已经看到团队在看到团队崩溃之前成功同步了 100%。我可以发布一连串的 mdadm 检查和详细信息。


services-admin@mydomain:($ sudo mdadm --detail /dev/md0

       Version : 1.2
 Creation Time : Mon Feb 25 14:42:27 2019
    Raid Level : raid6
    Array Size : 7813566464 (7451.60 GiB 8001.09 GB)
 Used Dev Size : 3906783232 (3725.80 GiB 4000.55 GB)
  Raid Devices : 4
 Total Devices : 4
   Persistence : Superblock is persistent

 Intent Bitmap : Internal

   Update Time : Mon Feb 25 16:01:57 2019
         State : clean, FAILED 
Active Devices : 0
Failed Devices : 4
 Spare Devices : 0

        Layout : left-symmetric
    Chunk Size : 512K

Consistency Policy : bitmap

Number   Major   Minor   RaidDevice State
   -       0        0        0      removed
   -       0        0        1      removed
   -       0        0        2      removed
   -       0        0        3      removed

   0       8        1        -      faulty   /dev/sda1
   1       8       17        -      faulty   /dev/sdb1
   2       8       33        -      faulty   /dev/sdc1
   3       8       49        -      faulty   /dev/sdd1

我相信我找到了问题的原因。查看各个驱动器 Smartctl 显示存在接口 CRC 错误。来自驱动器之一的样本(第 100、117 和 134 行)显示接口 CRC 错误。每个驱动器都显示类似的错误。我怀疑所有四个驱动器的接口都有故障。尤其是在这么短的时间。所以它看起来像一个坏的 esata 电缆、服务器 pci 卡、TowerRaid 接口或上面的一些。我将从电缆开始,然后从那里开始。

sudo smartctl --all /dev/sdb | cat -n $1
 1      smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-45-generic] (local build)
 2        Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
 5        Device Model:     WDC WD4002FFWX-68TZ4N0
 6        Serial Number:    K4JHGWXB
 7        LU WWN Device Id: 5 000cca 25de33882
 8        Firmware Version: 83.H0A83
 9        User Capacity:    4,000,787,030,016 bytes [4.00 TB]
10        Sector Sizes:     512 bytes logical, 4096 bytes physical
11        Rotation Rate:    7200 rpm
12        Form Factor:      3.5 inches
13        Device is:        Not in smartctl database [for details use: -P showall]
14        ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
15        SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
16        Local Time is:    Tue Feb 26 12:41:00 2019 MST
17        SMART support is: Available - device has SMART capability.
18        SMART support is: Enabled
21        SMART Status not supported: Incomplete response, ATA output registers missing
22        SMART overall-health self-assessment test result: PASSED
23        Warning: This result is based on an Attribute check.
25        General SMART Values:
26        Offline data collection status:  (0x80)   Offline data collection activity
27                          was never started.
28                          Auto Offline Data Collection: Enabled.
29        Self-test execution status:      (   0)   The previous self-test routine completed
30                          without error or no self-test has ever
31                          been run.
32        Total time to complete Offline
33        data collection:      (  113) seconds.
34        Offline data collection
35        capabilities:              (0x5b) SMART execute Offline immediate.
36                          Auto Offline data collection on/off support.
37                          Suspend Offline collection upon new
38                          command.
39                          Offline surface scan supported.
40                          Self-test supported.
41                          No Conveyance Self-test supported.
42                          Selective Self-test supported.
43        SMART capabilities:            (0x0003)   Saves SMART data before entering
44                          power-saving mode.
45                          Supports SMART auto save timer.
46        Error logging capability:        (0x01)   Error logging supported.
47                          General Purpose Logging supported.
48        Short self-test routine
49        recommended polling time:      (   2) minutes.
50        Extended self-test routine
51        recommended polling time:      ( 571) minutes.
52        SCT capabilities:            (0x003d) SCT Status supported.
53        SCT Error Recovery Control supported.
54        SCT Feature Control supported.
55        SCT Data Table supported.
57        SMART Attributes Data Structure revision number: 16
58        Vendor Specific SMART Attributes with Thresholds:
60          1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
61          2 Throughput_Performance  0x0005   137   137   054    Pre-fail  Offline      -       104
62          3 Spin_Up_Time            0x0007   142   142   024    Pre-fail  Always       -       369 (Average 381)
63          4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       23
64          5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
65          7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
66          8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
67          9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       820
68         10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
69         12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       21
70        192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       55
71        193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       55
72        194 Temperature_Celsius     0x0002   171   171   000    Old_age   Always       -       35 (Min/Max 19/42)
73        196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
74        197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
75        198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
76        199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       3
78        SMART Error Log Version: 1
79        ATA Error Count: 3
80          CR = Command Register [HEX]
81          FR = Features Register [HEX]
82          SC = Sector Count Register [HEX]
83          SN = Sector Number Register [HEX]
84          CL = Cylinder Low Register [HEX]
85          CH = Cylinder High Register [HEX]
86          DH = Device/Head Register [HEX]
87          DC = Device Command Register [HEX]
88          ER = Error register [HEX]
89          ST = Status register [HEX]
90        Powered_Up_Time is measured from power on, and printed as
91        DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
92        SS=sec, and sss=millisec. It "wraps" after 49.710 days.
94        Error 3 occurred at disk power-on lifetime: 715 hours (29 days + 19 hours)
95          When the command that caused the error occurred, the device was active or idle.
97          After command completion occurred, registers were:
98          ER ST SC SN CL CH DH

99          -- -- -- -- -- -- --
100         84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0
102         Commands leading to the command that caused the error were:
103         CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
104         -- -- -- -- -- -- -- --  ----------------  --------------------
105         61 40 d8 c0 33 f8 40 08  10d+11:01:08.855  WRITE FPDMA QUEUED
106         61 40 f0 80 2e f8 40 08  10d+11:01:08.847  WRITE FPDMA QUEUED
107         61 40 e8 40 29 f8 40 08  10d+11:01:08.844  WRITE FPDMA QUEUED
108         61 40 e0 00 24 f8 40 08  10d+11:01:08.841  WRITE FPDMA QUEUED
109         61 a8 d8 18 20 f8 40 08  10d+11:01:08.840  WRITE FPDMA QUEUED
111       Error 2 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
112         When the command that caused the error occurred, the device was active or idle.
114         After command completion occurred, registers were:
115         ER ST SC SN CL CH DH
116         -- -- -- -- -- -- --
117         84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0
119         Commands leading to the command that caused the error were:
120         CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
121         -- -- -- -- -- -- -- --  ----------------  --------------------
122         60 00 c8 00 02 00 40 08      00:00:16.009  READ FPDMA QUEUED
123         47 00 01 12 00 00 a0 08      00:00:15.990  READ LOG DMA EXT
124         47 00 01 00 00 00 a0 08      00:00:15.989  READ LOG DMA EXT
125         ef 10 02 00 00 00 a0 08      00:00:15.987  SET FEATURES [Enable SATA feature]
126         27 00 00 00 00 00 e0 08      00:00:15.987  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
128       Error 1 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
129         When the command that caused the error occurred, the device was active or idle.
131         After command completion occurred, registers were:
132         ER ST SC SN CL CH DH
133         -- -- -- -- -- -- --
134         84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0
136         Commands leading to the command that caused the error were:
137         CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
138         -- -- -- -- -- -- -- --  ----------------  --------------------
139         60 00 b8 00 02 00 40 08      00:00:15.373  READ FPDMA QUEUED
140         60 80 b0 80 00 00 40 08      00:00:15.370  READ FPDMA QUEUED
141         60 38 a8 40 00 00 40 08      00:00:15.370  READ FPDMA QUEUED
142         60 08 a0 10 00 00 40 08      00:00:15.370  READ FPDMA QUEUED
143         60 18 98 20 00 00 40 08      00:00:15.370  READ FPDMA QUEUED
145       SMART Self-test log structure revision number 1
146       No self-tests have been logged.  [To run self-tests, use: smartctl -t]
148       SMART Selective self-test log data structure revision number 1
150           1        0        0  Not_testing
151           2        0        0  Not_testing
152           3        0        0  Not_testing
153           4        0        0  Not_testing
154           5        0        0  Not_testing
155       Selective self-test flags (0x0):
156         After scanning selected spans, do NOT read-scan remainder of disk.
157       If Selective self-test is pending on power-up, resume after 0 minute delay.
