[SOLVED] Help chasing an I/O error, bad cable, bad controller, bad drive?

systemloc · 03-02-2024, 10:21 AM

I'm running an x86 PC, with Slackware 14.2 (4.19.139 huge/custom), recently updated to 15 (5.15.149 huge/custom). I have an LSI SAS2308 (rev 05) FW 16.00.00.00-IT, BIOS MPT2BIOS-7.31.00.00, and an IBM SAS expander 46M0997 previously FW ver 605, now 634A. I run 15 3-4 TB SATA drives all HGST Ultrastar or Seagate Ironwolf. The OS is on a separate drive connected to the builtin motherboard SATA controller.

My drives are configured in RAID 6, except the OS drive, which is kept separate from the RAID.

Problem:

I've been tracking a persistent rare intermittent fault where the RAID would lose 1-3 drives in short order after working well for days. I found that only the Seagate drives were affected. No errors were recorded in the SMART record. I could reproduce the fault by turning off the raid and doing a simple 'dd if=/dev/(drive) of=/dev/null' for a cycle or two. At this time, the FW on the IBM card was the old 605(?) FW, and all of the Seagate drives were connected to it. At this time, 4 drives were connected directly to the LSI card, and the rest to the IBM expander, which was connected to the LSI card. For diagnosis, I swapped all the Seagate drives to be connected directly to the LSI card. I did the dd test and got no errors. Researching, I found that the old IBM FW tended to have drive incompatibilities, so I updated it to current. (Thanks Art Of Server dude!) I tried hooking the Seagate drives back to the IBM card and running a few cycles of my dd test, and no errors. I also noted I was only using one uplink port on the IBM card, so I reconfigured my cabling to hook both uplinks to the LSI, and put all the RAID disks on the IBM card. Note that all of the drives always remained on the same 'SF 8087 to 4x SATA/SAS' splitter cable. As an aside, I did some speed testing with one vs two uplink cables, and did not notice a difference. Are there two ports to allow redundant HBAs? Hmm..

At this point, I also updated to Slackware 15 and updated the kernel. This box had sat disused for awhile, and 15 came out in the interim.

This is the error I was getting with the incompatible FW on the IBM expander:
Note that I was receiving similar errors from multiple Seagate drives, and none of them had any errors logged in SMART.

Code:

[ 1291.975114] blk_update_request: I/O error, dev sdi, sector 15006224 op 0x0:(READ) flags 0x80700 phys_seg 26 prio class 0
[ 1291.975141] blk_update_request: I/O error, dev sdi, sector 15002128 op 0x0:(READ) flags 0x80700 phys_seg 15 prio class 0
[ 1291.975151] blk_update_request: I/O error, dev sdi, sector 14999568 op 0x0:(READ) flags 0x84700 phys_seg 64 prio class 0
[ 1291.975250] blk_update_request: I/O error, dev sdi, sector 15003664 op 0x0:(READ) flags 0x84700 phys_seg 46 prio class 0
[ 1291.975340] blk_update_request: I/O error, dev sdi, sector 14999568 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 1291.975348] Buffer I/O error on dev sdi, logical block 1874946, async page read
[ 1291.975391] blk_update_request: I/O error, dev sdi, sector 14999568 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 1291.975413] Buffer I/O error on dev sdi, logical block 1874946, async page read
[ 1292.267222] sd 0:0:8:0: [sdi] Synchronizing SCSI cache
[ 1292.267303] sd 0:0:8:0: [sdi] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=DRIVER_OK
[ 1292.270904] mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x5005076028e30891)
[ 1292.270923] mpt2sas_cm0: removing handle(0x0012), sas_addr(0x5005076028e30891)
[ 1292.270927] mpt2sas_cm0: enclosure logical id(0x5005076028e30880), slot(255)
[ 1297.219020] mpt2sas_cm0: handle(0x12) sas_address(0x5005076028e30891) port_type(0x1)
[ 1298.237602] scsi 0:0:16:0: Direct-Access     ATA      ST4000VN008-2DR1 SC60 PQ: 0 ANSI: 6
[ 1298.237636] scsi 0:0:16:0: SATA: handle(0x0012), sas_addr(0x5005076028e30891), phy(16), device_name(0x0000000000000000)
[ 1298.237639] scsi 0:0:16:0: enclosure logical id (0x5005076028e30880), slot(255)
[ 1298.237733] scsi 0:0:16:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[ 1298.237739] scsi 0:0:16:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[ 1298.243439] sd 0:0:16:0: Power-on or device reset occurred
[ 1298.243585] sd 0:0:16:0: Attached scsi generic sg8 type 0
[ 1298.244273]  end_device-0:0:16: add: handle(0x0012), sas_addr(0x5005076028e30891)
[ 1298.247613] sd 0:0:16:0: [sdi] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
[ 1298.247648] sd 0:0:16:0: [sdi] 4096-byte physical blocks
[ 1298.277722] sd 0:0:16:0: [sdi] Write Protect is off
[ 1298.277734] sd 0:0:16:0: [sdi] Mode Sense: 7f 00 10 08
[ 1298.279513] sd 0:0:16:0: [sdi] Write cache: enabled, read cache: enabled, supports DPO and FUA
[ 1298.391582] sd 0:0:16:0: [sdi] Attached SCSI disk

I proceeded to dd my drives and then mdadm --repair the array through several cycles over days. I eventually began to have similar recurrent errors, now coming from only one of the Seagate drives. Again, though, no errors were logged in SMART. I tried doing a 'smartctl -t long' test on the drive, and no errors were noted. On several repeats of the dd test on the drive, I noted that the sectors failing the read were not consistent.

Here's the current error:

Code:

[37855.651421] blk_update_request: I/O error, dev sdi, sector 471946240 op 0x0:(READ) flags 0x80700 phys_s
eg 38 prio class 0
[37855.651549] blk_update_request: I/O error, dev sdi, sector 471944192 op 0x0:(READ) flags 0x84700 phys_s
eg 128 prio class 0
[37855.651589] blk_update_request: I/O error, dev sdi, sector 471946208 op 0x0:(READ) flags 0x80700 phys_s
eg 2 prio class 0
[37855.653050] blk_update_request: I/O error, dev sdi, sector 471944192 op 0x0:(READ) flags 0x0 phys_seg 1
 prio class 0
[37855.653073] Buffer I/O error on dev sdi, logical block 58993024, async page read
[37855.653133] blk_update_request: I/O error, dev sdi, sector 471944192 op 0x0:(READ) flags 0x0 phys_seg 1
 prio class 0
[37855.653138] Buffer I/O error on dev sdi, logical block 58993024, async page read
[37855.915477] sd 0:0:16:0: [sdi] Synchronizing SCSI cache
[37855.915627] sd 0:0:16:0: [sdi] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=DRIVER_OK
[37855.918520] mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x5005076028e30891)
[37855.918533] mpt2sas_cm0: removing handle(0x0012), sas_addr(0x5005076028e30891)
[37855.918538] mpt2sas_cm0: enclosure logical id(0x5005076028e30880), slot(255)
[37860.143948] mpt2sas_cm0: handle(0x12) sas_address(0x5005076028e30891) port_type(0x1)
[37861.162416] scsi 0:0:17:0: Direct-Access     ATA      ST4000VN008-2DR1 SC60 PQ: 0 ANSI: 6
[37861.162466] scsi 0:0:17:0: SATA: handle(0x0012), sas_addr(0x5005076028e30891), phy(16), device_name(0x0
000000000000000)
[37861.162472] scsi 0:0:17:0: enclosure logical id (0x5005076028e30880), slot(255)
[37861.162657] scsi 0:0:17:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[37861.162695] scsi 0:0:17:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[37861.168247] sd 0:0:17:0: Attached scsi generic sg8 type 0
[37861.168650] sd 0:0:17:0: Power-on or device reset occurred
[37861.169260]  end_device-0:0:17: add: handle(0x0012), sas_addr(0x5005076028e30891)
[37861.173216] sd 0:0:17:0: [sdi] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
[37861.173226] sd 0:0:17:0: [sdi] 4096-byte physical blocks
[37861.204349] sd 0:0:17:0: [sdi] Write Protect is off
[37861.204402] sd 0:0:17:0: [sdi] Mode Sense: 7f 00 10 08
[37861.206162] sd 0:0:17:0: [sdi] Write cache: enabled, read cache: enabled, supports DPO and FUA
[37861.318143] sd 0:0:17:0: [sdi] Attached SCSI disk

Here is the (abridged) smartctl output of the drive in question:

Code:

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.149] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST4000VN008-2DR166
Serial Number:    ZGY8WY6K
LU WWN Device Id: 5 000c50 0c8bcc814
Firmware Version: SC60
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5980 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Mar  2 11:03:23 2024 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   079   064   044    Pre-fail  Always       -       78372864
  3 Spin_Up_Time            0x0003   096   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       41
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   091   060   045    Pre-fail  Always       -       1197074970
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       5685 (148 154 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       30
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   073   068   040    Old_age   Always       -       27 (Min/Max 24/27)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       22
193 Load_Cycle_Count        0x0032   084   084   000    Old_age   Always       -       33686
194 Temperature_Celsius     0x0022   027   040   000    Old_age   Always       -       27 (0 18 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       4609 (67 78 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       14865683719
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       45013795835

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      5675         -
# 2  Extended offline    Completed without error       00%      3703         -
# 3  Extended offline    Completed without error       00%       626         -
# 4  Extended offline    Completed without error       00%       105         -

Troubleshooting:

I strongly suspect the drive is just fine, as SMART doesn't record any of these read faults. I suspect ongoing incompatibility with the IBM expander, or a bad cable. Of note, a single cable connected all of the Seagate drives through all of this testing, thus is unique to every fault. Currently, I replaced that cable, and I'm doing testing now, with the Seagate drives plugged into the expander card. If I get errors, I will try again, with the Seagate drives on the LSI card directly.

Questions:

What does this error actually mean? What are 'flags' in the error, and why does it have sectors which don't appear to match the LBA address given?

What device is actually generating the error? The drive, the expander, the HBA, or the PC/kernel?

Could this error be caused by a bad cable, bad expander/HBA, Firmware incompatibility, bad disk?

Does the fact that the errors don't correlate with any recorded fault in the SMART data make a bad disk unlikely?

smallpond · 03-02-2024, 12:26 PM

Request flags are defined here: https://elixir.bootlin.com/linux/lat...x/blk-mq.h#L27

When going through raid there will be multiple translations of the block offset in the raid to the underlying block device, its underlying device and so on. The number shown in the error depends on what layer is reporting. In this case seems to be sd disk sector.

SMART will only record errors detected by the disk, typically failure to read data with good ecc.

If the disk didn't detect a read error then it could well be a bad cable or card.

systemloc · 03-02-2024, 01:13 PM

Quote:

What does this error actually mean? What are 'flags' in the error, and why does it have sectors which don't appear to match the LBA address given?

I found that the 'blk_update_request' error is generated by the code in .../linus/block/blk-core.c. The flags don't look pertinent to the error, just the stored flags from the read request. It seems 'I/O' error, is a catchall error code. Not very useful. The linux block system always uses 512 bytes internally. The LBA size must be 4096 bytes.

The other error 'Buffer I/O error on dev sdi, logical block 1874946, async page read', I think is generated in .../linux/fs/buffer.c. The code isn't very helpful in identifying where it was called from.

It looks like 'blk_update_request' is called to pass the error. I'm guessing from the scsi code. I checked my 'scsi_logging_level', and naturally, all the logging is turned off. Consequently, I found a few things about setting logging on the scsi device.. The scsi_logging_level command helps you set your scsi logging level in /proc/sys/dev/scsi/logging_level. Separately, for the mpt3sas module, logging is controlled in /sys/module/mpt3sas/parameters/logging_level.

As a complete aside, the finger movements to type 'rm' and 'vi' are extremely similar! Oops...

systemloc · 03-02-2024, 01:15 PM

@smallpond: That was helpful. Since replacing the wire, I haven't had any more errors.

pan64 · 03-03-2024, 01:48 AM

in that case you might want to mark the thread solved.

systemloc · 03-03-2024, 10:02 AM

Things I learned for the next guy:

/dev/hd? means the device is using the ATA block system.
/dev/sd? means the device is using the SCSI block system.
(I believe SATA typically uses the SCSI block system).

If your running into these intermittent errors, your logging for the SCSI block system is likely turned off by default. I tried turning it on as above, but didn't get any output. I used the scsi_logging_level tool to do that. I found instructions on an IBM linux website. There it was mentioned that logging could help diagnose 'issues with LUN discovery and SCSI error handling (recovery), such caused by dirty fibre optics'. Turning on logging from the device driver, in my case, mpt3sas got logging. This output is large, so turn it on, reproduce the error, then turn it back off.

If you use an expander (or external drive enclosure), these devices are hidden behind the HBA. I found some tools you can use to interrogate that device. Check out smp_utils https://sg.danny.cz/sg/smp_utils.html. You'll need to have the kernel module sg, the generic scsi driver (CONFIG_CHR_DEV_SG), which is a module in most of the slackware kernels.

I found this video vital in explaining how to update the firmware for the IBM 46M0997 FW. Uses the sg3_utils, which is provided by slackware sg3.

The smartctl command let me check out the drives internal data on it's health: 'smartctl -a'. If the drive had an actual read error, it should be logged in this block here:

Code:

SMART Error Log Version: 1
No Errors Logged

I found other threads I cannot find again now in which a user had an actual failing disk throwing similar errors to mine. In his case, the drive logged the SMART errors here. Check out this excellent article.

Swapping the wire and drive to the LSI controller appeared to also fix the problem. I suspect the LSI card has more retries or something making it hide the bad cable better. This made diagnosis more difficult!

michaelk · 03-03-2024, 10:32 AM

Quote:

/dev/hd? means the device is using the ATA block system.

As of the 2.6.19 kernel all storage devices are part of the SCSI subsystem and use a /dev/sdx device id. Unless you were running a real old operating system on a real old computer with a IDE/PATA controller you should not have a /dev/hdx device.