[SOLVED] Help chasing an I/O error, bad cable, bad controller, bad drive?
Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Help chasing an I/O error, bad cable, bad controller, bad drive?
I'm running an x86 PC, with Slackware 14.2 (4.19.139 huge/custom), recently updated to 15 (5.15.149 huge/custom). I have an LSI SAS2308 (rev 05) FW 16.00.00.00-IT, BIOS MPT2BIOS-7.31.00.00, and an IBM SAS expander 46M0997 previously FW ver 605, now 634A. I run 15 3-4 TB SATA drives all HGST Ultrastar or Seagate Ironwolf. The OS is on a separate drive connected to the builtin motherboard SATA controller.
My drives are configured in RAID 6, except the OS drive, which is kept separate from the RAID.
Problem:
I've been tracking a persistent rare intermittent fault where the RAID would lose 1-3 drives in short order after working well for days. I found that only the Seagate drives were affected. No errors were recorded in the SMART record. I could reproduce the fault by turning off the raid and doing a simple 'dd if=/dev/(drive) of=/dev/null' for a cycle or two. At this time, the FW on the IBM card was the old 605(?) FW, and all of the Seagate drives were connected to it. At this time, 4 drives were connected directly to the LSI card, and the rest to the IBM expander, which was connected to the LSI card. For diagnosis, I swapped all the Seagate drives to be connected directly to the LSI card. I did the dd test and got no errors. Researching, I found that the old IBM FW tended to have drive incompatibilities, so I updated it to current. (Thanks Art Of Server dude!) I tried hooking the Seagate drives back to the IBM card and running a few cycles of my dd test, and no errors. I also noted I was only using one uplink port on the IBM card, so I reconfigured my cabling to hook both uplinks to the LSI, and put all the RAID disks on the IBM card. Note that all of the drives always remained on the same 'SF 8087 to 4x SATA/SAS' splitter cable. As an aside, I did some speed testing with one vs two uplink cables, and did not notice a difference. Are there two ports to allow redundant HBAs? Hmm..
At this point, I also updated to Slackware 15 and updated the kernel. This box had sat disused for awhile, and 15 came out in the interim.
This is the error I was getting with the incompatible FW on the IBM expander:
Note that I was receiving similar errors from multiple Seagate drives, and none of them had any errors logged in SMART.
I proceeded to dd my drives and then mdadm --repair the array through several cycles over days. I eventually began to have similar recurrent errors, now coming from only one of the Seagate drives. Again, though, no errors were logged in SMART. I tried doing a 'smartctl -t long' test on the drive, and no errors were noted. On several repeats of the dd test on the drive, I noted that the sectors failing the read were not consistent.
Here is the (abridged) smartctl output of the drive in question:
Code:
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.149] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate IronWolf
Device Model: ST4000VN008-2DR166
Serial Number: ZGY8WY6K
LU WWN Device Id: 5 000c50 0c8bcc814
Firmware Version: SC60
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5980 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Mar 2 11:03:23 2024 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 079 064 044 Pre-fail Always - 78372864
3 Spin_Up_Time 0x0003 096 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 41
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 091 060 045 Pre-fail Always - 1197074970
9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 5685 (148 154 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 30
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 073 068 040 Old_age Always - 27 (Min/Max 24/27)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 22
193 Load_Cycle_Count 0x0032 084 084 000 Old_age Always - 33686
194 Temperature_Celsius 0x0022 027 040 000 Old_age Always - 27 (0 18 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 4609 (67 78 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 14865683719
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 45013795835
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 5675 -
# 2 Extended offline Completed without error 00% 3703 -
# 3 Extended offline Completed without error 00% 626 -
# 4 Extended offline Completed without error 00% 105 -
Troubleshooting:
I strongly suspect the drive is just fine, as SMART doesn't record any of these read faults. I suspect ongoing incompatibility with the IBM expander, or a bad cable. Of note, a single cable connected all of the Seagate drives through all of this testing, thus is unique to every fault. Currently, I replaced that cable, and I'm doing testing now, with the Seagate drives plugged into the expander card. If I get errors, I will try again, with the Seagate drives on the LSI card directly.
Questions:
What does this error actually mean? What are 'flags' in the error, and why does it have sectors which don't appear to match the LBA address given?
What device is actually generating the error? The drive, the expander, the HBA, or the PC/kernel?
Could this error be caused by a bad cable, bad expander/HBA, Firmware incompatibility, bad disk?
Does the fact that the errors don't correlate with any recorded fault in the SMART data make a bad disk unlikely?
When going through raid there will be multiple translations of the block offset in the raid to the underlying block device, its underlying device and so on. The number shown in the error depends on what layer is reporting. In this case seems to be sd disk sector.
SMART will only record errors detected by the disk, typically failure to read data with good ecc.
If the disk didn't detect a read error then it could well be a bad cable or card.
What does this error actually mean? What are 'flags' in the error, and why does it have sectors which don't appear to match the LBA address given?
I found that the 'blk_update_request' error is generated by the code in .../linus/block/blk-core.c. The flags don't look pertinent to the error, just the stored flags from the read request. It seems 'I/O' error, is a catchall error code. Not very useful. The linux block system always uses 512 bytes internally. The LBA size must be 4096 bytes.
The other error 'Buffer I/O error on dev sdi, logical block 1874946, async page read', I think is generated in .../linux/fs/buffer.c. The code isn't very helpful in identifying where it was called from.
It looks like 'blk_update_request' is called to pass the error. I'm guessing from the scsi code. I checked my 'scsi_logging_level', and naturally, all the logging is turned off. Consequently, I found a few things about setting logging on the scsi device.. The scsi_logging_level command helps you set your scsi logging level in /proc/sys/dev/scsi/logging_level. Separately, for the mpt3sas module, logging is controlled in /sys/module/mpt3sas/parameters/logging_level.
As a complete aside, the finger movements to type 'rm' and 'vi' are extremely similar! Oops...
/dev/hd? means the device is using the ATA block system.
/dev/sd? means the device is using the SCSI block system.
(I believe SATA typically uses the SCSI block system).
If your running into these intermittent errors, your logging for the SCSI block system is likely turned off by default. I tried turning it on as above, but didn't get any output. I used the scsi_logging_level tool to do that. I found instructions on an IBM linux website. There it was mentioned that logging could help diagnose 'issues with LUN discovery and SCSI error handling (recovery), such caused by dirty fibre optics'. Turning on logging from the device driver, in my case, mpt3sas got logging. This output is large, so turn it on, reproduce the error, then turn it back off.
If you use an expander (or external drive enclosure), these devices are hidden behind the HBA. I found some tools you can use to interrogate that device. Check out smp_utils https://sg.danny.cz/sg/smp_utils.html. You'll need to have the kernel module sg, the generic scsi driver (CONFIG_CHR_DEV_SG), which is a module in most of the slackware kernels.
I found this video vital in explaining how to update the firmware for the IBM 46M0997 FW. Uses the sg3_utils, which is provided by slackware sg3.
The smartctl command let me check out the drives internal data on it's health: 'smartctl -a'. If the drive had an actual read error, it should be logged in this block here:
Code:
SMART Error Log Version: 1
No Errors Logged
I found other threads I cannot find again now in which a user had an actual failing disk throwing similar errors to mine. In his case, the drive logged the SMART errors here. Check out this excellent article.
Swapping the wire and drive to the LSI controller appeared to also fix the problem. I suspect the LSI card has more retries or something making it hide the bad cable better. This made diagnosis more difficult!
/dev/hd? means the device is using the ATA block system.
As of the 2.6.19 kernel all storage devices are part of the SCSI subsystem and use a /dev/sdx device id. Unless you were running a real old operating system on a real old computer with a IDE/PATA controller you should not have a /dev/hdx device.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.