Problème de disque systèmeRésolu/Fermé

Question

Bonjour, 
dmesg et smartctl me retourne tout deux des messages d'erreurs concernant des inodes ou des blocks dans un cas, des LBA dans l'autre. Bon le disque est vieux mais il n'y a pas l'air d'y avoir tant d'erreurs que ça et je ne comprends pas pourquoi smart à autorisé le système à écrire à cet endroit (d'autant plus qu'au feeling je dirais que j'ai une base MySQL dessus vu que ça arrive que quand je fais des écritures/lectures dans la base ;-) ...).
Est-ce que le disque est mort ? Est-ce que je peux encore faire quelque chose ? Si je transfère mon OS (c'est du Gentoo on réinstalle jamais, on installe et après on maintient) est-ce que ça ira ?
Il y a un aspect financier au truc, lié au matériel car c'est un disque de 2To et le To coûte encore cher ...
Pourquoi l'os peut encore écrire aux endroits abîmés ? Comment mettre d'accord smart et l'os ?
Je ne savais pas trop si il fallait poster dans "disques durs" ou Linux ... J'espère avoir posté au bon endroit. J'ai aussi un Windows sur cette machine. Tout est sauvegardé à part quelques modifications récentes ...
Les logs :
dmesg :
[  305.120027] EXT4-fs (sda7): error count since last fsck: 38[  305.120042] EXT4-fs (sda7): initial error at time 1481569390: __ext4_get_inode_loc:4072: inode 27003922: block 108003553[  305.120048] EXT4-fs (sda7): last error at time 1481748778: ext4_find_entry:1450: inode 31326828[  665.722076] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen[  665.722086] ata1.00: failed command: SMART[  665.722093] ata1.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 29 pio 512 in                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)[  665.722096] ata1.00: status: { DRDY }[  665.722105] ata1: hard resetting link[  665.722108] ata1: nv: skipping hardreset on occupied port[  667.247072] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)[  667.254628] ata1.00: configured for UDMA/133[  667.254661] ata1: EH complete[  674.714070] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen[  674.714078] ata1.00: failed command: SMART[  674.714085] ata1.00: cmd b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 17                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)[  674.714088] ata1.00: status: { DRDY }[  674.714096] ata1: hard resetting link[  674.714098] ata1: nv: skipping hardreset on occupied port[  676.086064] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)[  676.095606] ata1.00: configured for UDMA/133[  676.095647] ata1: EH complete
En plus si je regarde les dates ce n'est pas récent :
user@host ~ $ date -d@1481748778mer. déc. 14 21:52:58 CET 2016user@host ~ $ date -d@1481569390lun. déc. 12 20:03:10 CET 2016
sudo smartctl -a /dev/sda
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.4.21-gentoo] (local build)Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF INFORMATION SECTION ===Model Family:     Seagate Barracuda 7200.14 (AF)Device Model:     ST2000DM001-1CH164Serial Number:    W1E3YMWVLU WWN Device Id: 5 000c50 06056c896Firmware Version: CC24User Capacity:    2 000 398 934 016 bytes [2,00 TB]Sector Sizes:     512 bytes logical, 4096 bytes physicalRotation Rate:    7200 rpmForm Factor:      3.5 inchesDevice is:        In smartctl database [for details use: -P show]ATA Version is:   ATA8-ACS T13/1699-D revision 4SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s)Local Time is:    Thu Dec 15 16:01:09 2016 CETSMART support is: Available - device has SMART capability.SMART support is: Enabled=== START OF READ SMART DATA SECTION ===SMART overall-health self-assessment test result: PASSEDGeneral SMART Values:Offline data collection status:  (0x00) Offline data collection activity                                        was never started.                                        Auto Offline Data Collection: Disabled.Self-test execution status:      (   0) The previous self-test routine completed                                        without error or no self-test has ever                                        been run.Total time to complete Offlinedata collection:                (  575) seconds.Offline data collectioncapabilities:                    (0x73) SMART execute Offline immediate.                                        Auto Offline data collection on/off support.                                        Suspend Offline collection upon new                                        command.                                        No Offline surface scan supported.                                        Self-test supported.                                        Conveyance Self-test supported.                                        Selective Self-test supported.SMART capabilities:            (0x0003) Saves SMART data before entering                                        power-saving mode.                                        Supports SMART auto save timer.Error logging capability:        (0x01) Error logging supported.                                        General Purpose Logging supported.Short self-test routinerecommended polling time:        (   1) minutes.Extended self-test routinerecommended polling time:        ( 220) minutes.Conveyance self-test routinerecommended polling time:        (   2) minutes.SCT capabilities:              (0x3085) SCT Status supported.SMART Attributes Data Structure revision number: 10Vendor Specific SMART Attributes with Thresholds:ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE  1 Raw_Read_Error_Rate     0x000f   099   091   006    Pre-fail  Always       -       62023979  3 Spin_Up_Time            0x0003   096   096   000    Pre-fail  Always       -       0  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       579  5 Reallocated_Sector_Ct   0x0033   091   091   010    Pre-fail  Always       -       12088  7 Seek_Error_Rate         0x000f   068   060   030    Pre-fail  Always       -       90354076595  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       9885 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       165183 Runtime_Bad_Block       0x0032   096   096   000    Old_age   Always       -       4184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       255188 Command_Timeout         0x0032   098   088   000    Old_age   Always       -       94 95 95189 High_Fly_Writes         0x003a   092   092   000    Old_age   Always       -       8190 Airflow_Temperature_Cel 0x0022   064   053   045    Old_age   Always       -       36 (Min/Max 35/36)191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       520193 Load_Cycle_Count        0x0032   099   099   000    Old_age   Always       -       3401194 Temperature_Celsius     0x0022   036   047   000    Old_age   Always       -       36 (0 22 0 0 0)197 Current_Pending_Sector  0x0012   073   069   000    Old_age   Always       -       4464198 Offline_Uncorrectable   0x0010   073   069   000    Old_age   Offline      -       4464199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       9520h+02m+07.156s241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       17818987920242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       7161138812SMART Error Log Version: 1ATA Error Count: 256 (device log contains only the most recent five errors)        CR = Command Register [HEX]        FR = Features Register [HEX]        SC = Sector Count Register [HEX]        SN = Sector Number Register [HEX]        CL = Cylinder Low Register [HEX]        CH = Cylinder High Register [HEX]        DH = Device/Head Register [HEX]        DC = Device Command Register [HEX]        ER = Error register [HEX]        ST = Status register [HEX]Powered_Up_Time is measured from power on, and printed asDDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,SS=sec, and sss=millisec. It "wraps" after 49.710 days.Error 256 occurred at disk power-on lifetime: 9872 hours (411 days + 8 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  60 00 08 ff ff ff 4f 00      23:39:36.914  READ FPDMA QUEUED  61 00 08 ff ff ff 4f 00      23:39:36.914  WRITE FPDMA QUEUED  ea 08 88 ff ff ff af 00      23:39:36.878  FLUSH CACHE EXT  61 00 08 ff ff ff 4f 00      23:39:36.878  WRITE FPDMA QUEUED  61 00 78 ff ff ff 4f 00      23:39:36.877  WRITE FPDMA QUEUEDError 255 occurred at disk power-on lifetime: 9872 hours (411 days + 8 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  60 00 08 ff ff ff 4f 00      23:39:33.622  READ FPDMA QUEUED  27 00 00 00 00 00 e0 00      23:39:33.621  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]  ec 00 00 00 00 00 a0 00      23:39:33.618  IDENTIFY DEVICE  ef 03 46 00 00 00 a0 00      23:39:33.616  SET FEATURES [Set transfer mode]  27 00 00 00 00 00 e0 00      23:39:33.615  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]Error 254 occurred at disk power-on lifetime: 9872 hours (411 days + 8 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  60 00 08 ff ff ff 4f 00      23:39:30.363  READ FPDMA QUEUED  ea 08 a8 ff ff ff af 00      23:39:27.980  FLUSH CACHE EXT  61 00 08 ff ff ff 4f 00      23:39:27.980  WRITE FPDMA QUEUED  ea 50 98 ff ff ff af 00      23:39:27.956  FLUSH CACHE EXT  61 00 50 ff ff ff 4f 00      23:39:27.955  WRITE FPDMA QUEUEDError 253 occurred at disk power-on lifetime: 9872 hours (411 days + 8 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  60 00 08 ff ff ff 4f 00      23:38:51.455  READ FPDMA QUEUED  27 00 00 00 00 00 e0 00      23:38:51.454  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]  ec 00 00 00 00 00 a0 00      23:38:51.451  IDENTIFY DEVICE  ef 03 46 00 00 00 a0 00      23:38:51.448  SET FEATURES [Set transfer mode]  27 00 00 00 00 00 e0 00      23:38:51.448  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]Error 252 occurred at disk power-on lifetime: 9872 hours (411 days + 8 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  60 00 08 ff ff ff 4f 00      23:38:48.177  READ FPDMA QUEUED  ea 08 68 ff ff ff af 00      23:38:41.881  FLUSH CACHE EXT  61 00 08 ff ff ff 4f 00      23:38:41.881  WRITE FPDMA QUEUED  ea 68 58 ff ff ff af 00      23:38:41.866  FLUSH CACHE EXT  61 00 68 ff ff ff 4f 00      23:38:41.866  WRITE FPDMA QUEUEDSMART Self-test log structure revision number 1Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error# 1  Short offline       Completed: read failure       90%      9869         2870182152# 2  Extended captive    Completed: read failure       90%      9614         94431288# 3  Short offline       Completed: read failure       90%      9614         94431288# 4  Short offline       Completed: read failure       90%      9614         94431288SMART Selective self-test log data structure revision number 1 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS    1        0        0  Not_testing    2        0        0  Not_testing    3        0        0  Not_testing    4        0        0  Not_testing    5        0        0  Not_testingSelective self-test flags (0x0):  After scanning selected spans, do NOT read-scan remainder of disk.If Selective self-test is pending on power-up, resume after 0 minute delay.
Smart n'étant ni un truc qu'on pratique tous les jours, ni trop plaisant, car on y mets le nez que quand ça sent mauvais. Si certains peuvent m'aider, je serais reconnaissant. :-)
Configuration:   Des petits Linux Gentoo, un peu d'Android, un Windows 10 et un 7 ...
 
- Make me a sandwich. 
- What? Make it yourself.
- Sudo make me a sandwich.
- Okay

brupala · Answer

Salut,
si ton disque commence à déconner, tu le changes, c'est tout, 
on ne se pourrit pas la vie avec un disque entrain de lâcher, sauf à être maso.
Ce n'est pas le prix d'un disque neuf de nos jours.

ryko1820 · Answer

Yo, Je suis pas prêt d'en racheter un à ce prix là, c'est pour ça. Par chance ce disque est presque vide. Tout devrait largement rentrer sur un vieux 1 To qu'il me restait sur les bras suite au trépas d'un portable et qui n'a jamais eu de problème et je peux pas prendre le risque de perdre le boulot que j'ai fait dessus. Bon, là ça devient urgent : WARNING: CPU: 0 PID: 956 at drivers/ata/libata-eh.c:3993 ata_eh_finish+0x146/0x150()[ 5476.099621] Modules linked in: fuse powernow_k8 snd_hda_codec_idt acpi_cpufreq[ 5476.099630] CPU: 0 PID: 956 Comm: scsi_eh_0 Not tainted 4.4.21-gentoo #5[ 5476.099632] Hardware name: Dell Inc. OptiPlex 740 Enhanced/0YP806, BIOS 2.1.6 05/04/2008[ 5476.099635] 0000000000000000 ffff880235013c48 ffffffff8157fa84 0000000000000000[ 5476.099639] ffffffff8233977f ffff880235013c80 ffffffff8107a7a1 ffff8802357e1e58[ 5476.099643] ffff8802357e1e58 ffff8802357e0000 ffff8802357e1d70 0000000000000206[ 5476.099647] Call Trace:[ 5476.099654] [] dump_stack+0x93/0xcf[ 5476.099658] [] warn_slowpath_common+0xc1/0x120[ 5476.099662] [] warn_slowpath_null+0x23/0x30[ 5476.099665] [] ata_eh_finish+0x146/0x150[ 5476.099668] [] ? ata_sff_dev_classify+0x170/0x170[ 5476.099671] [] ata_do_eh+0x70/0xf0[ 5476.099674] [] ? ata_bmdma_post_internal_cmd+0xa0/0xa0[ 5476.099677] [] ? ata_sff_dev_classify+0x170/0x170[ 5476.099680] [] ? nv_scr_read+0x50/0x50[ 5476.099683] [] ? sata_sff_hardreset+0xa0/0xa0[ 5476.099686] [] ? nv_scr_read+0x50/0x50[ 5476.099689] [] ? ata_sff_dev_classify+0x170/0x170[ 5476.099692] [] ata_sff_error_handler+0xfe/0x150[ 5476.099695] [] ata_bmdma_error_handler+0x123/0x1e0[ 5476.099698] [] nv_swncq_error_handler+0x2f/0x2d0[ 5476.099703] [] ? try_to_del_timer_sync+0x66/0x80[ 5476.099706] [] ata_scsi_port_error_handler+0x4e8/0x9e0[ 5476.099709] [] ata_scsi_error+0xb6/0x100[ 5476.099712] [] scsi_error_handler+0xd3/0x780[ 5476.099717] [] ? __schedule+0x3cd/0xb00[ 5476.099720] [] ? scsi_eh_get_sense+0x1f0/0x1f0[ 5476.099723] [] ? scsi_eh_get_sense+0x1f0/0x1f0[ 5476.099726] [] kthread+0xfc/0x130[ 5476.099729] [] ? flush_kthread_worker+0x80/0x80[ 5476.099733] [] ret_from_fork+0x3f/0x70[ 5476.099736] [] ? flush_kthread_worker+0x80/0x80[ 5476.099738] ---[ end trace d80659087831bb7d ]---[ 5476.099743] ata1: EH complete ;-)

ryko1820 · Answer

Ce problème n'était probablement pas uniquement (ou pas du tout) un problème de disque et s'est poursuivi en WD Blue | hard resetting link (Résolu). 
Apparemment un problème de contrôleur SATA NVidia MCP51 + BIOS dépassé demandant une initialisation et un traitement particulier.
- Make me a sandwich. 
- What? Make it yourself.
- Sudo make me a sandwich.
- Okay

Problème de disque système

3 réponses

Newsletters