今朝、自宅鯖がエラーを飛ばしてきた。
Aug 2 06:25:37 www smartd[1061]: Device: /dev/nvme2, number of Error Log entries increased from 0 to 18446744073709551615
不自然に大きい数字はたいがい変数の型の最大値で、これはunsigned long longの最大値。とはいえ、こう暑い日が続くとSSDが故障してもおかしくはないので、とりま、smartdを実行してみる。
# smartctl -a /dev/nvme2n1
smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-477.15.1.el8_8.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: XPG GAMMIX S70 BLADE
Serial Number: 2L252L16E9WL
Firmware Version: 3.2.F.P7
PCI Vendor ID: 0x1dbe
PCI Vendor Subsystem ID: 0x5236
IEEE OUI Identifier: 0x00abcd
Controller ID: 0
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,048,408,248,320 [2.04 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 494e4e 4f47524954
Local Time is: Wed Aug 2 08:22:44 2023 JST
Firmware Updates (0x0e): 7 Slots
Optional Admin Commands (0x0007): Security Format Frmw_DL
Optional NVM Commands (0x0014): DS_Mngmt Sav/Sel_Feat
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 110 Celsius
Critical Comp. Temp. Threshold: 120 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 3.50W - - 0 0 0 0 5 5
1 + 3.30W - - 1 1 1 1 50 100
2 + 2.80W - - 2 2 2 2 50 200
3 - 0.1000W - - 3 3 3 3 500 5000
4 - 0.0080W - - 4 4 4 4 2000 60000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 - 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 49 Celsius
Available Spare: 100%
Available Spare Threshold: 25%
Percentage Used: 5%
Data Units Read: 12,262,245 [6.27 TB]
Data Units Written: 34,459,157 [17.6 TB]
Host Read Commands: 58,055,554
Host Write Commands: 376,086,791
Controller Busy Time: 0
Power Cycles: 45
Power On Hours: 14,831
Unsafe Shutdowns: 8
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 49 Celsius
Temperature Sensor 2: 30 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
特に問題はなさそう。
さて、何がこんなエラーを出力させているのかということで、軽く調べる。
# yumdownloader --source smartmontools
# rpm -ivh smartmontools
# cd rpmbuild/SOURCES/
# tar zxvf smartmontools-7.1.tar.gz
# cd smartmontools-7.1
# grep -r 'number of Error Log entries increased' *
smartd.cpp: PrintOut(LOG_CRIT, "Device: %s, number of Error Log entries increased from %" PRIu64 " to %" PRIu64 "\n",
smartd.cpp: MailWarning(cfg, state, 4, "Device: %s, number of Error Log entries increased from %" PRIu64 " to %" PRIu64,
smartd.cppから出力されてる。
3761 // Check if number of errors has increased
3762 if (cfg.errorlog || cfg.xerrorlog) {
3763 uint64_t oldcnt = state.nvme_err_log_entries;
3764 uint64_t newcnt = le128_to_uint64(smart_log.num_err_log_entries);
3765 if (newcnt > oldcnt) {
3766 PrintOut(LOG_CRIT, "Device: %s, number of Error Log entries increased from %" PRIu64 " to %" PRIu64 "\n",
3767 name, oldcnt, newcnt);
3768 MailWarning(cfg, state, 4, "Device: %s, number of Error Log entries increased from %" PRIu64 " to %" PRIu64,
3769 name, oldcnt, newcnt);
3770 state.must_write = true;
3771 }
3772 state.nvme_err_log_entries = newcnt;
3773 }
3764行のle128_to_uint64()は、上位8ビットが1つでも立ってたら、uint64_tの最大値を返してるだけ。
2606 // Convert 128 bit LE integer to uint64_t or its max value on overflow.
2607 static uint64_t le128_to_uint64(const unsigned char (& val)[16])
2608 {
2609 for (int i = 8; i < 16; i++) {
2610 if (val[i])
2611 return ~(uint64_t)0;
2612 }
2613 uint64_t lo = val[7];
2614 for (int i = 7-1; i >= 0; i--) {
2615 lo <<= 8; lo += val[i];
2616 }
2617 return lo;
2618 }
le128_to_uint64()の引数であるsmart_log.num_err_log_entriesは、3716行で読んできた構造体で、
3714 // Read SMART/Health log
3715 nvme_smart_log smart_log;
3716 if (!nvme_read_smart_log(nvmedev, smart_log)) {
3717 PrintOut(LOG_INFO, "Device: %s, failed to read NVMe SMART/Health Information\n", name);
3718 MailWarning(cfg, state, 6, "Device: %s, failed to read NVMe SMART/Health Information", name);
3719 state.must_write = true;
3720 return 0;
3721 }
読み取りの実態は、smartmontools-7.1/nvmecmds.cppにある、nvme_read_smart_log()
221 // Read NVMe SMART/Health Information log.
222 bool nvme_read_smart_log(nvme_device * device, nvme_smart_log & smart_log)
223 {
224 if (!nvme_read_log_page(device, 0x02, &smart_log, sizeof(smart_log), true))
225 return false;
226
227 if (isbigendian()) {
228 swapx(&smart_log.warning_temp_time);
229 swapx(&smart_log.critical_comp_time);
230 for (int i = 0; i < 8; i++)
231 swapx(&smart_log.temp_sensor[i]);
232 }
233
234 return true;
235 }
うーん、ビット化けでもしたんかなぁ…。