Intermittent boot hang at Job dev-disk-by\x2d...device/start running

Hello. I feel I’m at the end of my limited ability to troubleshoot this and would greatly appreciate any help.

*Intermittent meaning when it hangs on boot I may have to soft reboot it once, or more than half a dozen times, before it reaches the login screen.

Here’s some output that may shed light:

$ cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.19.9-200.fc43.x86_64 root=UUID=52a06f04-e0f0-4aee-9cb4-02c931b7cc42 ro rootflags=subvol=root rd.driver.blacklist=nouveau,nova_core modprobe.blacklist=nouveau,nova_core
$ cat /etc/fstab
#
# /etc/fstab
# Created by anaconda on Sun Nov 30 10:52:24 2025
#
# Accessible filesystems, by reference, are maintained under '/dev/disk/'.
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info.
#
# After editing this file, run 'systemctl daemon-reload' to update systemd
# units generated from this file.
#
UUID=52a06f04-e0f0-4aee-9cb4-02c931b7cc42 / btrfs subvol=root,compress=zstd:1 0 0
UUID=6c02f03b-b271-47e8-8622-5581c26bb0b9 /boot ext4 defaults 1 2
UUID=131D-3C8D /boot/efi vfat umask=0077,shortname=winnt 0 2
UUID=52a06f04-e0f0-4aee-9cb4-02c931b7cc42 /home btrfs subvol=home,compress=zstd:1 0 0
$ lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda           8:0    0 465.8G  0 disk
├─sda1        8:1    0   600M  0 part /boot/efi
├─sda2        8:2    0     2G  0 part /boot
└─sda3        8:3    0 463.2G  0 part
sdb           8:16   0 119.2G  0 disk
└─sdb1        8:17   0 119.2G  0 part
zram0       251:0    0     8G  0 disk [SWAP]
nvme0n1     259:0    0 931.5G  0 disk
└─nvme0n1p1 259:1    0 931.5G  0 part
nvme1n1     259:2    0 931.5G  0 disk
└─nvme1n1p1 259:3    0 931.5G  0 part /home
                                      /
$ sudo smartctl -x /dev/nvme1n1
smartctl 7.5 2025-04-30 r5714 [x86_64-linux-6.19.9-200.fc43.x86_64] (local build)
Copyright (C) 2002-25, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       CT1000P3SSD8
Serial Number:                      24404B60EDAF
Firmware Version:                   P9CR313
PCI Vendor/Subsystem ID:            0x1344
IEEE OUI Identifier:                0x00a075
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            00a075 014b60edaf
Local Time is:                      Fri Mar 27 18:25:03 2026 AEDT
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x00d7):     Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp Verify
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     83 Celsius
Critical Comp. Temp. Threshold:     85 Celsius
Namespace 1 Features (0x08):        No_ID_Reuse

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     5.50W       -        -    0  0  0  0        0       0
 1 +     3.00W       -        -    1  1  1  1        0       0
 2 +     1.50W       -        -    2  2  2  2        0       0
 3 -   0.0300W       -        -    3  3  3  3     5000    2500
 4 -   0.0025W       -        -    4  4  4  4     8000   40000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         1
 1 -    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        39 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    0%
Data Units Read:                    4,839,635 [2.47 TB]
Data Units Written:                 3,781,466 [1.93 TB]
Host Read Commands:                 18,903,059
Host Write Commands:                31,294,898
Controller Busy Time:               105
Power Cycles:                       694
Power On Hours:                     1,666
Unsafe Shutdowns:                   99
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               39 Celsius

Error Information (NVMe Log 0x01, 16 of 255 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06, NSID 0xffffffff)
Self-test status: No self-test in progress
No Self-tests Logged

Thanks!

Maybe the cennectors on the nvme drive are making intermittent contact?
Try unplugging and reseating cables etc.

Devices sda3, nvme0n1p1 and nvme1n1p1 show the same transid 57621 in your pic. They belong to the same btrfs. There is another device sdb1 in your lsblk output, which is not shown in your pic. Is that device also part of the same btrfs?

Thanks for the suggestion. I’ve disconnected all drive cables, given them and the drive connections a good clean with a soft brush and some air and reseated everything. The first boot was a success but two subsequent ones displayed the problematic behaviour.

Yes all four drives are in the same btrfs:

$ sudo btrfs filesystem show | awk '/ path /{print $NF}'
/dev/nvme1n1p1
/dev/sda3
/dev/sdb1
/dev/nvme0n1p1

Is sdb1 always missing when the boot hangs? Could you run smartctl on sdb?

Re. sdb1 always missing I can’t say I’ve noticed. I’ll run some reboots tomorrow and let you know what I find.

$ sudo smartctl -x /dev/sdb1
smartctl 7.5 2025-04-30 r5714 [x86_64-linux-6.19.9-200.fc43.x86_64] (local build)
Copyright (C) 2002-25, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     JMicron/Maxiotek based SSDs
Device Model:     KINGSTON SV100S2128G
Serial Number:    128B50023692
Firmware Version: 120504
User Capacity:    128,035,676,160 bytes [128 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        In smartctl database 7.5/5852
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Tue Mar 31 20:32:53 2026 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 163) Reserved.
Total time to complete Offline 
data collection:                (   30) seconds.
Offline data collection
capabilities:                    (0x1b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (   1) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   050    -    0
  2 Throughput_Performance  P-S---   100   100   050    -    0
  3 Unknown_JMF_Attribute   POS---   100   100   050    -    0
  5 Reallocated_Sector_Ct   PO--C-   100   100   050    -    0
  7 Unknown_JMF_Attribute   PO-R--   100   100   050    -    0
  8 Unknown_JMF_Attribute   P-S---   100   100   050    -    0
  9 Power_On_Hours          -O--C-   100   100   000    -    12628
 10 Unknown_JMF_Attribute   PO--C-   100   100   050    -    0
 12 Power_Cycle_Count       -O--C-   100   100   000    -    5822
168 SATA_Phy_Error_Count    -O--C-   100   100   000    -    0
175 Bad_Cluster_Table_Count PO----   100   100   010    -    5
192 Unexpect_Power_Loss_Ct  -O--C-   100   100   000    -    0
194 Temperature_Celsius     -O---K   041   100   020    -    41 (Min/Max 33/41)
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
240 Unknown_JMF_Attribute   PO--C-   100   100   050    -    0
170 Bad_Block_Count         PO----   100   100   010    -    5 162 12
173 Erase_Count             -O--C-   100   100   000    -    4089 6878 4558
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O     51  Comprehensive SMART error log
0x03       GPL     R/O     64  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log

SMART Extended Comprehensive Error Log Version: 1 (64 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported

SCT Commands not supported

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x000a  2            2  Device-to-host register FISes sent due to a COMRESET

According to Google AI that device family is known for failures to get detected during boot and that you probably run the latest firmware from 2013. I guess you could remove it from the btrfs, so that your system boots reliably?

I use a Kingston SSD in an external case. It is often not detected at boot but can be mounted after booting.

Kingston has a Windows app: https://www.kingston.com/en/support/technical/ssdmanager and https://www.kingston.com/en/support/technical/ksm-firmware-update. The 2nd link says to contact technical support if you don’t have Windows.

Looks like sdb1 was always missing upon failed boots. I ended up removing it from the btrfs and haven’t had issues so far. I doubt I’ll bother setting it up to mount after boot as I likely won’t need the extra space but good to know it’s an option.

Thanks for the help :slight_smile: