Current AWS cloud images for F41 (but likely others too) have two critical bugs that for some reason only trigger when I build derivative images with Hashicorp Packer. Nevertheless they look like bugs in the package versions provided by the original Fedora image. Scripts for reproducing below.
The two bugs
A. I can’t SSH to the machine (with EC2 connect). The reason is that the root directory has 770 permissions, but AuthorizedKeysCommand requires it to not be group-writeable. This is easily fixed with sudo chmod g-w /
in the provisioning script. For reference, sudo systemctl status sshd
after an ssh attempt shows sshd-session[1296]: error: Unsafe AuthorizedKeysCommand "/usr/bin/eic_run_authorized_keys": bad ownership or modes for directory /
. When I use the image directly, instead of in Packer, there is no issue with SSH (and nothing in journalctl about AuthorizedKeysCommand), even though / is group-writeable. I have no idea what changes between these scenarios, since the problem occurs with Packer even if I have nothing in the provisioning script, all the package version are identical (in particular rpm -qi openssh-server
, ssh -V
, all of sha256sum /usr/bin/*
).
B. There’s a known dhcpcd bug triggered by glibc>=2.40-11 which causes dhcpcd to crash with a core dump. Since it is called by cloud-init, I believe that it fails to set up networking at an earlier stage (leading to Network is unreachable
in logs, and then 405 Not Allowed
because it failed to fetch auth tokens), and in particular it fails to fetch and run AWS User Data shell scripts (often with critical setup like mounts or keys). The issue can be worked around with sudo dnf downgrade glibc -y
(which downgrades to 2.40-3) or fixed by recompiling dhcpcd
with their three-line patch. When I use the image directly, instead of via Packer, I see that dhcpcd still does crash during cloud-init in journalctl, but somehow that doesn’t stop user-data shell scripts from being fetched – again, I have no idea why that works.
My questions
This took my ages to debug and I have many questions now:
- Any idea why it fails only with Packer?
- Should / be group-writable, and when did that change? (the group and owner is root, of course)
- Should I report a bug about that group-writeability issue somewhere?
- Is downgrading glibc from 2.40-12 to 2.40-3 safe?
- The dhcpcd bug fix is on the dhcpcd repo, but no version with it has been released by dhcpcd yet, unfortunately (dhcpcd 10.1.0 and 10.0.10 do not have the fix). What’s the best way to proceed to have it patched in Fedora? The bug combination exists since Nov 11, the patch is committed since Dec 7.
- What’s the best way to publish my own patched version of the dhcpcd package quickly? COPR?
Steps to reproduce
Launch script for using the image (directly):
#!/bin/bash
set -e -u -v -o pipefail
REGION=us-east-1
AMI_IMAGE_ID=ami-0f05a0784a500f32d # latest official F41 image in us-east-1; dhcpcd crashes, root is group-writable, but SSH and user-data works.
SECURITY_GROUP=sg-01234 # a security group with port 22 inbound access from my IP.
KEY_NAME="abc" # an SSH key registered in AWS
echo -e '#!/bin/bash\necho LaunchScript hello | systemd-cat\n' > user-data.sh
aws ec2 run-instances \
--region "$REGION" \
--instance-type t2.micro \
--image-id "$AMI_IMAGE_ID" \
--key-name "$KEY_NAME" \
--associate-public-ip-address \
--security-group-ids "$SECURITY_GROUP" \
--tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=fedora-bug-repro-original}]" \
--metadata-options HttpTokens=required \
--user-data file://user-data.sh \
--output json > instance.json
INSTANCE_ID="$(jq -r '.Instances[0].InstanceId' instance.json)"
echo "INSTANCE_ID=$INSTANCE_ID"
aws ec2 wait instance-running --region "$REGION" --instance-ids "$INSTANCE_ID"
aws ec2 wait instance-status-ok --region "$REGION" --instance-ids "$INSTANCE_ID"
aws ec2-instance-connect ssh --region=us-east-1 --os-user=fedora --instance-id "$INSTANCE_ID"
Packer provisioning configuration
# Usage:
# - install packer
# - `packer init .` downloads the plugins.
# - `packer validate .` checks the syntax.
# - `packer build .` builds the image in ~10 minutes.
# For debugging, some options are:
# - `packer build -debug .` with Enter to confirm ALL steps and `ssh -i `pwd`/ec2_fedora.pem fedora@THE_IP_IT_SHOWS`
# - `packer build -on-error=ask .` to only stop on error, then SSH. You want to eventually clean-up nevertheless or retry, when it asks.
#
# Details: https://developer.hashicorp.com/packer/tutorials/aws-get-started/aws-get-started-build-image
packer {
required_plugins {
amazon = {
version = ">= 1.2.8"
source = "github.com/hashicorp/amazon"
}
}
}
source "amazon-ebs" "fedora" {
region = "us-east-1"
profile = "main" # That's the profile I have in ~/.aws/credentials, feel free to use other auth methods.
ami_name = "tmp-repro-fail" # Must be unique, use a timestamp if you need to.
spot_instance_types = ["t2.micro"]
spot_price = "auto"
# You can specify the AMI ID here, or use the filter below.
source_ami = "ami-0f05a0784a500f32d"
# source_ami_filter {
# most_recent = true
# filters = {
# name = "Fedora-Cloud-Base-AmazonEC2.x86_64-41-*"
# root-device-type = "ebs"
# virtualization-type = "hvm"
# }
# # Owner (Fedora) as seen in https://us-east-1.console.aws.amazon.com/ec2/home?region=us-east-1#AMICatalog
# # for an AMI specified officially on https://fedoraproject.org/pl/cloud/download
# owners = ["125523088429"] # Same in all AWS regions.
# }
ssh_username = "fedora"
imds_support = "v2.0"
launch_block_device_mappings {
device_name = "/dev/sda1"
volume_size = 5
volume_type = "gp3"
delete_on_termination = true
}
}
build {
name = "tmp-repro"
sources = [
"source.amazon-ebs.fedora"
]
provisioner "shell" {
inline = [
"sudo chmod g-w /", # Fixes SSH EC2-Connect fails.
"sudo dnf downgrade glibc -y", # Fixes the dhcpcd crash.
# "sudo systemctl status sshd",
# "sudo systemctl restart sshd",
# "sudo dnf update -y", # Does not help.
"sudo /usr/sbin/dhcpcd --debug --nobackground -4" # If you want to trigger the crash during provisioning.
]
}
}
Versions
- I mostly tested ami-0f05a0784a500f32d on us-east-1 (from Dec 16), the latest F41 available on AWS (from Fedora, i.e. Owner=125523088429).
sshd -V
: OpenSSH_9.8p1, OpenSSL 3.2.2 4 Jun 2024rpm -qv openssh-server
: openssh-server-9.8p1-3.fc41.2.x86_64rpm -qv glibc
: glibc-2.40-12.fc41.x86_64rpm -qv dhcpcd
: dhcpcd-10.0.8-1.fc41.x86_64 (updating to 10.0.10 from rawhide doesn’t change the issue, as expected).cloud-init
: 24.2- The version on the Fedora Cloud website is currently ami-046a6af96ef510bb6 (from Oct 24) , which contains glibc 2.40-3, so it does not have issue B. It does have issue A.