Strange DNS timeout issue

I have a slightly strange DNS issue which I am fairly sure is to do with DNSSEC- essentially there are a small number of sites which give a timeout when resolving at localhost but succeed when resolving via google or with dnssec-validation disabled. The vast majority of lookups are fine, there are just a tiny handful which “fail”.

As an example the following is OK (just to show resolving locally does indeed work):

byteplayer:~=> dig @127.0.0.1 google.com

; <<>> DiG 9.18.28 <<>> @127.0.0.1 google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 21423
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 42d13e382d75031c010000006704e99d828b86e7c527deb2 (good)
;; QUESTION SECTION:
;google.com. IN A

;; ANSWER SECTION:
google.com. 18 IN A 142.250.178.14

;; Query time: 27 msec
;; SERVER: 127.0.0.1#53(127.0.0.1) (UDP)
;; WHEN: Tue Oct 08 09:13:17 BST 2024
;; MSG SIZE rcvd: 83

If I try to resolve www.t3.com though I get:

byteplayer:~=> dig @127.0.0.1 www.t3.com
;; communications error to 127.0.0.1#53: timed out
;; communications error to 127.0.0.1#53: timed out

; <<>> DiG 9.18.28 <<>> @127.0.0.1 www.t3.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 64176
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 8e039ae4ee562111010000006704ea867c61cb0033eba818 (good)
;; QUESTION SECTION:
;www.t3.com. IN A

;; Query time: 2023 msec
;; SERVER: 127.0.0.1#53(127.0.0.1) (UDP)
;; WHEN: Tue Oct 08 09:17:10 BST 2024
;; MSG SIZE rcvd: 67

But if I resolve that direct to Google’s DNS it works:

byteplayer:~=> dig @8.8.8.8 www.t3.com

; <<>> DiG 9.18.28 <<>> @8.8.8.8 www.t3.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 182
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;www.t3.com. IN A

;; ANSWER SECTION:
www.t3.com. 256 IN CNAME trhmb96.ng.impervadns.net.
trhmb96.ng.impervadns.net. 30 IN A 45.223.102.77

;; Query time: 20 msec
;; SERVER: 8.8.8.8#53(8.8.8.8) (UDP)
;; WHEN: Tue Oct 08 09:18:03 BST 2024
;; MSG SIZE rcvd: 94

I think it’s something to do with dnssec as if I turn off dnssec-validation then all is good:

byteplayer:~=> dig @127.0.0.1 www.t3.com

; <<>> DiG 9.18.28 <<>> @127.0.0.1 www.t3.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58827
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 3e9b58f47f0e41cc010000006704eb355daf3401dd7adfb7 (good)
;; QUESTION SECTION:
;www.t3.com. IN A

;; ANSWER SECTION:
www.t3.com. 177 IN CNAME trhmb96.ng.impervadns.net.
trhmb96.ng.impervadns.net. 23 IN A 45.223.102.77

;; Query time: 25 msec
;; SERVER: 127.0.0.1#53(127.0.0.1) (UDP)
;; WHEN: Tue Oct 08 09:20:05 BST 2024
;; MSG SIZE rcvd: 122

I have tried increasing logging levels which doesn’t really help:

08-Oct-2024 09:25:26.472 queries: info: client @0x7f537aa2e168 127.0.0.1#50144 (www.t3.com): query: www.t3.com IN A +E(0)K (127.0.0.1)
08-Oct-2024 09:25:31.474 queries: info: client @0x7f537aa30168 127.0.0.1#53253 (www.t3.com): query: www.t3.com IN A +E(0)K (127.0.0.1)
08-Oct-2024 09:25:36.477 queries: info: client @0x7f537aa32168 127.0.0.1#38076 (www.t3.com): query: www.t3.com IN A +E(0)K (127.0.0.1)
08-Oct-2024 09:25:38.536 resolver: info: shut down hung fetch while resolving ‘trhmb96.ng.impervadns.net/A
08-Oct-2024 09:25:38.536 query-errors: info: client @0x7f537aa2e168 127.0.0.1#50144 (www.t3.com): query failed (operation canceled) for www.t3.com/IN/A at …/…/…/lib/ns/query.c:7842
08-Oct-2024 09:25:38.536 query-errors: info: client @0x7f537aa30168 127.0.0.1#53253 (www.t3.com): query failed (operation canceled) for www.t3.com/IN/A at …/…/…/lib/ns/query.c:7842
08-Oct-2024 09:25:38.536 query-errors: info: client @0x7f537aa32168 127.0.0.1#38076 (www.t3.com): query failed (operation canceled) for www.t3.com/IN/A at …/…/…/lib/ns/query.c:7842
08-Oct-2024 09:25:38.549 resolver: info: shut down hung fetch while resolving ‘impervadns.net/DNSKEY

I realise there are various CN names and so with this example but the essence is WHY does resolving direct to Google work but resolving locally (which also goes to Google!) fail?

My named.conf file looks as follows:

listen-on port 53 { any; };
version none;
directory “/var/named”;
dump-file “/var/named/data/cache_dump.db”;
statistics-file “/var/named/data/named_stats.txt”;
memstatistics-file “/var/named/data/named_mem_stats.txt”;
secroots-file “/var/named/data/named.secroots”;
recursing-file “/var/named/data/named.recursing”;
recursion yes;
allow-recursion { goodclients; };
allow-query { goodclients; };
allow-query-cache { goodclients; };

dnssec-validation auto;

managed-keys-directory “/var/named/dynamic”;

pid-file “/run/named/named.pid”;
session-keyfile “/run/named/session.key”;

/* Changes/CryptoPolicy - Fedora Project Wiki */
include “/etc/crypto-policies/back-ends/bind.config”;

forwarders {
8.8.8.8; # Google DNS
8.8.4.4; # Google secondary DNS
};
forward only; # Ensure BIND only forwards queries

};
/* logging {
channel default_debug {
file “data/named.run”;
severity dynamic;
};
};
*/

logging {
channel default_debug {
file “/var/named/data/named.log” versions 3 size 20m;
severity dynamic;
print-time yes;
print-severity yes;
print-category yes;
};
category default { default_debug; };
category queries { default_debug; };
category security { default_debug; };
};

zone “.” IN {
type hint;
file “/var/named/named.ca”;
};

include “/etc/named.rfc1912.zones”;

I have spent hours on this but not making any progress. The strange thing is that the number of lookups that fail is really small. As this is just a caching name server with forwarding I am at a loss as to why these few when dnssec is enabled are timing out. Another example is grd.bk .

Any guidance/advice (or if someone can simply check if this problem exists on your set-ups, it would be appreciated.

DNSSEC support has always been problematic:

This looks like a common problem not limited to systemd-resolved.

Here’s a possible workaround:

  • Use DNS providers that validate DNSSEC.
  • Disable local DNSSEC validation.
  • Use DoH/DoT for integrity.

Isn’t this always the way, you spend hours trying to sort it, post onto a forum them solve it yourself minutes later!

In my case, having read. really good guide here: DNSSEC Guide — BIND 9 9.18.14 documentation, that my UDP size didn’t seem very large.

I added the following line to my named.conf file:

edns-udp-size 4096;

After a restart of named, all is good and resolutions now work for the failing ones.

I hope this helps someone else. I wish that had been an easier way to “spot” this as it certainly wasn’t clear from Wireshark or even debug logs that this was an issue.

3 Likes

After some further reading here: domain name system - EDNS buffer size impact - Server Fault. it appears that 1232 is the recommendation to stop fragmentation of UDP packets. Named is SUPPOSED to fall back to using TCP though and the upstream DNS server SHOULD respond with a “TC” flag in the header telling my server to use TCP. However, no matter what I try I do not see this in any response from Google’s DNS servers. I have tried various things but TC doesn’t appear anywhere which explains the timeout. So although I do have a workaround, it is sub-optimal. I will keep investigating.

I am just confused now - after more checking I can see that google has their size set to 512 bytes anyway. So I am not seeing a TC simply because my queries were not exceeding that limit. If I request a large query (rather than just a simple A record for example) then I DO see a TC message. I am therefore utterly bemused why my local server simply times out unless I increase the value of edns-udp-size from 1232 for certain queries.