Solaris 11: DNS libresolv and NSS settings

The values for the resolver options (timeout:T attempts:A) - see /etc/resolv.conf - as well as the name service switch option for dns ([TRYAGAIN=R]) - see /etc/nsswitch.conf - are very important and may cause effictively locking you out of a machine, when ssh is the only way to access, and ssh its default settings as well (i.e. a 'LoginGraceTime' of 2 min).

Why? Because if all configured nameservers do not answer the request (e.g. because an intermediate network/DSL link is down) the time for a single lookup of a hostname takes 2 * MAX(NST * 2N) with NS == number of configured Nameservers, N == the number of the retry (i.e. 0..attempts) and T the configured timeout: First the resolver tries to lookup the unqualified hostname (i.e. contains no trailing dot) within the given timeout (but giving each nameserver at least 1 sec to respond) and than the qualified by appending the hosts' domain name (or vice versa if there is a dot within the hostname). So we get for the first try (i.e. N=0) 2 * MAX(NS,T). If no results have been received, for each consecutive attempt, the timeout gets doubled, and the lookup gets repeated with the new timeout again. This continues until the configured number of attempts has been reached (i.e. when N==A). So when an application asks to resolve a hostname, it takes roughly about T * (2A+2-2) until the call returns with a ETIMEDOUT error.

However, now the Name Service Switch comes into play: If an application does not explicitly use the libresolvN library directly but normal system/libc calls, the call gets "tunneled" to the name service cache daemon aka "nscd" via a so called NS door, which actually does the request, caches the result and passes it back to the application. However, if it received a ETIMEOUT|BUSY error, it repeats the lookup up TRYAGAIN times. So very roughly the complete equation for the worst case sums up to: T * (2A+2-2) * (R+1) !!!

Given a system uses the default values (i.e. timeout:5, attempts:2, TRYAGAIN=3) the call to resolve a hostname takes roughly 5*(24-2)*4 = 280 sec. Wrt. ssh, because it tries to resolve the connecting client's hostname, the authentication phase times out (4.7 min > 2 min) even before you actually get the chance to authenticate -> you are out of business [and NOTE, possibly due to a ssh bug "LookupClientHostnames no" / "VerifyReverseMapping no" don't change anything] ;-)

In Solaris 11 the algorithm seems to got modified to the worse, so for the sake of lazyness here a table with time measurements for 'getent hosts $host', which may help you determine, what settings might be right for you (all with TRYAGAIN=0):

TIME[s]TIME[s]TIME[s]
TANS=1NS=2NS=3
00 2.0 4.0 6.0
01 4.0 8.0 12.0
02 6.0 12.0 18.0
03 8.0 16.0 24.0
10 2.0 4.0 6.0
11 4.0 8.0 12.0
12 8.0 14.0 20.0
13 16.0 26.0 32.0
20 4.0 6.0 8.0
21 8.0 12.0 16.0
22 16.0 24.0 28.0
23 32.0 48.0 52.0
30 6.0 8.0 10.0
31 12.0 16.0 20.0
32 24.0 34.0 40.0
33 48.01:10.01:20.0
40 8.0 12.0 12.0
41 16.0 24.0 24.0
42 32.0 48.0 48.0
431:04.01:36.01:40.0
50 10.0 14.0 14.0
51 20.0 28.0 28.0
52 40.0 58.01:00.0
531:20.01:58.02:04.0

But remember wrt. NSS, NOT setting dns [TRYAGAIN=0] is a really bad idea!

Commands to change DNS resolver settings:

svccfg -s name-service/switch 'setprop config/host = "files dns [TRYAGAIN=0]"'
svccfg -s dns/client 'setprop config/options = "timeout:3 attempts:1"'
svccfg -s dns/client 'setprop config/nameserver = ("10.0.0.1" "10.0.0.2")'
svcadm refresh name-service/switch
svcadm refresh dns/client

cat /etc/resolv.conf
egrep '^(hosts|ipnodes):' /etc/nsswitch.conf

Misc

To verify, what's really going on, one may try to dtrace it using nscd.d -32 -p$pid, whereby $pid is the process ID of a nscd.

Note

According to Oracle Bug 19930631 the odd behavior has been fixed in Solaris 11.2.7.4.0 aka 11.2 SRU 2.7 (April 2015), i.e. now the TRYAGAIN value overwrites the attempts value from /etc/resolv.conf.

Copyright (C) 2011 Jens Elkner (jel+s11@cs.uni-magdeburg.de)