NSS vs. dig vs. telnet

One of my colleagues today had a customer who suffered a database failover. The failover workflow includes an automatic change to the database’s DNS A record, to point to the new host’s IP. The database A record has a TTL of 5 seconds. However, the customer’s application couldn’t connect to the database. Performing a dig against the database hostname returned the new IP address, however, telnet returned a different IP address.

I checked the DNS fleet, and every name server responded correctly, with a new IP address, so it was the customer’s OS, and I believe that the Name Service Switch (NSS)  facility for name resolution in Linux was the culprit.

The difference between dig and telnet is that dig will always query the name servers directly, whereas telnet would prefer the local caching service if there’s one running inside the OS. An example is the nscd daemon. nscd provides caching for accessing different “databases” through starndard libc interfaces. In our case, we are concerned about “hosts” database, and the gethostbyname() and similar libc functions.

On the side note here, /etc/nsswitch.conf is a Name Service Switch (NSS) configuration file, which specifies the “databases” and their sources search order. For example:

$ grep ^hosts /etc/nsswitch.conf
hosts:      files dns

The hosts database is first consulting files (like /etc/hosts, /etc/networks), then dns sources to resolve names.

To test and confirm this, I will start the nscd service on my server, and perform strace on both dig and telnet:

$ service nscd start

$ strace -e sendmsg,recvmsg -f dig test2.pawwa.in.rs 2>&1 | grep 'sendmsg\|recvmsg'
[pid 20250] recvmsg(20, 0x7fb91f9afba0, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid 20250] sendmsg(20, {msg_name(16)={sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.0.2")}, msg_iov(1)=[{"\317{\1\0\0\1\0\0\0\0\0\0\5test2\5pawwa\2in\2rs\0\0"..., 35}], msg_controllen=0, msg_flags=0}, 0) = 35
[pid 20250] recvmsg(20, {msg_name(16)={sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.0.2")}, msg_iov(1)=[{"\317{\201\200\0\1\0\1\0\0\0\0\5test2\5pawwa\2in\2rs\0\0"..., 65535}], msg_controllen=32, {cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=0}, 0) = 51

$  strace -e connect,sendto,recvmsg -f telnet test2.pawwa.in.rs 80
connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = 0
sendto(3, "\2\0\0\0\r\0\0\0\6\0\0\0hosts\0", 18, MSG_NOSIGNAL, NULL, 0) = 18
recvmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"hosts\0", 6}, {"\310O\3\0\0\0\0\0", 8}], msg_controllen=24, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {4}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 14

Conclusion: dig sends a UDP packet directly to my name server (10.0.0.2), but telnet queries nscd via its Unix socket file /var/run/nscd/socket.

Now, let’s stop nscd and perform the same test:

$ service nscd stop

$ strace -e connect,sendmsg,recvmsg -f dig test2.pawwa.in.rs 2>&1 | grep 'connect\|sendmsg\|recvmsg'
[pid 20320] recvmsg(20, 0x7f5a81d4dba0, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid 20320] sendmsg(20, {msg_name(16)={sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.0.2")}, msg_iov(1)=[{"\363z\1\0\0\1\0\0\0\0\0\0\5test2\5pawwa\2in\2rs\0\0"..., 35}], msg_controllen=0, msg_flags=0}, 0) = 35
[pid 20320] recvmsg(20, {msg_name(16)={sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.0.2")}, msg_iov(1)=[{"\363z\201\200\0\1\0\1\0\0\0\0\5test2\5pawwa\2in\2rs\0\0"..., 65535}], msg_controllen=32, {cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=0}, 0) = 51

$ strace -e connect,sendto,recvmsg -f telnet test2.pawwa.in.rs 80
connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.0.2")}, 16) = 0
Trying 54.228.254.199...
connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("54.228.254.199")}, 16) = 0
Connected to test2.pawwa.in.rs.

Conclusion: in absence of nscd, telnet now directly queries the name server.

If I want my system to use DNS for name resolution, I can simply edit /etc/nsswitch.conf and specify dns as the first source to consult, and restart nscd:

$ grep ^hosts /etc/nsswitch.conf
hosts:      dns files
$ service nscd restart

Finally, if I want to query the hosts database for my hostname I can execute the following:

$ getent hosts test2.pawwa.in.rs
54.228.254.199  test2.pawwa.in.rs

Comments are closed.