Linux中的select函数
最近在排查一个用户同步数据非常慢的问题。使用perf trace -S -p $pid发现,进程的大部分时间花费在了select函数上:
# perf trace -S -p 9091
...
12665.204 (35.778 ms): select(n: 18, inp: 0x28c13a0, outp: 0x28c1160 ) = 1
12700.995 ( 0.023 ms): read(fd: 17<socket:[169296244]>, buf: 0x2af3600, count: 32768 ) = 2896
^C
Summary of events:
p4d (9091), 13844 events, 100.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
select 2032 12389.683 0.000 6.097 124.075 7.78%
write 2858 53.719 0.008 0.019 0.252 1.09%
read 2032 38.850 0.004 0.019 0.259 1.35%
凭直觉觉得有些异常。按照Linux手册上对select的说法:
int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);
select() allow a program to monitor multiple file descriptors, waiting until one or more of the file descriptors become “ready” for some class of I/O operation (e.g., input possible).
A file descriptor is considered ready if it is possible to perform the corresponding I/O operation (e.g., read(2)) without blocking. […]
Three independent sets of file descriptors are watched. Those listed in readfds will be watched to see if characters become available for reading (more precisely, to see if a read will not block; in particular, a file descriptor is also ready on end-of-file), those in writefds will be watched to see if a write will not block, and those in exceptfds will be watched for exceptions.
On exit, the sets are modified in place to indicate which file descriptors actually changed status. Each of the three file descriptor sets may be specified as NULL if no file descriptors are to be watched for the corresponding class of events.
[…]
Return Value
On success, select() and pselect() return the number of file descriptors contained in the three returned descriptor sets (that is, the total number of bits that are set in readfds, writefds, exceptfds) which may be zero if the timeout expires before anything interesting happens. On error, -1 is returned, and errno is set appropriately; the sets and timeout become undefined, so do not rely on their contents after an error.
由于perf trace里看不到相关函数的参数及返回值,用strace追踪则可以看到:
strace -p 9091
[…]
write(19, "\271i\355\312\340\225\324\177\371\304$r\317\3511rc\304\353\360U\324\277e[\26\271\273\314\260\353"…, 4096) = 4096
select(18, [17], [], NULL, NULL) = 1 (in [17])
read(17, "\275\26hs\314\311\33\324\277\315B\357\26\250,q\362\5A)\257\231\21\225\273V\321\225\211\241\371\177"…, 32768) = 2896
write(19, "\355\325\255\222\222\22\0232\322\322\257\375\263\315\337S\v\252\342\323\23S3\376\33626~\310\3776\266"…, 4096) = 4096
select(18, [17], [], NULL, NULL) = 1 (in [17])
read(17, "\34306x\262'\17\342\277#\357\306\324\314*\300\207\340u\250X\331\220\214^\274\361\272\267\300\27"…, 32768) = 5792
write(19, "\331;r\371\221\\355\5\1\320\177\323S\353\310\215\7\247\346\n\0\2275\356\320\227\334\275O\264\237\334"…, 4096) = 4096
select(18, [17], [], NULL, NULL
^Cstrace: Process 9091 detached
可以看到应用在调用select函数时只设置了readfds以及writefds参数,exceptfds和timout为NULL,而且writefds为空(不理解为空和NULL有什么差别,猜测是不检测要写的fd)。
其中fd 17是个socket,进程通过它从一个远程服务器下载数据。是问题的主要嫌疑对象。目前的猜测是,由于数据传输慢,select需要多次查询,socket才有一次准备好,因而大部分时间花费在select函数上,而非read。
今天用perf trace追踪了一下两个传输速度正常的进程,结果如下:
8523.185 ( 0.006 ms): read(fd: 24, buf: 0x2b30459, count: 4096 ) = 4096
^C
Summary of events:
p4d (10316), 218542 events, 100.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
select 2498 7035.923 0.000 2.817 112.853 2.74%
read 104269 613.876 0.002 0.006 1.424 0.69%
write 2493 174.014 0.022 0.070 0.607 0.99%
open 5 0.109 0.017 0.022 0.028 8.48%
close 5 0.015 0.002 0.003 0.004 8.22%
flock 2 0.012 0.003 0.006 0.008 40.06%
11718.425 ( 0.004 ms): read(fd: 24, buf: 0x2b21699, count: 4096 ) = 4096
^C
Summary of events:
p4d (30890), 545667 events, 99.3%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
select 6116 9339.862 0.000 1.527 190.234 3.67%
read 261559 1052.123 0.001 0.004 1.099 0.48%
write 6099 279.031 0.019 0.046 1.633 1.14%
open 16 0.447 0.011 0.028 0.054 11.16%
close 15 0.038 0.001 0.003 0.006 12.28%
flock 4 0.021 0.003 0.005 0.008 20.89%