UDP recvfrom As was the case with sending, the socket API provides several mechanisms for receiving a UDP datagram. We begin with a study of recvfrom() which, like sendto(), does not require the user to pass an address structure and does not support scatter/gather operations via the iovec mechanism. The recvfrom() function takes the following arguments. fd File (socket) descriptor. ubuf Pointer to a buffer to hold the received message. size The size of the above buffer. addr Pointer to a structure of type struct sockaddr_in. If not NULL and the socket is not connected, the address of the sender of the received message is returned here.. addr_len A pointer to the size of the structure pointed to by addr . If non zero, its value is changed to the actual size of the address. flags Can be used to modify the behaviour of the receive operation. Flags supported for UDP sockets include: MSG_PEEK: Used to receive data, without dequeuing it. Thus, a subsequent call shall return the same data. MSG_ERRQUEUE: Used to receive queued errors from socket error queue. 1 The sys_recvfrom socket call. The sys_recvfrom() function, defined in net/socket.c, is the kernel function to which control is eventually passed from sys_socketcall(). Its parameters are those passed by the application to recvfrom(). If an incoming packet is queued for the socket and successfuly copied to the user buffer, sys_recvfrom returns its length in bytes. A return value of less than or equal to zero is an indication that an error condition has been encountered and that no data has been returned. 1622 asmlinkage long sys_recvfrom(int fd, void __user * ubuf, size_t size, unsigned flags, 1623 struct sockaddr __user *addr, int __user *addr_len) 1624 { 1625 struct socket *sock; 1626 struct iovec iov; 1627 struct msghdr msg; 1628 char address[MAX_SOCK_ADDR]; 1629 int err,err2; 1630 struct file *sock_file; 1631 int fput_needed; The operation of the fget_light() function was described in the discussion of UDP sendto. It returns a pointer to struct file structure corresponding to the fd passed in by the user. Then the sock_from_file() function is used to recover the pointer to the socket / inode pair and eventually the socket. If the fd does not index a valid socket NULL is returned and the call fails. 1633 sock_file = fget_light(fd, &fput_needed); 1634 if (!sock_file) 1635 return -EBADF; 1636 1637 sock = sock_from_file(sock_file, &err); 1638 if (!sock) 1639 goto out; 1640 2 Message structures The struct msghdr and the struct iovec are the vehicle by which the struct sockaddr_in and the address of data buffer are managed in kernel space. These structures, introduced in the description of the sendto function, are used in both send and receive operations. As with sendto(), the recvfrom() API does not support scatter-gather. Thus it is the responsibility of sys_recvfrom to construct the msghdr and iov. After recovering the pointer to the struct socket, sys_recvfrom() fills in the struct msghdr and the struct iovec. Since the API supports no mechanism for the receipt of ancillary control data, such data is not collected. 1641 msg.msg_control=NULL; 1642 msg.msg_controllen=0; Since the recvfrom() API also permits the caller to pass only a single continous input buffer, a simple I/O vector containing one data block is used. 1643 msg.msg_iovlen=1; 1644 msg.msg_iov=&iov; The base pointer for the data block is set to point to the user space address of the data buffer. 1645 iov.iov_len=size; 1646 iov.iov_base=ubuf; The name pointer contains the kernel space address of the local buffer in which the address of the sender of the message will be stored temporarily. 1647 msg.msg_name=address; 1648 msg.msg_namelen=MAX_SOCK_ADDR; When a socket has O_NONBLOCK flag set, the application will not block(wait) if there is currently no data to receive. 1649 if (sock->file->f_flags & O_NONBLOCK) 1650 flags |= MSG_DONTWAIT; 3 Passing control to sock_recvmsg The bulk of the work is done by sock_recvmsg(). If the value returned in err is positive, a packet has been successfully received. 1651 err = sock_recvmsg(sock, &msg, size, flags); 1652 On return to sys_recvmsg() a positive value for err indicates success. If the argument addr, which contains the user space address, is not NULL, and the message was successfully received, the sender's address is returned to user space by the move_addr_to_user() function. Note that your protocol will be responsible for assembling the sockaddr_in structure but will not be respnsible for moving it to user space. 1653 if(err >= 0 && addr != NULL) 1654 { 1655 err2 = move_addr_to_user(address, msg.msg_namelen, addr, addr_len); 1656 if(err2<0) 1657 err=err2; 1658 } 1659 out: 1660 fput_light(sock_file, fput_needed); 1661 return err; 1662 } 4 The sock_recvmsg() function The sock_recvmsg() function, defined in net/socket.c, is the point at which kernel support for all of the recv*() APIs converges. 650 int sock_recvmsg(struct socket *sock, struct msghdr *msg, 651 size_t size, int flags) 652 { 653 struct kiocb iocb; 654 struct sock_iocb siocb; 655 int ret; 656 657 init_sync_kiocb(&iocb, NULL); 658 iocb.private = &siocb; 659 ret = __sock_recvmsg(&iocb, sock, msg, size, flags); 660 if (-EIOCBQUEUED == ret) 661 ret = wait_on_sync_kiocb(&iocb); 662 return ret; 663 } 5 This code is analogous to that of _sock_sendmsg() 631 static inline int __sock_recvmsg(struct kiocb *iocb, struct socket *sock, 632 struct msghdr *msg, size_t size, int flags) 633 { 634 int err; 635 struct sock_iocb *si = kiocb_to_siocb(iocb); 636 637 si->sock = sock; 638 si->scm = NULL; 639 si->msg = msg; 640 si->size = size; 641 si->flags = flags; 642 643 err = security_socket_recvmsg(sock, msg, size, flags); 644 if (err) 645 return err; 646 For UDP and other sockets of family AF_INET, this indirect call used to be to inet_recvmsg(). Now this function has been renamed sock_common_recvmsg(). As with sendto() the scm cookie is not used. 647 return sock->ops->recvmsg(iocb, sock, msg, size, flags); 648 } 6 The sock_common_recvmsg() function The sock_common_recvmsg() function is defined in net/core/sock.c, 1621 int sock_common_recvmsg(struct kiocb *iocb, struct socket *sock, 1622 struct msghdr *msg, size_t size, int flags) 1623 { 1624 struct sock *sk = sock->sk; 1625 int addr_len = 0; 1626 int err; 1627 An indirect call is made to udp_recvmsg. Note that the flags are repartitioned into the no block flag and the flags originally passed in by the user. The true transport protocol returns the number of bytes of application data if receive was successful and a negative error code or 0 if it was not. 1628 err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT, 1629 flags & ~MSG_DONTWAIT, &addr_len); If no error occured then the address length which was filled in by udp_recvmsg() is copied back to the msg->msg_namelen field. 1630 if (err >= 0) 1631 msg->msg_namelen = addr_len; 1632 return err; 1633 } 7 The udp_recvmsg() function The udp_recvmsg() function is defined in net/ipv4/udp.c is the front end to the UDP transport receive mechanism. 776 /* 777 * This should be easy, if there is something there we 778 * return it, otherwise we block. 779 */ 780 781 static int udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, 782 size_t len, int noblock, int flags, int *addr_len) 783 { 784 struct inet_sock *inet = inet_sk(sk); The variable sin is declared to be a pointer of struct sockaddr_in type and set to the value of msg->msg_name. 785 struct sockaddr_in *sin = (struct sockaddr_in *)msg->msg_name; 786 struct sk_buff *skb; 787 int copied, err; 8 If MSG_ERRQUEUE is specified in flags, data is received from the error queue of the socket. You don't need to support this facility. 795 if (flags & MSG_ERRQUEUE) 796 return ip_recv_error(sk, msg, len); 797 A single UDP packet is retreived from the receive queue associated with the struct sock by skb_recv_datagram(). This is a handy function for your protocol to use. You do not need to support the try_again strategy. The try_again facility is new in kernel 2.6. It is used to automatically read another packet if the packet at the head of the queue contains a checksum error. How could this be?? Didn't we do checksums in udp_deliver? 798 try_again: 799 skb = skb_recv_datagram(sk, flags, noblock, &err); 9 Setting up for a copy-to-user space On return to udp_recvmsg() a test is made to see if an sk_buff pointer was returned by skb_recv_datagram() and if not a jump to an error exit point is made. 800 if (!skb) 801 goto out; 802 If a buffer pointer was returned it is necessary to copy the data back to user space and release the buffer. Since skb->len includes the length of the UDP header at this point (but no longer the MAC or IP header) , copied denotes length of user data. 803 copied = skb->len - sizeof(struct udphdr); If data to be returned to user exceeds the size of the buffer provided (which is presently held in len), the value of copied is adjusted downward to and the MSG_TRUNC flag is set. You DO need to perform this operation. 804 if (copied > len) { 805 copied = len; 806 msg->msg_flags |= MSG_TRUNC; 807 } 808 10 Copying the data to user space As with send there are multiple copy mechanisms that depend upon the need to validate the UDP checksum. There are three cases here: (1) no checksumming needed at all. (2) Checksumming needed but packet must be truncated (3) Checksumming needed but no truncation If check summing is not required, the skb_copy_datagram_iovec() copies the data to the user space buffer described by the iov. The sizeof (struct udphdr) is the offset to where copying should start. This function is exported to modules and is a good choice for you to use. 809 if (skb->ip_summed==CHECKSUM_UNNECESSARY) { 810 err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov, 811 copied); If it is necessary to truncate the message and a checksum is required then it is first necessary to call __udp_checksum_complete() to verify UDP checksum over the entire packet. In case of a checksum error, the sk_buff is freed and an error returned. In case of success only the part of the packet that will fit into the buffer provided is copied. 812 } else if (msg->msg_flags&MSG_TRUNC) { 813 if (__udp_checksum_complete(skb)) 814 goto csum_copy_err; 815 err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov, 816 copied); 11 Copy and checksum entire packet In the final case a checksum is required and the entire packet is to be copied. Here skb_copy_and_csum_datagram_iovec() verifies checksum and copies data from sk buff to I/O vector. 817 } else { 818 err = skb_copy_and_csum_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov); 819 820 if (err == -EINVAL) 821 goto csum_copy_err; 822 } 823 The copy can fail if for example the user passes and invalid buffer pointer. You need to check for condition as well. 824 if (err) 825 goto out_free; 826 The sock_recv_timestamp() function records the time stamp, when the sk_buff was received. Your transport protocol should do this. 827 sock_recv_timestamp(msg, sk, skb); 828 12 Returning the sender address to the application The sender address is copied back to the struct sockaddr_in from elements of the sk_buff. Note that this operation depends on the h and nh pointers in the sk_buff being set up properly. Your protocol needs to perform this operation as well (if and only if sin is not NULL). 829 /* Copy the address. */ 830 if (sin) 831 { 832 sin->sin_family = AF_INET; 833 sin->sin_port = skb->h.uh->source; 834 sin->sin_addr.s_addr = skb->nh.iph->saddr; 835 memset(sin->sin_zero, 0, sizeof(sin->sin_zero)); 836 } Depending on control message flags specified, corresponding ancillary control data is collected by ip_cmsg_recv. However, sys_recvfrom does not get any such data. 837 if (inet->cmsg_flags) 838 ip_cmsg_recv(msg, skb); 839 On a normal reception the length of data actually copied back to user space is the return value. Your protocol needs to perform this computation. 840 err = copied; 841 if (flags & MSG_TRUNC) 842 err = skb->len - sizeof(struct udphdr); 843 It is the responsibility of the recvmsg function to free the sk_buff(). 844 out_free: 845 skb_free_datagram(sk, skb); 846 out: 847 return err; 848 13 Handling checksum errors 849 csum_copy_err: 850 UDP_INC_STATS_BH(UDP_MIB_INERRORS); 851 852 skb_kill_datagram(sk, skb, flags); 853 854 if (noblock) 855 return -EAGAIN; 856 goto try_again; 857 } 14 The skb_rcv_datagram() function The skb_recv_datagram() function is defined in net/core/datagram.c. Its mission is to cause the calling application to sleep until a packet arrives and has been enqueued by udp_rcv. 146 struct sk_buff *skb_recv_datagram(struct sock *sk, unsigned flags, 147 int noblock, int *err) 148 { 149 struct sk_buff *skb; 150 long timeo; The function sock_error simply returns the negative of the last value to have been stored in sk->err and then zeroes the field. 151 /* 152 * Caller is allowed not to check sk->sk_err before skb_recv_datagram() 153 */ 154 int error = sock_error(sk); 155 156 if (error) 157 goto no_packet; 158 The value returned by sock_rcvtimeo determines the time to wait (in ticks) for data, if the received packet queue is presently empty. sk->rcvtimeo is set to default value of MAX_SCHEDULE_TIMEOUT by sys_socket. It is the maximum value of an unsigned long type. The units are specified in jiffies, but this is effectively a wait forever. 159 timeo = sock_rcvtimeo(sk, noblock); 160 15 The main receive loop On return to skb_recv_datagram() the main receive loop is entered. Exit from the loop will occur when: • a datagram has been successfully received • a timeout occurs (either instantly or after a very long wait) • an error occurs • a signal is received • 161 do { 162 /* Again only user level code calls this function, so nothing 163 * interrupt level will suddenly eat the receive_queue. 164 * 165 * Look at current nfs client by the way... 166 * However, this function was corrent in any case. 8) 167 */ If MSG_PEEK is specified in flags, skb_peek() is called. It is passed a pointer to the receive queue header. If the receive queue is non-empty, skb_peek() returns a pointer to the first sk buff without dequeuing it from receive queue. 168 if (flags & MSG_PEEK) { 169 unsigned long cpu_flags; 170 171 spin_lock_irqsave(&sk->sk_receive_queue.lock, 172 cpu_flags); 173 skb = skb_peek(&sk->sk_receive_queue); Note that the user count of the buffer is incremented here. This is necessary because when the buffer is returned to the peeker, the count will be decremented, and, since the buffer still resides on the queue, it must not be actually freed. 174 if (skb) 175 atomic_inc(&skb->users); 176 spin_unlock_irqrestore(&sk->sk_receive_queue.lock, 177 cpu_flags); 16 Dequeing the buffer If the MSG_PEEK flag is not specified, skb_dequeue is called. If the queue is non-empty, skb_dequeue will remove the first sk_buff from the list and return a pointer to it. Otherwise it will return NULL. Queue management is not via standard Linux list structures and the received packet queue is rooted at sk->receive_queue which is an element of type sk_buff_head. The __skb_dequeue() function does the work of actually removing an sk_buff from the receive queue. 178 } else 179 skb = skb_dequeue(&sk->sk_receive_queue); 180 On return to skb_recv_datagram(), if a pointer to an sk_buff is received, it is returned to udp_recvmsg(). 181 if (skb) 182 return skb; 183 184 /* User doesn't want to wait */ 185 error = -EAGAIN; 186 if (!timeo) 187 goto no_packet; 188 Otherwise, wait for data arrival. This call will cause the calling process to sleep. 189 } while (!wait_for_packet(sk, err, &timeo)); 191 return NULL; 192 193 no_packet: 194 *err = error; 195 return NULL; 196 } 17 Waiting for packet arrival. The wait_for_packet() function is defined in net/core/datagram.c. Because of the race condition previously described, it eschews the use of the kernel service routines designed to provide sleep/wakeup services and implements them internally. 68 * Wait for a packet.. 69 */ 70 static int wait_for_packet(struct sock *sk, int *err, long *timeo_p) 71 { 72 int error; 73 DEFINE_WAIT(wait); 74 The current process sets its state to TASK_INTERRUPTIBLE and adds itself to the queue of waiting processes. A significant amount of additional processing occurs before the process actually goes to sleep though. sk->sleep is of type wait_queue_head_t. 75 prepare_to_wait_exclusive(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE); 76 The err flag of the struct sock is checked for any errors. 77 /* Socket errors? */ 78 error = sock_error(sk); 79 if (error) 80 goto out_err; 81 18 The receive queue is tested for still empty. If not a jump is taken to the end of the function. 82 if (!skb_queue_empty(&sk->sk_receive_queue)) 83 goto out; 84 See if the shutdown flag of the sk (struct sock) has been set indicating that some manner of receive close is in progress. 85 /* Socket shut down? */ 86 if (sk->sk_shutdown & RCV_SHUTDOWN) 87 goto out_noerr; 88 19 Preparing for the sleep Since, a SOCK_DGRAM type socket is connectionless, we always get past this if-statement. Recall that this function is skb_dequeue() which can be used by protocols other than UDP. 89 /* Sequenced packets can come disconnected. 90 * If so we report the problem 91 */ 92 error = -ENOTCONN; 93 if (connection_based(sk) && 94 !(sk->sk_state == TCP_ESTABLISHED || sk->sk_state == TCP_LISTEN)) 95 goto out_err; 96 The connection_based() function is defined in net/core/datagram.c. 51 static inline int connection_based(struct sock *sk) 52 { 53 return (sk->type==SOCK_SEQPACKET || sk->type==SOCK_STREAM); 54 } The signal_pending() function checks the sigpending flag of struct task_struct. Is this flag is set, the value 1 is returned. 97 /* handle signals */ 98 if (signal_pending(current)) 99 goto interrupted; 100 Finally schedule_timeout() is called to give up control to the scheduler. Control will not return here until a packet is received, a timeout occurs, or a signal is received. 101 error = 0; 102 *timeo_p = schedule_timeout(*timeo_p); 20 Waking up from the sleep This is the standard wakeup action. Restore the task state and remove if from the wait queue. 103 out: 104 finish_wait(sk->sk_sleep, &wait); 105 return error; 106 interrupted: 107 error = sock_intr_errno(*timeo_p); 108 out_err: 109 *err = error; 110 goto out; 111 out_noerr: 112 *err = 0; 113 error = 1; 114 goto out; 115 } 116 21 Transferring data from an sk_buff to an I/O vector This procedure appears (and is) far more complex in the worst case than is typically the case in practice. The problem lies with the implementation of the sk_buff which in the worst case can consist of fragment lists and unmapped page buffers. The buffer header of type struct sk_buff contains two members, namely len and data_len, used to describe the length of the received packet. The skb->len field denotes length of the amount of data in the packet that remains to be processed. That is, it is initially set to the length of all headers and application data. As headers are removed as the packet is passed up the stack, the value of skb->len is decremented by the length of each network header removed. The value of skb->data_len is the amount of data that is held in fragments and in chained sk_buffs. 22 The skb_copy_datagram_iovec() function The function skb_copy_datagram_iovec(), defined in net/core/datagram.c, is used to copy a UDP datagram when checksumming is not required. It will copy from skb->data + offset. In the usual case the value of offset is the size of the UDP header and skb->data_len is 0. 238 /** 239 * skb_copy_datagram_iovec - Copy a datagram to an iovec. 240 * @skb: buffer to copy 241 * @offset: offset in the buffer to start copying from 242 * @to: io vector to copy to 243 * @len: amount of data to copy from buffer to iovec 244 * 245 * Note: the iovec is modified during the copy. 246 */ 247 int skb_copy_datagram_iovec(const struct sk_buff *skb, int offset, 248 struct iovec *to, int len) 249 { It is the case that skb->len includes the data in the kmalloc'd area, but skb->datalen includes only that which is in the fragment chain or unmapped page buffers. Thus start is set here to the amount of data in the kmalloc'd part of the buffer. 250 int start = skb_headlen(skb); 251 int i, copy = start - offset; 252 23 The comment is misleading and apparently reflects the philosophy that the kmalloc'd part of the sk_buff structure is for storage of network header elements. What is actually happening in the case of UDP is that memcpy_toiovec() is being passed a pointer to the start of the user data along with the length of the user data. • In the standard case (no fragments) the value of len will become 0 at line 259 and the function will return. 253 /* Copy header. */ 254 if (copy > 0) { 255 if (copy > len) 256 copy = len; 257 if (memcpy_toiovec(to, skb->data + offset, copy)) 258 goto fault; 259 if ((len -= copy) == 0) 260 return 0; 261 offset += copy; 262 } 24 Insufficient data in the kmalloc'd part skb_shinfo is defined in include/linux/skbuff.h. It simply returns a pointer to skb_shared_info structure that is pointed to by skb->end. If there exist unmapped page buffers skb_copy_datagram_iovec() will continue and copy data from unmapped page buffers into the I/O vec. 264 /* Copy paged appendix. Hmm... why does this look so complicated? */ 265 for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { 266 int end; 267 268 BUG_TRAP(start <= offset + len); 269 In the first iteration of this loop start contains the offset from the start of the packet data (including UDP header) of the beginning of the paged appendix. Thus end is set to the offset of the 1st byte beyond the data in the paged appendix and copy is set to the amount of data in this element of the paged appendix. 270 end = start + skb_shinfo(skb)->frags[i].size; 271 if ((copy = end - offset) > 0) { 272 int err; 273 u8 *vaddr; 274 skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; 275 struct page *page = frag->page; 276 277 if (copy > len) 278 copy = len; Get logical address of page corresponding to page. Copy data from fragment into I/O vector using memcpy_to_iovec. 279 vaddr = kmap(page); 280 err = memcpy_toiovec(to, vaddr + frag->page_offset + 281 offset - start, copy); 282 kunmap(page); 283 if (err) 284 goto fault; 285 if (!(len -= copy)) 286 return 0; 287 offset += copy; 288 } 289 start = end; 290 } 25 Processing the fragment chain Finally, if there exist additional sk_buffs in the chain, they are processed via a recursive call to skb_copy_datagram_iovec(). 292 if (skb_shinfo(skb)->frag_list) { 293 struct sk_buff *list = skb_shinfo(skb)->frag_list; 294 295 for (; list; list = list->next) { 296 int end; 297 298 BUG_TRAP(start <= offset + len); 299 300 end = start + list->len; 301 if ((copy = end - offset) > 0) { 302 if (copy > len) 303 copy = len; 304 if (skb_copy_datagram_iovec(list, 305 offset - start, 306 to, copy)) 307 goto fault; 308 if ((len -= copy) == 0) 309 return 0; 310 offset += copy; 311 } 312 start = end; 313 } 314 } 315 if (!len) 316 return 0; 317 318 fault: 319 return -EFAULT; 320 } 26 The actual copy to user space The memcpy_to_iovec() function is defined in net/core/iovec.c. It copies kernel data into an I/O vector. Note that as data is copied to the iovec, the len field of the element which is the recipient is decremented and the base pointer is incremented. This strategy makes it possible, albeit slightly inefficient, for callers that are passing multiple fragments of a packet to be copied to always just pass the base address of the iovec. Elements that have been previously filled will just be bypassed in the while loop because the if statement at line 88 will find that such elements have iov_len equal to 0. It is necessary that len not exceed the remaining capacity of the iov. It would be possible for your receive function to call this directly but it is better to use the highest level interface that is provided by the kernel. 83 int memcpy_toiovec(struct iovec *iov, unsigned char *kdata, int len) 84 { 85 while (len > 0) { 86 if (iov->iov_len) { 87 int copy = min_t(unsigned int, iov->iov_len, len); 88 if (copy_to_user(iov->iov_base, kdata, copy)) 89 return -EFAULT; 90 kdata += copy; 91 len -= copy; 92 iov->iov_len -= copy; 93 iov->iov_base += copy; 94 } 95 iov++; 96 } 97 98 return 0; 99 } 27