UDP Bind and Connect UDP bind The bind() system call is used to associate a local address with a socket. Addresses in the AF_IPV4 family are specified using struct sockaddr_in. The sys_bind() function is defined in net/socket.c. The mechanisms by which it is invoked and its arguments are passed was discussed in the previous section on UDP socket creation. sys_bind() takes following arguments, fd File (socket) descriptor. umyaddr Pointer to a struct sockaddr_in containing: address family, local IP address and port number. addrlen Size of struct sockaddr_in. 1 The sys_bind() function This wrapper function relies upon helpers to do: • recover the struct socket pointer from the fd • copy the struct sockaddr_in from user space to kernel space • invoke the protocol family dependent bind handler 1335 asmlinkage long sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen) 1336 { 1337 struct socket *sock; 1338 char address[MAX_SOCK_ADDR]; // bind address copied here 1339 int err, fput_needed; The sockfd_lookup_light() function returns a pointer to the struct socket structure corresponding to the fd. 1341 if((sock = sockfd_lookup_light(fd, &err, &fput_needed))!=NULL) 1342 { The move_addr_to_kernel() function copies the address from user space (umyaddr) to kernel space (address) using copy_from_user function. 1343 if((err=move_addr_to_kernel(umyaddr, addrlen,address))>=0) { 1344 err = security_socket_bind(sock, (struct sockaddr *)address, addrlen); 1345 if (!err) If the address copy succeeds, the bind function registered for the given socket type is invoked. In our case this function will be inet_bind(). 1346 err = sock->ops->bind(sock, 1347 (struct sockaddr *)address, addrlen); 1348 } 1349 fput_light(sock->file, fput_needed); 1350 } 1351 return err; 1352 } 2 Recovering the struct socket from the fd handle. The sockfd_lookup() function is defined in net/socket.c. Its mission is to map the map the fd to the address of the inode that contains the socket. 490 static struct socket *sockfd_lookup_light(int fd, int *err, int *fput_needed) 491 { 492 struct file *file; 493 struct socket *sock; 494 495 *err = -EBADF; 496 file = fget_light(fd, fput_needed); If there is an open struct file associated with fd, the fget() function will return a pointer to it. 497 if (file) { 498 sock = sock_from_file(file, err); 499 if (sock) 500 return sock; 501 fput_light(file, *fput_needed); 502 } 503 return NULL; 504 } 3 Recovering the struct file pointer from the fd handle The fget() function defined in fs/file_table.c retrieves the file pointer from the files_struct structure pointed to by the current process's task_struct. On return from fcheck(), if fd indexed a socket or file that was no longer open, file will be null. 207 * Lightweight file lookup - no refcnt increment if fd table isn't shared. 208 * You can use this only if it is guranteed that the current task already 209 * holds a refcnt to that file. That check has to be done at fget() only 210 * and a flag is returned to be passed to the corresponding fput_light(). 211 * There must not be a cloning between an fget_light/fput_light pair. 212 */ 213 struct file fastcall *fget_light(unsigned int fd, int *fput_needed) 214 { 215 struct file *file; 216 struct files_struct *files = current->files; 217 218 *fput_needed = 0; 219 if (likely((atomic_read(&files->count) == 1))) { 220 file = fcheck_files(files, fd); 221 } else { 222 rcu_read_lock(); 223 file = fcheck_files(files, fd); 224 if (file) { 225 if (atomic_inc_not_zero(&file->f_count)) 226 *fput_needed = 1; 227 else 228 /* Didn't get the reference, someone's freed */ 229 file = NULL; 230 } 231 rcu_read_unlock(); 232 } 233 234 return file; 235} 4 Validating the fd handle. fcheck_files() defined in linux/file.h checks the input fd to ensure that it is not out of range and returns a the struct file pointer from the fd array. If there is no open file associated with the slot indexed by fd, the return value will be NULL. 94 static inline struct file * fcheck_files(struct files_struct *files, unsigned int fd) 95 { 96 struct file * file = NULL; 97 struct fdtable *fdt = files_fdtable(files); 98 99 if (fd < fdt->max_fds) 100 file = rcu_dereference(fdt->fd[fd]); 101 return file; 102 } 5 Recovering the socket pointer In addition to using the dentry to find the inode, a the struct socket can also be accessed through the private_data pointer in the struct file if the fop pointer points to socket_file_ops. If not, the dentry can be used to access the inode from which the socket address can be recovered. This seems to be an "unusual" path and I'm not sure what triggers it. 440static struct socket *sock_from_file(struct file *file, int *err) 441{ 442 struct inode *inode; 443 struct socket *sock; 444 445 if (file->f_op == &socket_file_ops) 446 return file->private_data; /* set in sock_map_fd */ 447 448 inode = file->f_dentry->d_inode; 449 if (!S_ISSOCK(inode->i_mode)) { 450 *err = -ENOTSOCK; 451 return NULL; 452 } 453 454 sock = SOCKET_I(inode); 455 if (sock->file != file) { 456 printk(KERN_ERR "socki_lookup: socket file changed!\n"); 457 sock->file = file; 458 } 459 return sock; 6 Copying the sockaddr_in to kernel space. The variables uaddr and ulen are copies of the parameters passed by the user. Thus uaddr is a user space address. The value of kaddr is the address of the char address[] array that is allocated on the stack of sys_bind. Note that the implementation correctly does not trust the ulen value to be correct and thus prevents a buffer overflow situation that could corrupt the stack. Buffer overflows are the stack are exceedingly dangerous. Since the stack grows down a buffer overflow will grow back toward the top of the stack. Such an overflow will overlay the return address field and the stack and thus can cause a return to arbitray code that has been installed by the hacker! The copy_from_user() and copy_to_user() functions do their own validity checking of the user space pointers for read/write access and return 0 if the data was successfully copied. 228int move_addr_to_kernel(void __user *uaddr, int ulen, void *kaddr) 229{ 230 if(ulen<0||ulen>MAX_SOCK_ADDR) 231 return -EINVAL; 232 if(ulen==0) 233 return 0; 234 if(copy_from_user(kaddr,uaddr,ulen)) 235 return -EFAULT; 236 return audit_sockaddr(ulen, kaddr); 237} 7 The inet_bind() function For a SOCK_DGRAM type socket, sock>ops and sock>sk>prot were set by the function inet_create (called by sys_socket) to point to inet_dgram_ops and udp_prot respectively. Therefore, sock>ops >bind(...) translates to inet_bind(...), and a bind operation on a SOCK_DGRAM type socket results in a call to inet_bind() . 397 int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) 398 { 399 struct sockaddr_in *addr = (struct sockaddr_in *)uaddr; 400 struct sock *sk = sock->sk; 401 struct inet_sock *inet = inet_sk(sk); 402 unsigned short snum; 403 int chk_addr_ret; 404 int err; 405 406 /* If the socket has its own bind function then use it. (RAW) */ The udp_prot structure defined in net/ipv4/udp.c and which is here pointed to by sk>prot exports no internal bind function and thus the following call is not made and the ``standard bind'' is performed instead. You also don't need to provide a bind handler. 407 if (sk->sk_prot->bind) { 408 err = sk->sk_prot->bind(sk, uaddr, addr_len); 409 goto out; 410 } Conduct a sanity check on the length field. 411 err = -EINVAL; 412 if (addr_len < sizeof(struct sockaddr_in)) 413 goto out; 414 It is next necessary to determine whether the type of local IP address being bound to. Broadcast, multicast, local, unicast, and NAT are distinguished. The function inet_addr_type() is used to determine the type. This function will be considered in detail in the discussion of output routing. 415 chk_addr_ret = inet_addr_type(addr->sin_addr.s_addr); 416 8 Nonlocal binding Non localbinding is problematic as described in this comment located at the point of return from inet_addr_type() to inet_bind(). The problem is that you really don't want applications to bind to addresses that don't exist on this system, because such a binding will never permit succesful delivery of a packet. On the other hand some types of interfaces will refuse to come up if the associated link is down. In that case the normally local address will not exist in the local table and the route type will not be set to RTN_LOCAL. The hack here allows you to permit or proscribe nonlocal binding. /* Not specified by any standard per-se, however it breaks too many applications when removed. It is unfortunate since allowing applications to make a non- local bind solves several problems with systems using dynamic addressing.(ie. your servers still start up even if your ISDN link is temporarily down) */ 424 err = -EADDRNOTAVAIL; 425 if (!sysctl_ip_nonlocal_bind && 426 !inet->freebind && 427 addr->sin_addr.s_addr != INADDR_ANY && 428 chk_addr_ret != RTN_LOCAL && 429 chk_addr_ret != RTN_MULTICAST && 430 chk_addr_ret != RTN_BROADCAST) 431 goto out; 432 9 Binding to privileged ports Port numbers smaller than PROT_SOCK, defined in include/net/sock.h, may not be bound by unprivileged programs. You do need to be aware of where and when to use ntoh. 758 #define PROT_SOCK 1024 If requested port cannot be assigned, EACCES is returned. 433 snum = ntohs(addr->sin_port); 434 err = -EACCES; 435 if (snum && snum < PROT_SOCK && ! capable(CAP_NET_BIND_SERVICE)) 436 goto out; 437 10 Assigning the local IP address The comment below means that saddr is the address that is used as the source address in outgoing IP packets. The lock_sock() function obtains a lock on the struct sock. It is undone with release_sock(). 438 /* We keep a pair of addresses. rcv_saddr is the one 439 * used by hash lookups, and saddr is used for transmit. 440 * 441 * In the BSD API these are the same except where it 442 * would be illegal to use them (multicast/broadcast) in 443 * which case the sending device address is used. 444 */ 445 lock_sock(sk); 446 447 /* Check these errors (active socket, double bind). */ 448 err = -EINVAL; 449 if (sk->sk_state != TCP_CLOSE || inet->num) 450 goto out_release_sock; 451 Both of the local IP addresses contained in the struct sock are initially set to 0. The distinction between their normal usage is described in the comment above. The value of saddr is the source addresses in transmitted packets. For received packets if rcv_saddr is not 0, it must match the destination address in the packet for the packet to be deliverable to this socket. 452 inet->rcv_saddr = inet->saddr = addr->sin_addr.s_addr; The value of sk>saddr normally holds the source IP address to be used on transmitted packets. However, if sk>saddr == 0 then the IP address associated with the outgoing device is used. If the local address to which the socket is being bound is a multicast or broadcast address, then the address associated with the device must be used as the source address for transmit operations. 453 if (chk_addr_ret == RTN_MULTICAST || chk_addr_ret == RTN_BROADCAST) 454 inet->saddr = 0; /* Use device */ 455 11 Port allocation For UDP sockets, an indirect call to the udp_v4_get_port() function is made here. If snum is zero, an available port in the ephemeral port space will be assigned. If snum is nonzero, an attempt to allocate the port specified will be made. You must provide a get_port() function and it must reject attempts to bind to ports that are already bound. 456 /* Make sure we are allowed to bind here. */ 457 if (sk->sk_prot->get_port(sk, snum)) { 458 inet->saddr = inet->rcv_saddr = 0; 459 err = -EADDRINUSE; 460 goto out_release_sock; 461 } 462 On return to inet_bind() flags indicating whether or not the struct sock is bound to a specific local IP address and local port are set. These will prevent a disconnect from unbinding. 463 if (inet->rcv_saddr) 464 sk->sk_userlocks |= SOCK_BINDADDR_LOCK; 465 if (snum) 466 sk->sk_userlocks |= SOCK_BINDPORT_LOCK; 12 Destruction of existing connections The port number is converted to network byte order and saved in sk>sport and any existing connection is destroyed by zeroing the daddr and dport. The value of inet>num is set in the getport function. 467 inet->sport = htons(inet->num); 468 inet->daddr = 0; 469 inet->dport = 0; 470 sk_dst_reset(sk); 471 err = 0; The release_sock macro defined in include/net/socket.h calls the __release_sock() function if the backlog queue of the socket is not empty. Based upon a relatively detailed search of the code it appears that the backlog queue is used only by TCP. However, this function is the generic inet_bind(). 472 out_release_sock: 473 release_sock(sk); 474 out: 475 return err; 476 } 13 Allocation of a an ephemeral port All sockets must be bound before they can be used to transmit or receive any data! If the socket is not explicitly bound, it will be automatically bound by inet_autobind() which will pass the get_port() function an 0 in snum indicating that any port is acceptable. Who would have imagined that the simple act of consuming one port from the port space could be so complicated? However, the objective here is a good one. It is to balance the length of the hash queues so as to minimize the amount of time required to lookup the address of the associated struct sock from the port number in the UDP header of an incoming packet. You will need structures such as the following in your copport.c module.. Note that they must be global and not local to the cop_sock. I used a bit map to keep track of allocated and free ports without having to navigate the hash chains. 31 #define UDP_HTABLE_SIZE 128 118 struct hlist_head udp_hash[UDP_HTABLE_SIZE]; 119 DEFINE_RWLOCK(udp_hash_lock); 122 int udp_port_rover; 123 124 static int udp_v4_get_port(struct sock *sk, unsigned short snum) 125 { 126 struct hlist_node *node; 127 struct sock *sk2; 128 struct inet_sock *inet = inet_sk(sk); 129 130 write_lock_bh(&udp_hash_lock); 131 if (snum == 0) { 132 int best_size_so_far, best, result, i; 133 134 if (udp_port_rover > sysctl_local_port_range[1] || 135 udp_port_rover < sysctl_local_port_range[0]) 136 udp_port_rover = sysctl_local_port_range[0]; The variables local to this block are used as follows: • result – current port number under test • best – port number belonging to shortest hash queue • best_size_so_far – length of the shortest hash queue seen so far 137 best_size_so_far = 32767; 138 best = result = udp_port_rover; 14 Finding the shortest hash queue The variable result holds what we will call the candidate port. One iteration through the following loop is made for each queue in the UDP hash table. 139 for (i = 0; i < UDP_HTABLE_SIZE; i++, result++) { 140 struct hlist_head *list; 141 int size; 142 If the hash of result produces an empty list, the best possible case, has been found and the candidate port is accepted without further ado. Note that result is being incremented as we proceed through this loop. Therefore, it is possible that it is now larger than the maximum value specified by sysctl_local_port_range[1]. If that turns out to be true it is mapped back into the legitimate port space in a way that ensures it hashes to its original slot. This guarantees that the remapped port is still available because, if it were not, the hash queue would not have been empty in the first place. 143 list = &udp_hash[result & (UDP_HTABLE_SIZE - 1)]; 144 if (hlist_empty(list)) { 145 if (result > sysctl_local_port_range[1]) 146 result = sysctl_local_port_range[0] + 147 ((result - sysctl_local_port_range[0]) & 148 (UDP_HTABLE_SIZE - 1)); 149 goto gotit; 150 } 15 Current hash queue not empty Arrival here means that the hash queue indexed by i was not empty. The variable best_size_so_far contains the length of the shortest hash queue that has been found in the table. The variable size is used to count the length of the hash queue associated with the current candidate port. When size exceeds best_size_so_far the counting loop gives up. The value best is used to save the best candidate port seen so far. It seems as though it would have been simpler to store the current length of each hash queue rather than count it so that is what I did. 151 size = 0; 152 sk_for_each(sk2, node, list) 153 if (++size >= best_size_so_far) 154 goto next; 155 best_size_so_far = size; 156 best = result; 157 next:; 158 } 16 No empty hash queues At this point it is known that there are no empty hash queues and that the candidate port best is associated with the shortest hash queue. What is not known is whether the best port is available and is in the required range. Therefore, there are a maximum of 2 ^ 16 / 2 ^ 7 = 2 ^ 9 ports which share the same hash queue as best. This for loop looks at all of them. In fact, since the sysctl_local_port_range does not span the whole port space, it may look at some of them more than once! If there is a free port that maps to the shortest hash queue, this loop must find it since it looks at them all. It is almost the case that if there is any free port this search will succeed. Since we are searching the ports associated with the shortest hash queue it might seem as though the only way the search could fail is if all hash queues were equally populated and the port space was exhausted. However, it does seem conceivable that the shortest hash queue be such that all free ports were outside sysctl_local_port_range. In practice this does not appear to be a problem. 159 result = best; 160 for(i = 0; i < (1 << 16) / UDP_HTABLE_SIZE; i++, result += UDP_HTABLE_SIZE) { 161 if (result > sysctl_local_port_range[1]) 162 result = sysctl_local_port_range[0] 163 + ((result - sysctl_local_port_range[0]) & 164 (UDP_HTABLE_SIZE - 1)); 165 if (!udp_lport_inuse(result)) 166 break; 167 } If no free ports exist on the shortest free chain i will have reached its limit and port allocation fails. 168 if (i >= (1 << 16) / UDP_HTABLE_SIZE) 169 goto fail; Save the port number selected and update the udp_port_rover. Why not add 1 to the rover? 170 gotit: 171 udp_port_rover = snum = result; 172 } 17 Handling requests for a specific port. This block of code handles allocation of specific port requests. The strategy used is to scan the hash queue associated with the requested port. If the requested port is found to be already in use it may be still be assigned depending upon the outcome of the lengthy if statement below. Failure occurs if • the port number requested is used in an existing struct sock and • the two struct socks are different (i.e. this is not a rebind (thought those were illegal) and • the bound_dev_ifs are the same and ((either one of the IP addresses rcv_saddr is wild carded to 0 or both IP addresses are specific and identical) and either one of the struct sock has the reuse port flag set to 0). The basic idea seems to be that if their is sufficient differentiation to permit the demux algorithm to succeed port overloading will be permitted. You do not need to worry about bound_dev_ifs(). You should simply fail if the port is bound period. 172 } else { 173 sk_for_each(sk2, node, 174 &udp_hash[snum & (UDP_HTABLE_SIZE - 1)]) { 175 struct inet_sock *inet2 = inet_sk(sk2); 176 177 if (inet2->num == snum && 178 sk2 != sk && 179 !ipv6_only_sock(sk2) && 180 (!sk2->sk_bound_dev_if || 181 !sk->sk_bound_dev_if || 182 sk2->sk_bound_dev_if == sk->sk_bound_dev_if) && 183 (!inet2->rcv_saddr || 184 !inet->rcv_saddr || 185 inet2->rcv_saddr == inet->rcv_saddr) && 186 (!sk2->sk_reuse || !sk->sk_reuse)) 187 goto fail; 188 } 189 } 18 Hashing the socket into the new chain Here, the selected port number is stored in the sk>num field of the struct sock. Recall that this is the overloaded field that holds the protocol number for raw sockets. It remains unclear what the purpose of the test for unhashed is. If the struct sock is already in a hash queue here (possibly because this is a rebind), how do we know it is still in the correct hash queue?? You should use this basic approach to add the cop_sock to a hash queue. Make real sure that there are no execution paths that can lead to unbalanced write_lock_bh()/write_unlock_bh() 190 inet->num = snum; 191 if (sk_unhashed(sk)) { 192 struct hlist_head *h = &udp_hash[snum & (UDP_HTABLE_SIZE - 1)]; 193 194 sk_add_node(sk, h); 195 sock_prot_inc_use(sk->sk_prot); 196 } 197 write_unlock_bh(&udp_hash_lock); 198 return 0; 199 200 fail: 201 write_unlock_bh(&udp_hash_lock); 202 return 1; 203 } 19 Checking a port for in use. The inline function udp_lport_inuse() is defined in net/udp.h. Notice that if the port is inuse it is necessary to run the whole hash chain to discover that fact. It seems like a bit map would be a much simpler, clean, and more efficient way to manage the allocation of port space. 42 static inline int udp_lport_inuse(u16 num) 43 { 44 struct sock *sk; 45 struct hlist_node *node; 46 sk_for_each(sk, node, &udp_hash[num & (UDP_HTABLE_SIZE - 1)]) 48 if (inet_sk(sk)->num == num) 49 return 1; 50 return 0; 51 } 20 Socket reference counting The sock_hold() function should be called when a new persistent pointer to a struct sock is established. The sock_put() function should be called when the reference is dropped. These calls are now made by the helper functions. Thus sock_hold() should be called when the struct sock is inserted into the hash queue and sock_put() called by the unhash function. These comments are not as illuminating as they might be. The "penalty" for violating MUST or SHOULD is not clear. ● Socket reference counting postulates. Each user of socket SHOULD hold a reference count. ● Each access point to socket (an hash table bucket, reference from a list, running timer, skb in flight MUST hold a reference count. ● When reference count hits 0, it means it will never increase back. ● When reference count hits 0, it means that no references from outside exist to this socket and current process on current CPU is last user and may/should destroy this socket. ● sk_free is called from any context: process, BH, IRQ. When it is called, socket has no references from outside > sk_free may release descendant resources allocated by the socket, but to the time when it is called, socket is NOT referenced by any hash tables, lists etc. ● Packets, delivered from outside (from network or from another process) and enqueued on receive/ error queues SHOULD NOT grab reference count,when they sit in queue. Otherwise, packets will leak to hole, when socket is looked up by one cpu and unhasing is made by another CPU. ● It is true for udp/raw, netlink (leak to receive and error queues), tcp (leak to backlog). Packet socket does all the processing inside BR_NETPROTO_LOCK, so that it has not this race condition. UNIX sockets use separate SMP lock, so that they are prone too. 21 The sock_hold() and sock_put() functions 941 /* Ungrab socket and destroy it if it was the last reference. */ 942 static inline void sock_put(struct sock *sk) 943 { 944 if (atomic_dec_and_test(&sk->sk_refcnt)) 945 sk_free(sk); 946 } 312/* Grab socket reference count. This operation is valid only 313 when sk is ALREADY grabbed f.e. it is found in hash table 314 or a list and the lookup is made under lock preventing hash 315 table modifications. 316 */ 317 318 static inline void sock_hold(struct sock *sk) 319 { 320 atomic_inc(&sk->sk_refcnt); 321 } 322 323/* Ungrab socket in the context, which assumes that socket refcnt 324 cannot hit zero, f.e. it is true in context of any socketcall. 325 */ 326 static inline void __sock_put(struct sock *sk) 327 { 328 atomic_dec(&sk->sk_refcnt); 329 } 330 22 UDP connect All connect requests are vectored to sys_connect which is defined in net/socket.c. It takes following arguments, fd: File (socket) descriptor. uservaddr: Pointer to struct sockaddr_in containing address family, (remote) peer address and port. addrlen: Size of struct sockaddr_in. If the socket is of type SOCK_DGRAM then uservaddr is the address to which datagrams are sent by default, and the only address from which datagrams are accepted for reception. Sockets of this type may change their association multiple times. They may break their present association by connecting to an address of family AF_UNSPEC. 1477 asmlinkage long sys_connect(int fd, struct sockaddr __user *uservaddr, int addrlen) 1478 { 1479 struct socket *sock; 1480 char address[MAX_SOCK_ADDR]; 1481 int err, fput_needed; 1482 As with bind, the sockfd_lookup() function returns a pointer to struct socket structure corresponding to fd. 1483 sock = sockfd_lookup_light(fd, &err, &fput_needed); 1484 if (!sock) 1485 goto out; 1486 err = move_addr_to_kernel(uservaddr, addrlen, address); 1487 if (err < 0) 1488 goto out_put; 1489 23 1490 err = security_socket_connect(sock, (struct sockaddr *)address, addrlen); 1491 if (err) 1492 goto out_put; 1493 Recall that for a SOCK_DGRAM type socket, sock>ops and sock>sk>prot were set in the inet_create function to point to inet_dgram_ops and udp_prot respectively. Because of this binding sock>ops >connect(...) translates to inet_dgram_connect(...). The value of sock>file>f_flags was set to O_RDWR (Read/Write only) by sys_socket. 1494 err = sock->ops->connect(sock, (struct sockaddr *) address, addrlen, 1495 sock->file->f_flags); 1496 out_put: 1497 fput_light(sock->file, fput_needed); 1498 out: 1499 return err; 1500 } 24 The inet_dgram_connect function The AF_INET layer provides both dgram and stream generic connection iterfaces. UDP uses the inet_dgram_correct() the functions. COP will too. 478 int inet_dgram_connect(struct socket *sock, struct sockaddr * uaddr, 479 int addr_len, int flags) 480 { 481 struct sock *sk = sock->sk; Breaking an existing connection If the address family is AF_UNSPEC this is a request to disconnect the datagram oriented socket, and an indirect call to udp_disconnect is made to break current association of socket. We shall visit this function later. Note that failure to provide a disconnect will cause a kernel oops. 482 483 if (uaddr->sa_family == AF_UNSPEC) 484 return sk->sk_prot->disconnect(sk, flags); 485 It is not possible to connect an unbound socket. The function inet_autobind automatically binds an unbound socket. The variable sk>num contains the local port number in host byte order. The value sk>num==0 is being used as a test to see if the socket is already bound. 486 if (!inet_sk(sk)->num && inet_autobind(sk)) 487 return -EAGAIN; On return from inet_autobind, an indirect call to ip4_datagram_connect is made. COP must provide a cop_connect function, but it will make use of ip4_datagram_connect. 488 return sk->sk_prot->connect(sk, (struct sockaddr *)uaddr, addr_len); 489 } 25 Automatic binding The inet_autobind() function like inet_bind() makes an indirect call to udp_v4_get_port (which was described earlier in the description of UDP bind) to get an available source port for the socket. It passes a parameter of 0 indicating that any port is acceptable. 168 static int inet_autobind(struct sock *sk) 169 { 170 struct inet_sock *inet; 171 /* We may need to bind the socket. */ 172 lock_sock(sk); 173 inet = inet_sk(sk); 174 if (!inet->num) { 175 if (sk->sk_prot->get_port(sk, 0)) { 176 release_sock(sk); 177 return -EAGAIN; 178 } The get_port() function returns the assigned port number in sk>num. Its representation in network byte order is kept in sk>sport. 179 inet->sport = htons(inet->num); 180 } Also as was the case in bind release_sock() is called to release the lock. 181 release_sock(sk); 182 return 0; 183 } 26 The ip4_datagram_connect() function As in the case of bind, UDP does not provide a connection function. Before kernel 2.6, the udp_connect, which was defined in net/ipv4/udp.c performed this function but that function no longer exists. The code has been moved to ip4_datagram_connect() and is pointed to by the connect element of the udp_prot structure. It has two main missions: ● Fill in the destination data in the inet_sock ● Determine if a route to the destination is thought to exist. It does not actually send any probe packets nor expect any respose. 23 int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) 24 { 25 struct inet_sock *inet = inet_sk(sk); 26 struct sockaddr_in *usin = (struct sockaddr_in *) uaddr; 27 struct rtable *rt; 28 u32 saddr; 29 int oif; 30 int err; 31 The usual sanity checks on the input parameters are performed first. 32 33 if (addr_len < sizeof(*usin)) 34 return -EINVAL; 35 36 if (usin->sin_family != AF_INET) 37 return -EAFNOSUPPORT; 38 27 The sk_dst_reset() function resets the route cache pointer of struct dst_entry type to NULL. 39 sk_dst_reset(sk); 40 Here inet>saddr is the existing source address and usin>sin_addr.s_addr is the specified destination address. 41 oif = sk->sk_bound_dev_if; 42 saddr = inet->saddr; 43 if (MULTICAST(usin->sin_addr.s_addr)) { 44 if (!oif) 45 oif = inet->mc_index; 46 if (!saddr) 47 saddr = inet->mc_addr; 48 } 28 Routing the connection The connection can be established only if it is thought that a route exists to the destination address. Note that this does not mean that a route really exists to the destination address. For example, the default route will be accepted as a route to a nonassigned, but legal IP address. The ip_route_connect() function is called to do the real work. This is an example of what might be called an informal binding between the transport and network layers. Unlike the other cases in which services are abstracted and associated with a table of function pointers, here the UDP layer just ``knows about'' a handy entry point into the IP layer. The value of RT_CONN_FLAGS(sk) is defined as (sk>protinfo.af_inet.tos) | sk>localroute). The value of protinfo.af_inet.tos may be set only through the appropriate IP level setsockopt() call and thus for "regular" programs is normally 0. The value of localroute can be set only by a SOCKET level setsockopt() call. 49 err = ip_route_connect(&rt, usin->sin_addr.s_addr, saddr, 50 RT_CONN_FLAGS(sk), oif, 51 sk->sk_protocol, 52 inet->sport, usin->sin_port, sk); 53 if (err) 54 return err; On return it must be ensured that if the struct rtable carries the BROADCAST flag then so does the struct sock. 55 if ((rt->rt_flags & RTCF_BROADCAST) && !sock_flag(sk, SOCK_BROADCAST)) { 56 ip_rt_put(rt); 57 return -EACCES; 58 } 29 Updating the source and destination addresses As noted previously in the discussion of bind(), inet>rcv_saddr is used to match the destination IP address in the demultiplexing of input packets and inet>saddr is used as the source IP address in outgoing packets. If either one remains NULL at this point, it is inherited from the IP address associated with the outgoing device that is associated with the route table entry that was used. Note that this makes it impossible for a connected socket to have a wildcard rcv_saddr. This is a good thing because a connected socket is supposed to be able to receive only from the the other endpoint. 59 if (!inet->saddr) 60 inet->saddr = rt->rt_src; /* Update src addr */ 61 if (!inet->rcv_saddr) 62 inet->rcv_saddr = rt->rt_src; The destination IP address to which this socket is connected is unconditionally copied from the struct rtable. Under what conditions might this address not be the same as usin>sin_addr?? The destination port is simply copied from the input parameter data. The value of inet->id is yet another way (distinct from the AVL tree) by which IP header id values may be used. 63 inet->daddr = rt->rt_dst; 64 inet->dport = usin->sin_port; 65 sk->sk_state = TCP_ESTABLISHED; 66 inet->id = jiffies; 67 Store the address of the route cache element in the socket. 68 sk_dst_set(sk, &rt->u.dst); 69 return(0); 70 } 30 The ip_route_connect() function The ip_route_connect() function itself is defined as an inline function in include/net/route.h. Here the parameter tos has the value RT_CONN_FLAGS(sk) which is defined as (sk>protinfo.af_inet.tos) | sk>localroute) 146 static inline int ip_route_connect(struct rtable **rp, u32 dst, 147 u32 src, u32 tos, int oif, u8 protocol, 148 u16 sport, u16 dport, struct sock *sk) 149 { 150 struct flowi fl = { .oif = oif, 151 .nl_u = { .ip4_u = { .daddr = dst, 152 .saddr = src, 153 .tos = tos } }, 154 .proto = protocol, 155 .uli_u = { .ports = 156 { .sport = sport, 157 .dport = dport } } }; 158 If one of dst or src IP addresses is still zero __ip_route_output_key() is invoked to correct that situation. . Then ip_route_output_flow() makes the final routing decision. 159 int err; 160 if (!dst || !src) { 161 err = __ip_route_output_key(rp, &fl); 162 if (err) 163 return err; 164 fl.fl4_dst = (*rp)->rt_dst; 165 fl.fl4_src = (*rp)->rt_src; 166 ip_rt_put(*rp); 167 *rp = NULL; 168 } 169 return ip_route_output_flow(rp, &fl, sk, 0); 170 } 31 The UDP disconnect function Disconnect also unbinds if and only if the bind was an "autobind". Explict binds are not touched. 857 int udp_disconnect(struct sock *sk, int flags) 858 { 859 struct inet_sock *inet = inet_sk(sk); 860 /* 861 * 1003.1g - break association. 862 */ 863 864 sk->sk_state = TCP_CLOSE; 865 inet->daddr = 0; 866 inet->dport = 0; 867 sk->sk_bound_dev_if = 0; These flags were set in the bind function. So explicit binds are not undone. They were not set in autobind though. 868 if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK)) 869 inet_reset_saddr(sk); 870 871 if (!(sk->sk_userlocks & SOCK_BINDPORT_LOCK)) { 872 sk->sk_prot->unhash(sk); 873 inet->sport = 0; 874 } 875 sk_dst_reset(sk); 876 return 0; 877 } 32 Closing a socket Your protocol should use an approach similiar to UDP.. We will want to print out some performance data before calling sk_common_release. 879 static void udp_close(struct sock *sk, long timeout) 880 { 881 sk_common_release(sk); 882 } 33 The sk_common_release() function This helper function consolidates some of the operations that formerly had to be done by each protocol. The protocol's destroy function is responsible for free all sk_buffs that might remain on the tx_queue. The protocol must provide an unhash function because the protocol has its own hash table. Normally no packets will be left when line 1696 is reached and the call to sock_put() will release the socket. Therefore, the protocol close function must not touch the sock structure on return from sk_common_release. 1664 void sk_common_release(struct sock *sk) 1665 { 1666 if (sk->sk_prot->destroy) 1667 sk->sk_prot->destroy(sk); 1668 1669 /* 1670 * Observation: when sock_common_release is called, processes have 1671 * no access to socket. But net still has. 1672 * Step one, detach it from networking: 1673 * 1674 * A. Remove from hash tables. 1675 */ 1676 1677 sk->sk_prot->unhash(sk); 1678 1679 /* 34 Kernel comments say In this point socket cannot receive new packets, but it is possible that some packets are in flight because some CPU runs receiver and did hash table lookup before we unhashed socket. They will achieve receive queue and will be purged by socket destructor. Also we still have packets pending on receive queue and probably,our own packets waiting in device queues. sock_destroy will drain receive queue, but transmitted packets will delay socket destruction until the last reference will be released. 1690 1691 sock_orphan(sk); 1692 1693 xfrm_sk_free_policy(sk); 1694 1695 sk_refcnt_debug_release(sk); 1696 sock_put(sk); 1697 } 35 The udp_destroy_sock() function This function just frees all sk_buffs that happen to be on the write_queue. I recommend that you do this directly with skb_queue_purge(). 1207 static int udp_destroy_sock(struct sock *sk) 1208 { 1209 lock_sock(sk); 1210 udp_flush_pending_frames(sk); 1211 release_sock(sk); 1212 return 0; 1213 } 1214 36 The UDP unhash function This function illustrates how to use the helper function to unhash a socket structure. The function sk_del_node_init() calls sock_put() to drop the reference that was added during hashing. 209 210 static void udp_v4_unhash(struct sock *sk) 211 { 212 write_lock_bh(&udp_hash_lock); 213 if (sk_del_node_init(sk)) { 214 inet_sk(sk)->num = 0; 215 sock_prot_dec_use(sk->sk_prot); 216 } 217 write_unlock_bh(&udp_hash_lock); 218 } The sk_refcnt_debug_release_function This function will print a nastygram to the system log if your socket reference count is defective when sk_common_release() is called. 611 static inline void sk_refcnt_debug_release(struct sock *sk) 612 { 613 if (atomic_read(&sk->sk_refcnt) != 1) 614 printk(KERN_DEBUG "Destruction of the %s socket %p delayed, refcnt=%d\n", 615 sk->sk_prot->name, sk, atomic_read(&sk->sk_refcnt)); 616 } 37