http://hi.chinaunix.net/?uid-21365544-action-viewspace-itemid-42476
大半年了,一直在分析TCP/IP源码,现在主要是在分析IP部分,目前着重把分段与重组好好的看了下。。
现在这部分基本是分析完了,准备写个应用层的ip分段重组的模拟程序,各位看官如有什么想法,或者小弟的函数分析出错的话,请留下意见,咱们相互讨论讨论。。O(∩_∩)O~
以下是我分析的分段的主要函数,ip_fragment(),这也应该是tcp/ip中最麻烦的函数之一了(老师开始时这么说的,但当他看到TCP部分时就改口了,呵呵~)。。
源码版本:2.6.27.5
参考书籍:《Understanding linux network internals》
《TCP/IP Illustrated》
427 /* 428 * This IP datagram is too large to be sent in one piece. Break it up into 429 * smaller pieces (each of size equal to IP header plus 430 * a block of the data of the original IP data part) that will yet fit in a 431 * single device frame, and queue such a frame. for sending. 432 */ 433 / * 2009年4月15日开始分析此函数 434 * 对于转发分组的话,若当前的数据<mtu的话,就直接调用 435 * ip_finish_output这个函数,不会调用ip_fragment. 436 * 对于这个函数的分析,要考虑两个方面: 437 * 1.转发分组,即 数据不是本地上层产生的,因为不存在farg_list链表挂载的skb, 438 * 此时若需再分段的话则只能发生slow_path模式。 439 * 2.本地产生的数据。这个也要分3方面来考虑: 440 * a.上层 协议提供辅助分段且在fast_path过程检查成功,即可以只添加分段头部以完成分段。 441 * b.上层协议提供辅助分段但在fast_path过程检查中失败而跳入slow_path模式,再次对数据进行分段加头。 442 * c.上层没有进行任何的辅助分段操作,对packet的分段操作直接进入slow_path模式。 443 * ----cdc. 09.5.21 444 */ 445 int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff*)) /*无论是fast_path or slow_path,skb中都包含已初始化好的ip头*/ 446 { 447 struct iphdr *iph; /*ip头指针*/ 448 int raw = 0; 449 int ptr; 450 struct net_device *dev; 451 struct sk_buff *skb2; /*skb2是分段目标skb,对于每个分段,数据都被复制到skb2指向的缓存区,最后也是skb2被发送*/ 452 unsigned int mtu, hlen, left, len, ll_rs, pad; 453 int offset; /*分段数据区长度,以字节为单位*/ 454 __be16 not_last_frag; /*标识当前fragment是否是最后一个*/ 455 struct rtable *rt = skb->rtable; /*获得出口路由*/ 456 int err = 0; 457 458 dev = rt->u.dst.dev; /*出口 路由设备*/ 459 460 /* 461 * Point into the IP datagram header. 462 */ 463 464 iph = ip_hdr(skb); /*得到指向ip头的指针*/ 465 466 /*If the input IP packet cannot be fragmented because the source 467 * has set the DF flag, ip_fragment sends an ICMP packet back to 468 * the source to notify it of the problem, and then drops the packet. 469 */ 470 if (unlikely((iph->frag_off & htons(IP_DF)) && !skb->local_df)) { /* 471 unlike()说明括号里的值为true的概率很小,其中IP_DF=0x4000,The local_df flag shown in the if condition is set mainly by the Virtual Server code when it does not want the condition just described to generate an ICMP message. 472 */ 473 IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS); 474 icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,/*若无法分段,向发送方返回一个icmp通知报文*/ 475 htonl(ip_skb_dst_mtu(skb))); 476 kfree_skb(skb); /*若无法分段,则释放该sk_buff到cache中*/ 477 return -EMSGSIZE; 478 } 479 480 /* 481 * Setup starting values. 482 */ 483 484 hlen = iph->ihl * 4; /*三层头长度,单位byte,ihl指的是首部长度,4bit,单位4bytes,hlen最大60,即ihl最大15*/ 485 mtu = dst_mtu(&rt->u.dst) - hlen; /* Size of data space */ 486 IPCB(skb)->flags |= IPSKB_FRAG_COMPLETE; /* 487 宏IPCB(skb)定义为((struct inet_skb_parm*)((skb)->cb),sk_buff中的cb是以inet_skb_parm 488 形式存储的,分片完成的一个IP数据报,它的每一个skb的cb->flags被置上IPSKB_FRAG_COMPLETE标志 489 */ 490 491 /* When frag_list is given, use it. First, check its validity: 492 * some transformers could create wrong frag_list or break existing 493 * one, it is not prohibited. In this case fall back to copying. 494 * 495 * LATER: this step can be merged to real generation of fragments, 496 * we can switch to copy when see the first bad fragment. 497 */ 498 /*宏skb_shinfo(skb)得到指向skb_shared_info的指针,因为在skb中并没有直接指向该结构的指针所以必须用使用该宏*/ 499 /*Fast_path*/ 500 if (skb_shinfo(skb)->frag_list) { 501 struct sk_buff *frag; 502 /*注意下面得代码的作用: 503 * 针对fast_path,把有ip_push_pending_frames函数处理过的主skb的len值变成主的len, 504 * 即该skb对应缓存的大小+frag[]指向的碎片区的大小,以对应后面有可能发生的slow_path 505 * 过程 506 */ 507 int first_len = skb_pagelen(skb); 508 int truesizes = 0; /*len+sizeof(skb)*/ 509 510 if (first_len - hlen > mtu || /*数据区>mtu*/ 511 ((first_len - hlen) & 7) || /*判断是否是8bytes的整数倍,若不是则slow_path,因为发送的分段除了最后一个外,必须是8bytes的整数倍*/ 512 /* 513 * in ip_push_pending_frames, iph->frag_off is set to 0x4000, and IP_OFFSET = 0x1FFF 514 * thus 0x4000 & 0x1FFF = 0, this is obviously reasonable for the first IP fragment. 515 * 516 * As for forwarded ip packets, iph->frag_off is not always set to 0x4000, thus the 517 * following offset will not be always 0, and << 3 means to restore the real offset 518 * value from this skb's iph->frag_off, thus we can calculate the new frag_off for 519 * the new fragment being handled now. 520 * 521 */ 522 (iph->frag_off & htons(IP_MF|IP_OFFSET)) || 523 skb_cloned(skb)) /* 524 如果该skb是被克隆的话就返回1否则0,若该skb是被克隆的,则对 525 应的缓冲区就无法被修改,即无法添加ip头。但是对于slow_path的话, 526 分段就是被允许的,因为缓存的数据要被复制(只是被复制,未修改!)到多个skb,但并不修改 527 原缓存的数据。 528 */ 529 goto slow_path; 530 /* 531 * 遍历由frag_list构建的skbuff list 532 ip_append_data的主要任务只是创建发送 网络数据的套接字缓冲区skb,它根据输出路由查询得到的 533 输出网络设备接口的MTU,把超过MTU长度的应用数据分割开,并创建了多个skb,并且每个skb对应一段应用数据,放入套接字的发送 534 缓冲队列sk_write_queue,但它并没有为任何一个skb数据加上网络层首部,并且,随后在ip_push_pending_frames 535 函数中,又把发送缓冲队列中的所有的skb,以一个链表的形式追加到第一个skb的end成员后面 536 的struct skb_shared_info结构体中的frag_list上,并只为第一个skb加上了网络层首部,所以,实际上, 537 整个应用数据已经在各个skb中,ip_append_data这样做只是为接下来的真正的IP数据的分片作好准备。 538 这里的list就相当于一个数据包,只不过由很多skb来存储其中的每一部分。 539 */ 540 for (frag = skb_shinfo(skb)->frag_list; frag; frag = frag->next) { 541 /* Correct geometry. */ 542 if (frag->len > mtu || 543 ((frag->len & 7) && frag->next) || /*存在下一个sk_buff且其指向的缓存的大小不是8bytes的整数倍且下一个skb不为空*/ 544 skb_headroom(frag) < hlen) /* skb_headroom()函数得到在加了四层头部后,缓存剩余的 空间,若剩余的缓存空间无法 545 * 存放三层头时,亦进行分段,不过感觉这种情况一般不会发生,因为在分配缓存空间时 546 * 内核是按最坏打算分配的,即空间只能剩余,不可能不够 547 */ 548 goto slow_path; 549 550 /* Partially cloned skb? */ 551 /*如果该skb被复制,则无法修改对应缓存的数据, 552 * 即无法添加ip头 553 */ 554 if (skb_shared(frag)) 555 goto slow_path; 556 557 BUG_ON(frag->sk); /*相当于断言*/ 558 if (skb->sk){ 559 /* 560 * sock_hold, Grab socket reference count. This operation is valid only 561 * when sk is ALREADY grabbed f.e. it is found in hash table 562 * or a list and the lookup is made under lock preventing hash table 563 * modifications. 564 */ sock_hold(skb->sk); 566 frag->sk = skb->sk; /*对变量赋值,使得形参的sk指针指向frag,使得该socket获得对frag的所有权。 567 struct sock *sk,This is a pointer to a sock data structure of the socket 568 that owns this buffer. This pointer is needed when data is either locally generated 569 or being received by a local process, because the data and socket-related information 570 is used by L4 (TCP or UDP)and by the user application. When a buffer is merely being 571 forwarded (that is, neither the source nor the destination is on the local machine),this 572 pointer is NULL. 573 */ 574 frag->destructor = sock_wfree; /* 初始化析构函数指针 575 * void (*destructor),When the buffer belongs to a socket, it is usually set to sock_rfree or sock_wf *ree (by the skb_set_owner_r and skb_set_owner_w initialization functions, respectively),The two soc *ik_xxx routines are used to update the amount of memory held by the socket in its queues. 576 */ 577 truesizes += frag->truesize; /*得到除主skb外所有的len+sizeof(skb)*/ 578 }/*遍历结束*/ 579 } 580 581 /* Everything is OK. Generate! */ 582 583 err = 0; 584 ffset = 0; 585 frag = skb_shinfo(skb)->frag_list; 586 skb_shinfo(skb)->frag_list = NULL; /*清除frag_list和其挂载的skb的关系,使用frag指向挂载的skb*/ 587 skb->data_len = first_len - skb_headlen(skb); /* 588 转换data_len的大小,使其等于主skb对应的碎片区的大小 589 在转换之前,data_len=主skb碎片区大小+后续所有的len和 590 */ 591 skb->truesize -= truesizes; /* 592 得到主skb的len+sizeof(skb) 593 因为在上层的ip_push_pending_frame函数处理后,主skb的 594 truesize值就变成了所有的skb的len的和+所有skb本身的大 595 小之和,类似于上面说的这skb的len和datalen。 596 */ 597 skb->len = first_len; /*将主skb的len的值变成只是对于该skb的len,不包括挂载的skb*/ 598 iph->tot_len = htons(first_len); /*对于一切要发送的数据,都要转换字节序*/ 599 /* 600 *in ip_push_pending_frames, iph->frag_off is set to 0x4000 601 *下面的代码中,IP_MF的值是0x2000,所以对于第一个分段来说, 602 offset的值为0 603 * 604 */ 605 iph->frag_off = htons(IP_MF); 606 ip_send_check(iph); /*校验第一个分段的ip头*/ 607 /*为后续分段添加ip header*/ 608 for (;;) { 609 /* Prepare header of the next frame, 610 * before previous one went down. */ 611 if (frag) { frag->ip_summed = CHECKSUM_NONE; /*ip_summed represent the checksum and associated status flag,CHECKSUM_NONE=0*/ 613 skb_reset_transport_header(frag); /*skb->transport_header = skb->data - skb->head;获得四层头部相对head的偏移量*/ 614 __skb_push(frag, hlen); /*去掉三层头部,即指针指向四层头部,其中该函数有这么一句skb->len+=hlen,功能是为要添加头部的 615 skb开辟三层头部空间 616 */ 617 skb_reset_network_header(frag); /*类似skb_reset_transport_header,得到ip头部相对head的偏移量*/ 618 /* void *memcpy(void *dest, const void *src, size_t count){ 复制数据并返回指向被复制区域的指针 619 * char *tmp = dest; 620 * const char *s = src; 621 * 622 * while (count--) 623 * *tmp++ = *s++; 624 * return dest; 625 * } 626 */ 627 memcpy(skb_network_header(frag), iph, hlen); /*复制ip头到该skb,得到一个指向四层头的指针 628 skb_network_header返回return skb->head + skb->network_header; 629 即一个指向四层头部的指针 630 */ 631 iph = ip_hdr(frag); /*得到指向ip头部的指针*/ 632 iph->tot_len = htons(frag->len); /*把主机字节序变幻成网络字节序*/ 633 ip_copy_metadata(frag, skb); /*拷贝一些关于skb的 管理字段,如protocol,优先级等,其对分段并无任何影响*/ 634 if (offset == 0) 635 ip_options_fragment(frag); /*修改第一个分段包,去掉选项部分,以便后面的分段循环利用ip header*/ 636 offset += skb->len - hlen; /*计算偏移量*/ 637 iph->frag_off = htons(offset>>3); /*frag_off是13bit,offset是16bit且是8bytes的整数倍。*/ 638 if (frag->next != NULL) 639 iph->frag_off |= htons(IP_MF); 640 /* Ready, complete checksum */ 641 ip_send_check(iph); 642 } 643 644 err = output(skb); /*输出正确output()函数返回0,错误返回1 645 接上面的if(frag),如果frag不存在,error=1 646 */ 647 648 if (!err) /**若未出错,即frag存在*/ 649 IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGCREATES); 650 if (err || !frag) /*判断出错或者frag没有后继*/ 651 break; 652 653 skb = frag; /*将frag赋给skb*/ 654 frag = skb->next;/*frag指向下一个skb*/ 655 skb->next = NULL;/*清除skb与frag的关系*/ 656 } /*加头结束*/ 657 658 if (err == 0) { /*Fast_path成功统计分段个数,并推出该函数*/ 659 IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGOKS); 660 return 0; 661 } 662 663 while (frag) { /*Fast_path失败*/ 664 skb = frag->next; 665 kfree_skb(frag); /*释放所有分配的skb*/ 666 frag = skb; 667 } 668 IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS); 669 return err; /*返回错误*/ 670 } /*Fast_path结束*/ 671 672 slow_path: 673 left = skb->len - hlen; /* Space per frame,数据区的长度 */ 674 ptr = raw + hlen; /* Where to start from,第一次是指向数据区开始的位置,以后依次指向分段开始的位置 */ 675 676 /* for bridged IP traffic encapsulated inside f.e. a vlan header, 677 * we need to make room for the encapsulating header 678 */ 679 /*nf_bridge_pad, This is called by the IP fragmenting code and it ensures there is 680 * enough room for the encapsulating header (if there is one) 681 */ 682 pad = nf_bridge_pad(skb); /*pad是为vlan、pppoe等添加的头部*/ 683 ll_rs = LL_RESERVED_SPACE_EXTRA(rt->u.dst.dev, pad); /*II_rs, Link Layer Reserved Space--链路层预定空间*/ 684 /*room for L4 data should exclude the pad from mtu*/ 685 mtu -= pad; 686 687 /* 688 * Fragment the datagram. 689 */ 690 691 ffset = (ntohs(iph->frag_off) & IP_OFFSET) << 3; /*得到数据区为8bytes整数倍的长度,其中define IP_OFFSET 0x1FFF*/ 692 not_last_frag = iph->frag_off & htons(IP_MF);/*is true when more data is supposed to follow the current fragment in the packet*/ 693 694 /* 695 * Keep copying data until we run out. 696 */ 697 698 while (left > 0) { 699 len = left; /*在三层,把ip分组的数据区长度赋给缓存里数据块的大小,即由data和tail指向的空间大小*/ 700 /* IF: it doesn't fit, use 'mtu' - the data space left */ 701 /*数据区的长度大于MTU时,将数据区的长度变成MTU,否则就等于left*/ 702 if (len > mtu) 703 len = mtu; 704 /* IF: we are not sending upto and including the packet end 705 then align the next start on an eight byte boundary */ 706 /*如果ip分组的数据区长度大于MTU,即进行分段处理,变成8byte的整数倍*/ 707 if (len < left) { 708 len &= ~7; 709 } 710 /* 711 * Allocate buffer. 712 * The size of the buffer allocated to hold a fragment is the sum of: 713 * 714 * The size of the IP payload -> len 715 * 716 * The size of the IP header -> hlen 717 * 718 * The size of the L2 header -> II_rs 719 * 720 * 721 */ 722 723 if ((skb2 = alloc_skb(len+hlen+ll_rs, GFP_ATOMIC)) == NULL) {/* 724 此处分配空间必须正确,尤其是对于最后一个分段,分的大小必须 725 满足最后的数据要完全装满,否则的话会出现错误,复制数据失败 726 */ 727 NETDEBUG(KERN_INFO "IP: frag: no memory for new fragment!\n"); 728 err = -ENOMEM; 729 goto fail; 730 } 731 732 /* 733 * Set up data on packet 734 */ 735 736 ip_copy_metadata(skb2, skb); /*拷贝一些关于skb的字段 设置到skb2,比如protocol,优先级等。 */ 737 /*Increase the headroom of an empty &sk_buff by reducing the tail 738 *room. This is only allowed for an empty buffer. 739 */ 740 skb_reserve(skb2, ll_rs); /*在空缓存上为二层头部分配空间*/ 741 skb_put(skb2, len + hlen); /*在上面的 基础上,为ip_header和ip_payload分配空间*/ 742 /* skb2->nh.raw = skb2->data; 743 * skb2->h.raw = skb2->data + hlen; 744 * 以下的两句代码的意义相当于以上这两句,即分别初始化指向ip header和L4 header的指针*/ 745 skb_reset_network_header(skb2); /*返回head到ip头之间的偏移量*/ 746 skb2->transport_header = skb2->network_header + hlen;/*得到head到四层头部的偏移量*/ 747 748 /* 749 * Charge(填充) the memory for the fragment to any owner 750 * it might possess 751 */ 752 753 if (skb->sk) /*将一个sk_buff buffer和指定的sock结构连接起来,The newly allocated buffer is associated with 754 the socket attempting the transmission, if any 755 */ 756 skb_set_owner_w(skb2, skb->sk); 757 758 /* 759 * Copy the packet header into the new buffer. 760 */ 761 762 skb_copy_from_linear_data(skb, skb_network_header(skb2), hlen); /*复制ip头到分段*/ 763 764 /* 765 * Copy a block of the IP datagram. 766 */ 767 /* 768 * offset,指的是分段的偏移量,对应的是每个分段相对于原始的packet 769 * ptr,指的是对于当前的正在处理的packet,被复制区域相对该packet的偏移,对象是原始的packet 770 */ 771 if (skb_copy_bits(skb, ptr, skb_transport_header(skb2), len))/*复制数据区到新的skb。 2009 5.19该函数分析完毕*/ 772 773 774 BUG(); 775 left -= len; /*计算剩余的数据区长度*/ 776 777 /* 778 * Fill in the new header fields. 779 */ 780 iph = ip_hdr(skb2); 781 iph->frag_off = htons((offset >> 3)); 782 783 /* ANK: dirty, but effective trick. Upgrade options only if 784 * the segment to be fragmented was THE FIRST (otherwise, 785 * options are already fixed) and make it ONCE 786 * on the initial skb, so that all the following fragments 787 * will inherit fixed options. 788 */ 789 if (offset == 0) /*为第一个分段添加选项字段*/ 790 ip_options_fragment(skb); /*The first fragment (where offset is 0)is special from the IP options point 791 of view because it is the only one that includes a full copy of the options 792 from the original IP packet. Not all the options have to be replicated into 793 all of the fragments; only the first fragment will include all of them. 794 */ 795 796 /* 797 * Added AC : If we are fragmenting a fragment that's not the 798 * last fragment then keep MF on each bit 799 */ 800 if (left > 0 || not_last_frag) 801 iph->frag_off |= htons(IP_MF); 802 ptr += len;/*The following two statements update two offsets. It is easy to confuse the two. offset is maintaine 803 d because the packet currently being fragmented may be a fragment of a larger packet; if so, offset 804 represents the offset of the current fragment within the original packet (otherwise, it is simply 0). 805 ptr is an offset within the packet we are fragmenting and changes as the loop progresses. The two variables 806 have the same value in two cases: where the packet we are fragmenting is not a fragment itself, and where 807 this fragment is the very first fragment. 808 */ 809 offset += len; 810 811 /* 812 * Put this fragment into the sending queue. 813 */ 814 iph->tot_len = htons(len + hlen);/*计算ip分段的总长度,数据区长度+ip头部长度*/ 815 816 ip_send_check(iph); /*ip头校验*/ 817 818 err = output(skb2); /*发送分段*/ 819 if (err)/*出错处理*/ 820 goto fail; 821 822 IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGCREATES);/*若未出错,统计分段个数*/ 823 }/*Slow_path结束*/ 824 kfree_skb(skb); /*Slow_path成功,释放掉传进来的skb*/ 825 IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGOKS);/*统计分段成功的个数*/ 826 return err; 827 /*出错处理*/ 828 fail: 829 kfree_skb(skb); 830 IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS); 831 return err; 832 } / *ip_fragment函数结束 2009.5.19*/ // 注意这个时间,呵 呵~