首页 > 系统相关> > Linux 内核收发包流程

Linux 内核收发包流程

2019-08-20 18:40:10 作者：互联网

原文链接：https://blog.csdn.net/kklvsports/article/details/74452953

转载：https://blog.csdn.net/kklvsports/article/details/74452953

收包流程：

传统方式和NAPI方式收包流程是有差异的，如图所示。

传统收包是中断，驱动处理完后直接调用netif_rx将报文送入内核处理，内核将报文skb挂到该CPU的softnet_data结构input_pkt_queue队列上，为了统一传统收包和NAPI设备收包的处理，内核为所有不使用NAPI的驱动程序提供一个虚拟设备，叫做积压设备，每个CPU一个积压设备，对应结构softnet_data->backlog_dev。input_pkt_queue即是该设备的积压队列，用于存储skb，该队列是一个双向链表，组织结构如下。中断上半部只是将报文入队，并将backlog的实例挂到poll_list上，等待下半部软中断轮询poll_list net_rx_action->preocess_backlog将报文进一步处理。

input_pkt_queue structure
+------------------------------------------------------------+
| |
| skb_buff_head skb_buff skb_buff |
| _______ _______________ _______________ |
+-->| next |---->| next|---->| next|----+
+---| pre |<----| pre |<----| pre |<---+
| |_len=2_| |_______________| |_______________| |
| |
+------------------------------------------------------------+

传统收包是每个报文都触发中断，如果报文太快，中断太频繁，CPU总是处理中断，其他任务无法得到调度，于是NAPI（NewAPI）出现了，采用中断+轮询的方式收包以提高吞吐。

NAPI收包需要网卡驱动支持，如intel e1000系列网卡，在收包中断中e1000_intr_msix_rx将网卡napi实例加入softnet_data的poll_list链表上，然后设置NET_RX_SOFTIRQ软中断标志，等待net_rx_action中检查标志并处理。何时运行软中断？两个时机：1，do_IRQ-->irq_exit-->do_softirq-->call_softirq-->__do_softirq中断上半部退出的时候调用软中断处理函数net_rx_action，net_rx_action遍历poll_list链表上的网卡，函数执行过程如下（kernel version 3.2.x）。2，__do_softirq循环调用MAX_SOFTIRQ_RESTART = 10次net_rx_action如果还有pending的报文，则wakeup_softirqd唤醒ksoftirqd内核线程运行run_ksoftirqd-->__do_softirq-->net_rx_action收包。
static void net_rx_action(struct softirq_action *h)
{
struct softnet_data *sd = &__get_cpu_var(softnet_data);
unsigned long time_limit = jiffies + 2;
int budget = netdev_budget; //一次中断处理的skb数目，系统默认300，对应net.core.netdev_budget = 300
void *have;

local_irq_disable(); //关闭中断以访问softnet_data

while (!list_empty(&sd->poll_list)) {
struct napi_struct *n;
int work, weight;

/* If softirq window is exhuasted then punt.
* Allow this to run for 2 jiffies since which will allow
* an average latency of 1.5/HZ.
*/
if (unlikely(budget <= 0 || time_after_eq(jiffies, time_limit))) //轮询时间不要超过2个jiffies，处理skb数目不要超过预算300
goto softnet_break;

local_irq_enable();

/* Even though interrupts have been re-enabled, this
* access is safe because interrupts can only add new
* entries to the tail of this list, and only ->poll()
* calls can remove this head entry from the list.
*/
n = list_first_entry(&sd->poll_list, struct napi_struct, poll_list); //取poll_list链表的头，即某网卡的napi实例

have = netpoll_poll_lock(n);

weight = n->weight;//该网卡一次轮询最多处理的报文个数，64

/* This NAPI_STATE_SCHED test is for avoiding a race
* with netpoll's poll_napi(). Only the entity which
* obtains the lock and sees NAPI_STATE_SCHED set will
* actually make the ->poll() call. Therefore we avoid
* accidentally calling ->poll() when NAPI is not scheduled.
*/
work = 0;
if (test_bit(NAPI_STATE_SCHED, &n->state)) {
work = n->poll(n, weight);//调用设备特定的poll函数处理报文，poll中如果一次把包收完会将设备从poll_list上摘除？；如果是非NAPI调用的是process_backlog；
trace_napi_poll(n);
}

WARN_ON_ONCE(work > weight);

budget -= work;

local_irq_disable();

/* Drivers must not modify the NAPI state if they
* consume the entire weight. In such cases this code
* still "owns" the NAPI instance and therefore can
* move the instance around on the list at-will.
*/
//如果一次就把weight消耗光了，说明可能还需要继续轮询这个设备，所以把这个napi放到poll_list的末尾；如果还有报文在gro处理中，不再等待直接将报文feed进协议栈
if (unlikely(work == weight)) {
if (unlikely(napi_disable_pending(n))) {
local_irq_enable();
napi_complete(n);
local_irq_disable();
} else {
if (n->gro_list) {
/* flush too old packets
* If HZ < 1000, flush all packets.
*/
local_irq_enable();
napi_gro_flush(n, HZ >= 1000);
local_irq_disable();
}
list_move_tail(&n->poll_list, &sd->poll_list);
}
}

netpoll_poll_unlock(have);
}
out:
net_rps_action_and_irq_enable(sd);

#ifdef CONFIG_NET_DMA
/*
* There may not be any more sk_buffs coming right now, so push
* any pending DMA copies to hardware
*/
dma_issue_pending_all();
#endif

return;

softnet_break:
sd->time_squeeze++;
__raise_softirq_irqoff(NET_RX_SOFTIRQ);//如果本轮轮询没有处理完，设置软中断标志，等下次软中断调用net_rx_action处理？

goto out;
}

软中断之后报文进入内核协议栈进行处理。期间还设计netfilter，xfrm（ipsec）等的处理，后续再详细分析。

IP报文的处理过程如下：

硬件中断 -->do_IRQ-->handle_irq-->e1000_intr_msix_rx-->__napi_schedule(&adapter->napi)-->

____napi_schedule-->__raise_softirq_irqoff(NET_RX_SOFTIRQ)

do_IRQ-->irq_exit-->do_softirq-->call_softirq-->__do_softirq-->

net_rx_action->e1000e_poll-->e1000_receive_skb->napi_gro_receive-->

netif_receive_skb-->__netif_receive_skb-->__netif_receive_skb_core-->

deliver_skb-->ip_rcv-->NF_HOOK(NF_INET_PRE_ROUTING)-->

ip_rcv_finish-->dst_input-->ip_local_deliver-->

NF_HOOK(NF_INET_LOCAL_IN)-->ip_local_deliver_finish-->ipprot->handler()

ip_forward-->NF_HOOK(NF_INET_FORWARD)-->ip_forward_finish-->

dst_output-->dst->output-->ip_output-->NF_HOOK_COND(NF_INET_POST_ROUTING)-->

ip_finish_output-->ip_finish_output2-->__ipv4_neigh_lookup_noref-->

dst_neigh_output-->neigh_hh_output-->dev_queue_xmit-->dev_hard_start_xmit-->ndo_start_xmit

网上找到个协议栈收发包流程图图，非常好，感谢原作者.

参考：

http://blog.csdn.net/hui6075/article/details/51196056

标签：报文,NAPI,list,收发,内核,Linux,net,poll,napi
来源： https://blog.csdn.net/wangyangzhizunwudi/article/details/99864501