[ovs-dev] [PATCHv18] netdev-afxdp: add new netdev type for AF_XDP.

Eelco Chaudron echaudro at redhat.com
Thu Aug 8 14:53:46 UTC 2019



On 8 Aug 2019, at 14:09, Ilya Maximets wrote:

> On 08.08.2019 14:42, Eelco Chaudron wrote:
>>
>>
>> On 19 Jul 2019, at 16:54, Ilya Maximets wrote:
>>
>>> On 18.07.2019 23:11, William Tu wrote:
>>>> The patch introduces experimental AF_XDP support for OVS netdev.
>>>> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux 
>>>> socket
>>>> type built upon the eBPF and XDP technology.  It is aims to have 
>>>> comparable
>>>> performance to DPDK but cooperate better with existing kernel's 
>>>> networking
>>>> stack.  An AF_XDP socket receives and sends packets from an 
>>>> eBPF/XDP program
>>>> attached to the netdev, by-passing a couple of Linux kernel's 
>>>> subsystems
>>>> As a result, AF_XDP socket shows much better performance than 
>>>> AF_PACKET
>>>> For more details about AF_XDP, please see linux kernel's
>>>> Documentation/networking/af_xdp.rst. Note that by default, this 
>>>> feature is
>>>> not compiled in.
>>>>
>>>> Signed-off-by: William Tu <u9012063 at gmail.com>
>>>
>>>
>>> Thanks, William, Eelco and Ben!
>>>
>>> I fixed couple of things and applied to master!
>>
>> Good to see this got merged into master while on PTO. However, when I 
>> got back I decided to test it once more…
>>
>> When testing PVP I got a couple of packets trough, and then it would 
>> stall. I thought it might be my kernel, so updated to yesterdays 
>> latest, no luck…
>>
>> I did see a bunch of “eno1: send failed due to exhausted memory 
>> pool.” messages in the log. Putting back patch v14, made my 
>> problems go away…
>>
>> After some debugging, I noticed the problem was with the 
>> “continue” case in the afxdp_complete_tx() function.
>> Applying the following patch made it work again:
>>
>> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
>> index b7cc0d988..9b335ddf0 100644
>> --- a/lib/netdev-afxdp.c
>> +++ b/lib/netdev-afxdp.c
>> @@ -823,16 +823,21 @@ afxdp_complete_tx(struct xsk_socket_info 
>> *xsk_info)
>>
>>          if (tx_to_free == BATCH_SIZE || j == tx_done - 1) {
>>              umem_elem_push_n(&umem->mpool, tx_to_free, 
>> elems_push);
>>              xsk_info->outstanding_tx -= tx_to_free;
>>              tx_to_free = 0;
>>          }
>>      }
>>
>> +    if (tx_to_free) {
>> +        umem_elem_push_n(&umem->mpool, tx_to_free, 
>> elems_push);
>> +        xsk_info->outstanding_tx -= tx_to_free;
>> +    }
>> +
>>      if (tx_done > 0) {
>>          xsk_ring_cons__release(&umem->cq, tx_done);
>>      } else {
>>          COVERAGE_INC(afxdp_cq_empty);
>>      }
>>  }
>
> Good catch! Will you submit a patch?
> BTW, to reduce the code duplication I'd suggest to remove the 
> 'continue'
> like this:
>
>     if (*addr != UINT64_MAX) {
>         Do work;
>     } else {
>         COVERAGE_INC(afxdp_cq_skip);
>     }

Done, patch has been sent out…

>>
>>
>> Which made me wonder why we do mark elements as being used? To my 
>> knowledge (and looking at some of the code and examples), after the  
>> xsk_ring_cons__release() function a xsk_ring_cons__peek() should not 
>> receive any duplicate slots.
>>
>> I see a rather high number of afxdp_cq_skip, which should to my 
>> knowledge never happen?
>
> I tried to investigate this previously, but didn't find anything 
> suspicious.
> So, for my knowledge, this should never happen too.
> However, I only looked at the code without actually running, because I 
> had no
> HW available for testing.
>
> While investigation and stress-testing virtual ports I found few 
> issues with
> missing locking inside the kernel, so there is no trust for kernel 
> part of XDP
> implementation from my side. I'm suspecting that there are some other 
> bugs in
> kernel/libbpf that only could be reproduced with driver mode.
>
> This never happens for virtual ports with SKB mode, so I never saw 
> this coverage
> counter being non-zero.

Did some quick debugging, as something else has come up that needs my 
attention :)

But once I’m in a faulty state and sent a single packet, causing 
afxdp_complete_tx() to be called, it tells me 2048 descriptors are 
ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that there 
might be some ring management bug. Maybe consumer and receiver are equal 
meaning 0 buffers, but it returns max? I did not look at the kernel 
code, so this is just a wild guess :)

(gdb) p tx_done
$3 = 2048

(gdb) p umem->cq
$4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask = 2047, 
size = 2048, producer = 0x7f08486b8000, consumer = 0x7f08486b8040, ring 
= 0x7f08486b8080}

>>
>> $ ovs-appctl coverage/show  | grep xdp
>> afxdp_cq_empty             0.0/sec   
>> 339.600/sec        5.6606/sec   total: 20378
>> afxdp_tx_full              0.0/sec    
>> 29.967/sec        0.4994/sec   total: 1798
>> afxdp_cq_skip              0.0/sec 61884770.167/sec  
>> 1174238.3644/sec   total: 4227258112
>>
>>
>> You mentioned you saw this high number in your v15 change notes, did 
>> you do any research on why?
>>
>> Cheers,
>>
>> Eelco


More information about the dev mailing list