[ovs-dev] [PATCH 2/2] dpif-netdev: Introduce netdev array cache

Thu Jul 15 14:10:27 UTC 2021

On 7/15/2021 4:35 PM, Ferriter, Cian wrote:
> External email: Use caution opening links or attachments
>
>
>> -----Original Message-----
>> From: Eli Britstein <elibr at nvidia.com>
>> Sent: Wednesday 14 July 2021 16:21
>> To: Ferriter, Cian <cian.ferriter at intel.com>; Ilya Maximets <i.maximets at ovn.org>; Gaëtan Rivet
>> <grive at u256.net>; dev at openvswitch.org; Van Haaren, Harry <harry.van.haaren at intel.com>
>> Cc: Majd Dibbiny <majd at nvidia.com>; Stokes, Ian <ian.stokes at intel.com>; Flavio Leitner
>> <fbl at sysclose.org>
>> Subject: Re: [ovs-dev] [PATCH 2/2] dpif-netdev: Introduce netdev array cache
>>
>>
>> On 7/14/2021 5:58 PM, Ferriter, Cian wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>>> -----Original Message-----
>>>> From: Ilya Maximets <i.maximets at ovn.org>
>>>> Sent: Friday 9 July 2021 21:53
>>>> To: Ferriter, Cian <cian.ferriter at intel.com>; Gaëtan Rivet <grive at u256.net>; Eli Britstein
>>>> <elibr at nvidia.com>; dev at openvswitch.org; Van Haaren, Harry <harry.van.haaren at intel.com>
>>>> Cc: Majd Dibbiny <majd at nvidia.com>; Ilya Maximets <i.maximets at ovn.org>; Stokes, Ian
>>>> <ian.stokes at intel.com>; Flavio Leitner <fbl at sysclose.org>
>>>> Subject: Re: [ovs-dev] [PATCH 2/2] dpif-netdev: Introduce netdev array cache
>>>>
>>>> On 7/8/21 6:43 PM, Ferriter, Cian wrote:
>>>>> Hi Gaetan, Eli and all,
>>>>>
>>>>> Thanks for the patch and the info on how it affects performance in your case. I just wanted to
>> post
>>>> the performance we are seeing.
>>>>> I've posted the numbers inline. Please note, I'll be away on leave till Tuesday.
>>>>> Thanks,
>>>>> Cian
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Gaëtan Rivet <grive at u256.net>
>>>>>> Sent: Wednesday 7 July 2021 17:36
>>>>>> To: Eli Britstein <elibr at nvidia.com>; <dev at openvswitch.org> <dev at openvswitch.org>; Van Haaren,
>>>> Harry
>>>>>> <harry.van.haaren at intel.com>; Ferriter, Cian <cian.ferriter at intel.com>
>>>>>> Cc: Majd Dibbiny <majd at nvidia.com>; Ilya Maximets <i.maximets at ovn.org>
>>>>>> Subject: Re: [ovs-dev] [PATCH 2/2] dpif-netdev: Introduce netdev array cache
>>>>>>
>>>>>> On Wed, Jul 7, 2021, at 17:05, Eli Britstein wrote:
>>>>>>> Port numbers are usually small. Maintain an array of netdev handles indexed
>>>>>>> by port numbers. It accelerates looking up for them for
>>>>>>> netdev_hw_miss_packet_recover().
>>>>>>>
>>>>>>> Reported-by: Cian Ferriter <cian.ferriter at intel.com>
>>>>>>> Signed-off-by: Eli Britstein <elibr at nvidia.com>
>>>>>>> Reviewed-by: Gaetan Rivet <gaetanr at nvidia.com>
>>>>>>> ---
>>>>> <snipped patch contents>
>>>>>
>>>>>>> _______________________________________________
>>>>>>> dev mailing list
>>>>>>> dev at openvswitch.org
>>>>>>>
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flis
>> tinfo%2Fovs-
>> dev&data=04%7C01%7Celibr%40nvidia.com%7C7ca0caf9434e429e4ffd08d946d7cf2f%7C43083d15727340c1b7db39e
>> fd9ccc17a%7C0%7C0%7C637618715041410254%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJ
>> BTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=rv%2FdenANxrcTGxBBbRvhhlNioyswL7ieFr8AGcGtCs8%3D&rese
>> rved=0
>>>>>> Hello,
>>>>>>
>>>>>> I tested the performance impact of this patch with a partial offload setup.
>>>>>> As reported by pmd-stats-show, in average cycles per packet:
>>>>>>
>>>>>> Before vxlan-decap: 525 c/p
>>>>>> After vxlan-decap: 542 c/p
>>>>>> After this fix: 530 c/p
>>>>>>
>>>>>> Without those fixes, vxlan-decap has a 3.2% negative impact on cycles,
>>>>>> with the fixes, the impact is reduced to 0.95%.
>>>>>>
>>>>>> As I had to force partial offloads for our hardware, it would be better
>>>>>> with an outside confirmation on a proper setup.
>>>>>>
>>>>>> Kind regards,
>>>>>> --
>>>>>> Gaetan Rivet
>>>>> I'm showing the performance relative to what we measured on OVS master directly before the VXLAN
>>>> HWOL changes went in. All of the below results are using the scalar DPIF and partial HWOL.
>>>>> Link to "Fixup patches":
>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatchwork.ozlabs.org%2Fproject%2Fopen
>> vswitch%2Flist%2F%3Fseries%3D252356&data=04%7C01%7Celibr%40nvidia.com%7C7ca0caf9434e429e4ffd08d946
>> d7cf2f%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637618715041410254%7CUnknown%7CTWFpbGZsb3d8eyJWIjo
>> iMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Y62OrCRyS00vJHHPQvAHyhG5C
>> 4eO%2FSfWMCSPtszn3Is%3D&reserved=0
>>>>> Master before VXLAN HWOL changes (f0e4a73)
>>>>> 1.000x
>>>>>
>>>>> Latest master after VXLAN HWOL changes (b780911)
>>>>> 0.918x (-8.2%)
>>>>>
>>>>> After fixup patches on OVS ML are applied (with ALLOW_EXPERIMENTAL_API=off)
>>>>> 0.973x (-2.7%)
>>>>>
>>>>> After fixup patches on OVS ML are applied and after ALLOW_EXPERIMENTAL_API is removed.
>>>>> 0.938x (-6.2%)
>>>>>
>>>>> I ran the last set of results by applying the below diff. I did this because I'm assuming the plan
>>>> is to remove the ALLOW_EXPERIMENTAL_API '#ifdef's at some point?
>>>>
>>>> Yes, that is the plan.
>>>>
>>> Thanks for confirming this.
>>>
>>>> And thanks for testing, Gaetan and Cian!
>>>>
>>>> Could you also provide more details on your test environment,
>>>> so someone else can reproduce?
>>>>
>>> Good idea, I'll add the details inline below. These details apply to the performance measured
>> previously by me, and the performance in this mail.
>>>> What is important to know:
>>>> - Test configuration: P2P, V2V, PVP, etc.
>>> P2P
>>> 1 PHY port
>>> 1 RXQ
>>>
>>>> - Test type: max. throughput, zero packet loss.
>>> Max throughput.
>>>
>>>> - OVS config: EMC, SMC, HWOL, AVX512 - on/off/type
>>> In all tests, all packets hit a single datapath flow with "offloaded:partial". So all packets are
>> partially offloaded, skipping miniflow_extract() and EMC/SMC/DPCLS lookups.
>>> AVX512 is off.
>>>
>>>> - Installed OF rules.
>>> $ $OVS_DIR/utilities/ovs-ofctl dump-flows br0
>>>    cookie=0x0, duration=253.691s, table=0, n_packets=2993867136, n_bytes=179632028160, in_port=phy0
>> actions=IN_PORT
>>>> - Traffic pattern: Packet size, number of flows, packet type.
>>> 64B, 1 flow, ETH/IP packets.
>>>
>>>> This tests also didn't include the fix from Balazs, IIUC, because
>>>> they were performed a bit before that patch got accepted.
>>>>
>>> Correct, the above tests didn't include the optimization from Balazs.
>>>
>>>> And Flavio reported what seems to be noticeable performance
>>>> drop due to just accepted AVX512 DPIF implementation for the
>>>> non-HWOL non-AVX512 setup:
>>>>
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fo
>> vs-dev%2F2021-
>> July%2F385448.html&data=04%7C01%7Celibr%40nvidia.com%7C7ca0caf9434e429e4ffd08d946d7cf2f%7C43083d15
>> 727340c1b7db39efd9ccc17a%7C0%7C0%7C637618715041410254%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQ
>> IjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Co%2FugazVORXH43uWlB2UUAoP%2BGdHpdwisKs10B
>> PFW7c%3D&reserved=0
>>> We are testing partial HWOL setups, so I don't think Flavio's mail is relevant to this.
>>>
>>>> So, it seems that everything will need to be re-tested anyway
>>>> in order to understand what is the current situation.
>>>>
>>> Agreed, let's retest the performance. I've included the new numbers below:
>>>
>>> I'm showing the performance relative to what we measured on OVS master directly before the VXLAN
>> HWOL changes went in. All of the below results are using the scalar DPIF and partial HWOL.
>>> Link to "Fixup patches" (v2):
>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatchwork.ozlabs.org%2Fproject%2Fopen
>> vswitch%2Flist%2F%3Fseries%3D253488&data=04%7C01%7Celibr%40nvidia.com%7C7ca0caf9434e429e4ffd08d946
>> d7cf2f%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637618715041410254%7CUnknown%7CTWFpbGZsb3d8eyJWIjo
>> iMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ejj3KdwvnZwYfiYM%2BEcpL5t
>> nb3JIbtdhMmaMMhxJkSo%3D&reserved=0
>>> Master before VXLAN HWOL changes. (f0e4a73)
>>> 1.000x
>>>
>>> Master after VXLAN HWOL changes. (d53ea18)
>>> 0.964x (-3.6%)
>>>
>>> After rebased fixup patches on OVS ML are applied. (with ALLOW_EXPERIMENTAL_API=off)
>>> 0.993x (-0.7%)
>>>
>>> After rebased fixup patches on OVS ML are applied and after ALLOW_EXPERIMENTAL_API is removed.
>>> 0.961x (-3.9%)
>> According to this, we are better off without the array cache commit, at
>> least in this test (-3.6% vs -3.9%). Isn't it?
>>
>> In our internal tests we saw it actually improves, though not gaining
>> back all the degradation.
>>
>> What do you think?
>>
> I think we are better off keeping the array cache commit. It's a good idea and replaces a HMAP_FOR_EACH_IN_BUCKET() with a direct port_no to netdev lookup which is what we need for the netdev_hw_miss_packet_recover() call.
>
> Yes, I show a slight degradation with that, but it's small, and my test scenario only has one TX port. I think that one TX port hides the benefit of the array cache commit. When we have a more realistic number of TX ports, the direct lookup should be quicker than the HMAP_FOR_EACH.

I thought about doing the array of struct tx_port *, not only netdevs. 
In this case all calls to pmd_send_port_cache_lookup() are optimized.  
The new function pmd_netdev_cache_lookup() along with the changes in 
dp_netdev_hw_flow() are omitted.

Our tests (which are also not very realistic, in the same sense) showed 
very slight improvement/degradation, depending vxlan or non-vxlan setups 
(I don't have the precise results).

As it was very slight (either way), I thought to focus only in the usage 
for netdev_hw_miss_packet_recover().

Following your comment, maybe indeed in a more realistic setup we will 
benefit from this even more.

What do you think?

>
> My remaining concern is with PATCH 1/2 of this series.
> In that patch, we are hiding the overhead of the packet restoration API under ALLOW_EXPERIMENTAL_API, but this is a temporary fix. In the future when ALLOW_EXPERIMENTAL_API is removed from the packet restoration API, that overhead will once again be present.

That's right.

I think that until it is removed in the future it is better for setups 
without experimental support, and also gives us the opportunity to 
optimize more meanwhile, maybe taking advantage of future patches. For 
example this:

https://patchwork.ozlabs.org/project/openvswitch/list/?series=253593

For example, if we move netdev_hw_miss_packet_recover() to be 
dpif-offload function, we won't need to provide the netdev at all.

>
>>> So the performance is looking better now because of the optimization from Balazs. This together with
>> #ifdefing out the code brings the performance almost to where it was before.
>>> I'm worried that the #ifdef is a temporary solution for the partial HWOL case, since eventually that
>> will be removed. This leaves us with a -3.9% degradation.
>>> Thanks,
>>> Cian
>>>
>>>> Best regards, Ilya Maximets.
>>>>
>>>>> Diff:
>>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>>>>> index accb23a1a..0e29c609f 100644
>>>>> --- a/lib/dpif-netdev.c
>>>>> +++ b/lib/dpif-netdev.c
>>>>> @@ -7132,7 +7132,6 @@ dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
>>>>>        struct netdev *netdev OVS_UNUSED;
>>>>>        uint32_t mark;
>>>>>
>>>>> -#ifdef ALLOW_EXPERIMENTAL_API /* Packet restoration API required. */
>>>>>        /* Restore the packet if HW processing was terminated before completion. */
>>>>>        netdev = pmd_netdev_cache_lookup(pmd, port_no);
>>>>>        if (OVS_LIKELY(netdev)) {
>>>>> @@ -7143,7 +7142,6 @@ dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
>>>>>                return -1;
>>>>>            }
>>>>>        }
>>>>> -#endif
>>>>>
>>>>>        /* If no mark, no flow to find. */
>>>>>        if (!dp_packet_has_flow_mark(packet, &mark)) {
>>>>>