[ovs-dev] [PATCH 2/2] dpif-netdev: Introduce netdev array cache

Thu Jul 15 13:35:10 UTC 2021

> -----Original Message-----
> From: Eli Britstein <elibr at nvidia.com>
> Sent: Wednesday 14 July 2021 16:21
> To: Ferriter, Cian <cian.ferriter at intel.com>; Ilya Maximets <i.maximets at ovn.org>; Gaëtan Rivet
> <grive at u256.net>; dev at openvswitch.org; Van Haaren, Harry <harry.van.haaren at intel.com>
> Cc: Majd Dibbiny <majd at nvidia.com>; Stokes, Ian <ian.stokes at intel.com>; Flavio Leitner
> <fbl at sysclose.org>
> Subject: Re: [ovs-dev] [PATCH 2/2] dpif-netdev: Introduce netdev array cache
> 
> 
> On 7/14/2021 5:58 PM, Ferriter, Cian wrote:
> > External email: Use caution opening links or attachments
> >
> >
> >> -----Original Message-----
> >> From: Ilya Maximets <i.maximets at ovn.org>
> >> Sent: Friday 9 July 2021 21:53
> >> To: Ferriter, Cian <cian.ferriter at intel.com>; Gaëtan Rivet <grive at u256.net>; Eli Britstein
> >> <elibr at nvidia.com>; dev at openvswitch.org; Van Haaren, Harry <harry.van.haaren at intel.com>
> >> Cc: Majd Dibbiny <majd at nvidia.com>; Ilya Maximets <i.maximets at ovn.org>; Stokes, Ian
> >> <ian.stokes at intel.com>; Flavio Leitner <fbl at sysclose.org>
> >> Subject: Re: [ovs-dev] [PATCH 2/2] dpif-netdev: Introduce netdev array cache
> >>
> >> On 7/8/21 6:43 PM, Ferriter, Cian wrote:
> >>> Hi Gaetan, Eli and all,
> >>>
> >>> Thanks for the patch and the info on how it affects performance in your case. I just wanted to
> post
> >> the performance we are seeing.
> >>> I've posted the numbers inline. Please note, I'll be away on leave till Tuesday.
> >>> Thanks,
> >>> Cian
> >>>
> >>>> -----Original Message-----
> >>>> From: Gaëtan Rivet <grive at u256.net>
> >>>> Sent: Wednesday 7 July 2021 17:36
> >>>> To: Eli Britstein <elibr at nvidia.com>; <dev at openvswitch.org> <dev at openvswitch.org>; Van Haaren,
> >> Harry
> >>>> <harry.van.haaren at intel.com>; Ferriter, Cian <cian.ferriter at intel.com>
> >>>> Cc: Majd Dibbiny <majd at nvidia.com>; Ilya Maximets <i.maximets at ovn.org>
> >>>> Subject: Re: [ovs-dev] [PATCH 2/2] dpif-netdev: Introduce netdev array cache
> >>>>
> >>>> On Wed, Jul 7, 2021, at 17:05, Eli Britstein wrote:
> >>>>> Port numbers are usually small. Maintain an array of netdev handles indexed
> >>>>> by port numbers. It accelerates looking up for them for
> >>>>> netdev_hw_miss_packet_recover().
> >>>>>
> >>>>> Reported-by: Cian Ferriter <cian.ferriter at intel.com>
> >>>>> Signed-off-by: Eli Britstein <elibr at nvidia.com>
> >>>>> Reviewed-by: Gaetan Rivet <gaetanr at nvidia.com>
> >>>>> ---
> >>> <snipped patch contents>
> >>>
> >>>>> _______________________________________________
> >>>>> dev mailing list
> >>>>> dev at openvswitch.org
> >>>>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flis
> tinfo%2Fovs-
> dev&data=04%7C01%7Celibr%40nvidia.com%7C7ca0caf9434e429e4ffd08d946d7cf2f%7C43083d15727340c1b7db39e
> fd9ccc17a%7C0%7C0%7C637618715041410254%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJ
> BTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=rv%2FdenANxrcTGxBBbRvhhlNioyswL7ieFr8AGcGtCs8%3D&rese
> rved=0
> >>>>>
> >>>> Hello,
> >>>>
> >>>> I tested the performance impact of this patch with a partial offload setup.
> >>>> As reported by pmd-stats-show, in average cycles per packet:
> >>>>
> >>>> Before vxlan-decap: 525 c/p
> >>>> After vxlan-decap: 542 c/p
> >>>> After this fix: 530 c/p
> >>>>
> >>>> Without those fixes, vxlan-decap has a 3.2% negative impact on cycles,
> >>>> with the fixes, the impact is reduced to 0.95%.
> >>>>
> >>>> As I had to force partial offloads for our hardware, it would be better
> >>>> with an outside confirmation on a proper setup.
> >>>>
> >>>> Kind regards,
> >>>> --
> >>>> Gaetan Rivet
> >>> I'm showing the performance relative to what we measured on OVS master directly before the VXLAN
> >> HWOL changes went in. All of the below results are using the scalar DPIF and partial HWOL.
> >>> Link to "Fixup patches":
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatchwork.ozlabs.org%2Fproject%2Fopen
> vswitch%2Flist%2F%3Fseries%3D252356&data=04%7C01%7Celibr%40nvidia.com%7C7ca0caf9434e429e4ffd08d946
> d7cf2f%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637618715041410254%7CUnknown%7CTWFpbGZsb3d8eyJWIjo
> iMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Y62OrCRyS00vJHHPQvAHyhG5C
> 4eO%2FSfWMCSPtszn3Is%3D&reserved=0
> >>>
> >>> Master before VXLAN HWOL changes (f0e4a73)
> >>> 1.000x
> >>>
> >>> Latest master after VXLAN HWOL changes (b780911)
> >>> 0.918x (-8.2%)
> >>>
> >>> After fixup patches on OVS ML are applied (with ALLOW_EXPERIMENTAL_API=off)
> >>> 0.973x (-2.7%)
> >>>
> >>> After fixup patches on OVS ML are applied and after ALLOW_EXPERIMENTAL_API is removed.
> >>> 0.938x (-6.2%)
> >>>
> >>> I ran the last set of results by applying the below diff. I did this because I'm assuming the plan
> >> is to remove the ALLOW_EXPERIMENTAL_API '#ifdef's at some point?
> >>
> >> Yes, that is the plan.
> >>
> > Thanks for confirming this.
> >
> >> And thanks for testing, Gaetan and Cian!
> >>
> >> Could you also provide more details on your test environment,
> >> so someone else can reproduce?
> >>
> > Good idea, I'll add the details inline below. These details apply to the performance measured
> previously by me, and the performance in this mail.
> >
> >> What is important to know:
> >> - Test configuration: P2P, V2V, PVP, etc.
> >
> > P2P
> > 1 PHY port
> > 1 RXQ
> >
> >> - Test type: max. throughput, zero packet loss.
> > Max throughput.
> >
> >> - OVS config: EMC, SMC, HWOL, AVX512 - on/off/type
> > In all tests, all packets hit a single datapath flow with "offloaded:partial". So all packets are
> partially offloaded, skipping miniflow_extract() and EMC/SMC/DPCLS lookups.
> >
> > AVX512 is off.
> >
> >> - Installed OF rules.
> > $ $OVS_DIR/utilities/ovs-ofctl dump-flows br0
> >   cookie=0x0, duration=253.691s, table=0, n_packets=2993867136, n_bytes=179632028160, in_port=phy0
> actions=IN_PORT
> >
> >> - Traffic pattern: Packet size, number of flows, packet type.
> > 64B, 1 flow, ETH/IP packets.
> >
> >> This tests also didn't include the fix from Balazs, IIUC, because
> >> they were performed a bit before that patch got accepted.
> >>
> > Correct, the above tests didn't include the optimization from Balazs.
> >
> >> And Flavio reported what seems to be noticeable performance
> >> drop due to just accepted AVX512 DPIF implementation for the
> >> non-HWOL non-AVX512 setup:
> >>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fo
> vs-dev%2F2021-
> July%2F385448.html&data=04%7C01%7Celibr%40nvidia.com%7C7ca0caf9434e429e4ffd08d946d7cf2f%7C43083d15
> 727340c1b7db39efd9ccc17a%7C0%7C0%7C637618715041410254%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQ
> IjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Co%2FugazVORXH43uWlB2UUAoP%2BGdHpdwisKs10B
> PFW7c%3D&reserved=0
> >>
> > We are testing partial HWOL setups, so I don't think Flavio's mail is relevant to this.
> >
> >> So, it seems that everything will need to be re-tested anyway
> >> in order to understand what is the current situation.
> >>
> > Agreed, let's retest the performance. I've included the new numbers below:
> >
> > I'm showing the performance relative to what we measured on OVS master directly before the VXLAN
> HWOL changes went in. All of the below results are using the scalar DPIF and partial HWOL.
> > Link to "Fixup patches" (v2):
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatchwork.ozlabs.org%2Fproject%2Fopen
> vswitch%2Flist%2F%3Fseries%3D253488&data=04%7C01%7Celibr%40nvidia.com%7C7ca0caf9434e429e4ffd08d946
> d7cf2f%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637618715041410254%7CUnknown%7CTWFpbGZsb3d8eyJWIjo
> iMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ejj3KdwvnZwYfiYM%2BEcpL5t
> nb3JIbtdhMmaMMhxJkSo%3D&reserved=0
> >
> > Master before VXLAN HWOL changes. (f0e4a73)
> > 1.000x
> >
> > Master after VXLAN HWOL changes. (d53ea18)
> > 0.964x (-3.6%)
> >
> > After rebased fixup patches on OVS ML are applied. (with ALLOW_EXPERIMENTAL_API=off)
> > 0.993x (-0.7%)
> >
> > After rebased fixup patches on OVS ML are applied and after ALLOW_EXPERIMENTAL_API is removed.
> > 0.961x (-3.9%)
> 
> According to this, we are better off without the array cache commit, at
> least in this test (-3.6% vs -3.9%). Isn't it?
> 
> In our internal tests we saw it actually improves, though not gaining
> back all the degradation.
> 
> What do you think?
> 

I think we are better off keeping the array cache commit. It's a good idea and replaces a HMAP_FOR_EACH_IN_BUCKET() with a direct port_no to netdev lookup which is what we need for the netdev_hw_miss_packet_recover() call.

Yes, I show a slight degradation with that, but it's small, and my test scenario only has one TX port. I think that one TX port hides the benefit of the array cache commit. When we have a more realistic number of TX ports, the direct lookup should be quicker than the HMAP_FOR_EACH.

My remaining concern is with PATCH 1/2 of this series.
In that patch, we are hiding the overhead of the packet restoration API under ALLOW_EXPERIMENTAL_API, but this is a temporary fix. In the future when ALLOW_EXPERIMENTAL_API is removed from the packet restoration API, that overhead will once again be present.

> >
> > So the performance is looking better now because of the optimization from Balazs. This together with
> #ifdefing out the code brings the performance almost to where it was before.
> >
> > I'm worried that the #ifdef is a temporary solution for the partial HWOL case, since eventually that
> will be removed. This leaves us with a -3.9% degradation.
> >
> > Thanks,
> > Cian
> >
> >> Best regards, Ilya Maximets.
> >>
> >>> Diff:
> >>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> >>> index accb23a1a..0e29c609f 100644
> >>> --- a/lib/dpif-netdev.c
> >>> +++ b/lib/dpif-netdev.c
> >>> @@ -7132,7 +7132,6 @@ dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
> >>>       struct netdev *netdev OVS_UNUSED;
> >>>       uint32_t mark;
> >>>
> >>> -#ifdef ALLOW_EXPERIMENTAL_API /* Packet restoration API required. */
> >>>       /* Restore the packet if HW processing was terminated before completion. */
> >>>       netdev = pmd_netdev_cache_lookup(pmd, port_no);
> >>>       if (OVS_LIKELY(netdev)) {
> >>> @@ -7143,7 +7142,6 @@ dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
> >>>               return -1;
> >>>           }
> >>>       }
> >>> -#endif
> >>>
> >>>       /* If no mark, no flow to find. */
> >>>       if (!dp_packet_has_flow_mark(packet, &mark)) {
> >>>