[ovs-dev] [RFC Patch] dpif-netdev: Sorted subtable vectors per in_port in dpcls

Fri Jul 1 10:03:16 UTC 2016

> >> Did you consider keeping a separate dpcls instance per input port instead?
> >> I.e., have an hmap of dpcls'es using the in_port as the hash key, and
> >> creating and destroying them on demand. This would avoid the fixed
> >> size array, simplify the code and each insert and remove operation
> >> would be a bit cheaper. A miss would typically go though a lower
> >> number of subtables, even though the upcall cost would likely hide
> >> this benefit in practice. We are always exact matching in_port (as
> >> well as dl_type and recirc_id (see xlate_wc_init()), so this would
> >> add no overhead in terms of the overall number of megaflows.
> >>
> >
> > I see your point. I guess this would be a somewhat bigger change, but I can
> have a look at it.
> >
> 
> While the change is a bit bigger conceptually, in terms of code changes it
> might not be that bad.

If have created a second version of the patch to implement the "dpcls per in_port" idea. It turned out to be reasonably straightforward. Will send that new patch out during the day. Unfortunately it does not provide any significant performance increase over the first version, but is indeed a bit simpler.

> > I am a bit worried about the memory footprint and the impact on CPU
> cache performance because each PMD core would have its own set of
> dpcls'es even, although with RSS and vhost multi-queue support, each
> in_port will typically be served by many PMD cores.
> 
> I guess only a proper performance test exercising this will tell if it makes a
> difference.

We have a  test-bed where we can check the multi-core scalability of the patch to some extent. Will try to provide measurements soon.

> > In theory it should be enough to have a single dpcls instance per in_port
> shared by all PMDs which contains all the read-only data. Each PMD would
> cache relevant parts of that in its private L1 cache, so there should not be any
> performance impact. Volatile data such as flow stats would have to be kept
> per PMD thread as today. This should also avoid multiple up-calls installing
> the same megaflow in different PMD dpcls'es.
> >
> > What do you think?
> 
> Write access to cmaps requires mutual exclusion lock among writers which
> might hurt flow set-up performance. Also, if multiple packets of a single flow
> are processed by multiple PMD threads, they would likely all execute the
> upcalls anyway, and then lock each other out while trying to install the flow in
> the dpcls. We actually started with a single dpcls for the entire userspace
> datapath and the current division to one per PMD thread was introduced as
> an optimization to avoid this locking penalty.

OK, when trouble-shooting the new patch I have realized that the upcalls and the resulting installation of new megaflow entries are actually executed by the PMD threads themselves. That is of course a good reason to have independent dpcls data structures per PMD thread. I was coming from a different vSwitch architecture where a single control plane thread was the only one modifying the classifier data structures and the PMD threads were only doing look-ups on that central data.

> > One question, though. Wouldn't it be cleaner to create a new pvector data
> type variant without the multithread safety. I have the impression that the
> client code in dpif-netdev.c is now more exposed to implementation details
> of pvector than it should be, or?
> 
> Essentially pvector_impl is now the variant without thread safety. I
> contemplated naming the pvector_impl to something else as it is now
> exposed, but did not come up with a good name. Suggestions?

In my new patch I keep the hit counter in struct subtable and only periodically use those counters as new priorities in the subtable pvector. With that approach I can live with the existing pvector API. I don't think there is a need to optimize this further.

BR, Jan