[ovs-dev] [PATCH v2 3/4] Add file descriptor persistence where possible

Wed Feb 19 17:20:04 UTC 2020

On 19/02/2020 14:20, Dumitru Ceara wrote:
> On 2/18/20 7:12 AM, Anton Ivanov wrote:
>>
>>
>> On 17/02/2020 14:48, Dumitru Ceara wrote:
>>> On 2/14/20 6:54 PM, anton.ivanov at cambridgegreys.com wrote:
>>>> From: Anton Ivanov <anton.ivanov at cambridgegreys.com>
>>>>
>>>> 1. Adds "persistent" behaviour where feasible (streams and signals).
>>>> These are waited upon in the same thread where they are created. This
>>>> allows them to be registered persistently with the OS (if possible)
>>>> as well as the OS to provide hints - is the FD ready, is it closed,
>>>> etc.
>>>>
>>>> 2. Removes unnecessary attempts to perform a read vs EAGAIN on a fd
>>>> which is not ready if that fd has been registered as "private" to the
>>>> thread which waits upon it.
>>>>
>>>> 3. No longer breaks other parts of OVS which create the fd in one
>>>> thread and waits upon it in others.
>>>>
>>>> 3. Adds support for EPOLL on Linux and can be expanded to cover similar
>>>> poll++ frameworks in other OSes.
>>>>
>>>> 4. Sets up the necessary infrastructure to make IO/SSL multi-threaded
>>>> using a "centeral (e)poll dispatcher + IO threads" pattern
>>>>
>>>> Signed-off-by: Anton Ivanov <anton.ivanov at cambridgegreys.com>
>>>
>>> Hi Anton,
>>>
>>> A couple of issues inline. Except for that:
>>>
>>> 1. The "STP - flush the fdb and mdb when topology changed" OVS test is
>>> failing with your patches applied:
>>>
>>> make check TESTSUITEFLAGS='-k "flush the fdb"'
>>
>> I will have a look.
>>
>>>
>>> 2. Travis CI build fails:
>>>
>>> lib/fatal-signal.c:244:5: error: ignoring return value of ‘read’,
>>> declared with attribute warn_unused_result [-Werror=unused-result]
>>>
>>>        read(signal_fds[0], sigbuffer, sizeof(sigbuffer))
>>>
>>> 3. Travis CI OSX build fails:
>>>
>>> lib/poll-loop.c:46:1: error: unused function 'poll_create_node_add'
>>> [-Werror,-Wunused-function]
>>>
>>> COVERAGE_DEFINE(poll_create_node);
>>
>> I will fix all of these in the next version. The CI was out with the
>> daisies when I submitted the patch last week so I did not see the logs
>> until yesterday.
>>
>>>
>>> 4. While OVS might benefit from these changes I'm wondering about OVN
>>> and ovsdb-server specifically. ovsdb-server is single threaded and
>>> usually on large scale deployments we don't really see "poll" as the
>>> bottleneck or even the fact that code tries to read/write from FDs when
>>> FDs are not available for read/write.
>>>
>>> For example, here are results of running a scale test scenario which
>>> repeats the following iteration 300 times:
>>> - bring up a node (ovn-fake-multinode container) and connect it to the
>>> OVN Southbound DB.
>>> - configure an OVN logical switch to be bound to the new node.
>>> - configure an OVN logical switch port on the new logical switch.
>>> - configure an OVS internal interface on the new node and bind it to the
>>> OVN logical switch port.
>>> - wait until the new internal interface can ping its default gateway
>>> through OVN (i.e., until ovn-controller on the node received all updates
>>> from the SB DB and installed all OVS flows), highlighted in the output.
>>>
>>> The tests use rally-ovs (ovn-scale-test) on a 9 server setup (1 machine
>>> running OVN ovsdb-servers and ovn-northd and 8 machines simulating
>>> chassis using ovn-fake-multinode), in particular this modified scenario:
>>> https://github.com/dceara/ovn-scale-test/blob/ovn-switch-per-node/samples/tasks/scenarios/ovn-network/osh_workload_incremental.json
>>>
>>>
>>> With OVS master and OVN master:
>>> http://pastebin.test.redhat.com/836568
>>>
>>> With OVS master + your patches and OVN master:
>>> http://pastebin.test.redhat.com/836571
>>>
>>> Here are some of the logs we get on the OVN Southbound DB ovsdb-server
>>> that show that ovsdb-server spends up to 2 seconds in a single loop
>>> iteration sending/receiving updates to/from ovn-controllers:
>>>
>>> 2020-02-17T10:43:41.175Z|01991|poll_loop|INFO|wakeup due to [OVS_POLLIN]
>>> on fd 140 (192.16.0.1:6642<->192.16.0.120:52018) at lib/stream-fd.c:79
>>> (84% CPU usage)
>>> 2020-02-17T10:43:43.338Z|01992|timeval|WARN|Unreasonably long 2163ms
>>> poll interval (2144ms user, 9ms system)
>>> 2020-02-17T10:43:43.339Z|01993|timeval|WARN|faults: 590 minor, 0 major
>>> 2020-02-17T10:43:43.339Z|01994|timeval|WARN|disk: 0 reads, 8 writes
>>> 2020-02-17T10:43:43.339Z|01995|timeval|WARN|context switches: 0
>>> voluntary, 4 involuntary
>>> 2020-02-17T10:43:43.339Z|01996|poll_loop|INFO|Dropped 63 log messages in
>>> last 2 seconds (most recently, 2 seconds ago) due to excessive rate
>>> 2020-02-17T10:43:43.339Z|01997|poll_loop|INFO|wakeup due to [OVS_POLLIN]
>>> on fd 76 (192.16.0.1:6642<->192.16.0.56:33538) at lib/stream-fd.c:79
>>> (84% CPU usage)
>>> 2020-02-17T10:43:45.495Z|01998|timeval|WARN|Unreasonably long 2156ms
>>> poll interval (2129ms user, 17ms system)
>>> 2020-02-17T10:43:45.495Z|01999|timeval|WARN|faults: 738 minor, 0 major
>>> 2020-02-17T10:43:45.495Z|02000|timeval|WARN|context switches: 0
>>> voluntary, 7 involuntary
>>> 2020-02-17T10:43:47.651Z|02001|timeval|WARN|Unreasonably long 2157ms
>>> poll interval (2136ms user, 10ms system)
>>>
>>> In this case, and I think in most OVN use cases, ovsdb-server is busy
>>> because it actually has to send updates to large numbers of
>>> ovn-controllers connected to it. Unless I'm missing something the epoll
>>> change seems to improve performance only in cases where the Southbound
>>> DB doesn't do much sending/receiving.
>>
>> 1. Correct - it improves the handling and the cost of idle connections.
>> At present each connection is a fixed cost regardless of does it need
>> servicing or not.
>>
>> 2. It should also improve the cost of handling of many peers to send if
>> there is enough outstanding data on sockets to create EGAIN on send.
>>
>> 3. It does not fix the fundamental problem that the logic in ovsdb is
>> single threaded. It does, however allow json+io+ssl to become
>> multi-threaded which should leave just the logic in the main ovsd-server
>> thread. You cannot do that effectively without having information on the
>> state of the socket and does it need servicing.
>>
>> 4. I have seen the rally tests - they are flat scale-up. I agree that
>> the help from fixing the IO loop will be minimal because there is IO on
>> most fds to be serviced at all times. I would not expect them to help a
>> lot there.
>>
>> The difference from fixing the IO (leaving aside that it is a
>> prerequisite to getting SSL to worker threads) is not in scaling up, but
>> running at scale. The current "ram the EGAIN wall until it gives up"
>> design gives you a mandatory penalty per transaction deploying config to
>> one node while running (because of attempted failed reads on EAGAIN).
>> The correct test for that is the CPU cost and latency to deploy a
>> logical flow when running at steady state in f.e. 300 nodes.
> 
> Currently when a logical flow is added to the SB DB for a logical
> datapath (logical switch or logical router) all ovn-controllers that
> consider the logical datapath as "local" will receive the update from
> ovsdb SB.
> 
> ovn-controller local datapaths are:
> - all logical switch datapaths that have ports bound to the local chassis.
> - all logical datapaths corresponding to logical routers connected
> through OVN patch ports to local logical switches.
> 
> While it's true that in some deployments this will limit the number of
> ovn-controllers receiving the update, in OVN-k8s, for example, all node
> logical switches are connected to a single cluster logical router. Once
> a logical flow is added to any of the node logical switches all
> ovn-controllers on all nodes should receive the update. This means that
> there won't be too many idle connections during a SB DB logical flow update.

The example you are giving is a clear case of parallel IO for which you 
need a working IO event loop and retained state on where are you at 
present with IO for each connection. You cannot effectively parallellise 
IO+SSL while discarding all information about it.

Fixing the IO events is a prerequisite for doing other stuff as well - 
SSL threading, processing threading, etc. Sure you can try parallelising 
IO without any information on what is your IO state and tie that into 
some level of parallelism in the processing. That usually results in 
something that resembles spaghetti and is nightmare to continue scaling 
further.

> 
> I agree that the right way to test is to see how long it takes for a
> logical flow to be installed, which is what the rally-ovs tests do: they
> measure how long until the SB DB logical flow is translated to openflow
> by ovn-controller when a new node is brought up. At iteration X+1 we are
> at a steady state (X nodes up and configured) and node X+1 is brought up
> and its corresponding logical flows are added.
> 
>>
>> The current design as written has that penalty as a given - it is
>> unavoidable and it also grows linear (or worse) with size from other
>> factors in addition to EGAIN - f.e you also start copying a large pollfd
>> array to the kernel back and fourth on every iteration.
>>
>> I have not seen any tests trying to quantify this on a actual cluster.
>> All tests I have seen so far are scale-up and/or running a single
>> synthetic application on the whole cluster.
>>
>> Mine are mostly taking the code in question and running it outside of
>> OVS/OVN on a harness.
>>
>>
>>> How do you test
>>> performance/scalability improvements?
>>>
>>> Regards,
>>> Dumitru
>>>
> 
> 

-- 
Anton R. Ivanov

Cambridge Greys Limited, England and Wales company No 10273661
http://www.cambridgegreys.com/