[ovs-dev] [PATCH v2 3/4] Add file descriptor persistence where possible

Mon Feb 24 10:47:33 UTC 2020

On 19/02/2020 14:20, Dumitru Ceara wrote:
> On 2/18/20 7:12 AM, Anton Ivanov wrote:
>>
>> On 17/02/2020 14:48, Dumitru Ceara wrote:
>>> On 2/14/20 6:54 PM, anton.ivanov at cambridgegreys.com wrote:
>>>> From: Anton Ivanov <anton.ivanov at cambridgegreys.com>
>>>>
>>>> 1. Adds "persistent" behaviour where feasible (streams and signals).
>>>> These are waited upon in the same thread where they are created. This
>>>> allows them to be registered persistently with the OS (if possible)
>>>> as well as the OS to provide hints - is the FD ready, is it closed,
>>>> etc.
>>>>
>>>> 2. Removes unnecessary attempts to perform a read vs EAGAIN on a fd
>>>> which is not ready if that fd has been registered as "private" to the
>>>> thread which waits upon it.
>>>>
>>>> 3. No longer breaks other parts of OVS which create the fd in one
>>>> thread and waits upon it in others.
>>>>
>>>> 3. Adds support for EPOLL on Linux and can be expanded to cover similar
>>>> poll++ frameworks in other OSes.
>>>>
>>>> 4. Sets up the necessary infrastructure to make IO/SSL multi-threaded
>>>> using a "centeral (e)poll dispatcher + IO threads" pattern
>>>>
>>>> Signed-off-by: Anton Ivanov <anton.ivanov at cambridgegreys.com>
>>> Hi Anton,
>>>
>>> A couple of issues inline. Except for that:
>>>
>>> 1. The "STP - flush the fdb and mdb when topology changed" OVS test is
>>> failing with your patches applied:
>>>
>>> make check TESTSUITEFLAGS='-k "flush the fdb"'
>> I will have a look.

I cannot reproduce that. It succeeds every time - 20 out of 20 runs. Can you send me some logs please?

>>
>>> 2. Travis CI build fails:
>>>
>>> lib/fatal-signal.c:244:5: error: ignoring return value of ‘read’,
>>> declared with attribute warn_unused_result [-Werror=unused-result]
>>>
>>>        read(signal_fds[0], sigbuffer, sizeof(sigbuffer))
>>>
>>> 3. Travis CI OSX build fails:
>>>
>>> lib/poll-loop.c:46:1: error: unused function 'poll_create_node_add'
>>> [-Werror,-Wunused-function]
>>>
>>> COVERAGE_DEFINE(poll_create_node);
>> I will fix all of these in the next version. The CI was out with the
>> daisies when I submitted the patch last week so I did not see the logs
>> until yesterday.
>>
>>> 4. While OVS might benefit from these changes I'm wondering about OVN
>>> and ovsdb-server specifically. ovsdb-server is single threaded and
>>> usually on large scale deployments we don't really see "poll" as the
>>> bottleneck or even the fact that code tries to read/write from FDs when
>>> FDs are not available for read/write.
>>>
>>> For example, here are results of running a scale test scenario which
>>> repeats the following iteration 300 times:
>>> - bring up a node (ovn-fake-multinode container) and connect it to the
>>> OVN Southbound DB.
>>> - configure an OVN logical switch to be bound to the new node.
>>> - configure an OVN logical switch port on the new logical switch.
>>> - configure an OVS internal interface on the new node and bind it to the
>>> OVN logical switch port.
>>> - wait until the new internal interface can ping its default gateway
>>> through OVN (i.e., until ovn-controller on the node received all updates
>>> from the SB DB and installed all OVS flows), highlighted in the output.
>>>
>>> The tests use rally-ovs (ovn-scale-test) on a 9 server setup (1 machine
>>> running OVN ovsdb-servers and ovn-northd and 8 machines simulating
>>> chassis using ovn-fake-multinode), in particular this modified scenario:
>>> https://github.com/dceara/ovn-scale-test/blob/ovn-switch-per-node/samples/tasks/scenarios/ovn-network/osh_workload_incremental.json
>>>
>>>
>>> With OVS master and OVN master:
>>> http://pastebin.test.redhat.com/836568
>>>
>>> With OVS master + your patches and OVN master:
>>> http://pastebin.test.redhat.com/836571
>>>
>>> Here are some of the logs we get on the OVN Southbound DB ovsdb-server
>>> that show that ovsdb-server spends up to 2 seconds in a single loop
>>> iteration sending/receiving updates to/from ovn-controllers:
>>>
>>> 2020-02-17T10:43:41.175Z|01991|poll_loop|INFO|wakeup due to [OVS_POLLIN]
>>> on fd 140 (192.16.0.1:6642<->192.16.0.120:52018) at lib/stream-fd.c:79
>>> (84% CPU usage)
>>> 2020-02-17T10:43:43.338Z|01992|timeval|WARN|Unreasonably long 2163ms
>>> poll interval (2144ms user, 9ms system)
>>> 2020-02-17T10:43:43.339Z|01993|timeval|WARN|faults: 590 minor, 0 major
>>> 2020-02-17T10:43:43.339Z|01994|timeval|WARN|disk: 0 reads, 8 writes
>>> 2020-02-17T10:43:43.339Z|01995|timeval|WARN|context switches: 0
>>> voluntary, 4 involuntary
>>> 2020-02-17T10:43:43.339Z|01996|poll_loop|INFO|Dropped 63 log messages in
>>> last 2 seconds (most recently, 2 seconds ago) due to excessive rate
>>> 2020-02-17T10:43:43.339Z|01997|poll_loop|INFO|wakeup due to [OVS_POLLIN]
>>> on fd 76 (192.16.0.1:6642<->192.16.0.56:33538) at lib/stream-fd.c:79
>>> (84% CPU usage)
>>> 2020-02-17T10:43:45.495Z|01998|timeval|WARN|Unreasonably long 2156ms
>>> poll interval (2129ms user, 17ms system)
>>> 2020-02-17T10:43:45.495Z|01999|timeval|WARN|faults: 738 minor, 0 major
>>> 2020-02-17T10:43:45.495Z|02000|timeval|WARN|context switches: 0
>>> voluntary, 7 involuntary
>>> 2020-02-17T10:43:47.651Z|02001|timeval|WARN|Unreasonably long 2157ms
>>> poll interval (2136ms user, 10ms system)
>>>
>>> In this case, and I think in most OVN use cases, ovsdb-server is busy
>>> because it actually has to send updates to large numbers of
>>> ovn-controllers connected to it. Unless I'm missing something the epoll
>>> change seems to improve performance only in cases where the Southbound
>>> DB doesn't do much sending/receiving.
>> 1. Correct - it improves the handling and the cost of idle connections.
>> At present each connection is a fixed cost regardless of does it need
>> servicing or not.
>>
>> 2. It should also improve the cost of handling of many peers to send if
>> there is enough outstanding data on sockets to create EGAIN on send.
>>
>> 3. It does not fix the fundamental problem that the logic in ovsdb is
>> single threaded. It does, however allow json+io+ssl to become
>> multi-threaded which should leave just the logic in the main ovsd-server
>> thread. You cannot do that effectively without having information on the
>> state of the socket and does it need servicing.
>>
>> 4. I have seen the rally tests - they are flat scale-up. I agree that
>> the help from fixing the IO loop will be minimal because there is IO on
>> most fds to be serviced at all times. I would not expect them to help a
>> lot there.
>>
>> The difference from fixing the IO (leaving aside that it is a
>> prerequisite to getting SSL to worker threads) is not in scaling up, but
>> running at scale. The current "ram the EGAIN wall until it gives up"
>> design gives you a mandatory penalty per transaction deploying config to
>> one node while running (because of attempted failed reads on EAGAIN).
>> The correct test for that is the CPU cost and latency to deploy a
>> logical flow when running at steady state in f.e. 300 nodes.
> Currently when a logical flow is added to the SB DB for a logical
> datapath (logical switch or logical router) all ovn-controllers that
> consider the logical datapath as "local" will receive the update from
> ovsdb SB.
>
> ovn-controller local datapaths are:
> - all logical switch datapaths that have ports bound to the local chassis.
> - all logical datapaths corresponding to logical routers connected
> through OVN patch ports to local logical switches.
>
> While it's true that in some deployments this will limit the number of
> ovn-controllers receiving the update, in OVN-k8s, for example, all node
> logical switches are connected to a single cluster logical router. Once
> a logical flow is added to any of the node logical switches all
> ovn-controllers on all nodes should receive the update. This means that
> there won't be too many idle connections during a SB DB logical flow update.
>
> I agree that the right way to test is to see how long it takes for a
> logical flow to be installed, which is what the rally-ovs tests do: they
> measure how long until the SB DB logical flow is translated to openflow
> by ovn-controller when a new node is brought up. At iteration X+1 we are
> at a steady state (X nodes up and configured) and node X+1 is brought up
> and its corresponding logical flows are added.
>
>> The current design as written has that penalty as a given - it is
>> unavoidable and it also grows linear (or worse) with size from other
>> factors in addition to EGAIN - f.e you also start copying a large pollfd
>> array to the kernel back and fourth on every iteration.
>>
>> I have not seen any tests trying to quantify this on a actual cluster.
>> All tests I have seen so far are scale-up and/or running a single
>> synthetic application on the whole cluster.
>>
>> Mine are mostly taking the code in question and running it outside of
>> OVS/OVN on a harness.
>>
>>
>>> How do you test
>>> performance/scalability improvements?
>>>
>>> Regards,
>>> Dumitru
>>>
>
-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/