[ovs-dev] [PATCH v3] [python] Don't mix system poll/monkeypatched select

Tue Jun 15 20:06:53 UTC 2021

On 6/11/21 4:29 PM, Terry Wilson wrote:
> This is a partial revert of c1aa16d19, but keeps its test.
> 
> Applications that use eventlet cannot use poll() without blocking.
> To fix an issue where the check_connection_completion() code would
> not detect connection errors when using select(), we switched to
> using the system poll() for checking the connection while leaving
> the select-based poller the default. This mixes monkeypatched and
> unpatched code and caused eventlet-based code to block every time
> we connect.
> 
> According to the connect(2) man page, you can detect connect()
> errors when using select():
> 
>   After select(2) indicates writability, use getsockopt(2) to read
>   the SO_ERROR option at level SOL_SOCKET to determine whether
>   connect() completed successfully (SO_ERROR is zero) or
>   unsuccessfully (SO_ERROR is one of the usual error codes listed
>   here, explaining the reason for the failure)
> 
> so this patch does that instead.
> 
> Signed-off-by: Terry Wilson <twilson at redhat.com>
> ---
>  python/ovs/poller.py      | 35 ++---------------------------------
>  python/ovs/socket_util.py |  9 ++++++---
>  2 files changed, 8 insertions(+), 36 deletions(-)
> 
> diff --git a/python/ovs/poller.py b/python/ovs/poller.py
> index 3624ec865..7b3238d6d 100644
> --- a/python/ovs/poller.py
> +++ b/python/ovs/poller.py
> @@ -31,22 +31,14 @@ except ImportError:
>      SSL = None
>  
>  try:
> -    from eventlet import patcher as eventlet_patcher
> +    import eventlet.patcher
>  
>      def _using_eventlet_green_select():
> -        return eventlet_patcher.is_monkey_patched(select)
> +        return eventlet.patcher.is_monkey_patched(select)
>  except:
> -    eventlet_patcher = None
> -
>      def _using_eventlet_green_select():
>          return False
>  
> -try:
> -    from gevent import monkey as gevent_monkey
> -except:
> -    gevent_monkey = None
> -
> -
>  vlog = ovs.vlog.Vlog("poller")
>  
>  POLLIN = 0x001
> @@ -265,26 +257,3 @@ class Poller(object):
>      def __reset(self):
>          self.poll = SelectPoll()
>          self.timeout = -1
> -
> -
> -def get_system_poll():
> -    """Returns the original select.poll() object. If select.poll is
> -    monkey patched by eventlet or gevent library, it gets the original
> -    select.poll and returns an object of it using the
> -    eventlet.patcher.original/gevent.monkey.get_original functions.
> -
> -    As a last resort, if there is any exception it returns the
> -    SelectPoll() object.
> -    """
> -    try:
> -        if _using_eventlet_green_select():
> -            _system_poll = eventlet_patcher.original("select").poll
> -        elif gevent_monkey and gevent_monkey.is_object_patched(
> -                'select', 'poll'):
> -            _system_poll = gevent_monkey.get_original('select', 'poll')
> -        else:
> -            _system_poll = select.poll
> -    except:
> -        _system_poll = SelectPoll
> -
> -    return _system_poll()
> diff --git a/python/ovs/socket_util.py b/python/ovs/socket_util.py
> index 3faa64e9d..7dd000916 100644
> --- a/python/ovs/socket_util.py
> +++ b/python/ovs/socket_util.py
> @@ -159,8 +159,8 @@ def make_unix_socket(style, nonblock, bind_path, connect_path, short=False):
>  
>  
>  def check_connection_completion(sock):
> +    p = ovs.poller.SelectPoll()
>      if sys.platform == "win32":
> -        p = ovs.poller.SelectPoll()
>          event = winutils.get_new_event(None, False, True, None)
>          # Receive notification of readiness for writing, of completed
>          # connection or multipoint join operation, and of socket closure.
> @@ -170,12 +170,15 @@ def check_connection_completion(sock):
>                                   win32file.FD_CLOSE)
>          p.register(event, ovs.poller.POLLOUT)
>      else:
> -        p = ovs.poller.get_system_poll()
>          p.register(sock, ovs.poller.POLLOUT)
>      pfds = p.poll(0)
>      if len(pfds) == 1:
>          revents = pfds[0][1]
> -        if revents & ovs.poller.POLLERR or revents & ovs.poller.POLLHUP:
> +        if revents & ovs.poller.POLLOUT:
> +            # This is how select() indicates a connect() error

This wording doesn't sound correct to me as it might be an error and might
be not, IIUC.  POLLOUT here just means that something happened.  It might
be an error or the actual POLLOUT.
(Ugh... select()/poll() are so inconsistent across different platforms)

> +            return sock.getsockopt(socket.SOL_SOCKET, socket.SO_ERROR)

Looking at the history, it seems that ESX doesn't support SO_ERROR (at least
some old versions), so we need to avoid use of it.
See commit d6cedfd9d29d ("socket-util: Avoid using SO_ERROR.")

We, likely, need to treat this in the same path as error. i.e. invoke send()
and catch the exception.  However, for the POLLOUT case we should not
log and return the error if send() succeeds.  But this doesn't sound like
a good idea to send NULL bytes to everyone on success...

So, some options here:

1. Keep as is, i.e. system poll()
   --> python will be blocked by poll()

2. Fully revert c1aa16d19, i.e. just use select()
   --> connections will be treated as established even if they are not
       until the actual read/write.

3. Use select() + getsockopt(SO_ERRROR), effectively revert d6cedfd9d29d
   --> give up on systems that doesn't support SO_ERRROR. (ESX?)

4. Use select() + getsockopt(SO_ERRROR), keep returning 0 (success) on
   ENOPROTOOPT
   --> systems that doesn't support might think that connections are
       established while they are not.

5. Try to detect if SO_ERRROR supported or not and use system poll() or
   select() + getsockopt(SO_ERRROR) based on that fact.
   --> better performance on systems that supports and same as now on
       systems that doesn't.

Option 3 is, probably, OK, but I don't know.
Option 4 might be the best choice, but I'm not quiet sure how to detect.
Is checking for ENOPROTOOPT enough (same actually applies to option 3
as I'm not sure if ESX actually reports ENOPROTOOPT)?

Also, we're using select() in C code for windows.  Do we have the same
issue there?  Alin, do you know if check_connection_completion() correctly
detects connection failure on windows, e.g. if 'Check stream open block'
tests are working?

Any thoughts?

Ben, what do you think?

Best regards, Ilya Maximets.