[ovs-discuss] OpenStack profiling with networking-ovn - port creation is slow

Tue Feb 13 19:32:01 UTC 2018

On Tue, Feb 13, 2018 at 12:39:56PM +0100, Daniel Alvarez Sanchez wrote:
> Hi folks,
> 
> As we're doing some performance tests in OpenStack using OVN,
> we noticed that as we keep creating ports, the time for creating a
> single port increases. Also, ovn-northd CPU consumption is quite
> high (see [0] which shows the CPU consumption when creating
> 1000 ports and deleting them. Last part where CPU is at 100% is
> when all the ports get deleted).
> 
> With 500 ports in the same Logical Switch, I did some profiling
> of OpenStack neutron-server adding 10 more ports to that Logical
> Switch. Currently, neutron-server spawns different API workers
> (separate processes) which open connections to OVN NB so every
> time an update message is sent from ovsdb-server it'll be processed
> by all of them.
> 
> In my profiling, I used GreenletProfiler in all those processes to produce
> a trace file and then merged all of them together to aggregate the
> results. In those tests I used OVS master branch compiled it with
> shared libraries to make use of the JSON C parser. Still, I've seen
> that most of the time's been spent in the following two modules:
> 
> - python/ovs/db/data.py:  33%
> - uuid.py:  21%
> 
> For the data.py module, this is the usage (self time):
> 
> Atom.__lt__       16.25%     8283 calls
> from_json:118      6.18%   406935 calls
> Atom.__hash__      3.48%  1623832 calls
> from_json:328      2.01%     5040 calls
> 
> While for the uuid module:
> 
> UUID.__cmp__       12.84%  3570975 calls
> UUID.__init__       4.06%   362541 calls
> UUID.__hash__       2.96%     1800 calls
> UUID.__str__        1.03%   355016 calls
> 
> Most of the calls to Atom.__lt__ come from
> BaseOvnIdl.__process_update2(idl.py)
> -> BaseOvnIdl.__row_update(idl.py) -> Datum.__cmp__(data.py) ->
> Atom.__cmp__(data.py).

I don't know Python well enough to whether these are "real" or correct
optimizations, but the following strike me as possible low-hanging fruit
micro-optimizations:

diff --git a/python/ovs/db/data.py b/python/ovs/db/data.py
index 9e57595f7513..dc816f64708f 100644
--- a/python/ovs/db/data.py
+++ b/python/ovs/db/data.py
@@ -76,12 +76,12 @@ class Atom(object):
     def __eq__(self, other):
         if not isinstance(other, Atom) or self.type != other.type:
             return NotImplemented
-        return True if self.value == other.value else False
+        return self.value == other.value
 
     def __lt__(self, other):
         if not isinstance(other, Atom) or self.type != other.type:
             return NotImplemented
-        return True if self.value < other.value else False
+        return self.value < other.value
 
     def __cmp__(self, other):
         if not isinstance(other, Atom) or self.type != other.type:

We could also reduce the number of calls to Datum.__cmp__ by recognizing
that if we've already found one change we don't have to do any more
comparisons, something like this:

diff --git a/python/ovs/db/idl.py b/python/ovs/db/idl.py
index 60548bcf50b6..2e550adfdf1c 100644
--- a/python/ovs/db/idl.py
+++ b/python/ovs/db/idl.py
@@ -447,6 +447,7 @@ class Idl(object):
             raise error.Error("<table-updates> is not an object",
                               table_updates)
 
+        changed = False
         for table_name, table_update in six.iteritems(table_updates):
             table = self.tables.get(table_name)
             if not table:
@@ -472,8 +473,8 @@ class Idl(object):
                                       % (table_name, uuid_string))
 
                 if version == OVSDB_UPDATE2:
-                    if self.__process_update2(table, uuid, row_update):
-                        self.change_seqno += 1
+                    changed = self.__process_update2(table, uuid, row_update,
+                                                     changed)
                     continue
 
                 parser = ovs.db.parser.Parser(row_update, "row-update")
@@ -485,12 +486,12 @@ class Idl(object):
                     raise error.Error('<row-update> missing "old" and '
                                       '"new" members', row_update)
 
-                if self.__process_update(table, uuid, old, new):
-                    self.change_seqno += 1
+                changed = self.__process_update(table, uuid, old, new, changed)
+        if changed:
+            self.change_seqno += 1
 
-    def __process_update2(self, table, uuid, row_update):
+    def __process_update2(self, table, uuid, row_update, changed):
         row = table.rows.get(uuid)
-        changed = False
         if "delete" in row_update:
             if row:
                 del table.rows[uuid]
@@ -511,9 +512,8 @@ class Idl(object):
             else:
                 row_update = row_update['initial']
             self.__add_default(table, row_update)
-            if self.__row_update(table, row, row_update):
-                changed = True
-                self.notify(ROW_CREATE, row)
+            changed = self.__row_update(table, row, row_update, changed)
+            self.notify(ROW_CREATE, row)
         elif "modify" in row_update:
             if not row:
                 raise error.Error('Modify non-existing row')
@@ -528,10 +528,9 @@ class Idl(object):
                               row_update)
         return changed
 
-    def __process_update(self, table, uuid, old, new):
+    def __process_update(self, table, uuid, old, new, changed):
         """Returns True if a column changed, False otherwise."""
         row = table.rows.get(uuid)
-        changed = False
         if not new:
             # Delete row.
             if row:
@@ -547,13 +546,13 @@ class Idl(object):
             if not row:
                 row = self.__create_row(table, uuid)
                 changed = True
+                self.__row_update(table, row, new, True)
+                self.notify(ROW_CREATE, row)
             else:
                 # XXX rate-limit
                 vlog.warn("cannot add existing row %s to table %s"
                           % (uuid, table.name))
-            if self.__row_update(table, row, new):
-                changed = True
-                self.notify(ROW_CREATE, row)
+                changed = self.__row_update(table, row, new, changed)
         else:
             op = ROW_UPDATE
             if not row:
@@ -563,9 +562,8 @@ class Idl(object):
                 # XXX rate-limit
                 vlog.warn("cannot modify missing row %s in table %s"
                           % (uuid, table.name))
-            if self.__row_update(table, row, new):
-                changed = True
-                self.notify(op, row, Row.from_json(self, table, uuid, old))
+            changed = self.__row_update(table, row, new, changed)
+            self.notify(op, row, Row.from_json(self, table, uuid, old))
         return changed
 
     def __column_name(self, column):
@@ -608,8 +606,7 @@ class Idl(object):
 
         return old_row_diff_json
 
-    def __row_update(self, table, row, row_json):
-        changed = False
+    def __row_update(self, table, row, row_json, changed):
         for column_name, datum_json in six.iteritems(row_json):
             column = table.columns.get(column_name)
             if not column:
@@ -626,14 +623,13 @@ class Idl(object):
                           % (column_name, table.name, e))
                 continue
 
-            if datum != row._data[column_name]:
+            # If the column changed, update it and record that we saw a change.
+            # We also update it if we've already recorded some other change,
+            # because that's cheaper than doing an unnecessary comparison.
+            if changed or datum != row._data[column_name]:
                 row._data[column_name] = datum
                 if column.alert:
                     changed = True
-            else:
-                # Didn't really change but the OVSDB monitor protocol always
-                # includes every value in a row.
-                pass
         return changed
 
     def __create_row(self, table, uuid):

> The aggregated number of calls to BaseOvnIdl.__process_update2 is
> 1400 (and we're updating only 10 ports!!) while the total connections
> opened to NB database are 10:
> 
> # netstat -np | grep 6641 | grep python | wc -l
> 10
> 
> * Bear in mind that those results above were aggregated across all
> processes.
> 
> Looks like the main culprit for this explosion could be the way we
> handle ACLs. As every time we create a port, it'll belong to a Neutron
> security group (OVN Address Set) and we'll add a new ACL for every
> Neutron security group rule. If we patch the code to skip the ACL part,
> the time for creating a port remains stable over the time.
> 
> From the comparison tests against ML2/OVS (reference implementation),
> OVN outperforms in most of the operations except for the port creation
> where we can see it can become a bottleneck.
> 
> Before optimizing/redesigning the ACL part, we could do some other
> changes to the way we handle notifications from OVSDB: eg., instead of
> having multiple processes receiving *all* notifications, we could have
> one single process subscribed to those notifications and send a more
> optimized (already parsed) multicast notification to all listening processes
> to keep their own in-memory copies of the DB up to date. All processes
> would connect to NB database in "write-only" mode to commit their
> transactions however.

I'm not sure I understand this.  What information is it unnecessary to
have the processes receive?  If they are getting information they don't
need, then we might be able to eliminate some of it.

Can you sketch the rows that are being inserted or modified when a port
is added?  I would expect something like this as a minimum:

        * Insert one Logical_Switch_Port row.

        * Add pointer to Logical_Switch_Port to ports column in one row
          in Logical_Switch.

In addition it sounds like currently we're seeing:

        * Add one ACL row per security group rule.

        * Add pointers to ACL rows to acls column in one row in
          Logical_Switch.

and that's the bottleneck?