MEMORY OVERHEAD OF DBAPI IN SQLALCHEMY IMPLEMENTATION

Overview:

This investigation will go more deeply into sqlalchemy and its usage. The investigation continues with a thread that implements the database access to query the events from database. Now it shows the slight increment of memory usage.

Investigation Analysis:

In our service, previously the suspicion appears that if it had the memory leak caused by python source code implementation; it would leak from very beginning of its deployment. Other considerations that there were not any information about other services/applications running at the same time of the service running that could affect the operation of it.

In addition, some previous investigations show that there are no problems in python code implementation so that the direction of this investigation focuses on sqlalchemy and database access implementation. After doing many experiments, it turned out that there is an overhead of database access by using query of sqlalchemy.

  1. Performance of querying the whole table in SqlAlchemy

The below report shows the difference of memory usage between querying whole table and querying single columns of table. (Source: http://projects.curiousllc.com/examining-sqlalchemy-memory-usage.html)

Scenario 1: Queries the whole table.

Line # Mem usage Increment Line Contents

================================================

56 46.0 MiB 0.0 MiB           @profile

57                                          def get_bounced_emails():

58 48.2 MiB 2.2 MiB             emails = db.session.query(EmailAddress).\

59 48.3 MiB 0.0 MiB                 join(send_history, send_history.c.email_id == EmailAddress.email_id).\

60 372.1 MiB 323.9 MiB                  filter(send_history.c.bounce == 1).\ all()

63 372.1 MiB 0.0 MiB            email_dict = {}

64 374.1 MiB 2.0 MiB            for email in emails:

65 374.1 MiB 0.0 MiB                email_dict[email.email_address] = True

67 314.9 MiB -59.2 MiB         del emails       

68 314.9 MiB 0.0 MiB            return email_dict

Scenario 2: Queries the single column of table.

Line # Mem usage Increment Line Contents
================================================
56 46.0 MiB 0.0 MiB      @profile
57                               def get_bounced_emails():
58 47.1 MiB 1.1 MiB       emails = db.session.query(EmailAddress.email_address).\
59 47.1 MiB 0.0 MiB           join(send_history, send_history.c.email_id == EmailAddress.email_id).\
60 81.2 MiB 34.1 MiB          filter(send_history.c.bounce == 1).\
61                                           all()
62
63 81.2 MiB 0.0 MiB       email_dict = {}
64 84.2 MiB 3.0 MiB       for email in emails:
65 84.2 MiB 0.0 MiB           email_dict[email[0]] = True
66
67 65.9 MiB -18.2 MiB    del emails
68 65.9 MiB 0.0 MiB       return email_dict

The memory can increase so extremely if we query the whole table (in this example is EmailAddress) to get the entire record. After that, the memory consumption decreases outstandingly if we query only a single field (email_address) of the table. The below test is source code of cmha:

  1. Performance of querying all the objects in database.

After querying all the event:

<132>Jan 21 14:52:29  stdout WARNING Line # Mem usage Increment Line Contents

<132>Jan 21 14:52:29 stdout WARNING ================================================

<132>Jan 21 14:52:29 stdout WARNING 69 31.2 MiB 0.0 MiB     @profile

<132>Jan 21 14:52:29 stdout WARNING 70                                     def _get_events(self):

<132>Jan 21 14:52:29 stdout WARNING 71 31.2 MiB 0.0 MiB            with self.factory.session_scope() as session:

<132>Jan 21 14:52:29 stdout WARNING 72 31.2 MiB 0.0 MiB                    return session.query(Event).all()

******************************************************************************************************

<132>Jan 21 15:11:16 stdout WARNING Line # Mem usage Increment Line Contents

<132>Jan 21 15:11:16 stdout WARNING ================================================

<132>Jan 21 15:11:16 stdout WARNING 69 31.3 MiB 0.0 MiB           @profile

<132>Jan 21 15:11:16 stdout WARNING 70                                          def _get_events(self):

<132>Jan 21 15:11:16 stdout WARNING 71 31.3 MiB 0.0 MiB                    with self.factory.session_scope() as session:

<132>Jan 21 15:11:16 stdout WARNING 72 31.3 MiB 0.0 MiB                             return session.query(Event).all()

******************************************************************************************************

<132>Jan 21 15:13:38 stdout WARNING Line # Mem usage Increment Line Contents

<132>Jan 21 15:13:38 stdout WARNING ================================================

<132>Jan 21 15:13:38 stdout WARNING 69 31.4 MiB 0.0 MiB             @profile

<132>Jan 21 15:13:38 stdout WARNING 70                                            def _get_events(self):

<132>Jan 21 15:13:38 stdout WARNING 71 31.4 MiB 0.0 MiB                        with self.factory.session_scope() as session:

<132>Jan 21 15:13:38 stdout WARNING 72 31.4 MiB 0.0 MiB                                  return session.query(Event).all()

It is clear in the result that the memory usage increases by time in querying the whole Event table and is returned in a list. This amount of money may increase in the higher number with much more actors that usually access to database at the same time.

After querying single event:

<132>Jan 22 11:38:35 stdout WARNING Line # Mem usage Increment Line Contents

<132>Jan 22 11:38:35 stdout WARNING ================================================

<132>Jan 22 11:38:35 stdout WARNING 69 28.3 MiB 0.0 MiB         @profile

<132>Jan 22 11:38:35 stdout WARNING 70                                        def _get_events(self):

<132>Jan 22 11:38:35 stdout WARNING 71 28.3 MiB 0.0 MiB                      with self.factory.session_scope() as session:

<132>Jan 22 11:38:35 stdout WARNING 72                                                              #return session.query(Event).all()

<132>Jan 22 11:38:35 stdout WARNING 73 28.3 MiB 0.0 MiB                                return session.query(Event)

<132>Jan 22 11:38:35 stdout WARNING

********************************************************************************************************

<132>Jan 22 13:44:09 stdout WARNING Line # Mem usage Increment Line Contents

<132>Jan 22 13:44:09 stdout WARNING ================================================

<132>Jan 22 13:44:09 stdout WARNING 69 29.0 MiB 0.0 MiB                    @profile

<132>Jan 22 13:44:09 stdout WARNING 70                                                   def _get_events(self):

<132>Jan 22 13:44:09 stdout WARNING 71 29.0 MiB 0.0 MiB                            with self.factory.session_scope() as session:

<132>Jan 22 13:44:09 stdout WARNING 72                                                                  #return session.query(Event).all()

<132>Jan 22 13:44:09 stdout WARNING 73 29.0 MiB 0.0 MiB                                   return session.query(Event)

<132>Jan 22 13:44:09 stdout WARNING

It shows that the memory assignment is smaller than the scenario of querying all the rows but it still throws out the appearance of slight memory usage.

  1. Result Analysis:

Most of DBAPIs pre-buffer all the rows as they are fetched in memory before returning it back to the objects. It means that before the SQLAlchemy ORM gets a hold of the returned results, the whole results are stored in memory. Since the underlying DBAPI pre-buffers the rows, there will be some memories overhead even this memory overhead is much less than the memory used for ORM mapped object.

The DBAPI of MySQL is MySQLDB (a.k.a. mysql-python) and yes, it also pre-buffers the rows. Let see the below example:

(1) with self.factory.session_scope() as session:

           return session.query(Event).all()

(2) with self.factory.session_scope() as session:

            return session.query(Event)

In [1], ORM will work on all the rows before starting to return them back. It will return all the collected data yielded by generator into a list (check the “all()” method of query). Meanwhile in [2], ORM works on each row as soon as its data arrives and then returns back – it seems like “DB streaming”. The second scenario will take less memory use and latency.

Otherwise, it is recommended that it should query the individual columns (e.g. session.query(Object.column1, Object.column2)) instead of the whole ORM object. This action will decrease the memory overhead of loading data through DBAPI.

                     +-----------+                               __________
                    /---|   Pool    |---\                          (__________)
        +-------------+    /     +-----------+     \     +--------+   |                     |
connect() <--|   Engine    |---x                            x----| DBAPI    |---|  database |
        +-------------+    \    +-----------+     /     +--------+    |                    |  
                   \---|  Dialect  |---/                            |__________|
                    +-----------+                                  (__________)

                Figure 1. General structure of sqlalchemy in database access.

                     (Source: http://docs.sqlalchemy.org/en/rel_0_5/dbengine.html)

Summary:

1. In case of querying the whole table, it seems that even using query for a single ORM object for returning a list of ORM objects, the DBAPI still pre-buffers the rows and it takes memory consumption (note that there is no difference in DB traffic among these queries). However, it is clear that the single object query takes less memory than returning the list of queried objects.

2. In case of querying the single columns of table, it depends on the service's functionalities to work with events collected from database. It also depends on the design of event. Finally, it does not ensure that the memory overhead will disappear, but at first, it decreases the memory usage.