Using signals in Django to log changes in models

Sometimes we need to know who made what changes to which table. This might be required for legal audit purpose or for simple organisational level logging.

There are multiple Django apps available online which can help you log the model changes but there is no fun in doing that.

We will see how to do it without using ready-made app and hence will learn something in the process.



Signals:

Signals lets a sender notify another receiver that some event have occurred and some action needs to be performed.

For example, we have some data in cache as well in DB. We read data from cache and if not found then goes to DB as fallback. Now whenever a DB is updated, we need to update the cache as well.

But we might update the model from multiple views. Hence it is tough and not clean to write cache update logic in every such view.

Signals comes into picture now. Signal system have two main components. Senders and Receivers.

As name suggests, sender dispatches the signals and receiver receives them to perform some action.  To receive a signals, we need to register the receiver function. Every time a model is updates, signal is dispatched and required action is performed by receiver.

Django have a set of built-in signals which sends notification to user code. You can read all about signals in official Django documentation.

For this article, we will require, pre_save, post_save and pre_delete signals which are  part of  a set of signals sent by Django models.  django.db.models.signals module defines this set of signals.



Creating Signal:

To log any changes made to some model, we need to use either pre_save  or post_save  signal. We will keep the things simple and then will proceed to specific scenarios slowly.

So for now lets say we need to log changes made to every model.

First create a file signals.py in your app's root directory.

Create a function save_model_changes in signals file. This function will work as receiver.

from django.db.models.signals import post_save, pre_delete, pre_save
from django.dispatch import receiver


# this receiver is executed every-time some data is saved in any table
@receiver(pre_save)
def audit_log(sender, instance, **kwargs):
    # code to execute before every model save
    print("Inside signal code")


To execute the code in receiver function for just one model, define the sender in @receiver decorator.

@receiver(pre_save, sender=MyModel)
   

Before we proceed further, we need to tell out app config about the signal we just created. Inside your app.py file, import the signals.

from django.apps import AppConfig


class MyAppConfig(AppConfig):
    name = 'myapp'

    def ready(self):
        # everytime server restarts
        import myapp.signals

Inside your app's __init__.py file,

default_app_config = 'myapp.apps.MyAppConfig'


Now our signal is ready. Every-time we save anything to any model, we can see "Inside signal code" being printed in terminal.  


Logging model Changes:

Now our signal is ready and working, all we need to do is to write the logic to log the changes in audit table. Create a table where we will be storing the changes.

Table structure may vary as per your application's requirement. My audit table looks something like below.

from django.db import models


class ModelChangeLogsModel(models.Model):
    user_id = models.BigIntegerField(null=False, blank=True, db_index=True) 
    table_name = models.CharField(max_length=132, null=False, blank=True)
    table_row = models.BigIntegerField(null=False, blank=True)
    data = models.TextField(null=False, blank=True)
    action = models.CharField(max_length=16, null=False, blank=True)  # saved or deleted
    timestamp = models.DateTimeField(null=False, blank=True)

    class Meta:
        app_label = "myapp"
        db_table = "model_change_logs"


So I will be storing, which user made the changes to which table and what row number was edited.

Data will contain the snapshot of row in JSON format. Action column will  be used to store whether row was updated or deleted.

Now simplest thing will be to store the complete serialised model instance in data column, but that will take a lot of space. You may store only the fields which were changed and can ignore the rest. For this, we need to hit the database once more to retrieve the previous instance of model and compare it with current instance and then save the changed fields.


Getting previous instance of model:

    try:
        table_pk = instance._meta.pk.name
        table_pk_value = instance.__dict__[table_pk]
        query_kwargs = dict()
        query_kwargs[table_pk] = table_pk_value
        prev_instance = sender.objects.get(**query_kwargs) # for dynamic column name
    except ObjectDoesNotExist as e:
        # this instance is being created and not updated. ignore and return
        logging.getLogger("info_logger").info("Signals: creating new instance of "+str(sender))
        return


We first get the primary key of instance and try to fetch the older instance of model from database.

But if we are not updating the instance, instead we are creating it for first time, we will not find any older instance in DB hence we need to handle it , which is done via try except block above.

You can get the table name as str(instance._meta.db_table)

So this was the basics of how to use signals to log the model changes. You can optimise the approach as per your requirement, like tracking only specific/critical models like payments and finances or storing only few specific columns.



Comparing celery-rabbitmq docker cluster, multi-threading and scrapy framework for 1000 requests

I recently tried scraping the tweets quickly using Celery RabbitMQ Docker cluster.

Since I was hitting same servers I was using rotating proxies via Tor network. Turned out it is not very fast and using rotating proxy via Tor is not a nice thing to do. I was able to scrape approx 10000 tweets in 60 seconds i.e. 166 tweets per second. Not an impressive number. (But I was able to make Celery, RabbitMQ, rotating proxy via Tor network and Postgres, work in docker cluster.)

Above approach was not very fast, hence I tried to compare below three approaches to send multiple request and parse the response.

- Celery-RabbitMQ docker cluster
- Multi-Threading
- Scrapy framework

I planned to send requests to 1 million websites, but once I started, I figured out that it will take one whole day to finish this hence I settled for 1000 URLs.


Celery RabbitMQ docker cluster:

I started with Celery-RabbitMQ docker cluster. I removed Tor out of docker cluster because now one request will be sent to each 1000 different sites which is perfectly fine.

Please download the code from Github Repo and go through README file to start the docker cluster.

I first tried with 10 workers and set concurrency in worker to 10. It took 53 seconds to send 1000 requests and get the response.

Increased the worker count to 15 and then to 20. Below is the result. Time is in seconds and Free memory is in MBs. Every container takes some memory hence I tracked the memory usage as each time workers count was increased.

comparing celery rabbitmq docker cluster multi threading and scrapy framework for 1000 requests


As you can see 15 workers, each with 20 concurrency took the minimum time in sending 1000 requests out of which 721 were returned with HTTP status 200. But this also used the maximum memory.

We could achiever better performance if more memory is available. I ran this on machine with 8 GB RAM. As we increased the concurrency/workers, lesser HTTP 200 status were received.



Multi-Threaded Approach:

I used the simplest form of Multi-Threaded approach to send multiple requests at once.

Code is available in multithreaded.py file. Create a virtual environment and install dependencies. Run the code and measure the time.

I started with 8 threads, then 16 and went upto 256 threads. Since memory usage wasn't an issue this time, I didn't tracked the same. Here are the results.

comparing celery rabbitmq docker cluster multi threading and scrapy framework for 1000 requests


As you can see we received best performance with 128 threads.



Scrapy Framework:

For this I created a scrapy project where start_urls variable was generated from the link from text file.

Once scrapy project is created, Create a spider  in it.

spiders/myspider.py

import scrapy
from scrapy import signals

# to run : scrapy crawl my-scraper
class MySpider(scrapy.Spider):
    name = "my-scraper"
    start_urls = [
    ]

    status_timeouts = 0
    status_success = 0
    status_others = 0
    status_exception = 0

    def parse(self, response):
        print(response.status)

        if response.status == 200:
            self.status_success += 1
        else:
            self.status_others += 1

        print(self.status_success, self.status_others)

    def spider_opened(self):
        links = list()
        with open("one_thousand_websites.txt", "r") as f:
            links = f.readlines()

        self.start_urls = [link.strip() for link in links]

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
        return spider


First I ran it with default settings where it need to obey the robots.txt file. Concurrent requests value was default i.e. 16.

Experience with scrapy was the worst. There were lots of redirection from http to https and then it will read robots.txt file and then will fetch the response.

It ran for 1500 seconds and only 795 urls were processed so far when I aborted the process.

Now I changed the default settings.

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 64


Max concurrency was kept to 128 but then I started receiving Couldn't bind: 24: Too many open files error.

Updated the number of maximum files that can be opened at a time from 1024 to 10000 using command ulimit -n 10000

This didn't solved the problem. Reduced the max concurrency to 64 but this error keep coming after processing approx 500 urls. Later on scrapy started throwing DNS lookup failed: no results for hostname lookup error as well.

Now I didn't bothered to debug these issues and gave up.   



Conclusion:

If you need to send multiple requests quickly then docker cluster is the best option provided you have enough memory on your disposal. This will take more time as compared to multi-threaded approach to setup and start the process.

If you have limited memory then you may compromise a bit on speed and use multithreaded approach. This is also easy to setup.

Scrapy is not that good or may be I was not able to make it useful for me.  


Please comment if you have any better approach or if you think I missed something while trying these approaches.


try .. except .. else .. in python with example

You will face at least two types of errors while coding with python. Syntax errors and exceptions.

Syntax errors are also known as parsing errors. Compiler informs us about the parsing errors using an arrow.


rana@brahma:~$ python3
Python 3.5.2 (default, Nov 12 2018, 13:43:14) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> while True print("Hello")
  File "<stdin>", line 1
    while True print("Hello")
                   ^
SyntaxError: invalid syntax
>>> 


Even when your code is perfect syntax wise, some lines may report an errors when you try to execute them. Errors reported at runtime are known as exceptions.

For an uninterrupted execution, we must handle the exceptions properly. We can do so by using try except block. 

In below example we are trying to add a string and an integer. When statement is executed it throws an exception. 

>>> a = 'rana' + 10
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Can't convert 'int' object to str implicitly
>>> 


If we don't handle this exception, program will exit abruplty. Handle the exception using try except statements.


>>> 
>>> try:
...     a = 'rana' + 10
... except Exception as e:
...     print('Some useful message to debug the code')
... 
Some useful message to debug the code
>>> 

When above piece of code is executed, exception is thrown but is then catched by except block which in turn print a useful message which helps in debugging the code.


try ... except statement have an optional else clause. Else part is executed when there is no exception thrown in try clause. Else part is executed before finally clause.

Lets say you are trying to open a file in try block (just an example) and it is possible there might occur some error in opening the file, we will handle exception in except block. If there is no exception thrown, file is opened successfully, we have to close the file descriptor. 


>>> try:
...     f = open('filename.txt','r')
... except Exception as e:
...     print('Some exception in opening the file')
... else:
...     f.close()
...


Its better to add code to else clause instead of adding code to try clause. This helps in avoiding catching the exception which was not raised by code being protected in try caluse. For example in addition to opening file in try clause, we were also trying to convert a variable to integer. Imaging file opened just fine but convertion to int threw an exception which will be reported by except block as exception in opening file, which will be misleading. See the example code below.


>>> try:
...     f = open('filename.txt','r')
...     a = 'five'
...     b = int(a)
... except Exception as e:
...     print('can not open file')
... else:
...     f.close()
... 
can not open file
>>> 
 

Above could should be rewritten as below.

try:
    f = open('filename.txt','r')
    print('file was opened')
except Exception as e:
    print('can not open file')
else:
    print('going to close file')
    f.close()
try:
    a = 'five'
    b = int(a)
except Exception as e:
    print('exception in number conversion')

Output:

file was opened
going to close file
exception in number conversion

This make sure that actual exception is reported by except clause.


Now question arises, if we should use except or else block to control the execution flow?

In python using exception to control the execution flow is normal. For example iterators uses stopIteration exception to singal the end of items.

You can read more about exceptions by visitng the links in reference section below.

References:

1. https://stackoverflow.com/questions/16138232/is-it-a-good-practice-to-use-try-except-else-in-python
2. https://docs.python.org/3/tutorial/errors.html

Image source: quotemaster.org 





SUBSCRIBE
Please subscribe to get the latest articles in your mailbox.


Recent Posts:






© pythoncircle.com 2018-2019
Contact Us: code108labs [at] gmail.com
Address: 3747 Smithfield Avenue, Houston, Texas