Sometimes we need to know who made what changes to which table. This might be required for legal audit purpose or for simple organisational level logging.
There are multiple Django apps available online which can help you log the model changes but there is no fun in doing that.
We will see how to do it without using ready-made app and hence will learn something in the process.
For example, we have some data in cache as well in DB. We read data from cache and if not found then goes to DB as fallback. Now whenever a DB is updated, we need to update the cache as well.
But we might update the model from multiple views. Hence it is tough and not clean to write cache update logic in every such view.
Signals comes into picture now. Signal system have two main components. Senders and Receivers.
As name suggests, sender dispatches the signals and receiver receives them to perform some action. To receive a signals, we need to register the receiver function. Every time a model is updates, signal is dispatched and required action is performed by receiver.
Django have a set of built-in signals which sends notification to user code. You can read all about signals in official Django documentation.
For this article, we will require,
pre_delete signals which are part of a set of signals sent by Django models.
django.db.models.signals module defines this set of signals.
post_savesignal. We will keep the things simple and then will proceed to specific scenarios slowly.
So for now lets say we need to log changes made to every model.
First create a file
signals.py in your app's root directory.
Create a function
save_model_changes in signals file. This function will work as receiver.
from django.db.models.signals import post_save, pre_delete, pre_save from django.dispatch import receiver # this receiver is executed every-time some data is saved in any table @receiver(pre_save) def audit_log(sender, instance, **kwargs): # code to execute before every model save print("Inside signal code")
To execute the code in receiver function for just one model, define the sender in
Before we proceed further, we need to tell out app config about the signal we just created.
app.py file, import the signals.
from django.apps import AppConfig class MyAppConfig(AppConfig): name = 'myapp' def ready(self): # everytime server restarts import myapp.signals
default_app_config = 'myapp.apps.MyAppConfig'
Now our signal is ready. Every-time we save anything to any model, we can see "Inside signal code" being printed in terminal.
Table structure may vary as per your application's requirement. My audit table looks something like below.
from django.db import models class ModelChangeLogsModel(models.Model): user_id = models.BigIntegerField(null=False, blank=True, db_index=True) table_name = models.CharField(max_length=132, null=False, blank=True) table_row = models.BigIntegerField(null=False, blank=True) data = models.TextField(null=False, blank=True) action = models.CharField(max_length=16, null=False, blank=True) # saved or deleted timestamp = models.DateTimeField(null=False, blank=True) class Meta: app_label = "myapp" db_table = "model_change_logs"
So I will be storing, which user made the changes to which table and what row number was edited.
Data will contain the snapshot of row in JSON format. Action column will be used to store whether row was updated or deleted.
Now simplest thing will be to store the complete serialised model instance in data column, but that will take a lot of space. You may store only the fields which were changed and can ignore the rest. For this, we need to hit the database once more to retrieve the previous instance of model and compare it with current instance and then save the changed fields.
try: table_pk = instance._meta.pk.name table_pk_value = instance.__dict__[table_pk] query_kwargs = dict() query_kwargs[table_pk] = table_pk_value prev_instance = sender.objects.get(**query_kwargs) # for dynamic column name except ObjectDoesNotExist as e: # this instance is being created and not updated. ignore and return logging.getLogger("info_logger").info("Signals: creating new instance of "+str(sender)) return
We first get the primary key of instance and try to fetch the older instance of model from database.
But if we are not updating the instance, instead we are creating it for first time, we will not find any older instance in DB hence we need to handle it , which is done via try except block above.
You can get the table name as
So this was the basics of how to use signals to log the model changes. You can optimise the approach as per your requirement, like tracking only specific/critical models like payments and finances or storing only few specific columns.
I recently tried scraping the tweets quickly using Celery RabbitMQ Docker cluster.
Since I was hitting same servers I was using rotating proxies via Tor network. Turned out it is not very fast and using rotating proxy via Tor is not a nice thing to do. I was able to scrape approx 10000 tweets in 60 seconds i.e. 166 tweets per second. Not an impressive number. (But I was able to make Celery, RabbitMQ, rotating proxy via Tor network and Postgres, work in docker cluster.)
Above approach was not very fast, hence I tried to compare below three approaches to send multiple request and parse the response.
- Celery-RabbitMQ docker cluster
- Scrapy framework
I planned to send requests to 1 million websites, but once I started, I figured out that it will take one whole day to finish this hence I settled for 1000 URLs.
Please download the code from Github Repo and go through README file to start the docker cluster.
I first tried with 10 workers and set concurrency in worker to 10. It took 53 seconds to send 1000 requests and get the response.
Increased the worker count to 15 and then to 20. Below is the result. Time is in seconds and Free memory is in MBs.
Every container takes some memory hence I tracked the memory usage as each time workers count was increased.
As you can see 15 workers, each with 20 concurrency took the minimum time in sending 1000 requests out of which 721 were returned with HTTP status 200. But this also used the maximum memory.
We could achiever better performance if more memory is available. I ran this on machine with 8 GB RAM. As we increased the concurrency/workers, lesser HTTP 200 status were received.
Code is available in
Create a virtual environment and install dependencies. Run the code and measure the time.
I started with 8 threads, then 16 and went upto 256 threads.
Since memory usage wasn't an issue this time, I didn't tracked the same.
Here are the results.
As you can see we received best performance with 128 threads.
start_urlsvariable was generated from the link from text file.
Once scrapy project is created, Create a spider in it.
import scrapy from scrapy import signals # to run : scrapy crawl my-scraper class MySpider(scrapy.Spider): name = "my-scraper" start_urls = [ ] status_timeouts = 0 status_success = 0 status_others = 0 status_exception = 0 def parse(self, response): print(response.status) if response.status == 200: self.status_success += 1 else: self.status_others += 1 print(self.status_success, self.status_others) def spider_opened(self): links = list() with open("one_thousand_websites.txt", "r") as f: links = f.readlines() self.start_urls = [link.strip() for link in links] @classmethod def from_crawler(cls, crawler, *args, **kwargs): spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs) crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened) return spider
First I ran it with default settings where it need to obey the
robots.txtfile. Concurrent requests value was default i.e. 16.
Experience with scrapy was the worst. There were lots of redirection from
https and then it will read
robots.txt file and then will fetch the response.
It ran for 1500 seconds and only 795 urls were processed so far when I aborted the process.
Now I changed the default settings.
# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) CONCURRENT_REQUESTS = 64
Max concurrency was kept to 128 but then I started receiving
Couldn't bind: 24: Too many open fileserror.
Updated the number of maximum files that can be opened at a time from 1024 to 10000 using command
ulimit -n 10000.
This didn't solved the problem.
Reduced the max concurrency to 64 but this error keep coming after processing approx 500 urls. Later on scrapy started throwing
DNS lookup failed: no results for hostname lookup error as well.
Now I didn't bothered to debug these issues and gave up.
If you have limited memory then you may compromise a bit on speed and use multithreaded approach. This is also easy to setup.
Scrapy is not that good or may be I was not able to make it useful for me.
Please comment if you have any better approach or if you think I missed something while trying these approaches.
You will face at least two types of errors while coding with python. Syntax errors and exceptions.
Syntax errors are also known as parsing errors. Compiler informs us about the parsing errors using an arrow.
rana@brahma:~$ python3 Python 3.5.2 (default, Nov 12 2018, 13:43:14) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information. >>> while True print("Hello") File "<stdin>", line 1 while True print("Hello") ^ SyntaxError: invalid syntax >>>
Even when your code is perfect syntax wise, some lines may report an errors when you try to execute them. Errors reported at runtime are known as exceptions.
For an uninterrupted execution, we must handle the exceptions properly. We can do so by using
In below example we are trying to add a string and an integer. When statement is executed it throws an exception.
>>> a = 'rana' + 10 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: Can't convert 'int' object to str implicitly >>>
If we don't handle this exception, program will exit abruplty. Handle the exception using
>>> >>> try: ... a = 'rana' + 10 ... except Exception as e: ... print('Some useful message to debug the code') ... Some useful message to debug the code >>>
When above piece of code is executed, exception is thrown but is then catched by
except block which in turn print a useful message which helps in debugging the code.
try ... except statement have an optional
else clause. Else part is executed when there is no exception thrown in
try clause. Else part is executed before
Lets say you are trying to open a file in try block (just an example) and it is possible there might occur some error in opening the file, we will handle exception in except block. If there is no exception thrown, file is opened successfully, we have to close the file descriptor.
>>> try: ... f = open('filename.txt','r') ... except Exception as e: ... print('Some exception in opening the file') ... else: ... f.close() ...
Its better to add code to
else clause instead of adding code to
try clause. This helps in avoiding catching the exception which was not raised by code being protected in
try caluse. For example in addition to opening file in
try clause, we were also trying to convert a variable to integer. Imaging file opened just fine but convertion to int threw an exception which will be reported by except block as exception in opening file, which will be misleading. See the example code below.
>>> try: ... f = open('filename.txt','r') ... a = 'five' ... b = int(a) ... except Exception as e: ... print('can not open file') ... else: ... f.close() ... can not open file >>>
Above could should be rewritten as below.
try: f = open('filename.txt','r') print('file was opened') except Exception as e: print('can not open file') else: print('going to close file') f.close() try: a = 'five' b = int(a) except Exception as e: print('exception in number conversion')
file was opened going to close file exception in number conversion
elseblock to control the execution flow?
stopIterationexception to singal the end of items.
Image source: quotemaster.org