Difference between list, set and tuples in Python and more such comparisons

In this article we will see key differences between commonly used terms in python. For example difference between tuple and list, difference between range and xrange and so on.


List and Tuples:


List:
- are mutable i.e. we can add, extend or update a list.
- generally are homogeneous data structures.


>>> [x for x in range(10)]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


Tuples:
- are immutable i.e. we can not update a tuple.
- generally are heterogeneous data structure.
- is a sequence where position have semantic value.
- are more like records, collection of fixed number of fields.

>>> import time
>>> time.localtime()
time.struct_time(tm_year=2017, tm_mon=9, tm_mday=25, tm_hour=21, tm_min=52, tm_sec=24, tm_wday=0, tm_yday=268, tm_isdst=0)
 

Set:

- is just like mathematical set.
- is unordered.
- is mutable.
- doesn't contain duplicate values.

rana@Brahma: ~$ python3
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> mylist = [1,2,5,2,3]
>>> a = set(mylist)
>>> a
{1, 2, 3, 5}



range() and xrange():


range():
- In python2 range() function returns a list of integers.

rana@Brahma: ~$ python2
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> range(10)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

 - In python3 range() do whatever xrange use to do in python2 i.e. returns the generator object that can be used to display numbers only by looping. Only particular range is displayed on demand and hence called "lazy evaluation".

rana@Brahma: ~$ python2
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a = xrange(10)
>>> a
xrange(10)
>>> [i for i in xrange(10)]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

xrange():
- There is no xrange() function in python3. We already know what xrange() use to do in python2.

>>> a = xrange(10)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'xrange' is not defined

 - Since we are moving towards python3, we will not have any further discussion about it. If you are really interested visit this link.


raw_input() and input():


input():
- In python2, input() tries to run the user input as a valid python expression.

rana@Brahma: ~$ python2
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> input()
2^3
1
>>> input(2**3)
8

 - In python3, return the user input as string.


raw_input():
- In python2, return the user input as string.
- Doesn't exists in python3. To simulate the raw_input() in python3, use eval(input()).



Shallow and deep copy:


Before comparing shallow and deep copy, we must know that normal assignment is just pointing new variable to existing object.

rana@Brahma: ~$ python3
Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> a = [1,2]
>>> b = a
>>> print(id(a) == id(b))
True


Comparison of shallow copy and deep copy is relevant for compound objects i.e. where objects contains another objects.


Shallow copy: Shallow copy creates a new object and then use the reference to refer the inner objects.

>>> a = [[1,2],[3,4]]
>>> b = a.copy()
>>> print(id(a) == id(b))
False
>>> print(id(a[0]) == id(b[0]))
True


Deep copy: deep copy create a new object and recursively copy the inner objects too.

>>> from copy import deepcopy
>>> a = [[1,2],[3,4]]
>>> b = deepcopy(a)
>>> print(id(a) == id(b))
False
>>> print(id(a[0]) == id(b[0]))
False

id() function return the identity of the location of the object in memory.

>>> print(id(a))
139724773442120


Unicode and str:

Here is a nicely written article explaining the difference between str, unicode and byte.

unicode str byte
Python 2 unicode characters raw 8 bit values n/a
Python 3 n/a unicode characters raw 8 bit values


*args and **kwargs:

*args:
- is used to pass the variable length list of non-keyworded arguments to function.

*kwargs:
- is used to pass the variable length list of keyworded arguments to a function.

Always pass arguments in this order : formal arguments then non-keyworded argument list and then keyworded arguments list.    




Server Access Logging in Django using middleware

Some application admins need to know which user performed what action in the application. We also felt the need of such tracking hence we started developing the access log system for our application.

In this article we will see how to develop the server access logging app for Django project.


We will be storing below information:

  • URL/link visited.
  • Request method, get or post.
  • Get or Post Data
  • Referrer
  • IP address of visitor
  • Session ID


What is access log:

An access log is a list of all the requests for individual files that website visitors have requested from the website.


Why access log:

We can analyse the access logs and figure out multiple aspects of website visitors and their behaviour:
  • Origin of request i.e. referrer
  • Location of visitor
  • What page or link is visited most
  • In case of audit, which visitor clicked on which page and searched what. etc.



Access Logs:

To start access logging, we will be using middleware. Create a Django project and create an app in project.

We strongly recommend to use virtual environment for python or Django project development.

To log the information, we need to create model.


Model:
Create a model in your app's models.py file.

from django.db import models


class AccessLogsModel(models.Model):
    sys_id = models.AutoField(primary_key=True, null=False, blank=True)
    session_key = models.CharField(max_length=1024, null=False, blank=True)
    path = models.CharField(max_length=1024, null=False, blank=True)
    method = models.CharField(max_length=8, null=False, blank=True)
    data = models.TextField(null=True, blank=True)
    ip_address = models.CharField(max_length=45, null=False, blank=True)
    referrer = models.CharField(max_length=512, null=True, blank=True)
    timestamp = models.DateTimeField(null=False, blank=True)

    class Meta:
        app_label = "django_server_access_logs"
        db_table = "access_logs"

Middleware:
Now create a file logging_middleware.py in your app.

from .models import AccessLogsModel
from django.conf import settings
from django.utils import timezone


class AccessLogsMiddleware(object):

    def __init__(self, get_response=None):
        self.get_response = get_response
        # One-time configuration and initialization.

    def __call__(self, request):
        # create session
        if not request.session.session_key:
            request.session.create()

        access_logs_data = dict()

        # get the request path
        access_logs_data["path"] = request.path

        # get the client's IP address
        x_forwarded_for = request.META.get('HTTP_X_FORWARDED_FOR')
        access_logs_data["ip_address"] = x_forwarded_for.split(',')[0] if x_forwarded_for else request.META.get('REMOTE_ADDR')
        access_logs_data["method"] = request.method
        access_logs_data["referrer"] = request.META.get('HTTP_REFERER',None)
        access_logs_data["session_key"] = request.session.session_key

        data = dict()
        data["get"] = dict(request.GET.copy())
        data['post'] = dict(request.POST.copy())

        # remove password form post data for security reasons
        keys_to_remove = ["password", "csrfmiddlewaretoken"]
        for key in keys_to_remove:
            data["post"].pop(key, None)

        access_logs_data["data"] = data
        access_logs_data["timestamp"] = timezone.now()

        try:
            AccessLogsModel(**access_logs_data).save()
        except Exception as e:
            pass

        response = self.get_response(request)
        return response



In the file above, we are doing following things:

  1. Since we are using session key to uniquely identifying the request, we need to create session if session key doesn't exists.
  2. Get the path i.e. URL which user/visitor visited.
  3. Collect, IP address, request method, referrer URL and session key.
  4. Collect the post and get data and remove the sensitive information like password. You may edit the logic as per your requirement.
  5. Store the data with timestamp in table.


Settings:

For the above code to work, we need to complete below settings:

Add your app in installed app's list in settings.py file.

INSTALLED_APPS = [
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
    'django_server_access_logs'
]


Add your middleware class in middleware classes list.

MIDDLEWARE = [
    'django.middleware.security.SecurityMiddleware',
    'django.contrib.sessions.middleware.SessionMiddleware',
    'django.middleware.common.CommonMiddleware',
    'django.middleware.csrf.CsrfViewMiddleware',
    'django.contrib.auth.middleware.AuthenticationMiddleware',
    'django.contrib.messages.middleware.MessageMiddleware',
    'django.middleware.clickjacking.XFrameOptionsMiddleware',
    'django_server_access_logs.logging_middleware.AccessLogsMiddleware',
]


Make migrations to create the Model table in database.

Now start hitting the application URLs and you can see the entry in your table.

server access logging in django using middleware  

Complete code is available on Github.



Future Work:

You may make changes in middleware code to
  • Ignore hits on static and media URL.
  • Log user_id  if user is logged in.


Data Cleanup:

Since access_log table will take a lot of space, it is good idea to delete the old entries.

You might want to create a management command and schedule it to delete the data periodically.

create a command and use below code for cleanup.

from django.core.management.base import BaseCommand, CommandError
import datetime
from django.utils import timezone
from .models import AccessLogsModel


class Command(BaseCommand):

    help = 'Clean the user access logs older than x days'

    def add_arguments(self, parser):
        pass

    def handle(self, *args, **options):
        days_to_keep_data = 7
        now = timezone.now()
        back_date = now - datetime.timedelta(days=days_to_keep_data)
        AccessLogsModel.objects.filter(timestamp__lt=back_date).delete()
        
   

Must read article for middleware :

How to develop a distributable Django app to block the crawling IP addresses


Comparing celery-rabbitmq docker cluster, multi-threading and scrapy framework for 1000 requests

I recently tried scraping the tweets quickly using Celery RabbitMQ Docker cluster.

Since I was hitting same servers I was using rotating proxies via Tor network. Turned out it is not very fast and using rotating proxy via Tor is not a nice thing to do. I was able to scrape approx 10000 tweets in 60 seconds i.e. 166 tweets per second. Not an impressive number. (But I was able to make Celery, RabbitMQ, rotating proxy via Tor network and Postgres, work in docker cluster.)

Above approach was not very fast, hence I tried to compare below three approaches to send multiple request and parse the response.

- Celery-RabbitMQ docker cluster
- Multi-Threading
- Scrapy framework

I planned to send requests to 1 million websites, but once I started, I figured out that it will take one whole day to finish this hence I settled for 1000 URLs.


Celery RabbitMQ docker cluster:

I started with Celery-RabbitMQ docker cluster. I removed Tor out of docker cluster because now one request will be sent to each 1000 different sites which is perfectly fine.

Please download the code from Github Repo and go through README file to start the docker cluster.

I first tried with 10 workers and set concurrency in worker to 10. It took 53 seconds to send 1000 requests and get the response.

Increased the worker count to 15 and then to 20. Below is the result. Time is in seconds and Free memory is in MBs. Every container takes some memory hence I tracked the memory usage as each time workers count was increased.

comparing celery rabbitmq docker cluster multi threading and scrapy framework for 1000 requests


As you can see 15 workers, each with 20 concurrency took the minimum time in sending 1000 requests out of which 721 were returned with HTTP status 200. But this also used the maximum memory.

We could achiever better performance if more memory is available. I ran this on machine with 8 GB RAM. As we increased the concurrency/workers, lesser HTTP 200 status were received.



Multi-Threaded Approach:

I used the simplest form of Multi-Threaded approach to send multiple requests at once.

Code is available in multithreaded.py file. Create a virtual environment and install dependencies. Run the code and measure the time.

I started with 8 threads, then 16 and went upto 256 threads. Since memory usage wasn't an issue this time, I didn't tracked the same. Here are the results.

comparing celery rabbitmq docker cluster multi threading and scrapy framework for 1000 requests


As you can see we received best performance with 128 threads.



Scrapy Framework:

For this I created a scrapy project where start_urls variable was generated from the link from text file.

Once scrapy project is created, Create a spider  in it.

spiders/myspider.py

import scrapy
from scrapy import signals

# to run : scrapy crawl my-scraper
class MySpider(scrapy.Spider):
    name = "my-scraper"
    start_urls = [
    ]

    status_timeouts = 0
    status_success = 0
    status_others = 0
    status_exception = 0

    def parse(self, response):
        print(response.status)

        if response.status == 200:
            self.status_success += 1
        else:
            self.status_others += 1

        print(self.status_success, self.status_others)

    def spider_opened(self):
        links = list()
        with open("one_thousand_websites.txt", "r") as f:
            links = f.readlines()

        self.start_urls = [link.strip() for link in links]

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
        return spider


First I ran it with default settings where it need to obey the robots.txt file. Concurrent requests value was default i.e. 16.

Experience with scrapy was the worst. There were lots of redirection from http to https and then it will read robots.txt file and then will fetch the response.

It ran for 1500 seconds and only 795 urls were processed so far when I aborted the process.

Now I changed the default settings.

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 64


Max concurrency was kept to 128 but then I started receiving Couldn't bind: 24: Too many open files error.

Updated the number of maximum files that can be opened at a time from 1024 to 10000 using command ulimit -n 10000

This didn't solved the problem. Reduced the max concurrency to 64 but this error keep coming after processing approx 500 urls. Later on scrapy started throwing DNS lookup failed: no results for hostname lookup error as well.

Now I didn't bothered to debug these issues and gave up.   



Conclusion:

If you need to send multiple requests quickly then docker cluster is the best option provided you have enough memory on your disposal. This will take more time as compared to multi-threaded approach to setup and start the process.

If you have limited memory then you may compromise a bit on speed and use multithreaded approach. This is also easy to setup.

Scrapy is not that good or may be I was not able to make it useful for me.  


Please comment if you have any better approach or if you think I missed something while trying these approaches.



SUBSCRIBE
Please subscribe to get the latest articles in your mailbox.



Recent Posts:






© pythoncircle.com 2018-2019