Python Script 5: How to find most popular technologies on Stackoverflow

This script crawls the Stackoverflow pages to find the most popular technology by counting the number of tags on each question.

Important: Please do not send too many requests.

Respect the robot.txt file. Code is also available on Github.

You will require to install beautifulsoup  and requests  python package.



Code:
# Script to find the most used tags in questions on stackoverflow
# Author - Anurag Rana
# version - 1.0.0
# usage - python script_name


import operator, os, sys
import requests
from bs4 import BeautifulSoup


# global dictionary to store the count of tags processed so far
tag_count_dict = {}


def get_soup_from_link(link):
    html_text = requests.get(link).text
    soup = BeautifulSoup(html_text, 'html.parser')
    return soup


def tag_crawler(question_url):    
    soup = get_soup_from_link(question_url)
    tag_divs = soup.find_all('div', {'class': 'post-taglist'})
    for div in tag_divs:
        for tag_link in div.find_all('a'):
            tag = tag_link.string
            if tag is not None:
                tag_count_dict[tag] = tag_count_dict[tag] + 1 if tag in tag_count_dict else 1
                # print a dot to indicate script is progressing.
                print(".", end="")
                sys.stdout.flush()


def get_question_links(soup):
    return soup.find_all('a', {'class': 'question-hyperlink'})


def page_crawler(max_page_count):
    starting_url = 'http://stackoverflow.com/questions?page=PAGE_NUMBER&sort=newest'
    current_page_number = 1
    while current_page_number <= max_page_count:
        current_page_url = starting_url.replace('PAGE_NUMBER',str(current_page_number))                
        soup = get_soup_from_link(current_page_url)       
        print('Working on page number : ' + str(current_page_number))  
        # get link of all question posted on current page      
        question_links = get_question_links(soup)
        for link in question_links:
            question_url = 'http://stackoverflow.com/' + link.get('href')
            # crawl all tags on this question page
            tag_crawler(question_url)
            
        current_page_number += 1


def print_welcome_msg():
    os.system('clear')
    print("\n\n")
    print("This script will crwal stackoverflow pages to fetch the count of occurances of tags.")
    print("And then print the top 10 tags hence predicting the most popular technology on stackoverflow.")
    print("Next we will ask you the number of pages to crawl.")
    print("Remember, more the pages, better is the accuracy but at the same time script will run longer.")
    print('\n\nHow many pages would you like to crawl : ', end="")


def start():
    print_welcome_msg()
    max_page_count = int(input().strip())
    print('Starting now....')
    page_crawler(max_page_count)    
    sorted_tag_count_dict = sorted(tag_count_dict.items(), key=operator.itemgetter(1), reverse=True)[:10]
    print("")
    for tag_count in sorted_tag_count_dict:
        print("%15s : %d" %(tag_count[0], tag_count[1]))
    
start()
  python script 5 how to find most popular technologies on stackoverflow


Other Scripts:
Opening top 10 Google search results in one hit.
Formatting and validating JSON. Crawling all emails from a site.

How to backup database periodically on PythonAnyWhere server

You can host your Django app effortlessly on PythonAnyWhere server. If you are using the database in your app then it is strongly recommended to take backup of database to avoid loss of data.

This PythonAnyWhere article explain the process to take sql dump. We will extend the same article to take database backup periodically and delete the old files.

As explained in the article we can use below command to take backup.

mysqldump -u yourusername -h yourusername.mysql.pythonanywhere-services.com 'yourusername$dbname'  > db-backup.sql

We need to run this command in home directory.

To take the backup periodically, we will write a small script.



Script:

First lets create a directory in home where all the backup files will reside. Lets name is mysql_backups. Now in script, define the variables like backup directory name, file prefix and file suffix etc.

To distinguish, we will append the time stamp in yyyymmddHHMMSS  ormat in file name. Create the file name and run the backup command using os  package of python.

import os
import datetime
from zipfile import ZipFile


BACKUP_DIR_NAME = "mysql_backups"
FILE_PREFIX = "my_db_backup_"
FILE_SUFFIX_DATE_FORMAT = "%Y%m%d%H%M%S"
USERNAME = "username"
DBNAME = USERNAME+"$dbname"


# get today's date and time
timestamp = datetime.datetime.now().strftime(FILE_SUFFIX_DATE_FORMAT)
backup_filename = BACKUP_DIR_NAME+"/"+FILE_PREFIX+timestamp+".sql"

os.system("mysqldump -u "+USERNAME+" -h "+USERNAME+".mysql.pythonanywhere-services.com '"+DBNAME+"'  > "+backup_filename)


We can test this script by running using the python installed in virtual environment. I strongly recommend to use virtual environment for all your python and Django projects.

Now to save some space we will zip the sql file. For this we will use zipfile  python package.

# creating zip file
zip_filename = BACKUP_DIR_NAME+"/"+FILE_PREFIX+timestamp+".zip"
with ZipFile(zip_filename, 'w') as zip:
    zip.write(backup_filename, os.path.basename(backup_filename))


Now when sql file has been added to zip file, we may delete it.

os.remove(backup_filename)


Now we can schedule this script from tasks tab. Now lets extend the script to delete old files. Define a variable which signifies for how many days we need to keep the backup.

DAYS_TO_KEEP_BACKUP = 3


So for now we will keep last three backups. Now there are two ways to delete the old files. Either use the os.stat() function to check when was a file created and if it is older than DAYS_TO_KEEP_BACKUP, delete it.

for f in os.listdir(path):
    if os.stat(os.path.join(path,f)).st_mtime < now - 3 * 86400:


Or we use the below logic.

# deleting old files

list_files = os.listdir(BACKUP_DIR_NAME)

back_date = datetime.datetime.now() - datetime.timedelta(days=DAYS_TO_KEEP_BACKUP)
back_date = back_date.strftime(FILE_SUFFIX_DATE_FORMAT)

length = len(FILE_PREFIX)

# deleting files older than DAYS_TO_KEEP_BACKUP days
for f in list_files:
    filename = f.split(".")[0]
    if "zip" == f.split(".")[1]:
        suffix = filename[length:]
        if suffix < back_date:
            print("Deleting file : "+f)
            os.remove(BACKUP_DIR_NAME + "/" + f)


First get the list of all files in backup directory. Then for each file in list, check if file extension is 'zip' and then compare the file time stamp suffix with the back date. We can always use os.path.splitext()  function to get filename and extension. Its up to you to use whichever way you feel good. Feel free to tweak the script and experiment.

>>> filename, file_extension = os.path.splitext('/path/to/somefile.ext')
>>> filename
'/path/to/somefile'
>>> file_extension
'.ext'
 


Restoring database backup:

You can restore the database backup using below command.

mysql -u yourusername -h yourusername.mysql.pythonanywhere-services.com 'yourusername$dbname' < db-backup.sql

 

Complete code of the above script is available on github.  


References:
[1] https://help.pythonanywhere.com/pages/MySQLBackupRestore/

Python Script 1: Convert ebooks from epub to mobi format

We are starting a series of python scripts which we may use in our daily life to automate mundane task and save some time.

This is the first article in this series. Recently I bought Amazon's Ebook Reader, kindle paperwhite 3.

I purchased few books from kindle store and downloaded most of the books in Epub format. Now kindle doesn't support epub format. You need to convert them to either mobi or azw3 format.

Converting books one by one using some online tool is extremely time consuming and frustrating. Hence I searched for some tool which might perform bulk conversion. I found calibre. You may install and use it to convert books from one format to other in bulk.

But if you are more of a terminal guy then you need to read further. First install the calibre. Use the tool if you want to. Or use the below script to convert the books.

from os import listdir, rename
from os.path import isfile, join
import subprocess


# return name of file to be kept after conversion.
# we are just changing the extension. azw3 here.
def get_final_filename(f):
    f = f.split(".")
    filename = ".".join(f[0:-1])
    processed_file_name = filename+".azw3"
    return processed_file_name


# return file extension. pdf or epub or mobi
def get_file_extension(f):
    return f.split(".")[-1]


# list of extensions that needs to be ignored.
ignored_extensions = ["pdf"]

# here all the downloaded files are kept
mypath = "/home/user/Downloads/ebooks/"

# path where converted files are stored
mypath_converted = "/home/user/Downloads/ebooks/kindle/"

# path where processed files will be moved to, clearing the downloaded folder
mypath_processed = "/home/user/Downloads/ebooks/processed/"

raw_files = [f for f in listdir(mypath) if isfile(join(mypath, f))]
converted_files =  [f for f in listdir(mypath_converted) if isfile(join(mypath_converted, f))]

for f in raw_files:
    final_file_name = get_final_filename(f)
    extension = get_file_extension(f)
    if final_file_name not in converted_files and extension not in ignored_extensions:
        print("Converting : "+f)
        try:
            subprocess.call(["ebook-convert",mypath+f,mypath_converted+final_file_name]) 
            s = rename(mypath+f, mypath_processed+f)
            print(s)
        except Exception as e:
            print(e)
    else:
        print("Already exists : "+final_file_name)
 

I have a folder 'ebooks' which contains all the downloaded ebooks.

After files are converted to required format, they are stored in 'ebooks/kindle' directory and the original file is moved to 'ebooks/processed' directory.


Once calibre is installed, a command line utility 'ebook-convert' is made available which takes two command line arguments, name of file to be converted and name of output file. We will be calling this command line utility from our program, passing the file names one by one.


As of now we are ignoring pdf file as they take lot of time to convert and require some setting updates in calibre.

I will leave that up to you.


Kindle unlimited offer : Read over 1 million books.




SUBSCRIBE
Please subscribe to get the latest articles in your mailbox.



Recent Posts:






© pythoncircle.com 2018-2019