Scraping data of 2019 Indian General Election using Python Request and BeautifulSoup and analyzing it


Result of 2019 Indian General Election came out on 23rd May 2019 which can be viewed on the official website of election commission of India. 

In this article, we scraped the data for each constituency and dumped it into a JSON file to analyze further.

EC website has data for each constituency. To get stats for a particular constituency, we need to first select the state from the dropdown and then constituency from another dropdown which is populated based on the state selected. There are 543 constituencies. If we collect the data manually, we have to select values from dropdown 543 times. But, 


no time


That is when Python comes into the picture.


Creating URL:

The result home page URL for constituency wise data is http://results.eci.gov.in/pc/en/constituencywise/ConstituencywiseU011.htm

When you change the dropdown value to another state or another constituency, the web page is reloaded and URL is updated. Try changing the state and constituency few times and you will find a pattern in webpage URL.

URL is created using below logic.

Base URL i.e. http://results.eci.gov.in/pc/en/constituencywise/Constituencywise  + U char for Union territories or S for state + state code 01 to 29 + constitutency code starting from 1 + .htm?ac= + constituency code


So for example if you want to get the data of Muzaffarnagar constituency in Uttar Pradesh state, URL will change to http://results.eci.gov.in/pc/en/constituencywise/ConstituencywiseS243.htm?ac=3 where S means State (not union territories), 24 is code for Uttar Pradesh and 3 is code for Muzaffarnagar constituency.


Also, you will notice that if you pass the wrong state or constituency code, 404 status is returned.


Checking response:

Press F12 in your chrome browser to open the Chrome Dev tools. Select the network tab, check the preserve log check box. When you refresh the page or change the constituency, a new GET request is sent to the server.  


request


The response of this request is pure HTML. To parse the response and to fetch the required data, inspect the webpage. You will see that results are enclosed inside the tbody tag. But there are multiple tbody tags in HTML source code and none of them have any unique attribute like id or other attributes like class. 

Hence we will check the index of this tbody tag from the top and then will fetch data from it.


Code time:

Let's start writing code to fetch the data.

First, create a directory inside which we will place our files. Now change the working directory to the newly created directory and create a virtual environment using Python3. Using a virtual environment while working with Python is recommended. 

Activate the virtual environment. Install the dependencies from the requirements.txt file using below command.

pip install -r requirements.txt


Create a new python file election_data_collection.py. 

Import the required packages

import requests
from bs4 import BeautifulSoup
import json
from lxml import html
import time


Create a list of states code. 

states = ["U0" + str(u) for u in range(1, 8)] + ["S0" + str(s) if s < 10 else "S" + str(s) for s in range(1, 30)]


Every state has different number constituencies, we will loop up to the maximum number of constituency any state has. For now, this number is less than 100. We will terminate the loop for each state when we get 404 of any constituency.

base_url = "http://results.eci.gov.in/pc/en/constituencywise/Constituencywise"

results = list()

for state in states:
for constituency_code in range(1, 99):
url = base_url + state + str(constituency_code) + ".htm?ac=" + str(constituency_code)
# print("URL", url)
response = requests.get(url)


Parse the response and fetch the required tbody inside which all the result data is enclosed.

response_text = response.text

soup = BeautifulSoup(response_text, 'lxml')
tbodies = list(soup.find_all("tbody"))
# 11the tbody from top is the table we need to parse
tbody = tbodies[10]


For each row in tbody, collect the candidate's name and votes count and dump into a dictionary.  Repeat this for each constituency. Once all the data is collected, dump it into a JSON file to analyze further.

# write data to file
with open("election_data.json", "a+") as f:
f.write(json.dumps(results, indent=2))


Important Note:

Please do not flood the server with too many simultaneous requests. Always send one request at a time. Put some sleep between requests. I have added 0.5 seconds sleep between each request. 


Full Code:

"""
Author: Anurag Rana
Site: https://www.pythoncircle.com
"""

import requests
from bs4 import BeautifulSoup
import json
from lxml import html
import time

NOT_FOUND = 404
states = ["U0" + str(u) for u in range(1, 8)] + ["S0" + str(s) if s < 10 else "S" + str(s) for s in range(1, 30)]

# base_url = "http://results.eci.gov.in/pc/en/constituencywise/ConstituencywiseS2910.htm?ac=10"
base_url = "http://results.eci.gov.in/pc/en/constituencywise/Constituencywise"

results = list()

for state in states:
for constituency_code in range(1, 99):
url = base_url + state + str(constituency_code) + ".htm?ac=" + str(constituency_code)
# print("URL", url)
response = requests.get(url)

# if some state and constituency combination do not exists, 404, continue for next state
if NOT_FOUND == response.status_code:
break

response_text = response.text

soup = BeautifulSoup(response_text, 'lxml')
tbodies = list(soup.find_all("tbody"))
# 11the tbody from top is the table we need to parse
tbody = tbodies[10]
trs = list(tbody.find_all('tr'))

# all data for a constituency seat is stored in this dictionary
seat = dict()
seat["candidates"] = list()
for tr_index, tr in enumerate(trs):
# row at index 0 contains name of constituency
if tr_index == 0:
state_and_constituency = tr.find('th').text.strip().split("-")
seat["state"] = state_and_constituency[0].strip().lower()
seat["constituency"] = state_and_constituency[1].strip().lower()
continue

# first and second rows contains headers, ignore
if tr_index in [1, 2]:
continue

# for rest of the rows, get data
tds = list(tr.find_all('td'))

candidate = dict()

# if this is last row get total votes for this seat
if tds[1].text.strip().lower() == "total":
seat["evm_total"] = int(tds[3].text.strip())
if "jammu & kashmir" == seat["state"]:
seat["migrant_total"] = int(tds[4].text.strip())
seat["post_total"] = int(tds[5].text.strip())
seat["total"] = int(tds[6].text.strip())
else:
seat["post_total"] = int(tds[4].text.strip())
seat["total"] = int(tds[5].text.strip())
continue
else:
candidate["candidate_name"] = tds[1].text.strip().lower()
candidate["party_name"] = tds[2].text.strip().lower()
candidate["evm_votes"] = int(tds[3].text.strip().lower())
if "jammu & kashmir" == seat["state"]:
candidate["migrant_votes"] = int(tds[4].text.strip().lower())
candidate["post_votes"] = int(tds[5].text.strip().lower())
candidate["total_votes"] = int(tds[6].text.strip().lower())
candidate["share"] = float(tds[7].text.strip().lower())
else:
candidate["post_votes"] = int(tds[4].text.strip().lower())
candidate["total_votes"] = int(tds[5].text.strip().lower())
candidate["share"] = float(tds[6].text.strip().lower())

seat["candidates"].append(candidate)

# print(json.dumps(seat, indent=2))
results.append(seat)
print("Collected data for", seat["state"], state, seat["constituency"], constituency_code, len(results))
# Do not send too many hits to server. be gentle. wait.
time.sleep(0.5)

# write data to file
with open("election_data.json", "a+") as f:
f.write(json.dumps(results, indent=2))


Code is available on Github.


Running code:

Make sure the virtual environment is active before running the code.

# to collect the data
(venv) $ python election_data_collection.py

# to analyze the data
(venv) $ python election_data_analysis.py


It took me around 7 minutes to collect all the data for 543 constituencies with 0.5 seconds delay between each request.



Analyzing the data:

I have created a few sample functions to analyze the data which are in separate file election_data_analysis.py.

Since we have data with us already in the JSON file, we can use it instead of scraping the site again and again.

To analyze the data, load the content of the election_data.json file into a JSON object.


import json

with open("election_data.json", "r") as f:
data = f.read()
data = json.loads(data)


To get the highest number of votes any candidate got, use below function.


def candidate_highest_votes():
highest_votes = 0
candidate_name = None
constituency_name = None
for constituency in data:
for candidate in constituency["candidates"]:
if candidate["total_votes"] > highest_votes:
candidate_name = candidate["candidate_name"]
constituency_name = constituency["constituency"]
highest_votes = candidate["total_votes"]

print("Highest votes:", candidate_name, "from", constituency_name, "got", highest_votes, "votes")


To get the number of NOTA votes, use below function.


def nota_votes():
nota_votes_count = sum(
[candidate["total_votes"] for constituency in data for candidate in constituency["candidates"] if
candidate["candidate_name"] == "nota"])
print("NOTA votes casted:", nota_votes_count)


Code is available on GitHub. Feel free to fork it, rewrite it or optimize it. 

You may add new features, analyzing points or visualization using graphs, pie charts or bar diagrams.



Host your Django App for free.







Related Articles:
Python Script 14: Scraping news headlines using python beautifulsoup
Scraping news headlines using python beautifulsoup, web scraping using python, python script to scrape news, web scraping using beautifulsoup, news headlines scraping using python, python programm to get news headlines from web...
Python Script 2 : Crawling all emails from a website
Website crawling for email address, web scraping for emails, data scraping and fetching email adress, python code to scrape all emails froma websites, automating the email id scraping using python script, collect emails using python script...
How to create completely automated telegram channel with python
Creating a completely automated telegram channel to generate and post content using python code on regular basis. Automating the Telegram channel using python script...
Get latest Bitcoin and other crypto-currencies rates using python Django
Getting and displaying the latest bitcoin and other crypto currencies prices using python Django. Complete step to step guide to develop a django project to fetch bitcoin prices. Bitcoin prices in python....

2 thoughts on 'Scraping Data Of 2019 Indian General Election Using Python Request And Beautifulsoup And Analyzing It'
Bishnu :
pip install -r requirements.txtI cannot get this ? is it a library or what?
Admin :
You can install the dependencies (packages and modules) required to run the program using this command

Admin :
You can install the dependencies (packages and modules) required to run the program using this command at onces.

Leave a comment:


*All Fields are mandatory. **Email Id will not be published publicly.


SUBSCRIBE
Please subscribe to get the latest articles in your mailbox.


Recent Posts:






© pythoncircle.com 2018-2019
Contact Us: code108labs [at] gmail.com
Address: 3747 Smithfield Avenue, Houston, Texas