scraping beautifulsoup   0   1633
Python Script 10: Collecting one million website links

I needed a collection of different website links to experiment with Docker cluster. So I created this small script to collect one million website URLs.

Code is available on Github too.


Running script:

Either create a new virtual environment using python3 or use existing one in your system.

Install the dependencies.

pip install requests, BeautifulSoup


Activate the virtual environment and run the code.

python one_million_websites.py


Code:

import requests
from bs4 import BeautifulSoup
import sys
import time


headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/64.0.3282.167 Chrome/64.0.3282.167 Safari/537.36"
}

site_link_count = 0

for i in range(1, 201):
    url = "http://websitelists.in/website-list-" + str(i) + ".html"
    response = requests.get(url, headers = headers)
    if response.status_code != 200:
        print(url + str(response.status_code))
        continue
    
    soup = BeautifulSoup(response.text, 'lxml')
    sites = soup.find_all("td",{"class": "web_width"})
    
    links = ""
    for site in sites:
        site = site.find("a")["href"]
        links += site + "\n"
        site_link_count += 1

    with open("one_million_websites.txt", "a") as f:
        f.write(links)
    
    print(str(site_link_count) + " links found")

    time.sleep(1)


We are scraping links from site http://www.websitelists.in/. If you inspect the webpage, you can see anchor  tag inside td tag with class web_width.

We will convert the page response into BeautifulSoup object and get all such elements and extract the HREF value of them.

python script 10 collecting one million website links


Although there is natural delay of more than 1 second between consecutive requests which is pretty slow but is good for server. I still introduced one second delay to avoid 429 HTTP status.

Scraped links will be dumped in text file in same directory.  


Hosting Django App for free on PythonAnyWhere Server.


Featured Image Source : http://ehacking.net/


scraping beautifulsoup   0   1633

Related Articles:
Python Script 14: Scraping news headlines using python beautifulsoup
Scraping news headlines using python beautifulsoup, web scraping using python, python script to scrape news, web scraping using beautifulsoup, news headlines scraping using python, python programm to get news headlines from web...
Python Script 2 : Crawling all emails from a website
Website crawling for email address, web scraping for emails, data scraping and fetching email adress, python code to scrape all emails froma websites, automating the email id scraping using python script, collect emails using python script...
How to create completely automated telegram channel with python
Creating a completely automated telegram channel to generate and post content using python code on regular basis. Automating the Telegram channel using python script...
py_instagram_dl - The Python Package to Download All pictures of an Instagram User
Download all instagram images for any user using this python package....

0 thoughts on 'Python Script 10: Collecting One Million Website Links'
Leave a comment:


*All Fields are mandatory. **Email Id will not be published publicly.


SUBSCRIBE
Please subscribe to get the latest articles in your mailbox.


Recent Posts:






© pythoncircle.com 2018-2019
Contact Us: code108labs [at] gmail.com
Address: 3747 Smithfield Avenue, Houston, Texas