1000 Most Active  Blogs On Hashnode!

1000 Most Active Blogs On Hashnode!

Hi there 👋, welcome to my world. Not every person who has created an account on Hashnode has actually started a blog!

Yes, that's why when you write your first blog using this beautiful tool, you receive the self-starter badge. image.png

🔸 Hashnode

We all know that Hashnode is a blogging community for developers and people in tech that lets you publish articles on your own domain and get connected to fellow devs instantly!

🔸 Total Powered Blogs

So I wanted to find all the total blogs that are actually active and contributing to the community. My number one thought was to scrape the trending blogs, which is only limited to 50.

I am still figuring out another viable & possible way.

Obviously from this tweet by Syed Fazle Rahman; we can confidently say that there are over 100K blogs on Hashnode including custom mapped ones.

Thrilled to share some @hashnode YoY numbers from a very successful 2021⚡️

Blogs: 17K➡️100K (+488%)
Articles: 18K➡️73K (+405%)
Reads: 5M➡️31M (+600%)
Hackathons: 0➡️4
Team: 5➡️19
Funding: 8.7M
Twitter: 43K+
Discord: 6K+

Thank you to our amazing community. 💙

— Syed Fazle Rahman (@fazlerocks) January 10, 2022

🔸 My challenge

The challenge was to scrape every page as I scroll automatically using automation which I managed to achieve.

I used Selenium to scrape the homepage of Hashnode that actually loads blogs through infinite scrolling with Javascript powering it.

#Ad
ScraperAPI is a web scraping API tool that works perfectly with Python, Node, Ruby, and other popular programming languages.

Scraper API handles billions of API requests that are related to web scraping and if you used this link to get a product from them, I earn something.

Let's head to coding 🚀🚀

🔸 The Code


import pandas as pd
import time
from selenium.webdriver.common.by import By
from selenium import webdriver

Initialise the driver & target URL

driver = webdriver.Chrome()
url = "https://hashnode.com/community"

Some Timing functions to handle requests & if any dynamic rendering

def web_wait_time():
    return driver.implicitly_wait(5)

def web_sleep_time():
    return time.sleep(10)

Load and launch HTTP GET request

web_wait_time()
driver.get(url)

Web Element locator function. Got the class from inspecting the site.

blogs = []

def all_blogs():
    data = driver.find_elements(By.CLASS_NAME, "css-2wkyxu")

    # Loop through the different children_elements
    for data_elements in data:
        blog_data = data_elements.find_elements(By.TAG_NAME, "a:first-child")

        for blog in blog_data:
            blogs.append(blog.text)

Determining the actual length of the feed rendered from the server if we scrolled all way.

lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# print(lenOfPage) -> 7000 +

This is like the main function I used to keep scrolling through, and then call my scraper & locator functions to do the job.

So the logic is to match with the current page (default = 1) and loop auto-scroll until the actual length is met which is CPU-intensive besides being flagged by the server for too many requests.

So I decided to do only 50 pages, but you can go all in >>>

match = 1
while match < 50:

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    web_sleep_time()
    all_blogs()
    web_sleep_time()

    print(f"[+] -- We are on page {match}  out of {lenOfPage} pages of blog! --")
    match += 1

Matching Blogs with their respective info (Title & URL) in a list of tuples.

blog_length = len(blogs)
blog_list = []

for i in range(0, blog_length-1, 2):
    blog_info = blogs[i], blogs[i+1]
    blog_list.append(blog_info)


blog_list = list(set(blog_list))
# print(blog_list)

Finally, let's dump this data into a CSV File

blogs_dataframe = pd.DataFrame(blog_list, columns=["Blog Title", "Blog_URL"])
blogs_dataframe.to_csv("all_blogs.csv")

driver.quit()

🔸 The Output

image.png

If you want to fix the minor Windows OS errors assuming you are on Windows, please read my related article here.

🔸 1000 Blogs

When we run the code above, the final output CSV file contains all the blogs. For you to have 1000 blogs, you need to create logic to check the blog list length if it gets to 1000, then you break the while loop.

The concept of 1000 blogs comes from the fact that the web-feed loads recent posts from the server just like the former Twitter timeline-algorithm hence my title 'active' blogs.

Note: The blog list is not in any particular order.

My CSV File output has over 100 blogs and by 'active', I meant recent blog posts. So, even if you are just new here and have written your first active by the time of writing this article, consider yourself active 🙂.

Edit while match < 50: to while match < lenOfPage: to get all the active blogs.

The "1000-blog-list" is my challenge to you. Hope You get the gist.>>>

🔸 Summary

I successfully got the latest blogs in terms of contributions and If you left the while loop to run against the real number of pages, you would get all blogs that have a post I believe.

I am not sure if what I have done is illegal and against @Hashnode , someone can reach out to me but I believe the information scraped is OSINT and available freely to the public.

Check GitHub Repo

🔸 Conclusion

Once again, hope you learned some automation & scraping today. Let me know if you have scraping gigs or work.

Please consider subscribing or following me for related content, especially about Tech, Python & General Programming.

You can show extra love by buying me a coffee to support this free content and I am also open to partnerships, technical writing roles, collaborations and Python-related training or roles.

Buy Ronnie A Coffee 📢 You can also follow me on Twitter : ♥ ♥ Waiting for you! 🙂