Scrapy is a cool Python project that makes it easy to write web scraping bots that extract structured information from normal web pages. You can use it to create an API for a site that doesn’t have one, perform periodic data exports, etc. In this article, we’re going to make a scraper that retrieves the newest articles on Hacker News. The site already has an API, and there is even a wrapper for it, but the goal of this post is to show you how to get the data by scraping it.
Once we have gotten the links and the article titles, we will extract the article itself using Goose, then do sentiment analysis on the article to figure out if it’s positive or negative in tone. For this last piece, we’ll make use of the awesome TextBlob library.
A note about screen scraping
The owners of some sites do not want you to crawl their data and use it for your own purposes. You should respect their wishes. Many sites put information about whether or not they are open to being crawled in robots.txt. Hacker News’ robot.txt indicates that they are open to all crawlers, but that crawlers should wait at least 30 seconds between requests. That’s what we will do.
We will use pip to install all the libraries that we need.