Web scraping, article extraction and sentiment analysis with Scrapy, Goose and TextBlob

Scrapy is a cool Python project that makes it easy to write web scraping bots that extract structured information from normal web pages. You can use it to create an API for a site that doesn’t have one, perform periodic data exports, etc. In this article, we’re going to make a scraper that retrieves the newest articles on Hacker News. The site already has an API, and there is even a wrapper for it, but the goal of this post is to show you how to get the data by scraping it.

Once we have gotten the links and the article titles, we will extract the article itself using Goose, then do sentiment analysis on the article to figure out if it’s positive or negative in tone. For this last piece, we’ll make use of the awesome TextBlob library.

A note about screen scraping

The owners of some sites do not want you to crawl their data and use it for your own purposes. You should respect their wishes. Many sites put information about whether or not they are open to being crawled in robots.txt. Hacker News’ robot.txt indicates that they are open to all crawlers, but that crawlers should wait at least 30 seconds between requests. That’s what we will do.

Installing dependencies

We will use pip to install all the libraries that we need.

Generating a Scrapy project

This tutorial is pretty long. Want a PDF?

Just type in your email address and I'll send a PDF version to your inbox.

Powered by ConvertKit

Scrapy comes with a command line tool that makes it easy to generate project scaffolding. To generate the initial layout and files for our Hacker News project, go to the directory where you want to store the project and type this command:

Let’s use the tree  command to see what has been generated:

The best way to get to grips with the project structure is to dive right in and add code to these files, so we’ll go ahead and do that.

Creating a Hacker News item

In Scrapy, the output of a crawl is a set of items which have been scraped from a website, processed by a pipeline and output into a certain format. The starting point of any Scrapy project is defining what the items are supposed to look like. We’ll make a new item called HackerNewsItem  that represents a post on Hacker News with some associated data.

Open items.py  and change it to this:

Each Scrapy item is a class that inherits from scrapy.Item . Each field on the item is an instance of scrapy.Field . HackerNewsItem  has four fields:

  • link_title  is the title of the post on HackerNews
  • url  is the URL pointed to by the post
  • sentiment  is a sentiment polarity score from -1.0 to 1.0.
  • text  is the text of the article extracted from the page pointed to by the URL

Scrapy is smart enough that we do not have to manually specify the types of these fields, as we have to do in a Django model, for instance.

Making a spider

The next step is to define a spider that starts crawling HackerNews from the front page and follows the “More” links at the bottom of the page down to a given depth.

In the spiders  subdirectory of your project, make a file called hackernews_spider.py  and change its contents to this:

Our spider inherits from scrapy.contrib.spiders.CrawlSpider . This type of spider follows links extracted from previously crawled pages in the same manner as a normal web crawler. As you can see, we have defined several attributes at the top of the class.

This specifies the name used for the crawler when using the command line tool to start a crawl. To start a crawl with our crawler now, you can type:

The allowed_domains  list holds domains that the crawler will pay attention to. Any links to domains that are not in this list will be ignored.

start_urls  is a list of URLs to start the crawl at. In our case, only one is necessary: the Hacker News homepage.

The real meat of the implementation is in the rules  variable and the parse_item  method. rules  is an iterable that contains scrapy.contrib.spiders.Rule  instances. We only have one rule, which indicates that our spider should follow links that match the "news.ycombinator.com/newest"  regex and pass the pages behind those links to the handler defined by the parse_item  method. follow=True  tells the spider that it should recursively follow links found in those pages.

parse_item  uses XPath to grab a list of the articles on the page. Each article is in a table row element with a class of “athing”. We can grab a sequence of each matching element as follows:

Then, we iterate over that sequence and, for each article, we grab the title and the URL. For now, we’re not going to fill in the article text or the sentiment. These will be filled in as part of our item processing pipeline.

What we have so far is a complete spider. If you’ve been paying attention, you might realize that it will just keep following the “More” links at the bottom of the page until it runs out of pages. We definitely don’t want that. We only want to scrape the top few pages. Let’s sort that out. Open settings.py  and add a line that says:

With the depth limit set to 10, we will only crawl the first ten pages. While you’re in there, you can also specify a download delay so that our spider respects Hacker News’ robots.txt.

Now you can run the spider and output the scraped items into a JSON file like so:

The item pipeline

So far, we’re not extracting the text of any articles and we are not doing any sentiment analysis. First though, let’s add a filter to the item pipeline so that “self posts” – posts that are just somebody asking a question instead of linking to something –  are filtered out.

Open pipelines.py  and add this code:

Scrapy pipeline objects must implement a process_item  method and return the item or raise an exception. Links on Hacker News that point to self posts match the regex "item\?id=[0-9]+" . If the URL of the item matches the the regex, we raise a DropItem  exception. That, obviously enough, causes the item to be dropped from the eventual output of the spider.

Components are not registered with the item pipeline until you add them to the ITEM_PIPELINES  list in the project’s settings.py . Add this code to register the DropSelfPostsPipeline  component.

Extracting articles with Goose

Goose is a library that allows you to extract the main article text from any web page. It is not perfect, but it is the best solution currently available in Python. Here is an example of the basic usage:

We can add a new pipeline component to extract the article text from the scraped link and save it in the text  field in the item. Open pipelines.py  again and add this code:

The code here is pretty much the same as the minimal example, except we have wrapped the goose.extract  call in a try-except block. There is a bug in the Goose library that can cause it to throw an IndexError  when parsing the titles of certain web pages. There is an open pull request on the Github repository that fixes this, but it hasn’t been merged yet. We’ll just work around it for the moment.

Add this component to the item pipeline by changing the ITEM_PIPELINES  list in settings.py .

Analyzing sentiment with TextBlob

TextBlob is a library that makes it easy to do standard natural language processing tasks in Python. Check out this sentiment analysis example to see just how easy it is:

Let’s add another item pipeline component to fill in the sentiment field on our items:

As before, we must add the SentimentPipeline  component to the ITEM_PIPELINES  list, which now looks like this:

Now the crawler is complete. Run it using the same command we used before. When it is finished, you will have a file of Hacker News articles, annotated with the sentiment.

There is much more to Scrapy than we have looked at in this article. One of its most interesting features is the ability to integrate with Django by directly populating Django models with scraped data. There are also lots of options for outputting data and managing how the spider behaves, such as scraping using Firefox (for Javascript-heavy sites) and exporting with FTP and Amazon S3. Check out the Scrapy docs for more.