Text to speech with Python 3 on Linux and OSX

Recently I had a requirement to synthesise speech from text on two different operating systems. Here is what I came  up with.

OSX

Synthesising speech is a simple matter for OSX users because the operating system comes with the say  command. We can use subprocess  to call it.

Linux

On Linux, there are a few different options. I like to use the espeak  Python bindings when I can. You can install it on Ubuntu using apt-get .

Then use it like so:

espeak  supports multiple languages, so if you are not dealing with English text, you need to pass in the language code. Unfortunately, it looks like the Python bindings don’t support that yet, but we can still use subprocess  like we did on linux.

The list of available languages can be found on the espeak website here.

How to write a Reddit bot in Python

Something I have seen a lot of interest in is writing bots to interact with Reddit and provide useful services to the community. In this post, I’m going to show you how to build one.

Introducing BitesizeNewsBot

The bot we’re going to write is called BitesizeNewsBot. It sits on the “new” queue in the /r/worldnews subreddit and posts summaries of the articles that people link to.

I ran it for a few days last week and, after I worked out some of the kinks, it was ticking along nicely.

bitesize_news_bot

The code

Here is the full code for the bot. It’s under 80 lines! Take a minute to read it and then we will step through what it is doing.

Connecting to Reddit using PRAW

Due to some heroic open source contributions, the Python Reddit API Wrapper is a really mature and stable library that gives you access to everything in the Reddit API. The library even has its own subreddit – /r/praw . We’re going to use it to login using the bot’s account, periodically fetch the new submissions in /r/worldnews, and post comments on the submissions that contain compact summaries of the linked articles.

Here’s how we log in with PRAW:

Then we enter a loop of fetching the new submissions from /r/worldnews, summarizing them, and posting them back as comments. On each iteration, we sleep for ten minutes to be a good citizen of Reddit.

This line fetches the ten newest submissions:

When we get the submissions, we can iterate over them and prepare the summaries.

Summarizing text using PyTLDR

Automatic text summarization is a topic I am really interested in. I’ve implemented several summarization algorithms, but the point of this post is to show you how to make a bot, not how to do advanced natural language processing, so we’re going to use a great library called PyTLDR.

PyTLDR implements several summarization algorithms, but the one we’re going to use is TextRank. The summarization function looks like this:

The summarize_web_page  function takes either a string containing the article text, of a URL. If we give it a URL, as we are doing here, it uses Goose extractor behind the scenes to fetch the article text from the web page.

The function also takes a length parameter. If this is a value between zero and one, it represents the summary length as a fraction of the length of the original article. If it is greater than one, it represents a number of sentences. We have picked three as our summary length, which seems to strike the right balance between providing a useful summary and copying large pieces of the article.

The output of the summary function is a list of sentences. Before returning the summary from the function, we join them with newlines.

In the main loop, we call the function as follows:

Commenting on submissions

Once we have got the summary, we can generate the comment and post it on the article submission. Before we do that, though, we have to do a sanity check on the summary. Because article extraction from web pages is inherently unreliable, sometimes the summarize_web_page  function will return an empty string. This piece of code in our main loop checks for that case and moves on to the next submission if we can’t generate a sensible summary for the current one:

Posting the comment can fail in many ways, so we need to catch several exceptions. As we want the same handler for each one (just print the exception and move on to the next iteration of the loop), we can catch them all in one line:

Keeping track of seen submissions

We don’t want the bot to comment more than once on any article, so we keep track of them in a set that stores the unique identifiers of each post once the comment with the summary has been posted.

To persist the set of posts from one run of the program to the next, we will pickle it and store it in a file on disk.

At startup, we try to restore the set from disk, like so:

At shutdown, we store the set on disk:

We are using the register decorator from the atexit  module to make sure that no matter how our program quits the save_seen_posts  function is called. It will be called even if you hit Ctrl-C in the terminal.

In the main loop, we add the submission ID to the SEEN  set right after posting the comment:

With the set in place, we check whether the bot has already commented on the submission before trying to generate a summary:

We also check if the submission is a self-post, because by definition they do not link to any news article.

And that’s really all there is to writing a simple Reddit bot. You can make it more complicated if you want, but BitesizeNewsBot demonstrates the basics.

Web scraping, article extraction and sentiment analysis with Scrapy, Goose and TextBlob

Scrapy is a cool Python project that makes it easy to write web scraping bots that extract structured information from normal web pages. You can use it to create an API for a site that doesn’t have one, perform periodic data exports, etc. In this article, we’re going to make a scraper that retrieves the newest articles on Hacker News. The site already has an API, and there is even a wrapper for it, but the goal of this post is to show you how to get the data by scraping it.

Once we have gotten the links and the article titles, we will extract the article itself using Goose, then do sentiment analysis on the article to figure out if it’s positive or negative in tone. For this last piece, we’ll make use of the awesome TextBlob library.

A note about screen scraping

The owners of some sites do not want you to crawl their data and use it for your own purposes. You should respect their wishes. Many sites put information about whether or not they are open to being crawled in robots.txt. Hacker News’ robot.txt indicates that they are open to all crawlers, but that crawlers should wait at least 30 seconds between requests. That’s what we will do.

Installing dependencies

We will use pip to install all the libraries that we need.

Generating a Scrapy project

This tutorial is pretty long. Want a PDF?

Just type in your email address and I'll send a PDF version to your inbox.

Powered by ConvertKit

Scrapy comes with a command line tool that makes it easy to generate project scaffolding. To generate the initial layout and files for our Hacker News project, go to the directory where you want to store the project and type this command:

Let’s use the tree  command to see what has been generated:

The best way to get to grips with the project structure is to dive right in and add code to these files, so we’ll go ahead and do that.

Creating a Hacker News item

In Scrapy, the output of a crawl is a set of items which have been scraped from a website, processed by a pipeline and output into a certain format. The starting point of any Scrapy project is defining what the items are supposed to look like. We’ll make a new item called HackerNewsItem  that represents a post on Hacker News with some associated data.

Open items.py  and change it to this:

Each Scrapy item is a class that inherits from scrapy.Item . Each field on the item is an instance of scrapy.Field . HackerNewsItem  has four fields:

  • link_title  is the title of the post on HackerNews
  • url  is the URL pointed to by the post
  • sentiment  is a sentiment polarity score from -1.0 to 1.0.
  • text  is the text of the article extracted from the page pointed to by the URL

Scrapy is smart enough that we do not have to manually specify the types of these fields, as we have to do in a Django model, for instance.

Making a spider

The next step is to define a spider that starts crawling HackerNews from the front page and follows the “More” links at the bottom of the page down to a given depth.

In the spiders  subdirectory of your project, make a file called hackernews_spider.py  and change its contents to this:

Our spider inherits from scrapy.contrib.spiders.CrawlSpider . This type of spider follows links extracted from previously crawled pages in the same manner as a normal web crawler. As you can see, we have defined several attributes at the top of the class.

This specifies the name used for the crawler when using the command line tool to start a crawl. To start a crawl with our crawler now, you can type:

The allowed_domains  list holds domains that the crawler will pay attention to. Any links to domains that are not in this list will be ignored.

start_urls  is a list of URLs to start the crawl at. In our case, only one is necessary: the Hacker News homepage.

The real meat of the implementation is in the rules  variable and the parse_item  method. rules  is an iterable that contains scrapy.contrib.spiders.Rule  instances. We only have one rule, which indicates that our spider should follow links that match the "news.ycombinator.com/newest"  regex and pass the pages behind those links to the handler defined by the parse_item  method. follow=True  tells the spider that it should recursively follow links found in those pages.

parse_item  uses XPath to grab a list of the articles on the page. Each article is in a table row element with a class of “athing”. We can grab a sequence of each matching element as follows:

Then, we iterate over that sequence and, for each article, we grab the title and the URL. For now, we’re not going to fill in the article text or the sentiment. These will be filled in as part of our item processing pipeline.

What we have so far is a complete spider. If you’ve been paying attention, you might realize that it will just keep following the “More” links at the bottom of the page until it runs out of pages. We definitely don’t want that. We only want to scrape the top few pages. Let’s sort that out. Open settings.py  and add a line that says:

With the depth limit set to 10, we will only crawl the first ten pages. While you’re in there, you can also specify a download delay so that our spider respects Hacker News’ robots.txt.

Now you can run the spider and output the scraped items into a JSON file like so:

The item pipeline

So far, we’re not extracting the text of any articles and we are not doing any sentiment analysis. First though, let’s add a filter to the item pipeline so that “self posts” – posts that are just somebody asking a question instead of linking to something –  are filtered out.

Open pipelines.py  and add this code:

Scrapy pipeline objects must implement a process_item  method and return the item or raise an exception. Links on Hacker News that point to self posts match the regex "item\?id=[0-9]+" . If the URL of the item matches the the regex, we raise a DropItem  exception. That, obviously enough, causes the item to be dropped from the eventual output of the spider.

Components are not registered with the item pipeline until you add them to the ITEM_PIPELINES  list in the project’s settings.py . Add this code to register the DropSelfPostsPipeline  component.

Extracting articles with Goose

Goose is a library that allows you to extract the main article text from any web page. It is not perfect, but it is the best solution currently available in Python. Here is an example of the basic usage:

We can add a new pipeline component to extract the article text from the scraped link and save it in the text  field in the item. Open pipelines.py  again and add this code:

The code here is pretty much the same as the minimal example, except we have wrapped the goose.extract  call in a try-except block. There is a bug in the Goose library that can cause it to throw an IndexError  when parsing the titles of certain web pages. There is an open pull request on the Github repository that fixes this, but it hasn’t been merged yet. We’ll just work around it for the moment.

Add this component to the item pipeline by changing the ITEM_PIPELINES  list in settings.py .

Analyzing sentiment with TextBlob

TextBlob is a library that makes it easy to do standard natural language processing tasks in Python. Check out this sentiment analysis example to see just how easy it is:

Let’s add another item pipeline component to fill in the sentiment field on our items:

As before, we must add the SentimentPipeline  component to the ITEM_PIPELINES  list, which now looks like this:

Now the crawler is complete. Run it using the same command we used before. When it is finished, you will have a file of Hacker News articles, annotated with the sentiment.

There is much more to Scrapy than we have looked at in this article. One of its most interesting features is the ability to integrate with Django by directly populating Django models with scraped data. There are also lots of options for outputting data and managing how the spider behaves, such as scraping using Firefox (for Javascript-heavy sites) and exporting with FTP and Amazon S3. Check out the Scrapy docs for more.

Implementing the famous ELIZA chatbot in Python

ELIZA is a conversational agent, or “chatbot”, first implemented in 1966 by Joseph Weizenbaum. It was meant to emulate a Rogerian psychologist. Since then there have been various implementations, more or less similar to the original one. Emacs ships with an ELIZA-type program built in. The CIA even experimented with computer-aided interrogation of officers using a very similar, but rather more combative, version of the program.

My implementation is based on one originally written by Joe Strout. I have updated it significantly to use a more modern and idiomatic form of Python, but the text patterns in the reflections  and psychobabble  data structures are copied essentially verbatim.

Implementation

Let’s walk through the source code. Copy this into a file called eliza.py .

Run it with python eliza.py  and see if you can trip it up. Try not to spill your guts to your new computer therapist!

You will notice that most of the source code is taken up by a dictionary called reflections  and a list of lists called psychobabble . ELIZA is fundamentally a pattern matching program. There is not much more to it than that.

reflections  maps first-person pronouns to second-person pronouns and vice-versa. It is used to “reflect” a statement back against the user.

psychobabble  is made up of a list of lists where the first element is a regular expression that matches the user’s statements and the second element is a list of potential responses. Many of the potential responses contain placeholders that can be filled in with fragments to echo the user’s statements.

main  is the entry point of the program. Let’s take a closer look at it.

First, we print the initial prompt, then we enter a loop of asking the user for input and passing what the user says to the analyze  function to get the therapist’s response. If at any point the user types “quit”, we break out of the loop and the program exits.

Let’s see what’s going on in analyze .

We iterate through the regular expressions in the psychobabble  array, trying to match each one with the user’s statement, from which we have stripped the final punctuation. If we find a match, we choose a response template randomly from the list of possible responses associated with the matching pattern. Then we interpolate the match groups from the regular expression into the response string, calling the reflect  function on each match group first.

There is one syntactic oddity to note here. When we use the list comprehension to generate a list of reflected match groups, we explode the list with the asterisk (*) character before passing it to the string’s format  method. Format expects a series of positional arguments corresponding to the number of format placeholders – {0}, {1}, etc. – in the string. A list or a tuple can be exploded into positional arguments using a single asterisk. Double asterisks (**) can be used to explode dictionaries into keyword arguments.

Now let’s examine the reflect  function.

There is nothing too complicated going on in it. First, we make the statement lowercase, then we tokenize it by splitting on whitespace characters. We iterate through the list of tokens and, if the token exists in our reflections  dictionary, we replace it with the value from the dictionary. So “I” becomes “you”, “your” becomes “my”, etc.

As you can see, ELIZA is an extremely simple program. The only real intelligence in it is involved in the creation of suitably vague response templates. Try fiddling with the psychobabble  list to extend ELIZA’s conversational range and give her a different tone.

Connecting Eliza to IRC

The command line version of ELIZA is pretty fun, but wouldn’t it be cool to let her loose on the internet? I’m going to show you how to hook up the program we have already written to an IRC bot that connects to a public server, creates its own channel and carries on conversations with real human beings.

We’re going to use the SingleServerIRCBot  in the irc  package. You can install it with pip.

Copy this code into a file called elizabot.py .

Let’s go through it. The SingleServerIRCBot  class gives us some hooks we can use to respond to server events. We can make the bot join the given channel automatically by overriding the on_welcome  method.

Now, we have to listen to messages on the channel we joined and check if they are addressed to the bot. If they are, we pass the message to analyze  from the eliza  module and write the response back to the channel, prefixed with the nick of the user who sent the message.

We do that by overriding the on_pubmsg  method.

The IF statement in this method checks that the received message is prefixed with the bot’s nickname. Only then do we generate and send a response.

There is a little subtlety involved in sending messages. To send a message to a channel, we have to use the privmsg  method on the connection  object passed into the on_pubmsg  method, giving the name of the channel as the first argument. Fairly unintuitive, but easy once you know.

The rest of the script is straightforward. It just consists of a main  function that reads the command line arguments and starts the bot.

To run the script and and connect the bot to Freenode, type this command:

The bot will connect to the server, grab the nickame “Elizabot”, and join the #ElizaBot channel.

Here’s a demo of the bot in action:

talking_to_eliza