Private methods and attributes in Python

Unlike Java, which enforces access restrictions on methods and attributes, Python takes the view that we are all adults and should be allowed to use the code as we see fit. Nevertheless, the language provides a few facilities to indicate which methods and attributes are public and which are private, and some ways to dissuade people from accessing and using private things.

Normal attribute access

Let’s take a look at how normal attribute access works.

As we can see, there are no restrictions on accessing or assigning to the bar  attribute of our instance. The attribute is also included in __dict__ .

Making it private

Now let’s make bar  “private”. We can do that by adding two leading underscores to the name.

What has happened here is that the name of __bar  has been changed by the  interpreter so that it is not easily accessible outside the class. If we take a look at __dict__  again, we will see that it has been renamed to _Foo__bar , and can be accessed and assigned using that name.

This is called “name mangling”. Attributes whose names start with two underscores are renamed in the format _classname__attrname .

We only have to use the mangled name outside the class. Inside, we access the attribute in the normal way.

Getters and setters

After learning about “private” attributes, sometimes new Python programmers get the idea that they can use getters and setters to manage accessing and assigning attributes, so they write something like this.

It might work, but it’s not Python. Direct attribute access is the natural and Pythonic way to do things, so any solution to mediated attribute access should maintain that interface. There are a few ways to do it, such as overriding __getattr__  and __setattr__ , but the best way is to use managed attributes.

Here we have created a managed bar  attribute that stores its data in the private __bar  attribute. When getting and setting the value of __bar , we can run whatever code we want for validation, logging, etc., provided we go through the interface provided by the two decorated bar  functions. Useful, eh?

Private methods

Methods can be made private in the same way, by naming them with two leading underscores and no trailing underscores.

And just like private attributes, they are accessible by name inside the class.

A word about single underscores

So far we have dealt with names that start with two underscores, but it’s quite common to see names that start with a single underscore. They are not private in the same sense. Name mangling does not occur. A single underscore is mostly just a weak indication that the thing in question is meant to be used internally and is not part of the public interface of the class, module, etc., that it is inside.

In classes, attributes and methods that start with a single underscore are treated normally.

However, single underscores are not purely a stylistic thing. They do affect how the import  statement works.

PEP8 says:

_single_leading_underscore: weak “internal use” indicator. E.g. from M import * does not import objects whose name starts with an underscore.

This means that if we have a function called _hello_world  in a module called helloworld , and we import *  from it, then the _hello_world  function will not be pulled into the current scope.

It is possible to override the default hiding of objects with single leading underscores. __all__  is a list of the names of public objects exported by a module. If we add '_hello_world'  to the list, then it will be pulled in with the wildcard import.

The single underscore only affects wildcard imports, which we should avoid anyway. We can still grab the function specifically using from helloworld import _hello_world .

And that’s pretty much all you need to know about private attributes in Python!

Multi-line strings in Python

At some point, you will want to define a multi-line string and find that the obvious solutions just don’t feel clean. In this post, I’m going to take a look at three ways of defining them and give you my recommendation.

Concatenation

The first way of doing it, and the way that immediately comes to mind, is to just add the strings to each other, like so:

In my opinion, this looks extremely ugly. You can make it a bit better by omitting the + signs. This also works:

That’s better, but still a bit of an eyesore. Let’s move on.

Multi-line string syntax

Another way is to use the built-in multi-line string syntax, like so:

That’s much better than concatenation, but it has one conspicuous wart. You can’t indent the subsequent lines to be at the same level of indentation as the first one. The space will be interpreted as part of the string. The first line will be flush with the margin and the subsequent lines will be indented.

Tuple syntax

There is another way to do it that doesn’t suffer from the ugliness of concatenation or the indentation wart of the multi-line syntax, and that is to use the tuple syntax.

I have no idea why this works, but it does:

Note that you have to add the line breaks into the strings, because they’re not put in automatically. Nonetheless, I think this is by far the nicest, most readable method.

The difference between input and raw_input in Python

One of the first things that people notice when they ditch Python 2 and start coding in Python 3 – apart from the fact that print  is now a function – is that the raw_input  function has disappeared. So this Python 2 code:

must be converted to this in Python 3:

The change comes because the Python developers realized that they had made a dangerous mistake back in the early days. If you recall, the Python 2 version of the input  function used to be equivalent to this:

This allowed you to easily write programs that take input from the user and evaluate it as an int or a float or whatever type it is. For example:

raw_input, on the other hand, returned strings:

In Python 3, input  behaves like raw_input  in Python2, and the raw_input  function does not exist, so you have to do something like this (assuming you want to accept integer values):

Effectively, in the Python 2 version of the input  function, the string read from the prompt was evaled. To understand the danger of eval , you should take a look at this article by Ned Batchelder.

Automatically evaling whatever anybody decides to type at the prompt maybe makes things a little easier in a teaching context, because students don’t have to learn to convert strings to their intended types, but it also leaves the program open to executing arbitrary code that the user types in, revealing private information about your system or damaging it in some way.

Take this for example:

That will print the current working directory of your program.

Or if someone really wants to screw things up for you, they could just execute a recursive delete of your home directory. DO NOT RUN THIS CODE:

There is no need for Python to contain such footguns, regardless of their dubious teaching value.

To get the old behaviour of input  (which I hope I have convinced you that you do not want), replace your calls to it with eval(input()) . In fact, that is exactly what the automatic porting tool 2to3  does.

How to modify a list in place in Python

Did you ever see a piece of code that looks like this?

The purpose of it is to remove any even numbers from the list numbers . And it looks like it should work, right?

The problem explained

For most of you, the mistake will already be obvious, but beginners can expect to scratch their heads when they examine the list and see that it still has some even numbers in it.

Here’s what we wanted to see:

What is going on here? We can illuminate the problem if we write out the code without any of the syntactic niceties of the Python for  loop. (Keep in mind that this is merely for explanatory purposes and is not the kind of code you should be writing.)

Be your own debugger

Let’s “step through” the while  loop to see exactly how the execution goes wrong.

  • On the first iteration, the loop counter  i  is equal to 0. 1 (the value of the first element in numbers ) is assigned to elem . 1 is not divisible by 2 so the if  block is not called.
  • On the second iteration, i  is equal to 1. 2 (the value of the second element in numbers) is assigned to elem . 2 is clearly divisible by 2, so the if  block is called and the second element is removed from the list. The length of the list is reduced by 1. The element that was at the third position is now at the second position and so on. The list now looks like this.

  • On the third iteration, i  is equal to 2. 3 (the value of the third element in numbers , is assigned to elem . The 2 that used to be the third element, but is now the second, has been skipped entirely. It can never be removed from the list.
  • The 4 in the sixth position of the initial list is skipped in the same way.

Now that you understand how the code is going wrong, let’s see how to fix it.

How to fix it

There are several ways to refactor the code to give us the output we want. Let’s examine them.

First, we could use a list comprehension to build up a new list that contains only the elements in numbers  that are not divisible by 2. This is by far the simplest and most common way to achieve what we want. Here’s what it looks like:

A subtlety of this method is that it creates a completely new list and makes numbers point to it. If you had another name pointing to the old list, it will still point there. Here’s what I mean:

That may or may not be a problem, depending on the program, but if it is crucial to modify the list in place without copying anything, we can iterate backwards. While we’re at it, we can use del with an index instead of remove, which needlessly searches through the list for the first element with the given value:

So which one of these options should you use? If you are ok with creating a completely new list, then you should use the list comprehension. It’s by far the clearest and most idiomatic solution. Otherwise, if modifying the original list in place is crucial, you should go for the second version.

How to write a Reddit bot in Python

Something I have seen a lot of interest in is writing bots to interact with Reddit and provide useful services to the community. In this post, I’m going to show you how to build one.

Introducing BitesizeNewsBot

The bot we’re going to write is called BitesizeNewsBot. It sits on the “new” queue in the /r/worldnews subreddit and posts summaries of the articles that people link to.

I ran it for a few days last week and, after I worked out some of the kinks, it was ticking along nicely.

bitesize_news_bot

The code

Here is the full code for the bot. It’s under 80 lines! Take a minute to read it and then we will step through what it is doing.

Connecting to Reddit using PRAW

Due to some heroic open source contributions, the Python Reddit API Wrapper is a really mature and stable library that gives you access to everything in the Reddit API. The library even has its own subreddit – /r/praw . We’re going to use it to login using the bot’s account, periodically fetch the new submissions in /r/worldnews, and post comments on the submissions that contain compact summaries of the linked articles.

Here’s how we log in with PRAW:

Then we enter a loop of fetching the new submissions from /r/worldnews, summarizing them, and posting them back as comments. On each iteration, we sleep for ten minutes to be a good citizen of Reddit.

This line fetches the ten newest submissions:

When we get the submissions, we can iterate over them and prepare the summaries.

Summarizing text using PyTLDR

Automatic text summarization is a topic I am really interested in. I’ve implemented several summarization algorithms, but the point of this post is to show you how to make a bot, not how to do advanced natural language processing, so we’re going to use a great library called PyTLDR.

PyTLDR implements several summarization algorithms, but the one we’re going to use is TextRank. The summarization function looks like this:

The summarize_web_page  function takes either a string containing the article text, of a URL. If we give it a URL, as we are doing here, it uses Goose extractor behind the scenes to fetch the article text from the web page.

The function also takes a length parameter. If this is a value between zero and one, it represents the summary length as a fraction of the length of the original article. If it is greater than one, it represents a number of sentences. We have picked three as our summary length, which seems to strike the right balance between providing a useful summary and copying large pieces of the article.

The output of the summary function is a list of sentences. Before returning the summary from the function, we join them with newlines.

In the main loop, we call the function as follows:

Commenting on submissions

Once we have got the summary, we can generate the comment and post it on the article submission. Before we do that, though, we have to do a sanity check on the summary. Because article extraction from web pages is inherently unreliable, sometimes the summarize_web_page  function will return an empty string. This piece of code in our main loop checks for that case and moves on to the next submission if we can’t generate a sensible summary for the current one:

Posting the comment can fail in many ways, so we need to catch several exceptions. As we want the same handler for each one (just print the exception and move on to the next iteration of the loop), we can catch them all in one line:

Keeping track of seen submissions

We don’t want the bot to comment more than once on any article, so we keep track of them in a set that stores the unique identifiers of each post once the comment with the summary has been posted.

To persist the set of posts from one run of the program to the next, we will pickle it and store it in a file on disk.

At startup, we try to restore the set from disk, like so:

At shutdown, we store the set on disk:

We are using the register decorator from the atexit  module to make sure that no matter how our program quits the save_seen_posts  function is called. It will be called even if you hit Ctrl-C in the terminal.

In the main loop, we add the submission ID to the SEEN  set right after posting the comment:

With the set in place, we check whether the bot has already commented on the submission before trying to generate a summary:

We also check if the submission is a self-post, because by definition they do not link to any news article.

And that’s really all there is to writing a simple Reddit bot. You can make it more complicated if you want, but BitesizeNewsBot demonstrates the basics.

The Python Help System

A few years ago I had an interview with a company using Python for their main product. At the time I was a beginner, so I wasn’t able to answer all their questions. One of the questions I choked on was this:

– If you’re given a new Python package and you don’t know how it works, how would you figure it out?

I said I would Google it and see if there was any documentation online. They followed up with:

– What would you do if you had no internet connection?

I told them I would read the code and see what I could learn from it. That wasn’t the answer they were looking for.

Python’s online help system

Python includes a built-in help system based on the pydoc module. In a terminal, type:

A help page will be printed to the console:

All pydoc does is generate the help page based on the docstrings in the module.

Happily, you’re not stuck scrolling through the terminal, man page-style. You can start a local web server that serves the documentation in HTML format by typing:

Now if you go to http://localhost:8000  in your browser you will see an index page with links to the documentation pages for all the modules and packages installed on your system.

It will look something like this:

pydoc

Getting help in IDLE

In the REPL (IDLE or whatever alternative you are using), you can access the same help using the help  built-in function. This function is added to the built-in namespace (the things that are already defined when the interpreter starts up) by the site  module, which is automatically imported during initialization.

Checking the attributes on an object

Sometimes you don’t need the full help text. You only want to see what attributes a certain object has so that you can get on with writing code. In that case, the dir command can come in handy.

dir  works in two modes. The first is when it is invoked without any arguments. In that mode, it prints out a list of the names defined in the local scope.

The second is when it is given an object as an argument. In that mode, it tries to return a list of relevant attributes from the object passed in to it. What that means depends on whether the object is a module or a class.

If it’s a module, dir  returns a list of the module’s attributes. If it’s a class, dir  returns a list of the class attributes, and the attributes of the base classes.

A more useful dir function

Usually when I need dir , I also want to know the types of the object’s attributes. Here’s a function that annotates the output of the normal dir  function with the string names of the types of each returned attribute:

I’ll leave it there for now. Don’t forget to play around with the help system to get a feel for it. You’ll be glad of it next time you’re stuck somewhere without an internet connection and want to do some coding.

Web scraping, article extraction and sentiment analysis with Scrapy, Goose and TextBlob

Scrapy is a cool Python project that makes it easy to write web scraping bots that extract structured information from normal web pages. You can use it to create an API for a site that doesn’t have one, perform periodic data exports, etc. In this article, we’re going to make a scraper that retrieves the newest articles on Hacker News. The site already has an API, and there is even a wrapper for it, but the goal of this post is to show you how to get the data by scraping it.

Once we have gotten the links and the article titles, we will extract the article itself using Goose, then do sentiment analysis on the article to figure out if it’s positive or negative in tone. For this last piece, we’ll make use of the awesome TextBlob library.

A note about screen scraping

The owners of some sites do not want you to crawl their data and use it for your own purposes. You should respect their wishes. Many sites put information about whether or not they are open to being crawled in robots.txt. Hacker News’ robot.txt indicates that they are open to all crawlers, but that crawlers should wait at least 30 seconds between requests. That’s what we will do.

Installing dependencies

We will use pip to install all the libraries that we need.

Generating a Scrapy project

This tutorial is pretty long. Want a PDF?

Just type in your email address and I'll send a PDF version to your inbox.

Powered by ConvertKit

Scrapy comes with a command line tool that makes it easy to generate project scaffolding. To generate the initial layout and files for our Hacker News project, go to the directory where you want to store the project and type this command:

Let’s use the tree  command to see what has been generated:

The best way to get to grips with the project structure is to dive right in and add code to these files, so we’ll go ahead and do that.

Creating a Hacker News item

In Scrapy, the output of a crawl is a set of items which have been scraped from a website, processed by a pipeline and output into a certain format. The starting point of any Scrapy project is defining what the items are supposed to look like. We’ll make a new item called HackerNewsItem  that represents a post on Hacker News with some associated data.

Open items.py  and change it to this:

Each Scrapy item is a class that inherits from scrapy.Item . Each field on the item is an instance of scrapy.Field . HackerNewsItem  has four fields:

  • link_title  is the title of the post on HackerNews
  • url  is the URL pointed to by the post
  • sentiment  is a sentiment polarity score from -1.0 to 1.0.
  • text  is the text of the article extracted from the page pointed to by the URL

Scrapy is smart enough that we do not have to manually specify the types of these fields, as we have to do in a Django model, for instance.

Making a spider

The next step is to define a spider that starts crawling HackerNews from the front page and follows the “More” links at the bottom of the page down to a given depth.

In the spiders  subdirectory of your project, make a file called hackernews_spider.py  and change its contents to this:

Our spider inherits from scrapy.contrib.spiders.CrawlSpider . This type of spider follows links extracted from previously crawled pages in the same manner as a normal web crawler. As you can see, we have defined several attributes at the top of the class.

This specifies the name used for the crawler when using the command line tool to start a crawl. To start a crawl with our crawler now, you can type:

The allowed_domains  list holds domains that the crawler will pay attention to. Any links to domains that are not in this list will be ignored.

start_urls  is a list of URLs to start the crawl at. In our case, only one is necessary: the Hacker News homepage.

The real meat of the implementation is in the rules  variable and the parse_item  method. rules  is an iterable that contains scrapy.contrib.spiders.Rule  instances. We only have one rule, which indicates that our spider should follow links that match the "news.ycombinator.com/newest"  regex and pass the pages behind those links to the handler defined by the parse_item  method. follow=True  tells the spider that it should recursively follow links found in those pages.

parse_item  uses XPath to grab a list of the articles on the page. Each article is in a table row element with a class of “athing”. We can grab a sequence of each matching element as follows:

Then, we iterate over that sequence and, for each article, we grab the title and the URL. For now, we’re not going to fill in the article text or the sentiment. These will be filled in as part of our item processing pipeline.

What we have so far is a complete spider. If you’ve been paying attention, you might realize that it will just keep following the “More” links at the bottom of the page until it runs out of pages. We definitely don’t want that. We only want to scrape the top few pages. Let’s sort that out. Open settings.py  and add a line that says:

With the depth limit set to 10, we will only crawl the first ten pages. While you’re in there, you can also specify a download delay so that our spider respects Hacker News’ robots.txt.

Now you can run the spider and output the scraped items into a JSON file like so:

The item pipeline

So far, we’re not extracting the text of any articles and we are not doing any sentiment analysis. First though, let’s add a filter to the item pipeline so that “self posts” – posts that are just somebody asking a question instead of linking to something –  are filtered out.

Open pipelines.py  and add this code:

Scrapy pipeline objects must implement a process_item  method and return the item or raise an exception. Links on Hacker News that point to self posts match the regex "item\?id=[0-9]+" . If the URL of the item matches the the regex, we raise a DropItem  exception. That, obviously enough, causes the item to be dropped from the eventual output of the spider.

Components are not registered with the item pipeline until you add them to the ITEM_PIPELINES  list in the project’s settings.py . Add this code to register the DropSelfPostsPipeline  component.

Extracting articles with Goose

Goose is a library that allows you to extract the main article text from any web page. It is not perfect, but it is the best solution currently available in Python. Here is an example of the basic usage:

We can add a new pipeline component to extract the article text from the scraped link and save it in the text  field in the item. Open pipelines.py  again and add this code:

The code here is pretty much the same as the minimal example, except we have wrapped the goose.extract  call in a try-except block. There is a bug in the Goose library that can cause it to throw an IndexError  when parsing the titles of certain web pages. There is an open pull request on the Github repository that fixes this, but it hasn’t been merged yet. We’ll just work around it for the moment.

Add this component to the item pipeline by changing the ITEM_PIPELINES  list in settings.py .

Analyzing sentiment with TextBlob

TextBlob is a library that makes it easy to do standard natural language processing tasks in Python. Check out this sentiment analysis example to see just how easy it is:

Let’s add another item pipeline component to fill in the sentiment field on our items:

As before, we must add the SentimentPipeline  component to the ITEM_PIPELINES  list, which now looks like this:

Now the crawler is complete. Run it using the same command we used before. When it is finished, you will have a file of Hacker News articles, annotated with the sentiment.

There is much more to Scrapy than we have looked at in this article. One of its most interesting features is the ability to integrate with Django by directly populating Django models with scraped data. There are also lots of options for outputting data and managing how the spider behaves, such as scraping using Firefox (for Javascript-heavy sites) and exporting with FTP and Amazon S3. Check out the Scrapy docs for more.

How Python modules and packages work

One of the things that I wish I had found a clear explanation for when I was learning Python was packages, modules and the purpose of the __init__.py  file.

When you start out, you usually just dump everything into one script. Although this is great for prototyping stuff, and works fine for programs up to a thousand lines or so, beyond that point your files are just too big to work with easily. You also have to resort to copying and pasting to reuse functions from one program to the next.

Take this script, for instance. This code is saved in a file called add_nums.py .

This file on its own is a Python module. If you want to reuse the function add_nums  in another program, you can import the script into another program as a module:

But there’s a problem. When a module is imported, all the code at the top level is executed, so print add_nums(2, 5)  will run and your program will print “7”. There’s a little trick we can use to prevent such unwanted behaviour. Just wrap the top-level code in a main function and only run it if add_nums.py  is being run as a script and not imported as a module.

When your script is run directly, using ./add_nums.py  or python add_nums.py , the __name__  global variable is set to "__main__". Otherwise it is set to the name of the module. So by wrapping the invocation of the main function in an IF statement, you can make your script behave differently depending on whether it is being run as a script or imported.

Ok, so what about packages?

A package is just a directory with a __init__.py  file in it. This file contains code that is run when the package is imported. A package can also contain other Python modules and even subpackages.

Let’s imagine we have a package called foo . It is composed of a directory called foo , an __init__.py  file, and another file called bar.py  that contains function definitions.

Ok. Now let’s imagine that __init__.py is empty and there is a function called baz defined in bar.py. People who are just getting started making packages and don’t really understand how they work tend to make an empty __init__.py and then they magically find that they can import their package. Often they are copying what they have seen the Django startapp command do.

To be able to call baz, you have to import it like this:

That’s quite ugly. What we really want to do is this:

What do we have to do to make that work? We have two options. Either we move the definition of baz  into __init__.py  or we just import it in __init__.py. We’ll go for the second option. Change __init__.py to this:

Now, you can import baz from the package directly without referencing the bar module.

So, what kind of stuff should you put in __init__.py ? Firstly, anything to do with package initialization, for instance, reading a data file into memory. Remember that when you import a package, everything in __init__.py is executed, so it’s the perfect place for setting things up.

I also like to use it to import or define anything that makes up the package’s public interface. Although Python doesn’t have the concept of private and public methods like Java has, you should still strive to make your package API as clean as possible. Part of that is making the functions and classes you want people to use easy to access.

That’s it for now. Go make a package!

How to generate PDF reports with Jinja2 and PyQt

There are quite a few options for PDF generation in Python, but nothing fully open-source that ticks all the boxes. The Reportlab library is probably the most fully-featured solution available right now. Unfortunately, the open-source version doesn’t support templates, so unless you cough up for a license you’re going to be stuck manipulating the PDF format at a very low level.

Recently, I had a requirement to generate some simple PDF reports. I didn’t want to write lots of boilerplate and hoped I would be able to use some kind of templating. In the end I came up with a solution based on Jinja2 and Qt’s QWebView widget. First it renders the contents as HTML, then it uses the “Print as PDF” functionality of the QWebView  to save a PDF file. It’s a bit of a hack, but it gets the job done.

Here’s the main file, htmltopdf.py .

I also made a subdirectory for Jinja2 templates called, originally enough, templates. Inside it are two files, base.html  and report.html .

These templates are just examples. This is where you can get creative with CSS, etc. All I’ve done is provide a minimal scaffold so you can see what’s going on.

Let’s walk through htmltopdf.py  and figure out how it works.

The first thing we need to do is set up the Jinja2 environment. Although it is possible to instantiate a jinja2.Template  object by passing it a string holding the template text, instantiating the environment gives us access to template inheritance and other cool features. To set up the environment, we have to pass it a loader object. For our purposes we can use the basic package loader, which takes as arguments the name of the package (or the name of the file, in this case) and the template directory inside it. Our template directory is  templates .

Here is the function that we use to render a particular template. You can get the Template  object from the given template file name by calling get_template  on the environment. Once you have it, you  call render  on it, passing in the necessary information as keyword arguments.

Now let’s take a look at the print_pdf  function, which handles laying out the HTML from our render_template  function and printing it to a file.

We won’t be able to do much with the QWebView  unless we instantiate QApplication , so we bookend print_pdf  by constructing a QApplication  instance and finally calling exit  on it.

Next, we create the QWebView . Seeing as we already have the HTML we want to display in it, we can call setHtml  on the webview. If you want to load an external URL, for snapshotting web pages, etc., you have to use the QWebView.load  function to set its URL. In that case, you will need to register a signal handler to listen for the loadFinished()  signal, but seeing as we are just providing the HTML directly we don’t need to bother with that.

After we have injected the HTML into the webview, we get a QPrinter  and configure it to print an A4-size PDF document, by calling its  setPageSize , setOutputFormat  and setOutputFileName  methods. Other page sizes and output formats are also supported.

That covers everything novel in this approach. The main function just ties it all together and generates some sample data for the template. I found the loremipsum  package handy for quickly getting my hands on placeholder text.

Here’s what our generated PDF looks like:

rendered_pdf