How to write a Reddit bot in Python

Something I have seen a lot of interest in is writing bots to interact with Reddit and provide useful services to the community. In this post, I’m going to show you how to build one.

Introducing BitesizeNewsBot

The bot we’re going to write is called BitesizeNewsBot. It sits on the “new” queue in the /r/worldnews subreddit and posts summaries of the articles that people link to.

I ran it for a few days last week and, after I worked out some of the kinks, it was ticking along nicely.

bitesize_news_bot

The code

Here is the full code for the bot. It’s under 80 lines! Take a minute to read it and then we will step through what it is doing.

Connecting to Reddit using PRAW

Due to some heroic open source contributions, the Python Reddit API Wrapper is a really mature and stable library that gives you access to everything in the Reddit API. The library even has its own subreddit – /r/praw . We’re going to use it to login using the bot’s account, periodically fetch the new submissions in /r/worldnews, and post comments on the submissions that contain compact summaries of the linked articles.

Here’s how we log in with PRAW:

Then we enter a loop of fetching the new submissions from /r/worldnews, summarizing them, and posting them back as comments. On each iteration, we sleep for ten minutes to be a good citizen of Reddit.

This line fetches the ten newest submissions:

When we get the submissions, we can iterate over them and prepare the summaries.

Summarizing text using PyTLDR

Automatic text summarization is a topic I am really interested in. I’ve implemented several summarization algorithms, but the point of this post is to show you how to make a bot, not how to do advanced natural language processing, so we’re going to use a great library called PyTLDR.

PyTLDR implements several summarization algorithms, but the one we’re going to use is TextRank. The summarization function looks like this:

The summarize_web_page  function takes either a string containing the article text, of a URL. If we give it a URL, as we are doing here, it uses Goose extractor behind the scenes to fetch the article text from the web page.

The function also takes a length parameter. If this is a value between zero and one, it represents the summary length as a fraction of the length of the original article. If it is greater than one, it represents a number of sentences. We have picked three as our summary length, which seems to strike the right balance between providing a useful summary and copying large pieces of the article.

The output of the summary function is a list of sentences. Before returning the summary from the function, we join them with newlines.

In the main loop, we call the function as follows:

Commenting on submissions

Once we have got the summary, we can generate the comment and post it on the article submission. Before we do that, though, we have to do a sanity check on the summary. Because article extraction from web pages is inherently unreliable, sometimes the summarize_web_page  function will return an empty string. This piece of code in our main loop checks for that case and moves on to the next submission if we can’t generate a sensible summary for the current one:

Posting the comment can fail in many ways, so we need to catch several exceptions. As we want the same handler for each one (just print the exception and move on to the next iteration of the loop), we can catch them all in one line:

Keeping track of seen submissions

We don’t want the bot to comment more than once on any article, so we keep track of them in a set that stores the unique identifiers of each post once the comment with the summary has been posted.

To persist the set of posts from one run of the program to the next, we will pickle it and store it in a file on disk.

At startup, we try to restore the set from disk, like so:

At shutdown, we store the set on disk:

We are using the register decorator from the atexit  module to make sure that no matter how our program quits the save_seen_posts  function is called. It will be called even if you hit Ctrl-C in the terminal.

In the main loop, we add the submission ID to the SEEN  set right after posting the comment:

With the set in place, we check whether the bot has already commented on the submission before trying to generate a summary:

We also check if the submission is a self-post, because by definition they do not link to any news article.

And that’s really all there is to writing a simple Reddit bot. You can make it more complicated if you want, but BitesizeNewsBot demonstrates the basics.

The Python Help System

A few years ago I had an interview with a company using Python for their main product. At the time I was a beginner, so I wasn’t able to answer all their questions. One of the questions I choked on was this:

– If you’re given a new Python package and you don’t know how it works, how would you figure it out?

I said I would Google it and see if there was any documentation online. They followed up with:

– What would you do if you had no internet connection?

I told them I would read the code and see what I could learn from it. That wasn’t the answer they were looking for.

Python’s online help system

Python includes a built-in help system based on the pydoc module. In a terminal, type:

A help page will be printed to the console:

All pydoc does is generate the help page based on the docstrings in the module.

Happily, you’re not stuck scrolling through the terminal, man page-style. You can start a local web server that serves the documentation in HTML format by typing:

Now if you go to http://localhost:8000  in your browser you will see an index page with links to the documentation pages for all the modules and packages installed on your system.

It will look something like this:

pydoc

Getting help in IDLE

In the REPL (IDLE or whatever alternative you are using), you can access the same help using the help  built-in function. This function is added to the built-in namespace (the things that are already defined when the interpreter starts up) by the site  module, which is automatically imported during initialization.

Checking the attributes on an object

Sometimes you don’t need the full help text. You only want to see what attributes a certain object has so that you can get on with writing code. In that case, the dir command can come in handy.

dir  works in two modes. The first is when it is invoked without any arguments. In that mode, it prints out a list of the names defined in the local scope.

The second is when it is given an object as an argument. In that mode, it tries to return a list of relevant attributes from the object passed in to it. What that means depends on whether the object is a module or a class.

If it’s a module, dir  returns a list of the module’s attributes. If it’s a class, dir  returns a list of the class attributes, and the attributes of the base classes.

A more useful dir function

Usually when I need dir , I also want to know the types of the object’s attributes. Here’s a function that annotates the output of the normal dir  function with the string names of the types of each returned attribute:

I’ll leave it there for now. Don’t forget to play around with the help system to get a feel for it. You’ll be glad of it next time you’re stuck somewhere without an internet connection and want to do some coding.

Web scraping, article extraction and sentiment analysis with Scrapy, Goose and TextBlob

Scrapy is a cool Python project that makes it easy to write web scraping bots that extract structured information from normal web pages. You can use it to create an API for a site that doesn’t have one, perform periodic data exports, etc. In this article, we’re going to make a scraper that retrieves the newest articles on Hacker News. The site already has an API, and there is even a wrapper for it, but the goal of this post is to show you how to get the data by scraping it.

Once we have gotten the links and the article titles, we will extract the article itself using Goose, then do sentiment analysis on the article to figure out if it’s positive or negative in tone. For this last piece, we’ll make use of the awesome TextBlob library.

A note about screen scraping

The owners of some sites do not want you to crawl their data and use it for your own purposes. You should respect their wishes. Many sites put information about whether or not they are open to being crawled in robots.txt. Hacker News’ robot.txt indicates that they are open to all crawlers, but that crawlers should wait at least 30 seconds between requests. That’s what we will do.

Installing dependencies

We will use pip to install all the libraries that we need.

Generating a Scrapy project

This tutorial is pretty long. Want a PDF?

Just type in your email address and I'll send a PDF version to your inbox.

Powered by ConvertKit

Scrapy comes with a command line tool that makes it easy to generate project scaffolding. To generate the initial layout and files for our Hacker News project, go to the directory where you want to store the project and type this command:

Let’s use the tree  command to see what has been generated:

The best way to get to grips with the project structure is to dive right in and add code to these files, so we’ll go ahead and do that.

Creating a Hacker News item

In Scrapy, the output of a crawl is a set of items which have been scraped from a website, processed by a pipeline and output into a certain format. The starting point of any Scrapy project is defining what the items are supposed to look like. We’ll make a new item called HackerNewsItem  that represents a post on Hacker News with some associated data.

Open items.py  and change it to this:

Each Scrapy item is a class that inherits from scrapy.Item . Each field on the item is an instance of scrapy.Field . HackerNewsItem  has four fields:

  • link_title  is the title of the post on HackerNews
  • url  is the URL pointed to by the post
  • sentiment  is a sentiment polarity score from -1.0 to 1.0.
  • text  is the text of the article extracted from the page pointed to by the URL

Scrapy is smart enough that we do not have to manually specify the types of these fields, as we have to do in a Django model, for instance.

Making a spider

The next step is to define a spider that starts crawling HackerNews from the front page and follows the “More” links at the bottom of the page down to a given depth.

In the spiders  subdirectory of your project, make a file called hackernews_spider.py  and change its contents to this:

Our spider inherits from scrapy.contrib.spiders.CrawlSpider . This type of spider follows links extracted from previously crawled pages in the same manner as a normal web crawler. As you can see, we have defined several attributes at the top of the class.

This specifies the name used for the crawler when using the command line tool to start a crawl. To start a crawl with our crawler now, you can type:

The allowed_domains  list holds domains that the crawler will pay attention to. Any links to domains that are not in this list will be ignored.

start_urls  is a list of URLs to start the crawl at. In our case, only one is necessary: the Hacker News homepage.

The real meat of the implementation is in the rules  variable and the parse_item  method. rules  is an iterable that contains scrapy.contrib.spiders.Rule  instances. We only have one rule, which indicates that our spider should follow links that match the "news.ycombinator.com/newest"  regex and pass the pages behind those links to the handler defined by the parse_item  method. follow=True  tells the spider that it should recursively follow links found in those pages.

parse_item  uses XPath to grab a list of the articles on the page. Each article is in a table row element with a class of “athing”. We can grab a sequence of each matching element as follows:

Then, we iterate over that sequence and, for each article, we grab the title and the URL. For now, we’re not going to fill in the article text or the sentiment. These will be filled in as part of our item processing pipeline.

What we have so far is a complete spider. If you’ve been paying attention, you might realize that it will just keep following the “More” links at the bottom of the page until it runs out of pages. We definitely don’t want that. We only want to scrape the top few pages. Let’s sort that out. Open settings.py  and add a line that says:

With the depth limit set to 10, we will only crawl the first ten pages. While you’re in there, you can also specify a download delay so that our spider respects Hacker News’ robots.txt.

Now you can run the spider and output the scraped items into a JSON file like so:

The item pipeline

So far, we’re not extracting the text of any articles and we are not doing any sentiment analysis. First though, let’s add a filter to the item pipeline so that “self posts” – posts that are just somebody asking a question instead of linking to something –  are filtered out.

Open pipelines.py  and add this code:

Scrapy pipeline objects must implement a process_item  method and return the item or raise an exception. Links on Hacker News that point to self posts match the regex "item\?id=[0-9]+" . If the URL of the item matches the the regex, we raise a DropItem  exception. That, obviously enough, causes the item to be dropped from the eventual output of the spider.

Components are not registered with the item pipeline until you add them to the ITEM_PIPELINES  list in the project’s settings.py . Add this code to register the DropSelfPostsPipeline  component.

Extracting articles with Goose

Goose is a library that allows you to extract the main article text from any web page. It is not perfect, but it is the best solution currently available in Python. Here is an example of the basic usage:

We can add a new pipeline component to extract the article text from the scraped link and save it in the text  field in the item. Open pipelines.py  again and add this code:

The code here is pretty much the same as the minimal example, except we have wrapped the goose.extract  call in a try-except block. There is a bug in the Goose library that can cause it to throw an IndexError  when parsing the titles of certain web pages. There is an open pull request on the Github repository that fixes this, but it hasn’t been merged yet. We’ll just work around it for the moment.

Add this component to the item pipeline by changing the ITEM_PIPELINES  list in settings.py .

Analyzing sentiment with TextBlob

TextBlob is a library that makes it easy to do standard natural language processing tasks in Python. Check out this sentiment analysis example to see just how easy it is:

Let’s add another item pipeline component to fill in the sentiment field on our items:

As before, we must add the SentimentPipeline  component to the ITEM_PIPELINES  list, which now looks like this:

Now the crawler is complete. Run it using the same command we used before. When it is finished, you will have a file of Hacker News articles, annotated with the sentiment.

There is much more to Scrapy than we have looked at in this article. One of its most interesting features is the ability to integrate with Django by directly populating Django models with scraped data. There are also lots of options for outputting data and managing how the spider behaves, such as scraping using Firefox (for Javascript-heavy sites) and exporting with FTP and Amazon S3. Check out the Scrapy docs for more.

Implementing the famous ELIZA chatbot in Python

ELIZA is a conversational agent, or “chatbot”, first implemented in 1966 by Joseph Weizenbaum. It was meant to emulate a Rogerian psychologist. Since then there have been various implementations, more or less similar to the original one. Emacs ships with an ELIZA-type program built in. The CIA even experimented with computer-aided interrogation of officers using a very similar, but rather more combative, version of the program.

My implementation is based on one originally written by Joe Strout. I have updated it significantly to use a more modern and idiomatic form of Python, but the text patterns in the reflections  and psychobabble  data structures are copied essentially verbatim.

Implementation

Let’s walk through the source code. Copy this into a file called eliza.py .

Run it with python eliza.py  and see if you can trip it up. Try not to spill your guts to your new computer therapist!

You will notice that most of the source code is taken up by a dictionary called reflections  and a list of lists called psychobabble . ELIZA is fundamentally a pattern matching program. There is not much more to it than that.

reflections  maps first-person pronouns to second-person pronouns and vice-versa. It is used to “reflect” a statement back against the user.

psychobabble  is made up of a list of lists where the first element is a regular expression that matches the user’s statements and the second element is a list of potential responses. Many of the potential responses contain placeholders that can be filled in with fragments to echo the user’s statements.

main  is the entry point of the program. Let’s take a closer look at it.

First, we print the initial prompt, then we enter a loop of asking the user for input and passing what the user says to the analyze  function to get the therapist’s response. If at any point the user types “quit”, we break out of the loop and the program exits.

Let’s see what’s going on in analyze .

We iterate through the regular expressions in the psychobabble  array, trying to match each one with the user’s statement, from which we have stripped the final punctuation. If we find a match, we choose a response template randomly from the list of possible responses associated with the matching pattern. Then we interpolate the match groups from the regular expression into the response string, calling the reflect  function on each match group first.

There is one syntactic oddity to note here. When we use the list comprehension to generate a list of reflected match groups, we explode the list with the asterisk (*) character before passing it to the string’s format  method. Format expects a series of positional arguments corresponding to the number of format placeholders – {0}, {1}, etc. – in the string. A list or a tuple can be exploded into positional arguments using a single asterisk. Double asterisks (**) can be used to explode dictionaries into keyword arguments.

Now let’s examine the reflect  function.

There is nothing too complicated going on in it. First, we make the statement lowercase, then we tokenize it by splitting on whitespace characters. We iterate through the list of tokens and, if the token exists in our reflections  dictionary, we replace it with the value from the dictionary. So “I” becomes “you”, “your” becomes “my”, etc.

As you can see, ELIZA is an extremely simple program. The only real intelligence in it is involved in the creation of suitably vague response templates. Try fiddling with the psychobabble  list to extend ELIZA’s conversational range and give her a different tone.

Connecting Eliza to IRC

The command line version of ELIZA is pretty fun, but wouldn’t it be cool to let her loose on the internet? I’m going to show you how to hook up the program we have already written to an IRC bot that connects to a public server, creates its own channel and carries on conversations with real human beings.

We’re going to use the SingleServerIRCBot  in the irc  package. You can install it with pip.

Copy this code into a file called elizabot.py .

Let’s go through it. The SingleServerIRCBot  class gives us some hooks we can use to respond to server events. We can make the bot join the given channel automatically by overriding the on_welcome  method.

Now, we have to listen to messages on the channel we joined and check if they are addressed to the bot. If they are, we pass the message to analyze  from the eliza  module and write the response back to the channel, prefixed with the nick of the user who sent the message.

We do that by overriding the on_pubmsg  method.

The IF statement in this method checks that the received message is prefixed with the bot’s nickname. Only then do we generate and send a response.

There is a little subtlety involved in sending messages. To send a message to a channel, we have to use the privmsg  method on the connection  object passed into the on_pubmsg  method, giving the name of the channel as the first argument. Fairly unintuitive, but easy once you know.

The rest of the script is straightforward. It just consists of a main  function that reads the command line arguments and starts the bot.

To run the script and and connect the bot to Freenode, type this command:

The bot will connect to the server, grab the nickame “Elizabot”, and join the #ElizaBot channel.

Here’s a demo of the bot in action:

talking_to_eliza

Understand WSGI by building a microframework

When you’re learning web development in Python, it’s tempting to go straight for higher level frameworks like Django and Flask that abstract the interactions between the web server and the application. While this is certainly a productive way of getting started, it’s a good idea to go back to the lower level at some point so you understand what these frameworks are doing for you. In this post, you will learn about the Web Server Gateway Interface (WSGI) – the standard interface between web servers like nginx and Apache and Python applications. You’ll do that by working from a simple “Hello, world!” WSGI example up to a microframework that supports Flask-like URL routing with decorators, templating, and lets you code your application logic inside simple Django-like controller functions.

The simplest WSGI application

Take a look at this “Hello, world!” code:

The make_server function imported from wsgiref.simple_server  is part of the reference WSGI implementation included in the Python standard library. It returns a web server instance that you can start by calling its serve_forever  method. make_server  takes three arguments: the hostname, the port and the WSGI application itself. In this case, the hello_world  function is the WSGI application. WSGI applications have to be Python callables, i.e. either functions or classes that implement a  __call__  method.

Now let’s take a look at hello_world . You’ll notice that it has two arguments: environ  and start_response . environ  is a dictionary that holds information about the execution environment and the request, such as the path, the query string, the body contents and HTTP headers. start_response  is a function that starts the HTTP response by writing out the status code and the headers. Finally, the function returns a list with a single string in it. This is because the return value of a WSGI application must be

Finally, the function returns a list with a single string in it. This is because the return value of a WSGI application must be an iterable. (Strings are iterable too, of course, but iterating over a string and writing it out one character at a time is pretty slow.)

Save this code in a helloworld.py  and run it. Then open a browser and go to http://localhost:8000 . You should see “Hello, world!”. (If you don’t, check that you don’t have anything else running on port 8000.)

What the framework will look like

So far, the application is very limited. In your browser, go to http://localhost:800/helloworld/ . You will see “Hello, world!” again. As it stands, the application returns the same response for every path you try. It would be much nicer to be able to write something like this (MicroFramework is the framework class, which you will see later.):

You’ll learn about the route decorators in a moment. For the moment, just take a look at what is happening in the home and hello_world controller functions.

The controller functions are taking a Request  object as a parameter and returning a Response object. This is much cleaner than trying to take everything out of the WSGI environ  dictionary and then calling start_response , setting headers manually every time.

Here is the Request  class. It’s just a wrapper that extracts information from the environ dictionary and makes it accessible in a more convenient way:

The Request class

Most of the code in this class is just extracting keys from environ , but a few lines deserve a special mention. HTTP headers sent with the request are stored in environ in keys with the “HTTP_” prefix, so you can use a dict comprehension to extract them and store them in self.headers:

GET and POST data

Parsing the query string to extract GET and POST data is also interesting. The urlparse  module in the standard library contains a function called parse_qs  that takes a standard HTTP query string in the format ?key1=value1&key2=value2  and converts it into a dictionary that maps each key to a list of values. To extract GET data and store it in self.GET , you can call parse_qs(environ["QUERY_STRING"]) .

Extracting POST data is a bit more complicated as it is contained in the body of the HTTP request. First, you have to check if the HTTP method is POST, then read the content length, then get the POST query string by reading that number of bytes from the input stream. Finally, you call parse_qs  on the query string.

With the Request class in place, can instantiate a request object like so:

The Response class

What about Response ? It follows a similar principle:

As in the Request  class, most of the code here is just for convenience. You can assume that the default response code will be “200 OK” and allow the code to be set manually when it differs.

It is more natural to manipulate response headers as a dictionary, but they need to be passed to start_response  as a list of tuples, so the wsgi_headers  method, with the @property  decorator, returns them in that structure. You will see how this is used when you take a look at the __call__  method in the framework class.

Now you can build different types of responses by passing different keyword arguments. How about a normal “200 OK” response?

What about a “404 Not Found”?

Or what if an error occurs?

And what if you want to issue a redirect? It is as simple as setting the “302” status code and adding a location header.

The MicroFramework class

This is where it all comes together:

Look at the __call__  function. It gets back a instance of Response  by first building a  Request  object based on  environ  and passing that to the framework’s dispatch  method. Then it calls  start_response  with the status code and the headers from the response. Notice how wsgi_headers  is used.

Dispatching requests to controllers

The next interesting thing in the framework class is the dispatch  method. In the constructor, the  self.routes dictionary is initialized. It contains a mapping from regular expressions that represent request paths to controller functions. The method iterates through the regular expressions until it finds one that matches the request path, then it calls the associated controller function and returns the response from it to __call__ . If no route matches the path, it returns a “404” response generated by the not_found  static method.

If an error occurs while executing the controller function, the framework grabs the stack trace, prints it, and returns a “500” response generated by the internal_error  static method.

The route decorator

How do routes get into the self.routes  dictionary in the first place? That’s where the route  decorator comes in. All it does is add a mapping from the regex provided as an argument to the decorator to the controller function itself. The regexes can also contain capture groups that are stored in the request.params  list and made available to controller functions, as in the following example:

In this route regex, the capture group is a sequence of one or more numeric characters. The question mark (?) after the last slash makes the slash optional.

Integrating Jinja2

So far, the framework has no support for templating, but it is easy to integrate Jinja2 or any other template engine. Here is a simple example, which assumes that you have a template directory called “templates” with a file called “helloworld.html” in it.

This post showed you how to build a super simple web framework in Python, but this is just the beginning. For instance, the framework doesn’t support cookies or sessions, and there is no database access facility. Caching, form handling and other niceties that Django and Flask provide either out of the box or as plugins would also need to be added to turn this into a fully featured framework.

Why don’t you try to implement some of these features yourself?

How to obfuscate Python source code

A lot of the Python code you will come across is open source. The whole point is to distribute it freely, share knowledge and let people play around with it and learn from it.

Sometimes, though, you might want to prevent the end-user from reading the code. Maybe you are selling commercial software or maybe you just want to share the solution to a tricky coding challenge with your friends without giving the game away.

Whatever your reasons, there are a few approaches you can take:

Using pyobfuscate

One is obfuscation. I like a package called pyobfuscate. It transforms your normal and clearly written (right?) Python source code into new source code that is hard to read, by making changes to whitespace and names, stripping comments, removing functions, etc.

The package doesn’t seem to be on PyPI, but you can install it from the Github repo:

Let’s try it out. Save the following code in example.py :

Obfuscate it using the pyobfuscate  command, which should be on your path now that you have installed the package:

Thee obfuscated code will be printed to the console:

That’s pretty illegible!

Unfortunately pyobfuscate only works on one source file at a time, so it’s not really suitable for large projects. It also appears to only work with Python 2 at the moment.

Distributing bytecode

Another, arguably easier, method is to just distribute the .pyc files. The Python standard library includes a compileall module that can scan your source directory and compile all of your files into Python bytecode. Then you can distribute them without the source files. The .pyc files can still be decompiled into source code, but the code will not be as readable as it was before.

One problem with this method is that the initial .py script that you run cannot be compiled in this way. You can solve this problem by making a simple wrapper script that gives away no information about your program.

These two methods are really just a deterrent, not a secure way of hiding the code.

If you want something a bit more robust, you should take a look at Nuitka, which compiles Python code to C++, so you can compile that and just distribute the executable. It seems to be broadly compatible with different libraries and different versions of Python.

How to validate web requests in Django using a shared secret

I was talking to one of the guys in my co-working space the other day. He’s learning Python and building a backend for a mobile app that sends JSON data to a web service endpoint he has built using Django. The issue he was running into was how to validate that the requests were actually coming from his app. This post explains how to do it with a shared secret.

Restricting access to views

Let’s say you have a Django view function that looks something like this.

Notice how you are using the require_http_methods  decorator to restrict the allowed HTTP methods to POST. If somebody makes a non-POST request to this view, Django will return a 405 status code.

Verifying requests using HMAC

You can use a HMAC (hash-based message authentication code) to make sure that (1) the sender of the message has the shared secret, and (2) that the message contents has not been tampered with.

There are two inputs to the HMAC function: the shared secret, or the “key”, and the message itself, which for us is the body of the HTTP request. The hash function is SHA256. The Base-64 encoded HMAC can then be sent as a HTTP header and used on the receiving end to verify the message.

In this case, the HMAC is sent in the “X-MY-APP-HMAC” header. The following piece of code extends the Django view function to validate the request, assuming the shared secret is stored in your Django settings file at myapp.settings.MY_APP_SHARED_SECRET .

The 401 HTTP status code is used when some authentication has not been provided, so you can use it here to tell the client that the request has been denied because of the invalid HMAC.

Wrapping it up in a decorator

The code above will work fine, but it’s a bit messy. What if you have multiple views that need to verify their requests? You would end up copying and pasting code. You could move the verification code out to a function and call it first thing, but a much better option is to turn it into a decorator.

Then you can simply decorate your view function like so:

Remember that decorators get executed from the inside out, so the require_http_methods  decorator will get called before validate_request_hmac .

That’s it. How to generate requests with the proper headers on the sender’s side depends on the platform. I’ll leave it as an exercise.

How Python modules and packages work

One of the things that I wish I had found a clear explanation for when I was learning Python was packages, modules and the purpose of the __init__.py  file.

When you start out, you usually just dump everything into one script. Although this is great for prototyping stuff, and works fine for programs up to a thousand lines or so, beyond that point your files are just too big to work with easily. You also have to resort to copying and pasting to reuse functions from one program to the next.

Take this script, for instance. This code is saved in a file called add_nums.py .

This file on its own is a Python module. If you want to reuse the function add_nums  in another program, you can import the script into another program as a module:

But there’s a problem. When a module is imported, all the code at the top level is executed, so print add_nums(2, 5)  will run and your program will print “7”. There’s a little trick we can use to prevent such unwanted behaviour. Just wrap the top-level code in a main function and only run it if add_nums.py  is being run as a script and not imported as a module.

When your script is run directly, using ./add_nums.py  or python add_nums.py , the __name__  global variable is set to "__main__". Otherwise it is set to the name of the module. So by wrapping the invocation of the main function in an IF statement, you can make your script behave differently depending on whether it is being run as a script or imported.

Ok, so what about packages?

A package is just a directory with a __init__.py  file in it. This file contains code that is run when the package is imported. A package can also contain other Python modules and even subpackages.

Let’s imagine we have a package called foo . It is composed of a directory called foo , an __init__.py  file, and another file called bar.py  that contains function definitions.

Ok. Now let’s imagine that __init__.py is empty and there is a function called baz defined in bar.py. People who are just getting started making packages and don’t really understand how they work tend to make an empty __init__.py and then they magically find that they can import their package. Often they are copying what they have seen the Django startapp command do.

To be able to call baz, you have to import it like this:

That’s quite ugly. What we really want to do is this:

What do we have to do to make that work? We have two options. Either we move the definition of baz  into __init__.py  or we just import it in __init__.py. We’ll go for the second option. Change __init__.py to this:

Now, you can import baz from the package directly without referencing the bar module.

So, what kind of stuff should you put in __init__.py ? Firstly, anything to do with package initialization, for instance, reading a data file into memory. Remember that when you import a package, everything in __init__.py is executed, so it’s the perfect place for setting things up.

I also like to use it to import or define anything that makes up the package’s public interface. Although Python doesn’t have the concept of private and public methods like Java has, you should still strive to make your package API as clean as possible. Part of that is making the functions and classes you want people to use easy to access.

That’s it for now. Go make a package!

How to generate PDF reports with Jinja2 and PyQt

There are quite a few options for PDF generation in Python, but nothing fully open-source that ticks all the boxes. The Reportlab library is probably the most fully-featured solution available right now. Unfortunately, the open-source version doesn’t support templates, so unless you cough up for a license you’re going to be stuck manipulating the PDF format at a very low level.

Recently, I had a requirement to generate some simple PDF reports. I didn’t want to write lots of boilerplate and hoped I would be able to use some kind of templating. In the end I came up with a solution based on Jinja2 and Qt’s QWebView widget. First it renders the contents as HTML, then it uses the “Print as PDF” functionality of the QWebView  to save a PDF file. It’s a bit of a hack, but it gets the job done.

Here’s the main file, htmltopdf.py .

I also made a subdirectory for Jinja2 templates called, originally enough, templates. Inside it are two files, base.html  and report.html .

These templates are just examples. This is where you can get creative with CSS, etc. All I’ve done is provide a minimal scaffold so you can see what’s going on.

Let’s walk through htmltopdf.py  and figure out how it works.

The first thing we need to do is set up the Jinja2 environment. Although it is possible to instantiate a jinja2.Template  object by passing it a string holding the template text, instantiating the environment gives us access to template inheritance and other cool features. To set up the environment, we have to pass it a loader object. For our purposes we can use the basic package loader, which takes as arguments the name of the package (or the name of the file, in this case) and the template directory inside it. Our template directory is  templates .

Here is the function that we use to render a particular template. You can get the Template  object from the given template file name by calling get_template  on the environment. Once you have it, you  call render  on it, passing in the necessary information as keyword arguments.

Now let’s take a look at the print_pdf  function, which handles laying out the HTML from our render_template  function and printing it to a file.

We won’t be able to do much with the QWebView  unless we instantiate QApplication , so we bookend print_pdf  by constructing a QApplication  instance and finally calling exit  on it.

Next, we create the QWebView . Seeing as we already have the HTML we want to display in it, we can call setHtml  on the webview. If you want to load an external URL, for snapshotting web pages, etc., you have to use the QWebView.load  function to set its URL. In that case, you will need to register a signal handler to listen for the loadFinished()  signal, but seeing as we are just providing the HTML directly we don’t need to bother with that.

After we have injected the HTML into the webview, we get a QPrinter  and configure it to print an A4-size PDF document, by calling its  setPageSize , setOutputFormat  and setOutputFileName  methods. Other page sizes and output formats are also supported.

That covers everything novel in this approach. The main function just ties it all together and generates some sample data for the template. I found the loremipsum  package handy for quickly getting my hands on placeholder text.

Here’s what our generated PDF looks like:

rendered_pdf