How to validate web requests in Django using a shared secret

I was talking to one of the guys in my co-working space the other day. He’s learning Python and building a backend for a mobile app that sends JSON data to a web service endpoint he has built using Django. The issue he was running into was how to validate that the requests were actually coming from his app. This post explains how to do it with a shared secret.

Restricting access to views

Let’s say you have a Django view function that looks something like this.

Notice how you are using the require_http_methods  decorator to restrict the allowed HTTP methods to POST. If somebody makes a non-POST request to this view, Django will return a 405 status code.

Verifying requests using HMAC

You can use a HMAC (hash-based message authentication code) to make sure that (1) the sender of the message has the shared secret, and (2) that the message contents has not been tampered with.

There are two inputs to the HMAC function: the shared secret, or the “key”, and the message itself, which for us is the body of the HTTP request. The hash function is SHA256. The Base-64 encoded HMAC can then be sent as a HTTP header and used on the receiving end to verify the message.

In this case, the HMAC is sent in the “X-MY-APP-HMAC” header. The following piece of code extends the Django view function to validate the request, assuming the shared secret is stored in your Django settings file at myapp.settings.MY_APP_SHARED_SECRET .

The 401 HTTP status code is used when some authentication has not been provided, so you can use it here to tell the client that the request has been denied because of the invalid HMAC.

Wrapping it up in a decorator

The code above will work fine, but it’s a bit messy. What if you have multiple views that need to verify their requests? You would end up copying and pasting code. You could move the verification code out to a function and call it first thing, but a much better option is to turn it into a decorator.

Then you can simply decorate your view function like so:

Remember that decorators get executed from the inside out, so the require_http_methods  decorator will get called before validate_request_hmac .

That’s it. How to generate requests with the proper headers on the sender’s side depends on the platform. I’ll leave it as an exercise.

How Python modules and packages work

One of the things that I wish I had found a clear explanation for when I was learning Python was packages, modules and the purpose of the __init__.py  file.

When you start out, you usually just dump everything into one script. Although this is great for prototyping stuff, and works fine for programs up to a thousand lines or so, beyond that point your files are just too big to work with easily. You also have to resort to copying and pasting to reuse functions from one program to the next.

Take this script, for instance. This code is saved in a file called add_nums.py .

This file on its own is a Python module. If you want to reuse the function add_nums  in another program, you can import the script into another program as a module:

But there’s a problem. When a module is imported, all the code at the top level is executed, so print add_nums(2, 5)  will run and your program will print “7”. There’s a little trick we can use to prevent such unwanted behaviour. Just wrap the top-level code in a main function and only run it if add_nums.py  is being run as a script and not imported as a module.

When your script is run directly, using ./add_nums.py  or python add_nums.py , the __name__  global variable is set to "__main__". Otherwise it is set to the name of the module. So by wrapping the invocation of the main function in an IF statement, you can make your script behave differently depending on whether it is being run as a script or imported.

Ok, so what about packages?

A package is just a directory with a __init__.py  file in it. This file contains code that is run when the package is imported. A package can also contain other Python modules and even subpackages.

Let’s imagine we have a package called foo . It is composed of a directory called foo , an __init__.py  file, and another file called bar.py  that contains function definitions.

Ok. Now let’s imagine that __init__.py is empty and there is a function called baz defined in bar.py. People who are just getting started making packages and don’t really understand how they work tend to make an empty __init__.py and then they magically find that they can import their package. Often they are copying what they have seen the Django startapp command do.

To be able to call baz, you have to import it like this:

That’s quite ugly. What we really want to do is this:

What do we have to do to make that work? We have two options. Either we move the definition of baz  into __init__.py  or we just import it in __init__.py. We’ll go for the second option. Change __init__.py to this:

Now, you can import baz from the package directly without referencing the bar module.

So, what kind of stuff should you put in __init__.py ? Firstly, anything to do with package initialization, for instance, reading a data file into memory. Remember that when you import a package, everything in __init__.py is executed, so it’s the perfect place for setting things up.

I also like to use it to import or define anything that makes up the package’s public interface. Although Python doesn’t have the concept of private and public methods like Java has, you should still strive to make your package API as clean as possible. Part of that is making the functions and classes you want people to use easy to access.

That’s it for now. Go make a package!

How to generate PDF reports with Jinja2 and PyQt

There are quite a few options for PDF generation in Python, but nothing fully open-source that ticks all the boxes. The Reportlab library is probably the most fully-featured solution available right now. Unfortunately, the open-source version doesn’t support templates, so unless you cough up for a license you’re going to be stuck manipulating the PDF format at a very low level.

Recently, I had a requirement to generate some simple PDF reports. I didn’t want to write lots of boilerplate and hoped I would be able to use some kind of templating. In the end I came up with a solution based on Jinja2 and Qt’s QWebView widget. First it renders the contents as HTML, then it uses the “Print as PDF” functionality of the QWebView  to save a PDF file. It’s a bit of a hack, but it gets the job done.

Here’s the main file, htmltopdf.py .

I also made a subdirectory for Jinja2 templates called, originally enough, templates. Inside it are two files, base.html  and report.html .

These templates are just examples. This is where you can get creative with CSS, etc. All I’ve done is provide a minimal scaffold so you can see what’s going on.

Let’s walk through htmltopdf.py  and figure out how it works.

The first thing we need to do is set up the Jinja2 environment. Although it is possible to instantiate a jinja2.Template  object by passing it a string holding the template text, instantiating the environment gives us access to template inheritance and other cool features. To set up the environment, we have to pass it a loader object. For our purposes we can use the basic package loader, which takes as arguments the name of the package (or the name of the file, in this case) and the template directory inside it. Our template directory is  templates .

Here is the function that we use to render a particular template. You can get the Template  object from the given template file name by calling get_template  on the environment. Once you have it, you  call render  on it, passing in the necessary information as keyword arguments.

Now let’s take a look at the print_pdf  function, which handles laying out the HTML from our render_template  function and printing it to a file.

We won’t be able to do much with the QWebView  unless we instantiate QApplication , so we bookend print_pdf  by constructing a QApplication  instance and finally calling exit  on it.

Next, we create the QWebView . Seeing as we already have the HTML we want to display in it, we can call setHtml  on the webview. If you want to load an external URL, for snapshotting web pages, etc., you have to use the QWebView.load  function to set its URL. In that case, you will need to register a signal handler to listen for the loadFinished()  signal, but seeing as we are just providing the HTML directly we don’t need to bother with that.

After we have injected the HTML into the webview, we get a QPrinter  and configure it to print an A4-size PDF document, by calling its  setPageSize , setOutputFormat  and setOutputFileName  methods. Other page sizes and output formats are also supported.

That covers everything novel in this approach. The main function just ties it all together and generates some sample data for the template. I found the loremipsum  package handy for quickly getting my hands on placeholder text.

Here’s what our generated PDF looks like:

rendered_pdf

A really simple guide to packaging your PyQt application with cx_Freeze

Python is great for writing programs that run on your own machine or deploy to a web server, but when you want to distribute your applications to friends or customers, things can get very annoying very quickly.

– “It doesn’t work. I don’t know how to run this.”
– “Ok, did you install the Python interpreter?”
– “No, what’s that?”
– “You have to download it from www.python.org. Get the 2.7 version.”
– “Yeah, it’s ok. I’ll just use something else.”

We’ve all been there. If you’re going to distribute your software to people who aren’t Python programmers, you had better package it in a friendly way.

My preferred solution is cx_Freeze. Unlike py2exe, it is cross platform; you can use it to build packages for Windows, OSX and Linux. This post will walk through how to package a simple PyQt4 GUI application for all three platforms. The sample application is a Tetris clone I found in the PyQt4 tutorial on www.zetcode.com.

First, go here, copy the full version of the game and save it in a file called tetris.py.

Make sure you have cx_Freeze installed:

Now go to the directory where you saved tetris.py and run the quickstart command to generate a scaffold setup.py. This is a distutils setup script that tells cx_Freeze how to package your application.

You will be prompted for some information. When it asks for the “Python file to make executable from”, type the name of the script that is the main entry point of your application. In this case it’s tetris.py. The generated setup.py will look something like this:

Not very PEP8, but we can let that pass. Let’s walk through this script to see what is going on.

Build options control what Python packages and modules, and what non-Python files (such as assets) are included in the packaged application. cx_Freeze tries to figure out what is needed on its own, but you may need to manually specify some stuff here if you are using dynamic module imports anywhere in your program.

The “packages” and the “excludes” keys are included in the automatically generated buildOptions dictionary. The main other key that you might need to add are “includes” and “include_files”. ”includes” takes a list of modules that need to be included, and “include_files” takes a list of non-Python files, e.g.

The next interesting line is:

On Windows, GUI applications require a different Python base. They must be executed with the pythonw.exe interpreter, or a command prompt will open and remain open for the duration of the program.

If we read further in the generated setup.py, we see:

An instance of the cxFreeze.Executable class must be instantiated for the file that is the main entry point of your program. In general, you will only have one executable, but you can have more if, for instance, you are packaging a suite of command line tools.

The final section of the file contains the call to the setup function, passing in the buildOptions dictionary and the other options you specified when you ran cxfreeze-quickstart, such as the program name, version and description.

For this tetris.py script, we do not need to modify the autogenerated setup.py. You can go ahead and build the package by running

This will create a new build/ directory below your project. Below that, it will create another directory prefaced by exe. and followed by the name of your current platform. It does this so that the build command can be run on multiple platforms without overwriting.

Inside the platform subfolder, you will see your executable.

There’s a little gotcha that I should note here. When you run the build command, it will complete even if certain imports could not be found. Sometimes it won’t matter, but you should review the output of the command to make sure that everything necessary is included.

The build command will dump everything out into the platform folder. When packaging applications for Linux I recommend that you just distribute a .tar.gz of this, but for OSX and Windows, you should make use of the platform-specific packaging commands.

On OSX, you have the option of building a .dmg or a .app, by executing one of these at the prompt:

On Windows, you can build a .msi package as follows:

Unfortunately, you can’t build packages for one platform on another, so you can only build Windows packages on Windows and OSX packages on OSX, but the setup.py is the same across platforms.

Before I go, I should say something about accessing files inside your packaged application. When your packaged executable is running, the global __file__ variable is not set. If you try to grab the current path using os.path.dirname(__file__) it won’t work. You need to use os.path.dirname(sys.executable) .
Luckily, when your packaged application is running, a “frozen” attribute in sys is set. You can use this fact to grab a handle to files regardless of whether the packaged version of the program is running or not.

The cx_Freeze documentation lists this function as a starting point.

You might need to modify this function slightly, as it assumes that your data files are stored on the same level as your executable, not in a subdirectory.

That’s it! After working through this post, you should have more than enough information to start packaging and distributing your creations.

Resources for learning data science in Python

Over the last few years I’ve been playing around with NumPy, SciPy, scikit-learn and other Python libraries for data science and machine learning.

In the process, I’ve collected a bunch of nice resources that should be useful to anybody trying to get to grips with these topics in Python.

Tutorials

Tentative NumPy Tutorial – This is the NumPy tutorial from the SciPy wiki. It covers the basics and is written in a cookbook style, so it’s ideal for use as a reference. One to bookmark, for sure.

Python Scientific Lecture Notes – A really comprehensive set of notes that goes from basic NumPy and advanced standard Python features, to symbolic mathematics, image processing and machine learning using Scikit-learn, Scikit-image and Sympy.

Quantitative Economics with Python – This site not only contains an in-depth introduction to Python scientific computing with applications to quantitative economics, but also a touches on Pandas and IPython Notebooks, which are quickly becoming the standard for sharing computational ideas in Python.

NumPy for Matlab Users – Although my own foray into Matlab was limited to going through the Octave code for Stanford’s Machine Learning MOOC a few years ago, this tutorial has been recommended for people making the transition from Matlab to Python.

100 NumPy Exercises – Nicolas Rougier has put together a list of 100 exercises, graded from beginner to advanced levels, to teach people how to perform matrix operations the NumPy way. A great hands-on way to get to grips with the library.

Data Manipulation in Python – Mostly brief tutorials on manipulating and visualizing data from CSV files using Pandas.

Computational Statistics in Python – Ridiculously comprehensive

Beat Detection Algorithms – Short blog post about automatically detecting the tempo of a piece of music. Not Python, but still interesting.

Beat Detection Algorithms, Part II – Second part of the above post.

Gensim Tutorial – Gensim is a Python implementation of latent semantic analysis and latent direchlet allocation unsupervised topic modelling algorithms.

How to Implement a Neural Network in Python – Four-part tutorial on the basics of neural nets.

Hacker’s Guide to Neural Networks – Andrej Karpathy’s neural net tutorial.

Using pandas and scikit-learn for classification tasks – An interesting IPython Notebook published on Github by Skipper Seabold.

Books

Machine Learning in Action – Quite old now, but a fun book that shows how to implement many common machine learning algorithms in Python.

Natural Language Processing with Python – Like the Manning book, this one is showing its age, but it remains the best introduction to NLP with NLTK available.

Speech and Language Processing – Dan Jurafsky’s book about NLP. This is amazing stuff.

Python for Data Science – An introduction to many important Python scientific computing tools, including NumPy, SciPy, Pandas and IPython Notebooks, with an eclectic set of applications.

Machine Learning for Hackers – A pragmatic introduction to machine learning topics, focused on usable implementations rather than theory.

Bayesian Methods for Hackers – It is what is sounds like: an introduction to Bayesian techniques from a code-first point of view.

Model Based Machine Learning – Early access version of Christopher Bishop’s new book.

Courses

CS109 Data Science – Harvard course with lectures. Labs and solutions made using IPython Notebooks.

Learning from Data – Caltech course on the fundamentals of machine learning. Hard unwatered down material here.

Videos

Data School – 15 hours of videos and slides by data science experts. Math heavy.

Pandas from the Group Up – PyCon 2015 presentation

Neural Nets for Newbies – PyCon 2015 presentation about neural networks by Melanie Warrick. Quite approachable.

Machine Learning with Scikit-Learn I – First of two PyCon 2015 videos about sklearn.

Machine Learning with Scikit-Learn II – Second of two PyCon 2015 videos about sklearn.

Deep Learning Course – Full video of a set of Oxford lectures on deep learning by Nando de Freitas.

Blogs

Andrej Karpathy Blog – Stanford PhD student who writes a lot about machine learning, especially neural nets.

Hunch.net – John Langford’s blog about machine learning theory.

NLPers – Hal Daume III’s blog about NLP topics.

PyImageSearch – A blog about computer vision in Python.

How to read a file properly in Python

Very often beginning and even experienced Python programmers will read a file like this:

In most cases code like this will work fine. The read()  method will read the whole file into memory and store it in the contents  variable. This is known as “slurping”.

Likewise, the readlines()  method reads the whole file into memory, one line at a time (using the readline()  method), appending each line to a list. So this:

will return the same thing as:

(Note that by using os.linesep  instead of a character literal like “\n” we can make sure our code will work across platforms.)

But what happens when we want read a massive file that won’t fit in memory? It turns out we have some pretty nice options, based on whether we want to read the file line by line or in fixed-sized chunks.

Firstly, here is the Pythonic way to read a file line by line without slurping.

If you mostly work with text files, this is probably the best general purpose file reading idiom to use. While reading files piece by piece will always be slower than slurping the whole file into memory, that probably won’t make a difference for you.

On the other hand, this method assumes that your file has line breaks, which may not be the case. If you really want to make sure that your program only reads a certain number of characters from the file, you can take advantage of the fact that the file’s read()  method takes an optional parameter, read(n) . When n is omitted, the method will read to the end of the file, but when n is present, it will only read the specified number of characters. Here’s an example:

In this snippet we’re relying on a particular behaviour of the  read()  method: it returns an empty string when the end of the file is reached. When reading files of unknown length chunk by chunk, we use that fact to break out of the loop.

This code is a bit verbose, but it does what we want. Is there a more compact way to write it? It turns out that there is. The following snippet is functionally identical to the one above:

To understand this, it would help to know what is going on with the iter()  function. The iter() function has one mandatory argument and one optional argument, and it behaves differently depending on whether one or two arguments are provided.

If only one argument is provided, then that argument must be an object that supports the iterator protocol, i.e. it must implement the __iter__() method, or the sequence protocol, i.e. it must implement the __getitem__() method.

If, on the other hand, the second “sentinel” argument is provided, then the first argument must be a callable and the iterator returned by iter(callable, sentinel)  will behave as follows:

  • When the iterator’s next()  method is called, the callable passed in as the first argument will be called.
  • If the value returned from next()  is equal to the sentinel, then StopIteration  is raised.

Here’s a version of the last example modified slightly to deal with binary files:

The ‘b’ prefix before the sentinel string denotes a Python 3 bytes literal and deserves some explanation. Here is what the Python 2.x documentation says about it:

A prefix of ‘b’ or ‘B’ is ignored in Python 2; it indicates that the literal should become a bytes literal in Python 3 (e.g. when code is automatically converted with 2to3). A ‘u’ or ‘b’ prefix may be followed by an ‘r’ prefix

And the Python 3.3 docs:

Bytes literals are always prefixed with ‘b’ or ‘B’; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.

If we develop our code a little further, we can hide the details of the lambda expression, the iter  function, and the sentinel. The following generator takes a file object and an optional chunk size and lets us iterate through the file as chunks:

We can use it like so:

To specify a chunk size, just pass in the optional size argument to the generator. This reads a file in 1k chunks, for instance:

So there you have it. These basic file reading techniques cover the majority of what you will need in your day to day Python usage.