Why Is My Django/MySQL Application Showing Unicode as Question Marks?

Back up your database before you try anything here. Sometimes character set conversions can change your data in ways you don’t want. Be sensible and use mysqldump or something to safeguard it before you start messing around. Needless to say, you should try everything in a test environment first.

When you run a Django application (or any other web application, for that matter) on top of a stock MySQL install, you might hit a problem with storing Unicode characters. I saw it in a Django project that had to deal with Arabic text. Instead of the Arabic characters, it just showed a bunch of question marks.

Here’s how to fix it.

Check your MySQL character set

Out of the box, your MySQL character set is probably latin1 . We’re going to change it to utf8 .

First, run this command to check that you are in fact dealing with an incorrect character set:

In the output, you will probably see the following line:

If you do, keep going. We’re going to sort it out.

Edit my.cnf

The main MySQL configuration file is called my.cnf . On Ubuntu it is located at /etc/mysql/my.conf  . You can check where it is on your own system by running locate my.cnf .

The file is divided into sections and the start of each section is indicated with a name in square brackets. We’re interested in the sections [client]  and [mysqld] .

After making a backup of the current state of the file

open it in your text editor of choice and find the [client]  section. Add the following line to it:

Next, find the [mysqld]  section and add the following three lines to it:

Be careful that you add the code to the right sections. If you make a mistake here then MySQL will not start and it won’t write any useful error message to the logs.

Save my.cnf  and restart MySQL. On many systems, you can do this with the service  command:

Alter each table to use the new character set

First, you want to generate the script you are going to use to convert each table one by one to the new character set. Change the database name, username and password to the correct values and run this in the terminal.

It will generate the SQL you need to change each of your tables. For example, if your database contained three tables called users , comments  and posts , the generated code would look this this:

Run that code against the database using your tool of choice. It might take a while, depending on the size of your tables. You’ll know when you try it on your test environment. When it’s done, those question marks should be history.

How variable scope works in Python

Someone asked me to take a look at a piece of code recently and tell him why it wasn’t working. The problem was that he didn’t really understand Python variable scoping. That’s what I’m going to talk about today. It is quite basic, but you really need to have it down cold, and there are a few surprises in there too.

What you need to know

A variable in Python is defined when you assign something to it. You don’t declare it beforehand, like you can in C. You just start using it.

Any variable you declare at the top level of a file or module is in global scope. You can access it inside functions.

Before I go on I need to add a disclaimer: global variable are almost always a bad idea. Yes, sometimes you need them, but you almost always don’t. A good rule of thumb is that a variable should have the narrowest scope it needs to do its job. There’s a good discussion of global variables and the associated issues here.

Modifying the value of a global variable is less simple. Take a look at this example.

What happened? Why is the value of x 123 for the second print statement? It turns out that when we assigned the value 321 to x inside foo we actually declared a new variable called x in the local scope of that function. That x has absolutely no relation to the x in global scope. When the function ends, that variable with the value of 321 does not exist anymore.

To get the desired effect, we have to use the global keyword.

That’s more like it.

There is one more scope we have to worry about: the enclosing scope created by declaring one function inside another one. Watch.

What if you want to modify the value of x declared in the outer function? You’ll run into the same problem that made us use global. But we don’t want to use global here. x is not a global variable. It is in the local scope of a function.

Python 3 introduced the nonlocal keyword for this exact situation. I wrote a post about it on this page, but I’ll show you a quick example now.

A simple way to remember Python scoping rules

In the book Learning Python by Mark Lutz, he suggests the following mnenomic for remember how Python scoping works: LEGB

Going from the narrowest scope to the widest scope:

  • L stands for “Local”. It refers to variables that are defined in the local scope of functions.
  • E stands for “Enclosing”. It refers to variables defined in the local scope of functions wrapping other functions.
  • G stands for “Global”. These are the variables defined at the top level of files and modules.
  • B stands for “Built in”. These are the names that are loaded into scope when the interpreter starts up. You can look at them here: https://docs.python.org/3.5/library/functions.html

And that is everything you need to learn about this topic for the vast majority of Python programming tasks.

How to fix database race conditions in Django views

Today I’m going to show you how to fix an extremely common error in Django applications. My guess is about 90% of Django applications deployed in the wild suffer from this error, and like 72% of statistics I just made that one up on the spot. Seriously though, it’s pretty common.

Imagine you’ve got an online bookstore application with a Book  model that has a quantity attribute. When somebody buys a copy of one of your books, you want to decrease the quantity attribute by 1. Here is the naive way to do it:

At the start when you’ve got a small load on your system, this will seem to work fine. Now imagine your bookstore grows, you open some new branches, and there are multiple updates being run on your application every second. That’s when strange things will start to happen. Here is how two concurrent updates might play out with our current code. book1 represents the first concurrent update and book2 represents the second:

At the start of both concurrent updates, an identical copy of the data in the database is loaded into memory. The inventory quantity is decreased on each copy, then the new quantity is written back to the database, with the second update clobbering the first. Result: it is as if one of the updates never happened.

In database terms, what we need is called a SELECT FOR UPDATE. Basically, this locks the row in the database until the new information is written back, preventing a second instance from reading and modifying data that might be in the process of changing.

Since Django 1.4, implementing SELECT FOR UPDATE through the ORM is really simple:

That will lock the row selected with get until the end of the transaction block, which since Django 1.5 corresponds to the end of the request by default.

select_for_update is compatible with the postgresql_psycopg2, oracle, and mysql database backends. It doesn’t work for the sqlite backend.

Text to speech with Python 3 on Linux and OSX

Recently I had a requirement to synthesise speech from text on two different operating systems. Here is what I came  up with.

OSX

Synthesising speech is a simple matter for OSX users because the operating system comes with the say  command. We can use subprocess  to call it.

Linux

On Linux, there are a few different options. I like to use the espeak  Python bindings when I can. You can install it on Ubuntu using apt-get .

Then use it like so:

espeak  supports multiple languages, so if you are not dealing with English text, you need to pass in the language code. Unfortunately, it looks like the Python bindings don’t support that yet, but we can still use subprocess  like we did on linux.

The list of available languages can be found on the espeak website here.

How to integrate New Relic with Django, Apache and mod_wsgi

I just finished setting up New Relic application monitoring on Recommendify. It was a little bit of a painful process and it wasn’t properly described in their docs, so I’m going to note the steps I went through here. Hopefully it will help somebody else in the same situation.

Install the New Relic agent

First, install the newrelic package using pip.

Generate the New Relic configuration

Then copy the command to generate the configuration file from your New Relic dashboard. It will look something like this:

LICENSE-KEY will be the specific license key for your account.

This command will create a file called newrelic.ini. Copy it somewhere safe on your server and make sure it has permissions to be read by the user Apache runs under.

Edit wsgi.py

The next step took some time to figure out. New Relic provides a wrapper script for Python applications, but it doesn’t work for setups using embedded interpreters, such as Apache with mod_wsgi. As this is how Recommendify is configured, I had to find another way.

I edited the wsgi.py file from my Django project to wrap the WSGI application object.

This script imports the socket module and used the socket.gethostname() function to get the hostname of the current machine. I did this so that New Relic would only log data in production and not during development.

At the bottom of the script, I check if the application is running on the production environment. If it is, I import newrelic.agent, initialize it by passing it the path to the newrelic.ini created during the previous step, then wrap the WSGI application object using the WSGIApplicationWrapper class.

You will need to use your own Django settings module, hostname string, and path to newrelic.ini, of course.

Once you deploy the modified wsgi.py, you should start seeing monitoring information in New Relic after about five minutes. Here’s what it looks like for me.

new_relic

What to do after Codecademy

Today I want to offer my opinion on a question that lots of people have asked me since I started this blog. “What should I do after I have finished an introductory course on Codecademy?”

Codecademy is basic stuff

You should be proud to have finished your first course and dipped your toes into programming. It’s a great achievement, but you also have to be realistic: this is just the beginning. You have made your way to base camp, but you haven’t yet attempted to reach the summit.

The fact is that Codecademy, while providing a solid grounding in the basics, doesn’t go much farther than that. On its own, it only teaches you a fraction of the things that you need to know to program professionally (if that’s your goal).

So be proud, but be humble. There is so much to learn that you will never know everything. And that’s a good thing, as long as you like learning.

Identify your goals

Maybe it’s obvious, but the path you take from now should be dictated by what you want to be able to do. If your goal is to write web applications, you should get to grips with frameworks such as Django, Flask and Bottle. If your goal is to use Python for data science, you should be learning things like Numpy, Scipy and Pandas (and hitting the math books hard). If you want to create desktop software, you should learn PyQt and Tkinter.

Once you have to make the decision about what area (or areas) to focus on, then go and find some beginner materials for them. I’m not going to list any here, because there are just so many, but rest assured that they are easy to find.

(That reminds me of another point: Google is your friend. Programmers like to joke that the job is 80% Googling, and there is some truth to that, depending on what you’re building. So you’d better improve your Google-fu and learn how to ask good questions.)

Projects are key

Working through courses and tutorials, and reading technical books, are important ways of improving your knowledge, but there is really no substitute for trying to build something and picking up what you need to know as you go along. The skills of identifying bugs quickly, deciphering arcane error messages and knowing when to stop fiddling with your code and move on are as important to programming as knowledge of advanced language features or algorithms.

With that in mind, I suggest you set yourself a project. Maybe you already have something in mind, in which case, awesome! Or maybe you don’t. In that case, take a look at this list of projects and pick one that seems fun and achievable.

http://www.dreamincode.net/forums/topic/78802-martyr2s-mega-project-ideas-list/

Learn some real computer science

Assuming you’re like a lot of the people picking up coding these days and your goal is to make web applications, then you can go a long way without even thinking about the theoretical underpinnings of programming.

Don’t be that guy. Apart from the often mooted practical considerations of writing efficient code or gaining access to higher paying jobs, there is a world of elegance and beauty in programming that you may not have expected. I am going to suggest a few resources for this because people have remarked that they don’t know where to start.

Code: The Hidden Language of Computer Hardware and Software

This is the amuse-bouche. Just read it one straight through for fun and try to become inspired!

NAND2Tetris

I can’t say enough good things about this course. It took me from a vague understanding of how computers executed my code to being able to conceptualize from the ground up how it all works. If you follow the course the whole way through you will build a (simulated) computer from first principles, create an assembler for it, a compiler for a Java-like language, and a basic operating system. Do it!

Coursera – Algorithms Part 1

This course, taught by Robert Sedgewick, will teach you a basic algorithmic toolbox. The algorithms is deals with are sort of like a “Greatest Hits” collection. Knowledge of these algorithms and design techniques behind forms a kind of lingua franca among programmers. For that reason, they are also really popular interview questions.

Phew! If you get to grips with all that stuff, you’ll be doing well!

Learn source control

I often see people advising beginners to learn Git before they do anything else. Git is a great program, and version control is one of the key skills needed to develop software at a professional level, but I can’t agree with the sentiment. The fact is that Git is extremely complicated. Even experienced programmers end up having to trawl the documentation and StackOverflow to figure out how to do certain things. So I suggest you learn Mercurial instead. It has many of the advantages of Git while being much more user-friendly. As a bonus, it’s written in Python!

They key thing you need to acquire at this stage of your education is a habit of using version control to track the development of your personal projects. You need to go through the process of changing code, breaking your whole project and being able to revert to a working state so that you can really understand why people use version control. For that purpose, the steep learning curve of Git is just going to put you off.

I’m not saying you shouldn’t learn Git later. You should, especially if you want to participate in open source projects. But at that point knowing Mercurial is only going to help, because you will already understand the concepts.

Learn an IDE and/or a text editor

Now is also the time to learn about IDEs and text editors. If you’ve been hanging around on programming forums (you should be), you will no doubt have got an inkling of the holy war between Vi and Emacs users, and the strong opinions of people who think IDEs are unnecessary bloatware. I don’t want to come down on either side of these arguments. I just want to suggest that you are now at the point where you should be investigating them for yourself.

But for my money, PyCharm is an amazing piece of software. 😀

The end

I hope I’ve given you something to think about with this article. If it all seems like a lot of work, well, it is! I’ll let Peter Norvig provide some perspective.

As always, I’m happy to answer your emails.

Comparing files in Python using difflib

Everybody knows about the diff command in Linux, but not everybody knows that the Python standard library contains a module that implements the same algorithm.

A basic diff utility

First, let’s see what a minimal diff implementation using difflib might look like:

The context_diff function takes two sequences of strings – here provided by readlines – and optional fromfile and tofile keyword arguments, and returns a generator that yields strings in the “context diff” format, which is a way of showing changes plus a few neighbouring lines for context.

The library also supports other diff formats, such as ndiff.

Let’s use the utility to compare two versions of F. Scott Fitzgerald’s famous conclusion to The Great Gatsby.

The exclamation marks (!) denote the lines with changes on them. file1.txt is of course the version we know and love.

Fuzzy matches

That’s not all difflib can do. It also lets you check for “close enough” matches between text sequences.

When I saw this first, I immediately thought “Levenshtein Distance”, but it actually uses a different algorithm. Here’s what the documentation says about it:

The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980’s by Ratcliff and Obershelp under the hyperbolic name “gestalt pattern matching”. The basic idea is to find the longest contiguous matching subsequence that contains no “junk” elements (R-O doesn’t address junk). The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.

HTML diffs

The module includes a class called HtmlDiff that can be used to generate diff tables for files. This would be useful, for instance, for building a front end to a code review tool. This is the coolest thing in the module, in my opinion.

The class also has a method called make_file that outputs an entire HTML file, not just the table.

Here is what the rendered table looks like:

difflib_html

Go forth and diff!

There are a few other subtleties, but I have covered the main functionality in this post. Check out the official documentation for difflib here.

The bool function in Python

Python’s built-in bool  function comes in pretty handy for checking the truth and falsity of different types of values.

First, let’s take a look at how True and False are represented in the language.

True and False are numeric values

In Python internals, True is stored as a 1 and False is stored as a 0. They can be used in numeric expressions, like so:

They can even be compared to their internal representation successfully.

However, this is just a numeric comparison, not a check of truthiness, so the following comparison returns False:

bool to the rescue

The number 5 would normally be considered to be a truthy value. To get at its inherent truthiness, we can run it through the bool function.

The following are always considered false:

  • None
  • False
  • Any numeric zero: 0, 0.0, 0j
  • Empty sequences: "", (), []
  • Empty dictionaries: {}
  • Classes with __bool__() or __len__() functions that return False or 0.

Everything else is considered true.

Python Regular Expression Basics

Regular expressions is one of those topics that confuse even advanced programmers. Many people have programmed professionally for years without getting to grips with them. Too often, people just copy and paste Regexes from StackOverflow or other websites without really understanding what’s going on. In this article, I’m going to explain regular expressions from scratch and introduce you to Python’s implementation of them in the re  module.

Regular expressions describe sets of strings

A regular expression is a description of a set of strings. Regular expression matching is a method of finding out if a given string is in the set defined by a certain regular expression. Regular expression search is a method of finding occurrences of strings belonging to that set inside a larger string. Python’s re module provides facilities for search, matching and replacing matched substrings with something else.

The simplest regular expression is just a sequence of ordinary characters. Ordinary characters are those characters that do not have a special meaning in the regular expression syntax.

The re.match function returns a match object if the text matches the pattern. Otherwise it returns None.

Notice how I put the r prefix before the pattern string. Sometimes the regular expression syntax involves backslash-escaped characters that coincide with escape sequences. To prevent those portions of the string from being interpreted as escape sequences, we use the raw r prefix. We don’t actually need it for this pattern, but I always put it in for consistency.

To search in a larger string, we use re.search.

So far this is not very useful. We are just matching strings against other strings, which can be achieved more easily with == and in.

However, regular expressions really come into their own when when we start using sets of characters and repetitions.

Sets of characters and repetitions

Let’s say we don’t just want to match the string "cheese", but any string of lowercase alphabetic characters that is six characters long. In that case we can use the pattern "[a-z]{6}". The bit in the square brackets – [a-z] – means that we should match any lowercase alphabetic character from a to z. The bit in the curly brackets – {6} – means that the match should repeat six times.

The dot character . matches any character except a newline.

By the way, if you want to match the dot character itself, you will have to escape it. Special characters can be escaped and made to match their ordinary equivalents by putting a backslash \ before them.

Any other restricted set of characters can be defined, such as the set of all digits – [0-9] – and the set of all alphanumeric characters – [a-zA-Z0-9]. There are some shorthand ways of specifying common sets of characters too. For instance, \w is equivalent to the set [a-zA-Z0-9_], i.e. the set of every alphanumeric character and the underscore.

For a full list of the special character classes supported by re, use the help function.

* and +

So far we have learned how to match a set of characters a specific number of times, but what if we want to match it an indeterminate number of times? That’s where * and . come in.

* is known as the Kleene star, after Stephen Kleene, who invented this notation to describe finite state automata. It means “match the previous character or set of characters zero or more times”.

For instance, the regex a*  matches the following strings:

  • ""
  • "a"
  • "aa"
  • "aaa"
  • etc.

The + character, on the other hand, means “match the previous character or set of characters one or more times”.

So, the regex b+  matches the following strings.

  • "b"
  • "bb"
  • "bbb"
  • etc.

Unlike the previous pattern, this one does not match the empty string.

Repeating matches between x and y times

We’ve seen how to match characters either a definite number of times or an unlimited number of times, but we can also restrict the length of the match using the {x, y} syntax, where x is the lower limit and y is the upper limit.

The pattern a{3,5} will match strings composed of the character a repeated between three and five times.

The strings below match the pattern:

  • "aaa"
  • "aaaa"
  • "aaaaa"

However, the strings "aa"  and "aaaaaa"  do not match.

Excluding characters

Until now, we’ve been defining sets of character by the characters that are included in them, but we can also define sets of excluded characters. That is done using the caret ^ character _inside_ the square brackets.

In the above example, the pattern [^abc]+ matches any string of length one or more that does not contain the characters a, b or c.

Matching the start and end of strings

Regular expressions also support another feature called “anchors”. The caret ^ and the dollar sign $ are the two most common anchors, used to match the start and end of strings respectively. This feature is relevant for searching within strings rather than matching the whole string.

Consider the following example:

The first pattern – ^cheese  – matches the first occurrence of the substring "cheese"  within the search string. The second pattern – cheese$  – matches the second occurrence.

By using the two anchors together, we can match the whole string. Here is a pattern that will match any string starting and ending with "cheese".

Matching this or that

Sometimes we want to build a pattern that says “match this string, or match that string”. For that, we use the pipe | character.

We can confine it to a certain region using round brackets.

Optional items

What if we wanted to match, say, both the American and English spellings of the word “harbour”. The American version has no “u” in it. Here’s when the optional character ? comes in useful.

The ? in the pattern matches the preceding character u zero or more times.

Groups and named groups

Parts of a regex pattern bounded by round brackets are called “groups”.

These groups are numbered and can be accessed using indexes, but it is also possible to create named groups. These are accessible by name rather just by an index.

Greedy and non-greedy matching

The normal way for regex searches to work is greedily, i.e. matching as much of the search string as possible. Here’s an example.

The pattern <.*>  matched the whole string, right up to the second occurrence of > . However, if we only wanted to match the first <h1>  tag, then we can use the greedy qualifier  *?  that matches as little text as possible.

Now we’re only matching the first tag.

The end

We certainly haven’t covered everything there is to know about regular expressions in this post, but we’ve covered enough to decipher that vast majority of patterns found in the wild, and to invent our own without falling back on cargo-cult copying and pasting.

However, there’s no need to reinvent the wheel, so if you find a good regex that does what you need, you may as well swipe it. Before you do though, test it out with a tool like Regexr or similar.

Python descriptors made simple

Descriptors, introduced in Python 2.2, provide a way to add managed attributes to objects. They are not used much in everyday programming, but it’s important to learn them to understand a lot of the “magic” that happens in the standard library and third-party packages.

The problem

Imagine we are running a bookshop with an inventory management system written in Python. The system contains a class called Book  that captures the author, title and price of physical books.

Our simple Book class works fine for a while, but eventually bad data starts to creep into the system. The system is full of books with negative prices or prices that are too high because of data entry errors. We decide that we want to limit book prices to values between 0 and 100. In addition, the system contains a Magazine class that suffers from the same problem, so we want our solution to be easily reusable.

This tutorial is pretty long. Want a PDF?

Just type in your email address and I'll send a PDF version to your inbox.

Powered by ConvertKit

The descriptor protocol

The descriptor protocol is simply a set of methods a class must implement to qualify as a descriptor. There are three of them:

  • __get__(self, instance, owner)
  • __set__(self, instance, value)
  • __delete__(self, instance)

__get__ accesses a value stored in the object and returns it.

__set__ sets a value stored in the object and returns nothing.

__delete__ deletes a value stored in the object and returns nothing.

Using these methods, we can write a descriptor called Price that limits the value stored in it to between 0 and 100.

A few details in the implementation of Price deserve mentioning.

An instance of a descriptor must be added to a class as a class attribute, not as an instance attribute. Therefore, to store different data for each instance, the descriptor needs to maintain a dictionary that maps instances to instance-specific values. In the implementation of Price, that dictionary is self.values.

A normal Python dictionary stores references to objects it uses as keys. Those references by themselves are enough to prevent the object from being garbage collected. To prevent Book instances from hanging around after we are finished with them, we use the WeakKeyDictionary from the weakref standard module. Once the last strong reference to the instance passes away, the associated key-value pair will be discarded.

Using descriptors

As we saw in the last section, descriptors are linked to classes, not to instances, so to add a descriptor to the Book class, we must add it as a class variable.

The price constraint for books is now enforced.

How descriptors are accessed

So far we’ve managed to implement a working descriptor that manages the price attribute on our Book class, but how it works might not be clear. It all feels a bit too magical, but not to worry. It turns out that descriptor access is quite simple:

  • When we try to evaluate b.price and retrieve the value, Python recognizes that price is a descriptor and calls Book.price.__get__.
  • When we try to change the value of the price attribute, e.g. b.price = 23 , Python again recognizes that price is a descriptor and substitutes the assignment with a call to Book.price.__set__.
  • And when we try to delete the price attribute stored against an instance of Book, Python automatically interprets that as a call to Book.price.__delete__.

The number 1 descriptor gotcha

Unless we fully understand the fact that descriptors are linked to classes and not to instances, and therefore need to maintain their own mapping of instances to instance-specific values, we might be tempted to write the Price descriptor as follows:

But once we start instantiating multiple Book instances, we’re going to have a problem.

The key is to understand that there is only one instance of Price for Book, so every time the value in the descriptor is changed, it changes for all instances. That behaviour in itself is useful for creating managed class attributes, but it is not what we want in this case. To store separate instance-specific values, we need to use the WeakRefDictionary.

The property built-in function

Another way of building descriptors is to use the property built-in function. Here is the function signature:

fget, fset and fdel are methods to get, set and delete attributes, respectively. doc is a docstring.

Instead of defining a single class-level descriptor object that manages instance-specific values, property works by combining instance methods from the class. Here is a simple example of a Publisher class from our inventory system with a managed name property. Each method passed into property has a print statement to illustrate when it is called.

If we make an instance of Publisher and access the name attribute, we can see the appropriate methods being called.

That’s it for this basic introduction to descriptors. If you want a challenge, take what you have learned and try to reimplement the @property decorator. There is enough information in this post to allow you to figure it out.