Text to speech with Python 3 on Linux and OSX

Recently I had a requirement to synthesise speech from text on two different operating systems. Here is what I came  up with.

OSX

Synthesising speech is a simple matter for OSX users because the operating system comes with the say  command. We can use subprocess  to call it.

Linux

On Linux, there are a few different options. I like to use the espeak  Python bindings when I can. You can install it on Ubuntu using apt-get .

Then use it like so:

espeak  supports multiple languages, so if you are not dealing with English text, you need to pass in the language code. Unfortunately, it looks like the Python bindings don’t support that yet, but we can still use subprocess  like we did on linux.

The list of available languages can be found on the espeak website here.

What to do after Codecademy

Today I want to offer my opinion on a question that lots of people have asked me since I started this blog. “What should I do after I have finished an introductory course on Codecademy?”

Codecademy is basic stuff

You should be proud to have finished your first course and dipped your toes into programming. It’s a great achievement, but you also have to be realistic: this is just the beginning. You have made your way to base camp, but you haven’t yet attempted to reach the summit.

The fact is that Codecademy, while providing a solid grounding in the basics, doesn’t go much farther than that. On its own, it only teaches you a fraction of the things that you need to know to program professionally (if that’s your goal).

So be proud, but be humble. There is so much to learn that you will never know everything. And that’s a good thing, as long as you like learning.

Identify your goals

Maybe it’s obvious, but the path you take from now should be dictated by what you want to be able to do. If your goal is to write web applications, you should get to grips with frameworks such as Django, Flask and Bottle. If your goal is to use Python for data science, you should be learning things like Numpy, Scipy and Pandas (and hitting the math books hard). If you want to create desktop software, you should learn PyQt and Tkinter.

Once you have to make the decision about what area (or areas) to focus on, then go and find some beginner materials for them. I’m not going to list any here, because there are just so many, but rest assured that they are easy to find.

(That reminds me of another point: Google is your friend. Programmers like to joke that the job is 80% Googling, and there is some truth to that, depending on what you’re building. So you’d better improve your Google-fu and learn how to ask good questions.)

Projects are key

Working through courses and tutorials, and reading technical books, are important ways of improving your knowledge, but there is really no substitute for trying to build something and picking up what you need to know as you go along. The skills of identifying bugs quickly, deciphering arcane error messages and knowing when to stop fiddling with your code and move on are as important to programming as knowledge of advanced language features or algorithms.

With that in mind, I suggest you set yourself a project. Maybe you already have something in mind, in which case, awesome! Or maybe you don’t. In that case, take a look at this list of projects and pick one that seems fun and achievable.

http://www.dreamincode.net/forums/topic/78802-martyr2s-mega-project-ideas-list/

Learn some real computer science

Assuming you’re like a lot of the people picking up coding these days and your goal is to make web applications, then you can go a long way without even thinking about the theoretical underpinnings of programming.

Don’t be that guy. Apart from the often mooted practical considerations of writing efficient code or gaining access to higher paying jobs, there is a world of elegance and beauty in programming that you may not have expected. I am going to suggest a few resources for this because people have remarked that they don’t know where to start.

Code: The Hidden Language of Computer Hardware and Software

This is the amuse-bouche. Just read it one straight through for fun and try to become inspired!

NAND2Tetris

I can’t say enough good things about this course. It took me from a vague understanding of how computers executed my code to being able to conceptualize from the ground up how it all works. If you follow the course the whole way through you will build a (simulated) computer from first principles, create an assembler for it, a compiler for a Java-like language, and a basic operating system. Do it!

Coursera – Algorithms Part 1

This course, taught by Robert Sedgewick, will teach you a basic algorithmic toolbox. The algorithms is deals with are sort of like a “Greatest Hits” collection. Knowledge of these algorithms and design techniques behind forms a kind of lingua franca among programmers. For that reason, they are also really popular interview questions.

Phew! If you get to grips with all that stuff, you’ll be doing well!

Learn source control

I often see people advising beginners to learn Git before they do anything else. Git is a great program, and version control is one of the key skills needed to develop software at a professional level, but I can’t agree with the sentiment. The fact is that Git is extremely complicated. Even experienced programmers end up having to trawl the documentation and StackOverflow to figure out how to do certain things. So I suggest you learn Mercurial instead. It has many of the advantages of Git while being much more user-friendly. As a bonus, it’s written in Python!

They key thing you need to acquire at this stage of your education is a habit of using version control to track the development of your personal projects. You need to go through the process of changing code, breaking your whole project and being able to revert to a working state so that you can really understand why people use version control. For that purpose, the steep learning curve of Git is just going to put you off.

I’m not saying you shouldn’t learn Git later. You should, especially if you want to participate in open source projects. But at that point knowing Mercurial is only going to help, because you will already understand the concepts.

Learn an IDE and/or a text editor

Now is also the time to learn about IDEs and text editors. If you’ve been hanging around on programming forums (you should be), you will no doubt have got an inkling of the holy war between Vi and Emacs users, and the strong opinions of people who think IDEs are unnecessary bloatware. I don’t want to come down on either side of these arguments. I just want to suggest that you are now at the point where you should be investigating them for yourself.

But for my money, PyCharm is an amazing piece of software. 😀

The end

I hope I’ve given you something to think about with this article. If it all seems like a lot of work, well, it is! I’ll let Peter Norvig provide some perspective.

As always, I’m happy to answer your emails.

Comparing files in Python using difflib

Everybody knows about the diff command in Linux, but not everybody knows that the Python standard library contains a module that implements the same algorithm.

A basic diff utility

First, let’s see what a minimal diff implementation using difflib might look like:

The context_diff function takes two sequences of strings – here provided by readlines – and optional fromfile and tofile keyword arguments, and returns a generator that yields strings in the “context diff” format, which is a way of showing changes plus a few neighbouring lines for context.

The library also supports other diff formats, such as ndiff.

Let’s use the utility to compare two versions of F. Scott Fitzgerald’s famous conclusion to The Great Gatsby.

The exclamation marks (!) denote the lines with changes on them. file1.txt is of course the version we know and love.

Fuzzy matches

That’s not all difflib can do. It also lets you check for “close enough” matches between text sequences.

When I saw this first, I immediately thought “Levenshtein Distance”, but it actually uses a different algorithm. Here’s what the documentation says about it:

The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980’s by Ratcliff and Obershelp under the hyperbolic name “gestalt pattern matching”. The basic idea is to find the longest contiguous matching subsequence that contains no “junk” elements (R-O doesn’t address junk). The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.

HTML diffs

The module includes a class called HtmlDiff that can be used to generate diff tables for files. This would be useful, for instance, for building a front end to a code review tool. This is the coolest thing in the module, in my opinion.

The class also has a method called make_file that outputs an entire HTML file, not just the table.

Here is what the rendered table looks like:

difflib_html

Go forth and diff!

There are a few other subtleties, but I have covered the main functionality in this post. Check out the official documentation for difflib here.

The two ways to sort a list in Python

Today I’m going to take a look at another element of the language that tends to trip up Python beginners – the difference between sorted(my_list)  and my_list.sort().

The built-in function sorted sorts the list that is passed into it, and returns a new list while preserving the old one.

On the other hand, the sort method on list objects sorts the list in place, destroying the original ordering.

Using a list’s sort method is the equivalent assigning the output of sorted back to the original list.

However, that particular way of doing things is frowned upon. Only use sorted

sorted and list.sort both accept the key and reverse parameters. The cmp parameter, which allowed you to pass in a custom comparator function, has been removed in Python 3. key should be used instead.

Multi-line strings in Python

At some point, you will want to define a multi-line string and find that the obvious solutions just don’t feel clean. In this post, I’m going to take a look at three ways of defining them and give you my recommendation.

Concatenation

The first way of doing it, and the way that immediately comes to mind, is to just add the strings to each other, like so:

In my opinion, this looks extremely ugly. You can make it a bit better by omitting the + signs. This also works:

That’s better, but still a bit of an eyesore. Let’s move on.

Multi-line string syntax

Another way is to use the built-in multi-line string syntax, like so:

That’s much better than concatenation, but it has one conspicuous wart. You can’t indent the subsequent lines to be at the same level of indentation as the first one. The space will be interpreted as part of the string. The first line will be flush with the margin and the subsequent lines will be indented.

Tuple syntax

There is another way to do it that doesn’t suffer from the ugliness of concatenation or the indentation wart of the multi-line syntax, and that is to use the tuple syntax.

I have no idea why this works, but it does:

Note that you have to add the line breaks into the strings, because they’re not put in automatically. Nonetheless, I think this is by far the nicest, most readable method.

How to modify a list in place in Python

Did you ever see a piece of code that looks like this?

The purpose of it is to remove any even numbers from the list numbers . And it looks like it should work, right?

The problem explained

For most of you, the mistake will already be obvious, but beginners can expect to scratch their heads when they examine the list and see that it still has some even numbers in it.

Here’s what we wanted to see:

What is going on here? We can illuminate the problem if we write out the code without any of the syntactic niceties of the Python for  loop. (Keep in mind that this is merely for explanatory purposes and is not the kind of code you should be writing.)

Be your own debugger

Let’s “step through” the while  loop to see exactly how the execution goes wrong.

  • On the first iteration, the loop counter  i  is equal to 0. 1 (the value of the first element in numbers ) is assigned to elem . 1 is not divisible by 2 so the if  block is not called.
  • On the second iteration, i  is equal to 1. 2 (the value of the second element in numbers) is assigned to elem . 2 is clearly divisible by 2, so the if  block is called and the second element is removed from the list. The length of the list is reduced by 1. The element that was at the third position is now at the second position and so on. The list now looks like this.

  • On the third iteration, i  is equal to 2. 3 (the value of the third element in numbers , is assigned to elem . The 2 that used to be the third element, but is now the second, has been skipped entirely. It can never be removed from the list.
  • The 4 in the sixth position of the initial list is skipped in the same way.

Now that you understand how the code is going wrong, let’s see how to fix it.

How to fix it

There are several ways to refactor the code to give us the output we want. Let’s examine them.

First, we could use a list comprehension to build up a new list that contains only the elements in numbers  that are not divisible by 2. This is by far the simplest and most common way to achieve what we want. Here’s what it looks like:

A subtlety of this method is that it creates a completely new list and makes numbers point to it. If you had another name pointing to the old list, it will still point there. Here’s what I mean:

That may or may not be a problem, depending on the program, but if it is crucial to modify the list in place without copying anything, we can iterate backwards. While we’re at it, we can use del with an index instead of remove, which needlessly searches through the list for the first element with the given value:

So which one of these options should you use? If you are ok with creating a completely new list, then you should use the list comprehension. It’s by far the clearest and most idiomatic solution. Otherwise, if modifying the original list in place is crucial, you should go for the second version.

The Python Help System

A few years ago I had an interview with a company using Python for their main product. At the time I was a beginner, so I wasn’t able to answer all their questions. One of the questions I choked on was this:

– If you’re given a new Python package and you don’t know how it works, how would you figure it out?

I said I would Google it and see if there was any documentation online. They followed up with:

– What would you do if you had no internet connection?

I told them I would read the code and see what I could learn from it. That wasn’t the answer they were looking for.

Python’s online help system

Python includes a built-in help system based on the pydoc module. In a terminal, type:

A help page will be printed to the console:

All pydoc does is generate the help page based on the docstrings in the module.

Happily, you’re not stuck scrolling through the terminal, man page-style. You can start a local web server that serves the documentation in HTML format by typing:

Now if you go to http://localhost:8000  in your browser you will see an index page with links to the documentation pages for all the modules and packages installed on your system.

It will look something like this:

pydoc

Getting help in IDLE

In the REPL (IDLE or whatever alternative you are using), you can access the same help using the help  built-in function. This function is added to the built-in namespace (the things that are already defined when the interpreter starts up) by the site  module, which is automatically imported during initialization.

Checking the attributes on an object

Sometimes you don’t need the full help text. You only want to see what attributes a certain object has so that you can get on with writing code. In that case, the dir command can come in handy.

dir  works in two modes. The first is when it is invoked without any arguments. In that mode, it prints out a list of the names defined in the local scope.

The second is when it is given an object as an argument. In that mode, it tries to return a list of relevant attributes from the object passed in to it. What that means depends on whether the object is a module or a class.

If it’s a module, dir  returns a list of the module’s attributes. If it’s a class, dir  returns a list of the class attributes, and the attributes of the base classes.

A more useful dir function

Usually when I need dir , I also want to know the types of the object’s attributes. Here’s a function that annotates the output of the normal dir  function with the string names of the types of each returned attribute:

I’ll leave it there for now. Don’t forget to play around with the help system to get a feel for it. You’ll be glad of it next time you’re stuck somewhere without an internet connection and want to do some coding.

How to obfuscate Python source code

A lot of the Python code you will come across is open source. The whole point is to distribute it freely, share knowledge and let people play around with it and learn from it.

Sometimes, though, you might want to prevent the end-user from reading the code. Maybe you are selling commercial software or maybe you just want to share the solution to a tricky coding challenge with your friends without giving the game away.

Whatever your reasons, there are a few approaches you can take:

Using pyobfuscate

One is obfuscation. I like a package called pyobfuscate. It transforms your normal and clearly written (right?) Python source code into new source code that is hard to read, by making changes to whitespace and names, stripping comments, removing functions, etc.

The package doesn’t seem to be on PyPI, but you can install it from the Github repo:

Let’s try it out. Save the following code in example.py :

Obfuscate it using the pyobfuscate  command, which should be on your path now that you have installed the package:

Thee obfuscated code will be printed to the console:

That’s pretty illegible!

Unfortunately pyobfuscate only works on one source file at a time, so it’s not really suitable for large projects. It also appears to only work with Python 2 at the moment.

Distributing bytecode

Another, arguably easier, method is to just distribute the .pyc files. The Python standard library includes a compileall module that can scan your source directory and compile all of your files into Python bytecode. Then you can distribute them without the source files. The .pyc files can still be decompiled into source code, but the code will not be as readable as it was before.

One problem with this method is that the initial .py script that you run cannot be compiled in this way. You can solve this problem by making a simple wrapper script that gives away no information about your program.

These two methods are really just a deterrent, not a secure way of hiding the code.

If you want something a bit more robust, you should take a look at Nuitka, which compiles Python code to C++, so you can compile that and just distribute the executable. It seems to be broadly compatible with different libraries and different versions of Python.

How to read a file properly in Python

Very often beginning and even experienced Python programmers will read a file like this:

In most cases code like this will work fine. The read()  method will read the whole file into memory and store it in the contents  variable. This is known as “slurping”.

Likewise, the readlines()  method reads the whole file into memory, one line at a time (using the readline()  method), appending each line to a list. So this:

will return the same thing as:

(Note that by using os.linesep  instead of a character literal like “\n” we can make sure our code will work across platforms.)

But what happens when we want read a massive file that won’t fit in memory? It turns out we have some pretty nice options, based on whether we want to read the file line by line or in fixed-sized chunks.

Firstly, here is the Pythonic way to read a file line by line without slurping.

If you mostly work with text files, this is probably the best general purpose file reading idiom to use. While reading files piece by piece will always be slower than slurping the whole file into memory, that probably won’t make a difference for you.

On the other hand, this method assumes that your file has line breaks, which may not be the case. If you really want to make sure that your program only reads a certain number of characters from the file, you can take advantage of the fact that the file’s read()  method takes an optional parameter, read(n) . When n is omitted, the method will read to the end of the file, but when n is present, it will only read the specified number of characters. Here’s an example:

In this snippet we’re relying on a particular behaviour of the  read()  method: it returns an empty string when the end of the file is reached. When reading files of unknown length chunk by chunk, we use that fact to break out of the loop.

This code is a bit verbose, but it does what we want. Is there a more compact way to write it? It turns out that there is. The following snippet is functionally identical to the one above:

To understand this, it would help to know what is going on with the iter()  function. The iter() function has one mandatory argument and one optional argument, and it behaves differently depending on whether one or two arguments are provided.

If only one argument is provided, then that argument must be an object that supports the iterator protocol, i.e. it must implement the __iter__() method, or the sequence protocol, i.e. it must implement the __getitem__() method.

If, on the other hand, the second “sentinel” argument is provided, then the first argument must be a callable and the iterator returned by iter(callable, sentinel)  will behave as follows:

  • When the iterator’s next()  method is called, the callable passed in as the first argument will be called.
  • If the value returned from next()  is equal to the sentinel, then StopIteration  is raised.

Here’s a version of the last example modified slightly to deal with binary files:

The ‘b’ prefix before the sentinel string denotes a Python 3 bytes literal and deserves some explanation. Here is what the Python 2.x documentation says about it:

A prefix of ‘b’ or ‘B’ is ignored in Python 2; it indicates that the literal should become a bytes literal in Python 3 (e.g. when code is automatically converted with 2to3). A ‘u’ or ‘b’ prefix may be followed by an ‘r’ prefix

And the Python 3.3 docs:

Bytes literals are always prefixed with ‘b’ or ‘B’; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.

If we develop our code a little further, we can hide the details of the lambda expression, the iter  function, and the sentinel. The following generator takes a file object and an optional chunk size and lets us iterate through the file as chunks:

We can use it like so:

To specify a chunk size, just pass in the optional size argument to the generator. This reads a file in 1k chunks, for instance:

So there you have it. These basic file reading techniques cover the majority of what you will need in your day to day Python usage.