How to read a file properly in Python

Very often beginning and even experienced Python programmers will read a file like this:

In most cases code like this will work fine. The read()  method will read the whole file into memory and store it in the contents  variable. This is known as “slurping”.

Likewise, the readlines()  method reads the whole file into memory, one line at a time (using the readline()  method), appending each line to a list. So this:

will return the same thing as:

(Note that by using os.linesep  instead of a character literal like “\n” we can make sure our code will work across platforms.)

But what happens when we want read a massive file that won’t fit in memory? It turns out we have some pretty nice options, based on whether we want to read the file line by line or in fixed-sized chunks.

Firstly, here is the Pythonic way to read a file line by line without slurping.

If you mostly work with text files, this is probably the best general purpose file reading idiom to use. While reading files piece by piece will always be slower than slurping the whole file into memory, that probably won’t make a difference for you.

On the other hand, this method assumes that your file has line breaks, which may not be the case. If you really want to make sure that your program only reads a certain number of characters from the file, you can take advantage of the fact that the file’s read()  method takes an optional parameter, read(n) . When n is omitted, the method will read to the end of the file, but when n is present, it will only read the specified number of characters. Here’s an example:

In this snippet we’re relying on a particular behaviour of the  read()  method: it returns an empty string when the end of the file is reached. When reading files of unknown length chunk by chunk, we use that fact to break out of the loop.

This code is a bit verbose, but it does what we want. Is there a more compact way to write it? It turns out that there is. The following snippet is functionally identical to the one above:

To understand this, it would help to know what is going on with the iter()  function. The iter() function has one mandatory argument and one optional argument, and it behaves differently depending on whether one or two arguments are provided.

If only one argument is provided, then that argument must be an object that supports the iterator protocol, i.e. it must implement the __iter__() method, or the sequence protocol, i.e. it must implement the __getitem__() method.

If, on the other hand, the second “sentinel” argument is provided, then the first argument must be a callable and the iterator returned by iter(callable, sentinel)  will behave as follows:

  • When the iterator’s next()  method is called, the callable passed in as the first argument will be called.
  • If the value returned from next()  is equal to the sentinel, then StopIteration  is raised.

Here’s a version of the last example modified slightly to deal with binary files:

The ‘b’ prefix before the sentinel string denotes a Python 3 bytes literal and deserves some explanation. Here is what the Python 2.x documentation says about it:

A prefix of ‘b’ or ‘B’ is ignored in Python 2; it indicates that the literal should become a bytes literal in Python 3 (e.g. when code is automatically converted with 2to3). A ‘u’ or ‘b’ prefix may be followed by an ‘r’ prefix

And the Python 3.3 docs:

Bytes literals are always prefixed with ‘b’ or ‘B’; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.

If we develop our code a little further, we can hide the details of the lambda expression, the iter  function, and the sentinel. The following generator takes a file object and an optional chunk size and lets us iterate through the file as chunks:

We can use it like so:

To specify a chunk size, just pass in the optional size argument to the generator. This reads a file in 1k chunks, for instance:

So there you have it. These basic file reading techniques cover the majority of what you will need in your day to day Python usage.

Download Mastering Decorators

Mastering_decorators_cover

Enjoyed this article? Join the newsletter and get Mastering Decorators - a gentle 22-page introduction to one of the trickiest parts of Python.

Weekly-ish. No spam. Unsubscribe any time. Powered by ConvertKit