Comparing files in Python using difflib

Everybody knows about the diff command in Linux, but not everybody knows that the Python standard library contains a module that implements the same algorithm.

A basic diff utility

First, let’s see what a minimal diff implementation using difflib might look like:

The context_diff function takes two sequences of strings – here provided by readlines – and optional fromfile and tofile keyword arguments, and returns a generator that yields strings in the “context diff” format, which is a way of showing changes plus a few neighbouring lines for context.

The library also supports other diff formats, such as ndiff.

Let’s use the utility to compare two versions of F. Scott Fitzgerald’s famous conclusion to The Great Gatsby.

The exclamation marks (!) denote the lines with changes on them. file1.txt is of course the version we know and love.

Fuzzy matches

That’s not all difflib can do. It also lets you check for “close enough” matches between text sequences.

When I saw this first, I immediately thought “Levenshtein Distance”, but it actually uses a different algorithm. Here’s what the documentation says about it:

The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980’s by Ratcliff and Obershelp under the hyperbolic name “gestalt pattern matching”. The basic idea is to find the longest contiguous matching subsequence that contains no “junk” elements (R-O doesn’t address junk). The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.

HTML diffs

The module includes a class called HtmlDiff that can be used to generate diff tables for files. This would be useful, for instance, for building a front end to a code review tool. This is the coolest thing in the module, in my opinion.

The class also has a method called make_file that outputs an entire HTML file, not just the table.

Here is what the rendered table looks like:

difflib_html

Go forth and diff!

There are a few other subtleties, but I have covered the main functionality in this post. Check out the official documentation for difflib here.

Download Mastering Decorators

Mastering_decorators_cover

Enjoyed this article? Join the newsletter and get Mastering Decorators - a gentle 22-page introduction to one of the trickiest parts of Python.

Weekly-ish. No spam. Unsubscribe any time. Powered by ConvertKit