Using Regular Expressions and Generators to Tokenize a File

Generator Tricks for System Programmers really opened my eyes to the utility of python generators.  Yesterday I took an opportunity to use one, and bone up on my regular expression ninja skills to boot.

I received an email from a colleague containing some references to reports our company had written, so that we can post them in appropriate places on our website.  He clearly went through some pains to organize the information in a very human-readable manner:

CATEGORY 1
Report Title1 (report type year)
Report Title2 (report type year)
CATEGORY 2
...

Which is great.  The trouble is, I also need the existing URL associated with each of these reports.  And it may make sense to pull this list from a database, so I’d like to treat each one of those as a row with certain attributes:

NAME  | TYPE | YEAR | CATEGORY

Oh sure, I could copy the text into Excel and put everything into columns by hand, but where’s the fun of that?  Python to the rescue!

Python generators are kind of like a list, but without the list.  Like a list, it is a sequence of things (ints, strings, other lists, objects, etc).  Like a list, you can iterate through each value of the generator.  But instead of storing the entire list in memory, it evaluates some function to generate the next value.  This means that generators can more efficiently deal with very large sets of inputs.  You’re not loading the whole input set into memory, instead you just ask for the ‘next’ value, and the generator decides what to spit out.  It might mean reading the next line of a file, adding or modifying an object, or calculating the next value in the Fibonacci sequence.

Check out the System Admin’s Guide to Generators for a more full explanation, or look up the documentation on the yield statement.  Python also has a nice shorthand for generator expressions, that is very similar to how list comprehensions are done.  It often leads to pretty clean, readable code.  Here’s the key section of code from this example:


regex = re.compile(r"(?P<name>.*) \((?P<type>.*) (?P<year>\d*)\)")    

with open ('researchreports.txt') as infile:
    lines = (l.rstrip() for l in infile)   
    matches = ((regex.search(l),l) for l in lines)   
    newline = (matcherfunction(m) for m in matches)

First, we define the regular expression used to parse the line, and extract the report name, type, and year.  The next four lines do the actual work:

  • Open the file for reading
  • From the open file, spit out each line, stripping off whitespace from the end
  • For each of those lines, run the regular expression.  Spit out a tuple of (Match Object, original line).
  • For each of those tuples, run the matcher function, which spits out the tuple (name, type, year), or the original line in the event where the original line wasn’t in the right format.

That’s basically the end of the magic.  The rest is just writing out to a csv file.  Python’s CSV module to the rescue.  Here’s the whole code in case you’re interested.


#!/usr/bin/env python

from csv import writer
import re

def matcherfunction(m):
    """if we have a MatchObject, return the parsed output.  if not, return the original line"""
    if m[0]:
        return (m[0].group('name'), m[0].group('type'), m[0].group('year'))
    else:
        return m[1],

regex = re.compile(r"(?P<name>.*) \((?P<type>.*) (?P<year>\d*)\)")       
with open ('researchreports.txt') as infile:
    lines = (l.rstrip() for l in infile)   
    matches = ((regex.search(l),l) for l in lines)   
    newline = (matcherfunction(m) for m in matches)

    with open ('researchreports.csv', 'w') as outf:
        csvfile = writer(outf)
        headers = ['','TYPE','YEAR', 'URL', 'CATEGORIES']
        csvfile.writerow(headers)
        csvfile.writerows(newline)

Did I over complicate the problem?  Probably.  I could’ve just read in the whole file as a string, and then done a global regex search/replace.  But that would be problematic if I were dealing with a huge input file.  The advantage of this approach is that it doesn’t matter how many rows there are; it’ll march through them with no worries about memory limitations.  Second, it’ll be easier to modify and reuse this approach than a custome RegEx.  Finally, it apparantly fits my mental model of how to solve the problem.

What I really want is for someone to show me how to do this in one line with awk/sed.  =)

Advertisements