Archive for January 2009
Using Regular Expressions and Generators to Tokenize a File
Generator Tricks for System Programmers really opened my eyes to the utility of python generators. Yesterday I took an opportunity to use one, and bone up on my regular expression ninja skills to boot.
I received an email from a colleague containing some references to reports our company had written, so that we can post them in appropriate places on our website. He clearly went through some pains to organize the information in a very human-readable manner:
CATEGORY 1 Report Title1 (report type year) Report Title2 (report type year) CATEGORY 2 ...
Which is great. The trouble is, I also need the existing URL associated with each of these reports. And it may make sense to pull this list from a database, so I’d like to treat each one of those as a row with certain attributes:
NAME | TYPE | YEAR | CATEGORY
Oh sure, I could copy the text into Excel and put everything into columns by hand, but where’s the fun of that? Python to the rescue!
Python generators are kind of like a list, but without the list. Like a list, it is a sequence of things (ints, strings, other lists, objects, etc). Like a list, you can iterate through each value of the generator. But instead of storing the entire list in memory, it evaluates some function to generate the next value. This means that generators can more efficiently deal with very large sets of inputs. You’re not loading the whole input set into memory, instead you just ask for the ‘next’ value, and the generator decides what to spit out. It might mean reading the next line of a file, adding or modifying an object, or calculating the next value in the Fibonacci sequence.
Check out the System Admin’s Guide to Generators for a more full explanation, or look up the documentation on the yield statement. Python also has a nice shorthand for generator expressions, that is very similar to how list comprehensions are done. It often leads to pretty clean, readable code. Here’s the key section of code from this example:
regex = re.compile(r"(?P<name>.*) \((?P<type>.*) (?P<year>\d*)\)")
with open ('researchreports.txt') as infile:
lines = (l.rstrip() for l in infile)
matches = ((regex.search(l),l) for l in lines)
newline = (matcherfunction(m) for m in matches)
First, we define the regular expression used to parse the line, and extract the report name, type, and year. The next four lines do the actual work:
- Open the file for reading
- From the open file, spit out each line, stripping off whitespace from the end
- For each of those lines, run the regular expression. Spit out a tuple of (Match Object, original line).
- For each of those tuples, run the matcher function, which spits out the tuple (name, type, year), or the original line in the event where the original line wasn’t in the right format.
That’s basically the end of the magic. The rest is just writing out to a csv file. Python’s CSV module to the rescue. Here’s the whole code in case you’re interested.
#!/usr/bin/env python
from csv import writer
import re
def matcherfunction(m):
"""if we have a MatchObject, return the parsed output. if not, return the original line"""
if m[0]:
return (m[0].group('name'), m[0].group('type'), m[0].group('year'))
else:
return m[1],
regex = re.compile(r"(?P<name>.*) \((?P<type>.*) (?P<year>\d*)\)")
with open ('researchreports.txt') as infile:
lines = (l.rstrip() for l in infile)
matches = ((regex.search(l),l) for l in lines)
newline = (matcherfunction(m) for m in matches)
with open ('researchreports.csv', 'w') as outf:
csvfile = writer(outf)
headers = ['','TYPE','YEAR', 'URL', 'CATEGORIES']
csvfile.writerow(headers)
csvfile.writerows(newline)
Did I over complicate the problem? Probably. I could’ve just read in the whole file as a string, and then done a global regex search/replace. But that would be problematic if I were dealing with a huge input file. The advantage of this approach is that it doesn’t matter how many rows there are; it’ll march through them with no worries about memory limitations. Second, it’ll be easier to modify and reuse this approach than a custome RegEx. Finally, it apparantly fits my mental model of how to solve the problem.
What I really want is for someone to show me how to do this in one line with awk/sed. =)
You can’t make a Ferrari out of an El Camino…a UX analogy
Customer-centric focus can be a bit of a culture change in organizations that are not used to it, so it requires a concentrated and sustained effort to educate decision makers and project stakeholders on what exactly is user experience design, and how to design for, achieve, and measure good user experience.
We’ve had some success communicating the need for user experience design work in our company, but it seems we need to be more clear on the fact that its not something that can be tacked on at the end. I still hear people saying things like, “We’ll get the requirements, build the back end, and then you can come in and put a good user experience on it.“
UX practitioners see the problems with that kind of thinking: 1) User Experience is not something you can ‘tack on’ at the end of a project. Rather, it requires a focused, coordinated effort between the UX team, development team, and business stakeholders to ensure product features meet user needs and expectations, 2) User Experience is not a discrete part of a product that you assemble together. We can distinguish the user interface of a product, the part of a product that users can see and feel, but that’s only part of the user experience. Rather, user experience is the quality of experience that a person has with a particular design, so it includes the user interface, as well as database performance, information architecture, business processes and site metaphors, delivery mechanism.
I’ve been trying to think of some type of analogy that encapsulates why we can’t come in and put a good user experience on a project. Here’s the best thing I’ve come up with so far:
Let’s say I’ve got a busted up, rusty, 1973 El Camino. I can paint it cherry red, put in leather seats, and paint a horse on the front. I’m not going to have the same experience — speed, handling, prestige — as if I were driving a Ferrari. To get the Ferrari experience, all the parts are designed to work together to deliver the speed, handling, and prestige that its customers expect. Similarly, our development, UX, and client stakeholders have to coordinate to understand and deliver the experience that our customers expect.
Do you think that a valid analogy?
Simple Web Response Time Testing with Python
For my day job, I’m creating a series of HTML pages that each have a table that shows how our various services and solutions map onto problems our customers are likely to have. The main site is currently thousands of static HTML pages, with a bit of PHP thrown in a few pages to do page footers. We’re working on upgrading to a dynamic CMS type site. In the meantime, I used the opportunity to learn a bit more about PHP and I wrote a small function to generate the table HTML given a JSON document describing the table headers, rows, and content.
As I was debugging the sites, I felt like there was sometimes a noticeable delay in rendering the page that wasn’t there on the existing static pages. Was this my imagination, or something that our users might notice and complain about. Hmm, I don’t have any web profiling software, and I couldn’t find anything that I could quickly install and run. And I had some time. Looks like I have to write some code. In the immortal words of Leeroy Jenkins, Let’s Do This!”
Python timeit Module
Python’s mantra is Batteries Included, implying that for whatever coding task you have, there’s probably something in the standard library that will do muct of what you want. You shouldn’t have to go and write something completely from scratch. I knew about python’s time module. I was planning on using it to mark the time before fetching my webpage, mark the time after fetching the page, and comparing the two. But I stumbled onto the timeit module, which makes it even a bit easier. Timeit basically wraps up that logic of marking time before and after some bit of code in convenient package. You give the timeit.Timer() class a bit of code that you want to time. The timeit() method will run the code a specified number of times (default 1,000,000) and return the average time for code execution. The repeat() method will run the timeit() method a specified number of times, and return a list of the average times.
In action, it looks like this:
import timeit
# Request the page 100 times, time the response time
t = timeit.Timer("h.request('http://PAGE/URL',headers={'cache-control':'no-cache'})", "from httplib2 import Http; h=Http()")
times_p1 = t.repeat(100,1)
Three lines of code…not bad. The Timer() class takes two strings as parameters: 1) The python code you would like repeated and timed, 2) Python code required to run before each run of the test code. If you’re familiar with Unit Testing, then the 2nd parameter is like the setUp() method. Notice I’m using the httplib2 library instead of the standard urllib library. I like httplib2 for requesting urls because I’m familiar with it, it combines requesting the url and reading its contents, and its really good about dealing with caching. In this case, I don’t want the server to cache.
The second line instructs my Timer() to run 100 sets of my test code, with 1 trial per set. The output is a list of 100 times.
The documentation for timeit.repeat() gives some good advice on how much stock to put into these numbers, and using mean/standard deviation to describe the performance. But what I really wanted to know was whether or not my page took significantly longer to load than a similar page with no dynamic content. I expanded my code to repeatedly time a second, static page, and the two lists in two columns of a csv file.
import timeit
from csv import writer
# Hit the dynamic page 100 times, time the response time
t = timeit.Timer("h.request('http://PAGE1/URL',headers={'cache-control':'no-cache'})","from httplib2 import Http; h=Http()")
times_p1 = t.repeat(100,1)
# Now hit a similar static page 100 times
t = timeit.Timer("h.request('http://PAGE2/URL', headers={'cache-control':'no-cache'})","from httplib2 import Http; h=Http()")
times_p2 = t.repeat(100,1)
# the times to a CSV file
times = zip(times_p1,times_t2)
with open('times.csv','w') as f:
w = writer(f)
w.writerows(times)
Note we’re using the python with statement from Python 2.5+, which encapsulates some of the try/except/finally logic you’d normally write when opening a file. Because I had even more spare time, I imported my new times.csv file into a statistics program (SPSS) to calculate mean, and perform a T-Test to see if the means of the two columns they are statistically different. I also could have used various statistics scripting tools: scipy, R, for example. But I didn’t have THAT much time.
There was a statistically significant difference. The dynamic page was, on average, about 1.2 ms slower than the static page. This makes practically no difference to the user experience of the page, and makes my development life much easier (and also illustrates how practical significance may differ from statistical significance). I’ll continue to generate pages dynamically.
Four Steps for Selling UX to Management and IT Teams
An increasing part of my role as the company’s User Experience Designer is to educate coworkers and managers on what exactly is and is not User Experience, what is good User Experience, how do you know if you’ve got it, and how do you get it. At least my take on these issues. I recently had the opportunity to present some of these ideas at an all-day conference for senior management, technology leaders, and project managers within the company that are likely to retain the services of our internal UX/website team. This was my first major audience and wide-exposure within the company, and as a new hire in a relatively high visibility position it was important that I make a good first impression. I also knew that ‘User Experience’ meant a lot of different things to different people in the organization…some people think it is just graphic design, others think it is usability testing, fewer think it is about user- or customer- centered design. I also got the feeling that people expected me to ‘own user experience,’ in that I would come in, do my UX design thing, and allow IT to focus on coding and delivering ‘the real system.’ So it was also an opportunity to level set, and get some feedback on what their expectations were for me.
I’m happy to report that the presentation was a ’success,’ which I measured by 1) me giving a fluid, polished presentation, and 2) people seeking me out afterwards to say ‘good presentation,’ or looking for follow up information. I can think of a few reasons why this went as well as it did. Let’s discuss them in terms of steps to help you sell UX in an IT organization:
1) Know your audience
What does the audience want or need to hear? I asked a lot of questions about who was going to be attending the meeting (various senior IT managers, senior business analysts, project leads), and their reasons for attending (mandatory for many). I even went to some of those people and asked them what they wanted to know and hear about user experience. The better you can tailor your talk to address the needs, questions, and concerns of those that were there, the more you increase your chances of success.
2) Don’t bury the lead
I received the audiobook Made to Stick for my birthday this year. This book is about generating and presenting compelling and memorable messages that the audience can take with them and easily share with others. In other words, ideas that stick. One of their recommendation is to make sure you lead off with the main or most important point. You want to communicate the main point up front, and follow with supporting materials. They draw the analogy with how journalists write newspaper articles: the lead sentence/paragraph gives the key who-what-when-where-how information, and the rest adds detail and context.
I spent an hour creating my lead — the one sentence version of my talk. The key idea that I thought would be most important to the audience.
I was in the middle of a long day of speakers, I think it was especially important to put the key ideas up front, because you probably don’t have peoples’ attention for very long. Which brings me to the next point.
3) Help them pay attention
John Medina, a molecular biologist at U of Washington, wrote a fascinating and easily readable book describing what science knows about how the brain works. He distilled that knowledge into 12 Brain Rules which can be applied to how we work, learn, and live, Brain Rules #4, “We don’t pay attention to boring things.” He points to research that shows that humans can maintain attention on boring things (things that don’t move, things that don’t generate an emotional response, things we can’t mate with) for about 10 minutes, after which our attention drops precipitously. This can be combatted by presenting some type of dynamic (and relevant) elements, inserting some type of funny or emotionally appealing (and relevant) story or other element. About 10 minutes into my talk, I had a picture of my face photoshopped onto the body of Abby Cadabby, to make the point that There Is No Magic UX Fairy. Just as I was about to lose them, I did a little something to bring them back on task for a bit. I also helped people pay attention by using a wireless microphone and walking around the room. I was the only speaker that did this, so this by itself was novel and interesting. But I could walk around and engage directly with people, which helped encourage people to look back at me (and stop texting for a couple seconds).
4) Practice, refine, practice, cut, practice
I cleared my schedule the day before to generate and refine the presentation content and graphics, and practice the talk. After my hour creating the lead, I created an outline for the talk, and ran the early outline by a few coworkers. Only then did I start working on the presentation slides: finding good images, writing text, revising text, cutting text. Once I finished the first draft of the slides, I found an empty practice room and went through the talk out loud, with transitions. The important point here, I think, is that you go through the entire talk one time, nonstop. Even if you stumble, or the words on the screen aren’t quite right, wait to make changes until you’re done. I think it is valuable at this point to evaluate how well you are communicating your thoughts. Where do you stumble? Where do you ramble on? Where do you read directly from the slides instead of engaging the audience? Go through it once, then go back and revise.
Then I put it away for an hour or so. I had lunch, and did something completely unrelated. I came back to it, and practiced the talk in the room in which I would be giving the presentation. I invited my team to come watch me practice and give feedback (which, to my surprise, they all came and gave excellent feedback). I revised my slides again. By this time, I had given the talk to myself or others roughly 5-7 times. I had a good command of the key ideas I wanted to communicate, and the pacing and phrasing I wanted.
When it came time to give the talk, I was confident in the material and the ideas I wanted to communicate. Had there been technical difficulties and the presentation system failed, I think I still would have given a compelling presentation.
I’ve been asked to lead the discussion on User Experience for an upcoming meeting with yet another set of company vice presidents. I’m following the same general guidelines for shaping my thoughts (yes, I spent another hour on a one sentence lead). I’ll let you know how it goes.