Optimizing a Python Program

These days, I’ve been optimizing a Python program I wrote. Optimizing is a fun task, but very difficult. Most of the time, the first solution I think is even worse than the previous situation. I need more experience.

Some processes were too slow in my program and I realized it was because I was performing too much disk I/O operations. I thought a solution could be read more data in memory and operate there. Now I have excessive memory consumption.

Here is a very simplified description of my memory consumption problem:

I have a text file. Each line in the file represents an item of a large list. Each line has two string values separated by a character. Something like a CSV file. I have to read the file content and put it in a list.

A line in the file looks like this:

Content of the first value|Content of the second value

The separator is '|'

Here is a simple Python program that read the file:

class Field:
    def __init__(self, line):
        self.value1, self.value2 = line.split('|')

fields = []

with open('test_data') as file_:
    for line in file_:
        fields.append(Field(line))

Running this program with a test file of about 42 MB gives this results:

Execution time (time): 0m4.108s
Memory consumed (pmap): 166652K

I was surprised by the high memory usage of the program. If the file is 42MiB, I thought the program should use a similar amount of memory, obviously a higher amount, but not almost four times the size of the file.

An equivalent program in C (error checking is omitted):

#include<stdio.h>
#include<stdlib.h>
#include<string.h>

#define VALUE1_MAX 30
#define VALUE2_MAX 80
#define LINE_SIZE VALUE1_MAX + VALUE2_MAX + 3
#define BUFFER 10000

typedef struct
{
    char value1[VALUE1_MAX+1];
    char value2[VALUE2_MAX+1];
} field;

int main()
{
    FILE *file = fopen("test_data", "r");

    field *fields = (field*) malloc(BUFFER*sizeof(field));
    char line[LINE_SIZE];
    char *part;
    long i=0;
    long size = BUFFER;
    while(fgets(line, LINE_SIZE, file) != NULL) {
        part = strtok(line, "|");
        strcpy(fields[i].value1, part);
        part = strtok(NULL, "|");
        strcpy(fields[i].value2, part);

        i++;
        if (i == size) {
            size += BUFFER;
            fields = (field*) realloc(fields, size*sizeof(field));
        }
    }
    fclose(file);
    free(fields);
    return 0;
}

Results for the C program:

Execution time (time): 0m0.537s
Memory consumed (pmap): 57548K

This is much better.

The problem with the Python program seems to be the Field objects using more memory than they need. Testing the program without the Field creations, changing fields.append(Field(line)) withfields.append(line) seems to perform better:

Execution time (time): 0m0.575s
Memory consumed (pmap): 66808K

Clearly, the Field object is the bottleneck both in memory consumption and execution time. This is probably because of some default memory allocations that Python makes for the object and its fields. Python is a really cool language, but it doesn’t let you control the way the memory is used. This is a positive thing in most of the cases, but in some of them, like this one, is negative.

Most of the times, there are only very small parts of a program that really need to be optimized. And a programmer is much more productive with Python than with C. It doesn’t make sense to rewrite the program in C. Instead, a C module could be written for the bottlenecks.

I was too lazy to learn how to use the Python C API, so I looked a this project called Cython. Cython is a language designed for writing Python extensions. It’s very similar to Python, but is translated to C and compiled to an easy to use Python module. Cython also lets you mix C code and Python code easily. It lets you use high level python objects or low level C data types as you need and mix them properly.

I decided to rewrite the Field class in Cython:

#field.pyx 
DEF VALUE1_MAX = 30
DEF VALUE2_MAX = 80

cdef extern from "string.h": 
    char *strcpy(char *dest, char *src)

cdef class Field: 
    cdef readonly char value1[VALUE1_MAX+1] 
    cdef readonly char value2[VALUE2_MAX+1] 
    def __init__(self, line):
        v1, v2 = line.split('|')
        strcpy(self.value1, v1)
        strcpy(self.value2, v2)

This extension type can be used almost in the same way than a real Python object:

>>> f = Field('Hello|World')
>>> f.value1
'Hello'
>>> f.value2
'World'
>>>

I had to modify the original Python script to use the new module:

from field import Field

fields = []

with open('test_data') as file_:
    for line in file_:
        fields.append(Field(line))

Results of the new program:

Execution time (time): 0m1.257s
Memory consumed (pmap): 69800K

This is a huge improvement. With a very small change, the program now consumes almost 100MB less memory and it runs three times faster. I could write more parts in Cython, using strtok() instead of str.split(), or even rewriting the entire list and reading process. I would probable get a performance very similar to the C program. But I’m comfortable with the results now. I’m still surprised with the small effort compared to the awesome results.

If you want to do your own tests. Here is a simple script to generate a test file with 500k values:

import string
import random

with open('test_data', 'w') as f:
    for i in range(500000):
        value1 = ''.join(random.choice(string.letters)
                         for s in range(random.randint(15, 30)))
        value2 = ''.join(random.choice(string.letters)
                         for s in range(random.randint(50, 80)))
        f.write(value1 + '|' + value2 + '\n')

Summer of Code: Lunar Eclipse Final Report

I had a lot of fun participating in the Google Summer of Code. As I said before, I worked improving Lunar Eclipse, an open source visual designer for Silverlight/Moonlight. This is what I achieved during the summer:

Tools & Handle Support:

I worked on editing support for the following Silverlight Elements:

Circle, Ellipse, Rectangle and Square: These are the basic boxed shapes. Previous version of Lunar Eclipse already had support for this elements. Anyway I almost refactored the entire Handle & Tool subsystem. This helped me to adapt it to more complex shapes like bezier paths.

Line and PolyLine: Simple two point line and multiple point line.

Bezier Path and Pen Tool: I worked on two tools for creating Path figures. One let you create a path using Bezier Segments. This is working fine but there are a couple of bugs regarding translations of points. I also implemented a Pen tool for creating paths based on manual movement, it’s working, but it needs to be optimized to produce less nodes.

TextBox: I have partial TextBox support. I couldn’t implement graphical text editing because the Entry control is not yet implemented in Moonlight.

I also tried to add Image support. But it was impossible to use the Downloader and set media to objects using the GtkSilver widget (used for creating Desktop Moonlight applications).

Other features

Selection Rect – Select All – Clear Selection – Delete Selection: I improved the selection rectangle and all the selection subsystem. This was important in order to implement other operations such as ordering and alignment. The use of keys for selection (like control to add to selection) is not working because the Keyboard class was not implemented in Moonlight at the time.

Property Panel: The properties panel was fixed. I believe that in the future, Lunar Eclipse will use the toolbox system of MonoDevelop, but some of the internal parts of this work, such as property introspection, can be reused in the future.

Infinite Undo Redo: Undo and Redo was implemented in previous versions of Lunar Eclipse, anyway, I fixed a lot of bugs related with this and implemented Undo – Redo support for all tools and operations.

File Open / Save: As simple as that. Files can now be saved and loaded 🙂

Zooming and Scrolling: Zooming is now possible. There are some ugly effects caused by the GtkSilver implementation. I guess this will be fixed in the future. Scrolling is working good but it needs improvements to be more usable.

Ordering and Alignment: Send to Front, Send Backwards, Align Right, etc.

Serialization and Back: Serialization has been fixed. Loading from XAML is working too so you can move back and forth between Xaml and Design. Serialization still need a lot of love. Output is too verbose and the text indentation structure is not ‘remembered’ by the serlializator. I want something similar to the MS Expression Blend’s system, which is awesome.

Copy – Paste – Clone: Clipboard operations. Copy was a bit difficult because Silverlight elements don’t have a clone method. I Implemented my own Clone method based on the serialization work.

New interface: This wasn’t a critical feature, but I rewrote the GTK# based user interface. This was an easy task thanks to the awesome MonoDevelop.

Video Demo:

Here is a demo video showing some of the features of Lunar Eclipse:

http://video.google.com/googleplayer.swf?docid=7961478585740522243&hl=en&fs=true

Video en alta resolución

Future

It was really difficult to work on Lunar Eclipse the last weeks of the Summer of Code. Moonlight hackers started to work heavily on the 2.0 version. Moonlight became very unstable and its API changed a lot, (besides Silverlight 2 Beta 2 API changed too). All these changes broke Lunar Eclipse. The last objective, animation support, was not completed. Anyway, I definitely want to keep working on this project. I decided to freeze Lunar Eclipse while Moonlight gets more stable, but I started to work on adding support for Silverlight projects to MonoDevelop. After that, I want to fix bugs in LE and start isolating the design surface to be easily plugged in applications such as MonoDevelop or a Web Based Lunar Eclipse.

Python vs C#: Queries

One of the most beloved C# 3.0 features is Linq. Linq brings great power to C#, it allows you to easily write structured queries over collections or remote data sources. Now with C# is possible to make queries as easy as with other languages like Python. I decided to compare the way you make queries with C# and with Python. I found a great page showing 101 Linq examples, I decided to write Python versions of this examples. Which version do you like more?

Where – Simple 1

C# version:

int[] numbers = { 5, 4, 1, 3, 9, 8, 6, 7, 2, 0 };
var lowNums = from n in numbers where n < 5 select n;

Python version:

numbers = [5, 4, 1, 3, 9, 8, 6, 7, 2, 0]
low_nums = (n for n in numbers if n < 5)

 

Where – Indexed

C# version:

string[] digits = { "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine" };
var shortDigits = digits.Where((digit, index) => digit.Length < index);

Python version:

digits = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine']
short_digits = (digit for index, digit in enumerate(digits) if len(digit) < index)

 

Select – Simple 1

C# version:

var numsPlusOne = from n in numbers select n + 1;

Python version:

nums_plus_one = (n + 1 for n in numbers)

 

Select – Anonymous Types 1

C# version:

string[] words = { "aPPLE", "BlUeBeRrY", "cHeRry" };

var upperLowerWords =
    from w in words
    select new {Upper = w.ToUpper(), Lower = w.ToLower()};

Python version:

The exact Python version would be something like:

words = ['aPPLE', 'BlUeBeRrY', 'cHeRry']

upper_lower_words = ( type('', (), {'upper': w.upper(), 'lower': w.upper() })
                      for w in words)

But I feel more Pythonic this:

upper_lower_words = ( (w.lower(), w.upper()) for w in words)

Or even this:

upper_lower_words = ( {'upper': w.upper(), 'lower': w.upper() }
                      for w in words)

SelectMany – Compound from 1

C# version:

int[] numbersA = { 0, 2, 4, 5, 6, 8, 9 };
int[] numbersB = { 1, 3, 5, 7, 8 };

var pairs =
    from a in numbersA,
         b in numbersB
    where a < b
    select new {a, b};

Python version:

numbersA = [0, 2, 4, 5, 6, 8, 9]
numbersB = [1, 3, 5, 7, 8 ]

pairs = ( (a, b) for a in numbersA 
                 for b in numbersB 
                 if a < b)

SelectMany – from Assignment

C# version:

var orders = from c in customers,
                  o in c.Orders,
                  total = o.Total
             where total >= 2000.0M
             select new {c.CustomerID, o.OrderID, total};

Python version:

I couldn’t find how to make the assignment in Python, so the version is:

orders = ( {'customer_id': c.customer_id,
            'order_id': o.order_id,
            'total': o.total }
           for c in customers
           for o in c.orders
           if o.total > 2000)

SelectMany – Multiple from

C# version:

var orders = from c in customers
             where c.Region == "WA"
             from o in c.Orders
             where o.OrderDate >= cutoffDate
             select new {c.CustomerID, o.OrderID};

Python version:

orders = ( (c.customer_id, o.order_id)
           for c in customers if c.region == 'WA'
           for o in c.orders if o.date >= cutoff_date)

Take Simple

C# version:

var first3Numbers = numbers.Take(3);

Python version:

if we are working with something like a list, we could do:

first_3_numbers = numbers[:3]

but, if we are working with iterators, we must do:

first_3_numbers = itertools.islice(numbers, None, 3)

Skip – Simple

C# version:

var allButFirst4Numbers = numbers.Skip(4);

Python version:

all_but_fist_4_numbers = numbers[4:] # list version all_but_fist_4_numbers = itertools.islice(numbers, 4, None) # iterator version 

TakeWhile – Simple

C# version:

var firstNumbersLessThan6 = numbers.TakeWhile(n => n < 6);

Python version:

fist_numbers_less_that_6 = itertools.takewhile(lambda x: x < 6, numbers)

SkipWhile – Simple

C# version:

var allButFirst3Numbers = numbers.SkipWhile(n => n % 3 != 0);

Python version:

all_but_first_3_numbers = itertools.dropwhile(lambda x: x % 3 != 0, numbers)

First & Last

C# version:

numbers.First()
numbers.Last()

Python version:

numbers[0]  # first for a list numbers[-1] # last for a list 
numbers.next()   # first for iterator list(numbers)[0] # first for iterator 
list(numbers)[-1] # last for iterator 

First – Indexed

C# version:

int evenNum = numbers.First((num, index) => (num % 2 == 0) && (index % 2 == 0));

Python version:

even_num = [n for i, n in enumerate(numbers) if n%2 == 0 and i%2 == 0][0]

or:

even_num = (n for i, n in enumerate(numbers) if n%2 == 0 and i%2 == 0).next()

to be continued…

PyBotLearn

PyWeek is finished. Unfortunately, I didn’t have enough time for it this year, so I’m almost a DNF. The theme for this year was Robot. My idea ended up being more like an application than a game. I wanted to create an educational environment for teaching programing to children. Something similar to Logo. The idea was to build a game where your could program a small robot. It should be interactive so children could experiment on the fly.

In other words, It should be something like this: A window with a robot and a world, bellow a Python console. The users should be able to write commands in the console and the robot should execute them. The robot should be able to interact with it’s environment and with the user too. Error, warnings, etc, should be given to the programmer in a friendly way. Additionally, there should be various challenges that the programmers should solve. This should be the fun factor for the game.

But, unfortunately, as I said before, I didn’t have more that a couple of hours every day since Thursday, so I’ve failed on most of my objectives, and the only thing I have is a barely working demo. No challenges, no fun, no objects, no cool environment. Any way, I still like the idea, and I will try to continue with this project. I would love to hear more opinions about it.

My final submission was called “PyBotLearn”. You can download it from my PyWeek 6 Entry Page.

Here is a small video of the demo:

A higher resolution video can be downloaded from here

Effective Text Editing

I saw the video 7 Habits For Effective Text Editing 2.0 that arhuaco recommended months ago. I was tempted to start learning Vim, but after thinking for a while, I came to the conclusion that there is no good reason for learning Vim. I still don’t get why people likes Vim that much. Most of the features that Bram Mooleenar showed in the video have been present in other tools for years and, in my humble opinion, they work much better than in Vim. Other things Bram talked about are just too crazy for me (He suggested that word processing would be more productive if we edit every paragraph in Vim and then copy-paste it on Microsoft Word… WTF!!)

Here are my habits for effective text editing:

Golden Rule: Use the best tool for the Job. I have learned that using a generic text editor for everything leads to be very unproductive. I like to use JEdit as my generic Text Editor. I love it for things like XML, C/C++, this weblog, etc. It has all the advanced features I need in a text editor. I used to use it for C# and Python but I discovered that using an specialized IDE for these languages is thousands of times more productive. Now I use MonoDevelop for C# and PyDev for Python.

The most important features I need when editing source code that can be hardly founded in a generic text editor are:

  • Good Code Completion. Note the “Good” word is remarked. Some generic text editors have support for code completion but most of the time is very, very poor. For C#, MonoDevelop is the open source tool with the best code completion out there (ok, maybe #develop has good code completion too). For Python it’s a bit more difficult due the dynamic nature of the language. I have tested many editors and IDEs and I think PyDev has the best code completion (I’ve heard that Wing IDE has good support too). Code completion is specially useful when you’re working with large libraries like GTK+.
  • Refactoring Operations. Rename, go to definition, go to parent definition, encapsulate, look for references, etc. They are essential features, it’s impossible to be effective without them.
  • Integrated Debugger. This one is missing from MonoDevelop. Hopefully will be there for the next release. It’s lovely how with a simple click you can create a break point in your code, run it, and it will stop right there, then you can easily watch every object state and even add some code (on dynamic languages).
  • Version Control Integration. Automatic add, remove, rename subversion files. Easily track the changes on the files. I love the ChangeLog integration of MonoDevelop. Eclipse has a good support for Subversion too.
  • Integrated UI Editors. It saves a lot of time if you’re writing GUI applications.
  • Task list. If you put notes on your source code comments such as: TODO, FIXME, NOTE, HACK, etc. It will generate a list with all notes present on all files, so you can look for pending things easily. This feature is present in MonoDevelop, PyDev and JEdit. I love it.
  • Integrated Help. Put the cursor on a word, hit F1 and automatically open the help browser with the page for the class or method description for the object or definition you selected.

There are other cool features that I like and most of the time are present in a good generic text editor:

  • Code folding.
  • Easy comment, uncomment of lines.
  • Advanced search and replace.
  • Diff viewers. JEdit has a Diff plug in, but I prefer Meld.
  • Splits. Actually I have not seen a tool with a split support as good as JEdit.

Involúcrate+GNOME

Last week I went to Lima (Perú) to attend Involucrate+GNOME. I really had a good time. Special Thanks toDiego, who invited me. Of course, I’ve got everything what I’ve seen in peruvian tv for years and never had the opportunity to taste. Inca-Cola, Chicha Morada, Cremolada, etc. I also verified that no one can eat a Margarita cookie as a normal cookie, leafs must be eaten first 😛

En Peru

Photo Stolen directly from Tatiana’s FlickrMore Photos on Flickr

Comming soon…

  • PyWeek: March 30th. Of course, I’m not going to miss it.
  • Summer of Code 2008. Receiving applications on March 24th. I’m Trying again. Last year I couldn’t apply because I was in London and I suspended my studies. I hope the third one is going to be the winning one.
  • GUADEC 2008 Istanbul. Registration and sponsorship requests open!

Game bits

Making an Indie MMO
I found the video of Daniel James‘s presentation for the Game Developers Conference Independant Game Summit. Very interesting. I’m impressed to see how Puzzle Pirates was made by only six people. There are also various interesting tips for MMO developers out there.

Igda scholarships
Igda will award 25 scholarships for the Game Developer Conference 2007. It will be in San Francisco on February 18-22.

Open Liero
Liero was one of my favorite freeware games of all times. I was very pleased to see that there is an open source implementation around similar to what OpenTTD is for Transport Tycoon Deluxe.

Urban Terror
These days, I can be found playing Urban Terror, a tactical like Counter Strike. Urban Terror is based on Quake III Arena and is completely open source. My nick name is Kira. Yes is because of Death Note.

PyWeek Ended

PyWeek is over. It was absolutelly fun!. My final entry is not what I would call a finished product, but it’s not bad. A couple of hours before the challenge end, the pyweek.org server went down. We had to send a md5 sum of our final entries to one of the event’s coordinators via e-mail.

Video of my game:

http://video.google.com/googleplayer.swf?docId=-3566000892545301155&hl=en

My code and more detailed comments in my PyWeek Entry Page.