Object-oriented programming and Python

Object-oriented programming is a technique that was developed more than 30 years ago, but has become a generally accepted and applied method only during the last decade. It is a design and organization strategy for non-trivial computer code. It differs from the older and in scientific programming still prevalent technique of procedural programming by shifting the focus of code structuring from algorithms (functions) to data.

This sounds much more complicated than it is in practice. In fact, object-oriented programming is a much more natural approach for many applications, including all kinds of modelling and simulations. In the following the basic features of object-oriented programming will be demonstrated using examples from molecular modelling.

Objects and classes

An object is the computer representation of some real-life entity. Basic objects of computation are for example integers, floating-point numbers, character strings, or data files. Basic objects of geometry are vectors, lines, planes, spheres, etc. Basic objects of molecular modelling are atoms, molecules, molecular complexes, etc. All these objects are characterized by specific properties (the value of an integer, the constituent atoms in a molecule) and a number of operations that can be performed with them (integers can be added, atoms can be moved, molecules can be asked for their center of mass). The specific properties are called the attributes of an object, and the operations are referred to as methods. An object type, or class, is specified by a class definition that contains all the methods, and in some object-oriented programming languages (but not in Python) also a list of all attributes. (In Python attributes can be added or removed at any time, so there is no point in listing them.)

An object type "atom" could for example be defined as follows:

class Atom:

    def __init__(self, element):
        self.element = element

    def mass(self):
        return self.element.mass

(The actual definition of an atom in MMTK is of course more complicated.) With this definition we could then write

c1 = Atom(carbon)
to create an atom of element carbon and assign it to the variable c1. (Of course carbon is a variable referring to some other object, which we will describe later.) What actually happens is that Python creates a new object of class Atom and then calls the method __init__ to initialize it (i.e. set its attributes). The method __init__ has two parameters: the newly created object (self) and the argument given for its construction (element). All it does is to remember the second parameter in the attribute element, referred to as self.element inside the method. Once initialization is completed, the new object is assigned to the variable c1. Without this assignment, there would be no way to use the new object later on.

After creating the atom, we can ask it for its mass:

print c1.mass()
This will call the method mass defined in the class of the object (i.e. Atom) with the actual object as the first parameter (and no further parameters because none were specified). The method will simply return the attribute mass of the object that is the value of the attribute element of the object that the method was called for - i.e. our carbon atom. You may wonder at this point how the method can know that the element actually has an attribute called mass. The answer is that it doesn't - if the attribute turns out not to exist, then Python will report an error. Note also that any object with an attribute called mass can be used in the creation of an atom to define its element, since no other method or attribute is ever used. This demonstrates one of the advantages of object-oriented programming: any code using certain objects need only know how it can be used (i.e. its interface). It does not have to know how an object's methods work internally, meaning that they can be changed (e.g. by implementing a more efficient algorithm) without affecting external code as long as they still do what they are supposed to do.

You have probably guessed that the method __init__ is something special, since Python calls it automatically. There are a few more of these special methods, whose names begin and end with two underscores. For a complete list consult the Python manual. You may also have noticed that the first parameter of all methods is called self. This is a convention followed by most Python programmers, but by no means a requirement of Python. If you want to be more precise, you could call it the_object_that_this_method_is_called_for, but typing this name repeatedly will soon convince you that self is a better choice! (Smalltalk programmers will of course consider this choice obvious.)

To make this short example complete, we have to provide the definition of carbon used above. Here is the code that provides it:

class Element:

    def __init__(self, name, mass, nuclear_charge):
        self.name = name
        self.mass = mass
        self.nuclear_charge = nuclear_charge

carbon = Element('carbon', 12, 6)

Inheritance

An important technique in object-oriented programming is called inheritance. It means that a class definition can be written as an extension (derived class) of another class definition (base class) whose methods it inherits. Inheritance permits specialization without unnecessarily replicating code. This is best illustrated in another example, protein modelling.

Proteins are complexes of peptide chains and perhaps other molecules. Peptide chains are, of course, molecules. But they are special molecules with unique features. A useful operation on a peptide chain is for example the extraction of a subchain, which however makes no sense for molecules in general. Therefore a full-featured class representing peptide chains must be different from the general class representing molecules, while still retaining all operations defined on molecules. This is done by making the peptide chain class a derived class of the molecule class:

class Molecule:

    ...
    # much code left out
    ...

    def centerOfMass(self):
        # return center of mass vector


class PeptideChain(Molecule):

    ...
    # much code left out
    ...

    def extractSubchain(self, first_residue, last_residue):
        # return the subchain

An object of class PeptideChain would then offer both the method centerOfMass and the method extractSubchain, whereas an object of class Molecule would allow only the first, but not the second method.

It is even possible for a class to inherit methods from more than one base class; this is called multiple inheritance. However, this is rarely used and will not be discussed here. A more detailed discussion of this and other advanced techniques of object-oriented programming can be found in any textbook on the subject.

Python

Python is the foundation on which MMTK is built. The majority of the MMTK code is written in Python, with only a small part written in C for speed. More importantly for users, all applications are Python programs, and even the definition files for molecules etc. are nothing but simple Python programs. This means that MMTK users will have to learn the basics of Python, but that is all they have to learn. There are no "input files" requiring some cryptic and ill-defined command line syntax, as in many other modelling packages.

Since Python is a well-documented language (it comes with a full set of manuals, and there are also books on Python), there is no need to repeat all the details here. The following description should be sufficient for understanding the application examples and writing similar ones. However, getting more familiar with Python is strongly recommended. It is a very consistent language without many subtleties, so it can be learned in a relatively short time. And it is useful for many tasks in computational science that are not covered by MMTK.

Python is an interpreted language, which means that you can type code directly into the interpreter without having to write it to a file or to compile it. In practice you will nevertheless want to write all but one-line code into text files so that you can rerun it. There are two ways to make interactive use of Python much more convenient: the Python mode for Emacs, which comes with the interpreter, and the Tk-based user interface kit PTUI. Both allow typing code, executing pieces of it, and saving it in a file.

Modules and names

Python syntax is generally similar to that of most other modern programming languages. Source code is packaged into modules; each source code file constitutes one module, whose name equals the file name without the (compulsory) suffix .py. The contents of the file define names (variables) in the module. Note that every name is a variable, including the names of classes, functions, and modules. The reason is that in Python everything is an object that can be assigned to variables and passed to functions as a parameter. A function definition creates a function object and assigns it to the function name, which then differs in nothing from a name that received its value in any other way. This has numerous practical advantages that you will probably discover when using Python.

Names can be of any length and consist of letters, digits, and underscores; the first character must not be a digit. Upper and lower case letters are distinct. Names starting with two underscores are treated in a special way; you should avoid them in the beginning. Names are not restricted to some data types. The same name can first contain an integer and then be redefined to designate a peptide chain. However, this is usually not a good idea; programmers should always try to keep their code clear and readable.

A module can freely access all the names that are defined within itself. The only way to access names defined in other modules is to import the module or specific names of the module. This means that you never have to worry about name clashes between different modules, which are a serious problem when writing large programs in languages without modules. The only names that must be globally unique are module names (and there is even a way to remove this condition).

The most unusual feature of Python syntax for people used to other languages is its use of indentation to indicate code structure, as you have probably noticed from the examples in the last section. Indentation serves the purpose of curly brackets in C, begin/end statements in Pascal, and endif statements in Fortran. The amount of indentation is arbitrary (although most people seem to use four characters) but must be consistent within a subblock, i.e. all its lines must begin in the same position. Normally each line contains one statement. If a statement becomes too long, it can be split across several lines by terminating all but the last line with a backslash. This is not necessary if there are any opened and not yet closed brackets or parentheses - that actually covers most practical needs.

Predefined data types

Python has a range of predefined basic data types. The most important ones are:

The number types come with the standard set of arithemtic expressions, with exponentiation denoted by a double asterisk, as in Fortran. Standard mathematical functions (like square root, logarithm, or trigonometric operations) are provided in a module called Numeric. (This module is part of the numerics extension and more powerful than the standard module math.) Character strings are delimited by either single or double quotes and can be concatenated with the addition operator; substrings can be extracted by indexing (see below). The module string contains many useful standard operations; the module regexp provides more advanced searching and substitution operatione.

Lists are ordered collections of arbitrary objects. They are denoted by square brackets:

list = ['abc', 3, 4, [Atom(carbon), 'whatever you want']]

List elements can be extracted by indexing: list[0] is the string 'abc'. It is also possible to extract a sublist by providing two indices: list[1:2] is the list [3, 4]. You have probably noticed that the first element of a list has the number 0. It is also possible to count from the end by using negative indices. And of course items can be added to or deleted from a list. Another frequent operation is iteration over a list:

for x in list:
    print x

This will print each item in the list on a separate line. Lists are the most important composite data type in Python and users are strongly recommended to get familiar with them.

Tuples seem at first glance to be the same as lists: they too are ordered collections of arbitrary objects, and they allow much the same operations. The only (but important) difference is that tuples are immutable, i.e. they cannot be modified after their creation. That makes tuples usable in situations where lists cannot be allowed, e.g. as keys i dictionaries (see below).

Arrays are similar to arrays in languages like C or Fortran in that they are multidimensional collections of objects of identical type. However, Python arrays allow much more powerful operations on arrays; they should better be compared to arrays in array languages like APL or (to a more limited degree) Matlab. It should be noted that Python array operations on the standard number types have been optimized extensively, such that numerical algorithms expressible in terms of these array operations are almost as fast as hand-written C code. Arrays are created from nested lists or tuples by the function array from the module Numeric:

from Numeric import array
a = array([[1,2],[3,4]])
There are other ways to create arrays with predefined contents (e.g. all zero). The full set of array operations is too large to be listed here. Users who are likely to profit from array operations (e.g. everyone wanting to write analysis routines) should consult the tutorial and documentation for the Python numerics extension.

Dictionaries are probably less familiar to scientists, but they are extremely useful in many applications. They store keys with associated values. Values can be arbitrary objects, keys are slightly restricted (they must be immutable objects, i.e. lists are not allowed, but tuples are). Dictionaries are denoted by curly brackets. They are most commonly used to associate properties with objects:

charge = {o:  -0.834, h1: 0.417, h2: 0.417}
This dictionary defines a charge for three atoms. The values can be accessed by indexing with a key: charge[h1] is 0.417.

Control structures

Python's control structures are those that most modern programming languages offer. There are decisions:

if a > b:
    print 'a is larger'
elif a == b:
    print 'a and b are equal'
else:
    print 'b is larger'
The elif and else parts are optional. There are also conditional loops:
while x < 1000:
    x = x*x
Finally, as has been shown above in the descriptions of lists, there is a construct to iterate over the elements of sequences (which can be lists, tuples, character string, arrays, or user-defined sequence types):
for letter in 'abcdefg':
    print letter
This iteration is also used for the frequent "iterate n times" construct:
for i in range(10):
    print i
This will print the first 10 integers, starting from zero, i.e. the numbers 0 to 9 (not the number 10!). The funtion range simply constructs a list of numbers. Both conditional loops and sequence iterations can be terminated by executing the statement break.

There is another important control structure, exception handling, is used to deal with error conditions in a systematic way. Since this construct is not needed for simple MMTK applications, it will not be discussed here. Refer to the Python manuals for more information.

The standard library

Python comes with a large library of standard modules that deal with frequent data types and algorithms. They are described in detail in the Library Reference Manual. MMTK users will probably never need many of these modules, but it is still a good idea to get an overview of what's there, rather than reinventing the wheel later on. Of particular interest are the following modules:

Even more important is the module Numeric, which is part of the numerics extension and therefore has its own manual and tutorial. It provides all array operations and also standard mathematical functions. Other parts of the numerics extension deal with specific numerical algorithm. Of particular importance is the module LinearAlgebra that provides functions for solving systems of linear equations, inverting matrices, solving eigenvalue problems, and solving least-squares problems.


Table of contents