[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Speeding up Python (Was: FICL has a nice VM interpreter loop...)
- Subject: Speeding up Python (Was: FICL has a nice VM interpreter loop...)
- From: Greg & Cindy Gritton <gritton(at)ibm.net>
- Date: Sat, 09 Jan 1999 17:31:20 -0800
- Newsgroups: comp.lang.python
- Organization: The Gritton Family
- References: <slrn79aj3e.1a1.wtanksle@tursiops.escher.ml.org> <000b01be3adf$562f21c0$ed9e2299@tim> <36960B84.E9CD9394@equi4.com> <3696687A.D806138A@appliedbiometrics.com>
- Reply-To: gritton(at)ibm.net
- Xref: news.doit.wisc.edu comp.lang.python:48729
There are likely a number of ways to speed up Python.
Many would require extensive changes to the Python source code,
and might be good candidates for Python 2.
One good source of techniques is Smalltalk.
Like Python, Smalltalk's variables can store any type,
and the type can change at any time.
Smalltalk has been going down the learning curve
of implementing dynamically typed languages for over
18 years, and some reasonably fast implementations
have been developed. Many of the techniques have
been documented in the literature.
The slowest widely used version of Smalltalk is Squeak,
a highly portable open source bytecode interpreter that
runs about 3 times as fast as Python in small C-style
benchmarks. (See benchmark results at the end of this
message.) The fastest is likely to be VisualWorks,
which uses dynamic (JIT) compilation and some other
tricks to end up running at 10 times the speed of Squeak.
Some of the techniques used to speed up Smalltalk include:
1. Garbage collection
Some of the early implementations of Smalltalk used
reference counting. However, they found that when they
implemented actual garbage collection, in addition
to being able to collect cycles, they reduced overhead.
Garbage collection actually sped up the implemention.
Modern garbage collection techniques can often
have lower overhead than manual memory allocation
using malloc and free even, even if you don't include
the reference counting overhead.
Garbage collection doesn't have to be complicated
nor non-portable. Conservative garbage collectors
that are aimed at C or C++ are likely to have
porting problems because they are trying to add
garbage collection on top of a language implemention
that doesn't expect it and optimizing compilers
that can hide information from the collector.
By contrast, Squeak has a simple, effecient, but
highly portable garbage collector because it is
built into the language implemention.
2. Tagged values / "Unboxed" integers
In most implementations of Smalltalk, a reference
is a 32-bit value that can be either a "small"
integer or a pointer. One example might be
using the lowest bit to differentiate between
the two. If it is a 1, it means the other 31 bits
contain the integer value. If it is 0, it means
the number is a pointer. Since pointers must be
aligned on a 4-byte boundary they use their natural
form without shifting.
In Smalltalk, this allows up to 31-bit integers
to be stored in manipulated and stored without
the overhead of allocating a real object.
The virtual machine checks overflow and will convert
the "small" integer to a large (unlimited precision)
integer if necessary. There is no overflow error.
3. Fast instance variable access
Instance variables in objects are accessed by offset.
No search is required. This is the same as in C++.
However, it does mean that a classes methods have
to be recompiled when the instance variable in a class
change. (The Smalltalk implementation handles this
automatically.)
4. Fast global variable access
Global and class variables are accessed through a dictionary.
However, instead of being formed as a set of parallel
key/value arrays, the dictionary is formed by a set of
"association" objects. Each association contains a key
and value pair. The compiled code contains references
directly to the association. So, the program can
change the global value just like in Python, but no
search is required to find that global value.
The only difference is that the global value is deleted
from the global (or class) dictionary, code already
compiled will keep using the old value rather than
give a "Name error". If the global is later redefined
the old code will still use the old value unless redefined.
A way around this would be to store an error value
in the association before deleting it from the dictionary.
The LOAD_GLOBAL bytecode could check for the error value,
and lookup a new value if it found one, or return a name
error if it didn't. This check should be faster than
a hash-table search, which requires a miss check anyway.
3. Fast method dispatch
In Smalltalk, as in Python, the actual code to call
is based on the function (method) name and the type
(class) of the object receiving that call. Finding
the actual method to call is called method dispatch.
Their are two ways Smalltalk implementations do
method dispatch. One is to maintain a global hash
table indexed by a simple function (such as XOR)
of the class and method name. The hash table is
generally not large enough to include all combinations
but rather serves as a cache to the more commonly used
ones. It implementation checks the hash table
against the class and method name, and if it finds
a match uses the associated code pointer.
If it doesn't find a match it starts searching the
classes method dictionary, then its superclasses, etc.
Another way is to have associated with
each call site storage for a method name, a
last used class, and a method address.
At each call, the code checks
to see if the class of the current object is the
same as it was last time. If it is, it calls
the same method it used last time. If not, it
calls the generalized lookup routine (which may
include the method cache described above).
In the general case that the object type doesn't
change between calls, the code for this can be
executed about as quickly as a C++ vtable lookup,
but it is a much more general function.
6. Above and beyond
Beyond these techniques, and quite a bit of tuning,
two additional techniques can give radical speedups.
The first is dynamic (JIT) compilation, or compiling
to native code. VisualWorks uses this, maintaining
a native code cache of recently executed methods,
achieving a 10x speedup over squeak and a 5x speedup
over other commercial Smalltalks I have tried
Of course, you have to implement code generation for
each system you want to port to. The language
overhead still doesn't completly dissapear.
Even with native compilation, VisualWorks is still
10x slower on my simple benchmarks than optimized C.
A group developing the Self language, a derivitive
of Smalltalk, developed a number of techniques
for even faster execution. Their techniques
involved guessing the type of a variable, and later
profiling the code and observing the type,
and compiling accordingly. Using this information,
they were able to get with a factor of 2-4 of
optimized C, but they used a lot of memory in the
process. They descibe their techniques at
http://self.sunlabs.com. Their papers their are
worth reading.
The Self group started out at Stanford. They later
moved to Sun. Then, they formed a company
Animorphic, to commercialize the technology.
(Apparently, they got the memory consumption back
down to a reasonable level.) But, before they
could, they were bought by Sun. Sun's HotSpot
compiler for Java is supposed to be based on
this technology.
Smalltalk implementations probably represent the state
of the art in implementing dynamic object-oriented languages.
Java has borrowed a number of these same techniques.
If you look closely at how the bytecodes are defined
(especially including the "quick" bytecodes) you will
see much the same setup, with the unfortunate exception
of tagged values. Also, unfortunately, the early Java
implementations has poor implementations of memory
management, generally using conservative collectors.
So, even though in theory Java should have less
overhead than Smalltalk, it doesn't run any faster.
Many of these techniques could be applied to Python.
I would probably start with Garbage Collection.
In addition to speed, this would allow cycles to be
collected. Not only is this a boon to programmers,
it broadens the techniques that can be used within
python to implement language extensions, etc.
Finally, it shouldn't be too complex to implement.
(I might try it myself if I have time.)
Of course, smaller optimizations also work.
Eliminating the actual generation of a list with
a "range" argument is one of the most obvious.
As a side note, when considering the cost and benefit
of a low-level optimization, don't just count
instructions. Different instructions have different
costs. An approximate order of cost would be
cheap Add, subtract, shift, logical on registers
| Load, store, FP operations (except divide/remainder),
| cheap operations on memory (Intel)
| Integer multiply on some processors (Intel)
| Function Call (Intel, Sparc)
| Integer multiply (others)
| Function Call (others)
v Mispredicted branch*
costly Divides, remainder, squre root, etc. (around 30x cheap)
Of course, it is best to actually measure the cost.
(But you all knew that.)
Anyway, I hope these ideas are useful.
Good Pythoning.
Greg Gritton
gritton@ibm.net
* The bytecode interpreter loop is a likely example
of a mispredicted branch. Because the branch location
changes from one iteration to the next the processor
will have a difficult time predicting it.
---------------------- Benchmark results -----------------------
All runs are on a 100MHz Cyrix 6x86 PR-120 with 512K L2 cache
and 24M RAM run in December 1998 to January 1999.
The times are in milliseconds.
System Bubble Quick Perm Towers
C (Visual C++ "release") 21 11 21 22
VisualWorks Smalltalk 3.0 206 99 120 79
Smalltalk MT 1.5 259 105 239 215
Dolphin Smalltalk 2.1 805 364 701 538
Squeak 2.0 1349 632 961 1266
Python 1.5 3577 1555 2844 3115
These benchmarks are derived from the Stanford Integer bencharks.
They tend to test low-level operations such as array access,
comparisons, and procedure calls. They were originally written
in C, and becaue of this and their low-level nature tend to
favor C, so they don't necessarily give an idea of how fast
it is to use a langauge or system in real life. However,
they do give somewhat of a worst case.
As an example of this, two of the benchmarks test sorting speed.
When quicksort, for example, is replaced with Python's native sort
function, the entire benchmark is reduced from 1555ms to 483ms.
And, most of that is now initializing the array. The sorting
itself takes only 98ms.
I can send or post source code if anyone wishes.