Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Global Interpreter Lock, or how to kill it (morepypy.blogspot.com)
165 points by kingkilr on June 29, 2011 | hide | past | favorite | 43 comments


I'm confused, Software Transactional Memory may be new but optimistically transacting is not. In 1993 I was using it at Sun in network protocol design and I was doing that because people suggested they had used that technique before and it worked well. (so it was known earlier than my use of it)

Basically if you have a system where contention is possible but unexpected, you 'tag in' before you start to do things that would be wrong if you did them in contention, and you 'tag out' when you're done. The system keeps track of a mutable state value which gets updated when state is mutated and the 'tag id' of the person who mutated it. When you tag out if you're the only tag id that has been mutating the system your done, otherwise the system resets your changes and you re-do. This 'wins' if most of the time you won't be contended. Its 'safe' because you always detect when it was contended and restart from a safe starting point. You don't 'roll back' generally because if you lost the tag race its because someone else "won" it and the new state is again consistent.

Surely there is some seminal paper on this somewhere.


That sounds an awful lot like some bus contention schemes from the late 70s/80s. It's been a while since my grad class that covered that though, so I can't provide any citations.

I notice that we're still stuck in that time period insofar as real software innovations, sigh.


It's also used in lock-free algorithms nowadays.

I remember being delighted by it when I first read about it. Rather than assume contention, it just deals with it only when it comes up. A refreshing change of mindset.


Yep, there is a seminal paper on this subject. Start by reviewing coretalk.net and directly contact me.


This seems to be conflating two levels of concurrency. One is at the user level, where the user may want to write a function that atomically removes an element from one list and adds it to another, and one is at the interpreter level, where it needs to be able to complete all operations involved in, say, adding the element to the list, without another thread coming in and stomping on the first thread, with all the associated hazards.

At the interpreter level, as the article mentions, it suffices to lock the structures down as Jython does. STM would just seem to add overhead that isn't necessary.

At the user level, STM + imperative(/uncontrolled effects) is basically a known failure. A lot of effort has been spent on it, with people with similar levels of control over their VM (like the C# attempt), and it just doesn't really work.

If you've got the control necessary to automatically STM things, you might as well just equally-automatically copy the Jython-style locking. STM is "nifty" but I don't see that it actually adds anything useful. Either that or you have to rigidly control effects in user code, and that's not Pythonic in any sense whatsoever (philosophical or practical).


At the user level, STM + imperative(/uncontrolled effects) is basically a known failure. A lot of effort has been spent on it, with people with similar levels of control over their VM (like the C# attempt)

Agreed. The only project that I know of that's even close to a "working" implementation of STM is Haskell and that's because Haskell doesn't have uncontrolled stateful code.


Clojure is designed for the user level and offers some unique advantages over Haskell's STM [1]:

  * MVCC snapshot avoiding transactions restarts 
    on read invalidation.
  * Ensure references on read-writes provides a 
    kind of manual control over resource acquisition order.
  * Has explicit commute which reduces retries on 
    commutative writes.
[1] http://stackoverflow.com/questions/4560605/how-does-clojure-...

EDIT: I didn't read the OP closely enough. Yes grafting fine-grained STM onto a imperative language hasn't born much fruit.


Isn't Clojure still mostly functional and mostly based on persistent data structures?


Yes very much so...and then some.


> At the interpreter level, as the article mentions, it suffices to lock the structures down

That's a sufficient, but not necessary condition. The idea of STM is to provide just the necessary protection, and nothing more, so that even multiple threads can manipulate the same structure simultaneously, as long as they don't manipulate the same properties/parts of this structure.

Having said that, I don't claim that the way (P|J)ython is written, there is any unnecessary protection going on, furthermore, to fully ensure that just the necessary protection is on, a lot of code may have to be rewritten.


"If you've got the control necessary to automatically STM things, you might as well just equally-automatically copy the Jython-style locking."

So far thinking about how to really do that automatically failed. Do you feel like providing an idea?


Failed for CPython or failed for PyPy? I'm just going on the claim that PyPy is in a position to automatically add STM, per that blog post. If in fact it can't automatically add locks, it isn't going to be automatically adding STM either and this is all a moot discussion.


Adding fine-grained locks without risking deadlock is pretty tricky. Perhaps too tricky for automatic addition.

As I (shallowly) understand it, STM doesn't introduce deadlock threats, so it requires less-amazing feats to add it automatically.

Note, again, that the FA does admit that using STM will perform more poorly than fine-grained locking, and he only hopes that it will scale well enough that extra cores will pay for it.

As I understand it, one of the reasons for the GIL is that it makes it easier to write C extensions (especially as compared with fine-grained locking, which is not friendly to extensions), and I'm not sure, but I think the STM scheme preserves this property?


Thoughts on how this can be made efficient at all? For example, Clojure has refs to limit the number of things which need to be tracked by STM, as well as fast persistent immutable data structures to avoid the overhead of copying data.


"All of PyPy, CPython and IronPython have a GIL"

IronPython actually does NOT have a GIL: http://wiki.python.org/moin/IronPython


I had this on my mind for some time, and this seems a good time to ask: why wouldn't python have an "unsafe" mode. If the internals were threadsafe, why couldn't we have a version of python that will explode on threading mistakes and it's up to the user to make sure the proper places are locked? Basically, the same level of guarantee that C provides. I'd be happy to use that version in places where it's needed. Same version of threading was possible in Perl (use if what you're doing is threadsafe, expect coredumps otherwise).


That's the problem: the internals of Python are not thread safe. Essentially every library backing up Python is written without thread safety in mind, and this is precisely why the GIL is necessary.


Would the concept of 'unsafe' Python fit into the ideology of the language?

I'm wondering myself, not sure either way I'd choose. I'm tempted to say, "no, it would not fit the language ideology" but ctypes lets you shoot yourself in the foot if you go down that path. :)


I'm not clear on the desired result of this project. Is it to make currently-unsafe Python code automatically correct? Or is it to keep the GIL semantics, but make it faster?

If it is the former, then I worry. Software transaction memory is hard to get right in languages without explicit and trustworthy annotations for side-effecting code (ie, types) [1].

1) http://www.bluebytesoftware.com/blog/2010/01/03/ABriefRetros... (currently down, but Google has a cached copy)


python doesn't scale well with multiple threads/cores[1]. the high level motivation is to fix that. since the GIL is the main source of the problems, that has to go.

[1] the best current solution is to use the multiprocessing package which runs a completely separate python instance on each core, but obviously that doesn't support simple shared memory access (you can do it, but it's not "natural").


The goal is to remove the GIL from PyPy. RPython (the language our VM is written in) has excellent annotations on things like side-effects.


> Is it to make currently-unsafe Python code automatically correct? Or is it to keep the GIL semantics, but make it faster?

Neither. It's to improve concurrency in the interpreter (there currently is none whatsoever due to the GIL), so that multithreaded software can scale on multiple cores performing Python bytecode execution.

Currently, the GIL means the Python interpreter can only execute Python code in a single OS thread at a time (C extensions such as I/O systems can release the GIL). The result is that, even with a number of (OS) threads spawned in the interpreter, you don't get much parallelism benefits. And even if only one thread performs computation and the rest does I/O, there are churning issues with the GIL (I recommend checking David Beazley's research and presentations on the subject[0]).

[0] http://dabeaz.com/blog.html section "The GIL"


is the speed hit (factor 2-5) really that bad? do haskell programmers have experience that confirms that?

and would it be possible to somehow switch this on and off dynamically, so that a single "pypy" can adapt automatically if multiple threads start?

and how will this affect a stable, well supported [edit: full library], "final" release of pypy (especially, p3)? is it going to remove effort/resources from a GIL version? i get the impression it's getting close to stable/easy to use and it would be a pity to lose that.

[edit: ps, otherwise, this sounds most excellent]


> is the speed hit (factor 2-5) really that bad? do haskell programmers have experience that confirms that?

The penalty was pretty bad in the earliest STMs, but the Simons have done so much work on multi-core stuff and the STM libraries that it's hard to say.

(I will note that lack of purity might be, as the comments point out, a real problem. That was what sunk the .Net/C# folks trying to add STM.)


I'm not sure what that speed hit is supposed to be for.

If it's about time taken on the synchronization operations themselves, I haven't found the benchmarks but I seem to recall STM being reasonably competitive (maybe 10%-50% slowdown) with locking implementations of fairly simple things like MVar or Chan.

That's certainly much larger than the hit for running under GHC's threaded runtime. The original version described in "Haskell on a Shared-Memory Multiprocessor" shows a performance hit usually under 10% for sequential code. I expect it's only gotten better with tuning since then. (Of course, purity helps a lot. Re-evaluating a pure thunk produces the same result, so two threads entering the same code just costs performance rather than correctness, and it's usually enough to use simple reads and writes to narrow the window).


PyPy is stable/easy to use for python2, with a limited selection of C extensions.


I think processes could replace threads in most cases, but common OSs are hampering us.

Fork is slow in Windows. In unix, having lots of processes crowds ps (creates disincentive) and if you want to effectively manage a tree of threads you have to do fiddly work managing a thread group and (if you want to be fast) wrapping your head and software around shared memory IPC.

I think that if support for multiple processes in mainstream processes was more effective than it is, we'd both spend less time worrying about threads and write more stable software.


Two answers really:

* sharing memory - sometimes you have lots of immutable data, like modules, graphs, whatnot. Yes, there is copy-on-write and no, it doesn't work well on any python implementation out there. Also sometimes it's mostly-immutable data, but not quite.

* serializing lots of data is a mess and even if feasible is usually a big performance hit if you want to exchange actual objects.


With Plan 9's rfork() (and linux's less nice clone()), you can create new processes that share the memory of the old process, which addresses those two answers. Though, I'm not sure if this reinforces OP's point or just indicates that the distinction between threads and processes can be fuzzy.


What are the uses of a process that shares address space with its parent?


I wondered, too, so I had to look this up http://cm.bell-labs.com/magic/man2html/2/fork shows that it is not the whole address space:

RFMEM If set, the child and the parent will share data and bss segments. Otherwise, the child inherits a copy of those segments. Other segment types, in particular stack segments, will be unaffected. May be set only with RFPROC.

So, it basically is a way to start a process that shares all its globals (including static variables, I think) with another process, but not other memory. That is more secure than having threads, but also more restricted, as one cannot share heap-allocated structures between such processes. I guess this feature gets used most in Fortran code where nothing gets allocated dynamically.

It also makes it easier to selective kill a thread of execution from the command line, but I do not see when that might be useful.


I think you're wrong, reasoning in the wrong direction. I think that green threads / fibers / lightweight threads are going to replace threads. Haskell uses them, for example. They are much more lightweight, you can get additional correctness guarantees from them (1: bind the "runner" threads to specific CPUs, you have to worry about memory fences less, 2: if other fibers that use the same resources run on the same "runner" thread, you can "lock" shared data simply by preventing a (lightweight) context switch), the only problem is that they are hard to implement in C(++) (you need stack switching if you want to be reasonably fast), so not many compilers/interpreters implement them.


I see this sentiment everywhere. Processes don't solve every problem, nor do threads, and nor do fibers. They're each useful in different places.

Despite the GIL, threads are still useful in Python for handling asynchronous operations. One or more threads pull items off a work queue, process them, and then put the results somewhere else. The GIL is a problem only when operations are CPU-bound, but Python is pretty damn slow at that anyway. Alternatively, you can sidestep the GIL by creating a thread in a C extension that does the heavy-lifting, then calls back into Python with the result.


Fork doesn't work for much in python; for example if you want to create a bunch of data and then fork a bunch of processes to calculate stuff using the data (say the data is a bunch of word count indexes or something), you can't even read the data from the python processes without triggering copy-on-write. This is because all these little reference counts are scattered throughout the data and they get changed just by reading.


This doesn't quite apply to PyPy (which doesn't have refcounting), but the point stands. In case of PyPy this is GC flags and moving GC.


I find it strange that more attention hasn't been paid to the more obvious path: Multiple interpreters in one process. You can safely run a separate Python interpreter in each thread of a process. More overhead than threading, but less than multiprocessing, and takes care of both the Windows forking problem and the general unix "housekeeping" problem.


Here is the article mentioned (but not linked to) in the article

A comprehensive strategy for contention management in software transactional memory

http://portal.acm.org/citation.cfm?doid=1594835.1504199


It seems like the real solution would be to introduce low level guarantees and memory access (similar to Java) and the option to disable the GIL and C API extensions in CPython (until a new API is introduced?).

Also, give the developers access to some real synchronization primitives, that would be sweet.

I'm not a CPython developer, but the last points of the points on the desired list[0] seems very unfeasible to me. Not even STM solves the "Speed" requirement, but PyPy gives away with native extensions so it's halfway there!

[0] http://wiki.python.org/moin/GlobalInterpreterLock


If you are not familiar with the GIL, this video was really informative: [video] http://blip.tv/file/2232410


Wouldn't this negatively impact power consumption (over fine-grained, Jython-style locks) by having a processor repeating tasks instead of blocking and waiting for a task? It seems like this is intended to allow Python to scale to multiple cores better, but that is most often useful in a data center environment, where wasting power seems like a bad idea.


As the author wrote at the end of the article, instead of removing the GIL once and for all, they will rather add an option to use STM instead of the GIL as compile-time option or maybe even at runtime, as it would not only effect power consumption but obviously the speed as well.

That way the user can decide whether he needs multi-core scalability or simply speed.


Why can't we use stackless python?


Stackless python has a GIL as well. What we need is lockless python :-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: