I've been playing around with some Greenarray[1] chips lately, and I have to admit that the core ideas are fascinating. Their particular variant of Forth is actually almost pleasant to use--certainly better than MIPS, which is my closest point of reference. (Ignoring crazy things like the color of identifiers controlling your program's semantics.) The chips are also extremely power efficient.
I've found having a mental model of a stack-based system far easier than even a simple register-based one. However, this might also be true because the Greenarray chips themselves are far simpler than any other CPU I've ever looked at.
And, of course, the fact that you have 144 cores on a tiny chip despite a cheap manufacturing process is also very neat.
Oh, please write about your experiences programming those chips. I'm sure I'm not the only one who'd like to hear all about it. I'd particularly like to know how you deploy code to the 144 cores and how they talk to one another.
Edit: that request is to anyone who has programmed the GreenArrays chips.
Yes, please write this! I'd normally just upvote your comment, but I feel like the sorts of people who play around with GreenArray chips are not that prone to blogging, at least by my googling...
Are you able to boot all the cores and get them functional? I like the design of these chips too, especially how each core "blocks" on its message queue by instantly going to sleep, taking zero power -- and then instantly wakes up the moment a message appears.
Mini book review: On the off chance you're planning on implementing a stack machine, you must read this book. A few years ago I was working on a stack machine, and I didn't know what to read because virtually all of the references for fundamental material are decades old. Let me save you the trouble and tell you this is the one you want. There are newer papers you should read, of course, but if you're only going to read one thing, it should be this.
Why would you want a stack machine in this day and age? They're small. Really, really, small. The stack processor we made was so small that it fit in an empty space in the floorplan of our "real" (x86) processor.
I'm using stack machines now because, like you said, they're small. Meaning, you can fit a ton of them on a chip, and if you can get them to talk to each other, you have yourself a pretty powerful parallel processor. The chip I'm working with has 144 cores running at ~700MHz, and it costs $20. It's insane how much being small buys you.
The very same. It's working really well—initially I was intimidated about the chips because of a few of the constraints the chip has in order to be as good as it is. Three challenges:
1) Figuring out how to work within the very low (but reasonable) amount of RAM each core has. They're independent computers, so each core needs to store its code and its data, all within 64 18-bit bytewords, which roughly corresponds to a tweet of information if you tweeted in UTF-8.
2) Dealing with the IDE: the IDE is attacking a problem nobody's solved: parallel compilation. That being said, it has been the second greatest challenge so far, after
3) Getting the hardware set up. My co-project-doer is light-years ahead of me in the EE department and makes most of the EE decisions, with some input from me about what capabilities the chip needs to have (a minimum amount of off-die RAM, say). Sourcing parts is something I would not have been able to do on my own, so I'd say this project requires a good deal of electronic literacy, if not dexterity.
That being said, though it's a bit of a prima-donna, it has every right and reason to be. It's an amazing, amazing chip. It's fast, powerful, unclocked, can work with all kinds of devices, is a crazy bargain both in hardware cost and in energy cost, and best of all, for me at least, an endless source of fun, hard problems. It's a fascinating chip.
It's great to see people using Chuck's work, and in particular it's wonderful to see your reaction. I think it would be easy to recoil at the strange tools, not to mention the difficulty of programming the thing. I agree it's a fascinating device.
Can you give a hint as to what we can expect to do with this chip?
I was a Forth enthusiast and read the book in OP (in early 2000's) and followed Chuck's work from afar, but I am not sure what exactly I could do with this chip, given its peculiar limitations.
With low RAM like that, what you want to do is math, not UI. That means electrical engineering or number crunching, AFAIK.
OK, so first, it has a lot of I/O ports of different kinds on just one chip. In a lab, you could run lots of analogue knobs and many servos attached to just one chip. This means a lot less hardware debugging, which should be a Godsend.
Another possibility is HPC. You can tile a motherboard with these, cool them all down to -50 very easily and cheaply, and then overclock them as you perform number crunching that you couldn't do with a GPU. After all, these cores are truly independent, and can therefore branch independently, unlike GPU cores. That opens the door to a wide variety of algorithms that aren't GPGPU accessible.
I've always thought it was a nifty chip, but one I couldn't think of an application for. Things like the 18-bit word size can throw some developers off -- the last time I looked at the chip some audio application guys were kvetching that they really need 24 bits for their applications. I'm skeptical that the architecture is a good fit for number crunching, but perhaps I'm just blind to all the numeric apps that 18-bit integers are nifty for.
yet like said at the beginning, it doesn't stop me from wishing I had an excuse to play with them. Blazing speed with lots of cores, there ought to be some applications that for that sort of parallel horsepower.
Don't think of the F18 computers in isolation- think about doing stream processing with a network of them. You probably can't fit your application on one F18, but you could easily fit one or two functions in 64 words, especially given that a single machine word can contain up to four instructions.
It sure looks like a fine chip for a very small niche, but they don't even have floating point instructions AFAIK, so you'd have to emulate them in software. So I'm extremely skeptical about their number crunching capabilities.
No floating point. I believe design complexity of the chip would increase manifold if it was. Emulation is not an option too with those extremely limited resources (64 words of memory). The design philosophy of this chip is that 18 bits of integer precision should be enough for your applications, period. I just feel like I'm not smart enough to figure out how to fit a useful application into that.
Yes, I'm using Color Forth. It's not the best part of the package...it took a really long time to learn, and I had to learn its keyboard layout to do so. My philosophy, in general, is, "do whatever you want, but don't change my editor or keyboard." After all, I spent several weeks being unable to type because I was learning Dvorak, years ago. And emacs had a steep learning curve too.
The UI, I have to say, would not pass Apple's muster. It is a completely different way of programming, and I preferred to program on paper for the longest time rather than learn all the weird controls.
Hmmm interesting... If it uses that much less silicon then I imagine you could save huge amounts of power. A lot of people would be interested.
There must be some tradeoff then. Companies are burning huge amounts of cash buying data centers in remote areas because of power. Presumably it's not very hard to compile C to stack code. (Or maybe it won't be as efficient as... Forth?)
Or maybe most applications aren't really CPU limited. I remember reading about Chuck Moore's low power processors, but they seem to have mostly specialized applications. In that case it's probably not fair to compare them to x86.
The book actually discusses the problems with running C code on stack machines. To sum up, yes- you absolutely can run C code on a stack machine, but you lose most of the advantages of the architecture. Stack machines make procedure calls very inexpensive and encourage the use of very short procedures with a small number of variables in scope at a given time. The stack operators that are available provide quick access to a few elements at the top of the stacks at the expense of random access to elements that are deeper. Stack-oriented languages like Forth make the programmer aware of this and encourage breaking tasks down in this manner. Idiomatic C has procedures which are at least an order of magnitude longer and have many more values in scope at a given time. This provides a compiler with a fairly uncomfortable and difficult optimization task.
Speaking as someone who works with compilers and really likes stack machines, I don't think they will provide much practical value unless programmers are willing to adapt their languages and their approach to programming to the strengths and weaknesses of the paradigm.
"The algorithm have I developed for intra-block stack scheduling seems to be quite effective, eliminating 91% to 100% of redundant local variable accesses within basic blocks for the small programs studied. Hand-performed global optimization results indicate that significantly better stack scheduling can be done if variables are kept on the stack across basic block boundaries."
What is also possible is to discover identical code and automatically generate procedures for it.
Yeah that makes sense. I think programming is becoming more and more heterogeneous. So it's possible that stack machines will make a resurgence for some applications, driven by huge power savings.
I like Forth but feels a bit "old" to me, i.e. in that it doesn't have common data structures like hash tables. I wonder if a higher level stack language like Joy would run better on a stack machine vs. x86.
But once you are higher level, the procedure call time doesn't really matter. I haven't ever programmed any application where procedure call time is a factor... it almost seems like the last thing to optimize.
How small is really, really small? Something that is small in a x86 sense is not at all small in an embedded world. If you could elaborate a bit on the performance requirements for the application I'd be really happy. I am curious how it compares to an equivalent ARM (f. x. the 12kgate Cortex-M0+) in terms of area, power, performance.
One domain where stack machines shine is code density because of no need for register encoding. Instruction lengths aren't measured in bytes, but rather in bits. With some form of binary coding, one can reduce code size even further, 1-7 bits per instruction is not uncommon with rather elaborate instruction set which includes a handful of "macro" instructions for common tasks such as initializing array of data and random number generation and some others.
If implementing a stack based VM within a program, the stack machine has some overhead(because it has to be implemented with all the functionality and interfacing with the rest of the program) but that is mainly a curiosity - one can implement a less-well scaling ISA without binary coding and fixed opcode length with far fewer bytes for the VM, but then the cheap cost will be offset with worse code density. On the other hand a more advanced VM takes far more space, but the greater code densty pays off after certain amount of code.
Bit-aligned instructions are a really horrible idea if you ever want to make a CPU implementing that instruction set actually fast. When you want to decode multiple instructions in parallel, having more possible locations for the beginning of the next instruction adds a lot of latency and power on a very critical path. Getting more cache to fit instructions and more bandwidth from it to the cpu is a problem that can be solved by Moore's law, having to put more muxes in front of your decoder isn't, now that Moore's law just buys you more transistors and less power, but not faster transistors.
x86 is now considered to be extremely hard to decode because it's byte-aligned instructions, and basically all modern instruction sets prefer fixed-width instructions. ARM used to tout how space-efficient it's THUMB-2/ARMv7 instruction set was compared to other RISC instruction sets because it had 2 possible instruction lengths (16b and 32b), but now that they got to (had to) reboot the ISA for 64-bit, ARMv8 is fixed-width.
For clarity for those who aren't exactly clear with all this, this only applies when the instruction set is actually compressed somehow - not for stack based machines in general.
In a more general sense, I see benefit outside of just demoscene. If code density makes programs small enough to fit in cache it makes diskless, RAM-based systems potentially faster.
One could easily think so, but it's not quite true.
While x86 processors have relied on caching the instructions for some two decades now the overhead of decoding the VM bytecode instruction bitstream of the VM is certainly too high(especially if it's compressed) to outweigh the possibly reduced stalls due to instruction cache miss because of greater code density.
Of course, one could translate the VM bytecode bitstream to equivalent x86 code upon running, but without some really smart heuristic optimizer there wouldn't be much to gain in speed over simple x86 implementation, although it'd very probably be faster than the VM.
Either my comment was mistunderstood (e.g. we are each referring to different caches) or I am suffering from a fundamental misunderstanding.
Let's say you have a diskless RAM-based system. Swapless.
Now you launch a large program, say firefox or some other "modern web browser". That is going to require page caching, correct? This program cannot fit into the CPU cache, specifically what the Wikipedia entry (to use a common point of reference) calls the "data cache" (cf. what it calls the "instruction cache").
Now say you launch another large program. Depending on how much main memory you have, the OS is going to have to do some decision-making. Which program (or parts thereof) need to be "immediately available"?
Now imagine you have a small program constructed from very dense code, say a few K in size, and it fits entirely within the data cache. Correct me if I'm wrong, but in that case there is no page caching or OS decision-making. The program is "immediately available" as long as it's in the data cache.
OK, now I'm sure someone will take this comment apart piece by piece. But keep in mind the general idea I am suggesting is: code density lends itself to smaller programs, smaller programs lend themselves to fitting in the data cache, and fitting entirely within the CPU's data cache lends itself to running faster in a diskless RAM-based system. If this is wrong, then you need to explain why, specifically.
You seem to be mixing up several different levels of caching. Firstly, instructions are primarily fit in the instruction cache, not the data cache.
Secondly, there is no signifcant address translation overhead and the os doesn't need to do any decision making (or well, it does, but it is cached in the TLB and code TLBs are so efficienct nowadays that you can expect the cost to be null) so long as the resident set fits in the main ram -- whether it fits in the cache or not is irrelevant.
Thirdly, just how does the disklessness of the system have any influence? I am not entirely sure but it seems you are somehow mixing up the page cache and the CPU cache, which are completely different things.
Since then the result seems to be that stack vs registers, it doesn't really matter as much as people thought. A modern processor that's out of order [1] has to reconstruct the dependencies in the dataflow graph anyway. Out of all the things going on in a processor, doing so from a register ISA or from a stack ISA is not that different. A bare bones stack machine may be smaller than a bare bones register machine, but nobody wants to run those anyway.
One of problems of stack machines is that it is mostly impossible to build useful out-of-order stack machine (or even in-order multiple issue). Short or non-existent pipelines and low CPI of stack machines tend to offset this for certain classes of code (lots of branching and so), but cannot rival performance of modern register based designs for other classes of code (number crunching).
It is straightforward to convert stack instructions to 3-address instructions, by keeping a counter -- a stack pointer -- that points into the register file. As instructions are decoded the stack pointer is read and the current reading is delivered on down the pipeline with the instruction itself.
From there on, the cpu is executing an instruction that calls out explicit register numbers. The stack counter is kept with the decoder at the very start of the pipeline. This "register number decoder" needs to run sequentially, and it needs to run a faster cycle time than the rest of the cpu.
Idea: Notch should use a stack machine as the basis for his new 0x10c game. (A big part of the game involves programming an in-game 16bit computer.) It'd be a great excuse for a lot of people to explore this architecture.
I do hope more old technical books make it to a new life online. Otherwise whole areas will get increasingly hard to access, as there are only a limited number of library copies.
[1]: http://www.greenarraychips.com/
I've found having a mental model of a stack-based system far easier than even a simple register-based one. However, this might also be true because the Greenarray chips themselves are far simpler than any other CPU I've ever looked at.
And, of course, the fact that you have 144 cores on a tiny chip despite a cheap manufacturing process is also very neat.