Author Archives: walkerb

x86/x64 vs ARM: What’s the difference anyhow?

You may have heard about Windows RT vs Windows 8. They’re, like, almost the same, but also really different? It’s confusing. Well, here’s the difference:

Windows 8 can only run on x86/x64 processors. Windows RT can only run on ARM processors.

Cool. But why does x86/x64 vs ARM matter? Why does each processor require different versions of Windows? Let’s drill down.

x86/x64 processors: They’re fast and powerful, but they require a lot of electricity. So, they’re used in desktop computers that can plug into the wall. All versions of Windows run on x86/x64.

ARM processors: They’re weak but low-power processors for smartphones and other devices that aren’t plugged into the wall. Mobile iOS and Android operating systems run on ARM.

The two processor architectures are mutually exclusive: a program that’s built for x86/x64 can’t run on ARM under any circumstances, and vice versa.

Since the 90s, these architectures have existed in parallel worlds: ARM for phones and small PDAs, x86/x64 for desktops and big laptops. But in the past few years, the market’s gotten all hot and bothered for tablets that are bigger and more powerful than phones, but simpler than laptops — like the iPad or Kindle. Tablets have to be small and light, which means tiny batteries, which means ARM processors. But tablets have USB ports, full web browsers, and word processing and photo editing apps, which means ARM-based portables have become direct challengers to x86/x64-based desktops.

Windows has always been a desktop-only operating system, so it’s only been available for x86/x64 processors. But Microsoft sees everyone moving to tablets, and it doesn’t want to lose all its future revenue, so it entered the mobile arena with Windows RT and the Surface. [update 2016: Windows RT sorta failed and the Surface is becoming a brand for x86/x64 processor powered laptops, with the Surface Pro and Surfacebook]

x86/x64 processor manufacturers are potentially the most harmed by the rise of tablets. But there’s an easy way for them to stay relevant — make an x86/x64 processor that’s low-power enough to place in tablets. Intel’s doing that with the Atom processors, which give up processing power, x64 support, and high-speed computing features like SSE in return for super-reduced power consumption.

BUT. GET THIS. Atom processors still can consume 2x the electricity of an ARM processor, even at the same processor count/clock speed. What?

Well, it’s because of an inherent difference between the processors. See, “ARM” stands for “Advanced RISC Machine”. RISC stands for ‘Reduced Instruction Set Computing’, and, befitting an acronym that’s part of your entire brand name, it’s what makes ARM so low-power.

You may remember that the instruction set refers to the set of commands that the processor can execute. Well, most code only requires a few instructions — read/write memory, do arithmetic, jump, boolean logic, not much more. ARM processors only offer these basic instructions. Thus, a reduced instruction set.

x86/x64 processors are CISC, or ‘Complex Instruction Set Computing’. Although almost all code can be represented by the basic instructions in RISC, certain patterns of instructions are common — for instance, “write this byte to memory then look at the immediately following byte”. CISC processors offer combo-instructions (previous example being STOSB) that handle these common instruction patterns super-efficiently. However, support for these combo-instructions requires extra hardware — and that hardware costs electricity.

That difference in hardware is why ARM processors use less power than x86/x64 processors at the same clock speed. Mind you, it also means that some programs run faster in x86/x64 processors than they do in ARM processors with the same specs — an algorithm that takes 3 cycles on an ARM processor can take 1 cycle on an x86/x64 processor if it’s been wrapped into a CISC combo-instruction. It’s also why programs built for x86/x64 can’t run in ARM — once you compile a program for x86/x64, it’s hardcoded to use these combo-instructions, and there’s no translation to ARM from there.

So what do the specs of two similar-release-date processors look like? Let’s compare the ARM AM3359 and the x86 Atom Z650.

ARM AM3359 Atom Z650
INTRODUCED IN Q3 2011 Q2 2011
# CORES 1 1
CLOCK SPEED 720MHz 1.2GHz
L1 CACHE 64KB 56KB
L2 CACHE 256KB 512KB
POWER CONSUMPTION 0.7W 3W

The Atom Z650 is definitely more powerful, with a 67% clock speed increase — but it consumes over 400% the electricity. That said, it theoretically could run a program over 4X faster than the ARM AM3359, if that program uses a ton of CISC instructions.

All the same, I’d eat my whole Beanie Baby collection if a real-world program can get more than a 2x speed increase. And that’s a lot of Beanie Babies.

WHAT IS THIS ABOUT ASSEMBLY NOW

“Dang Ben,” you say, “it’s absurd how good you are at Starcraft 2!” Well, yeah, you’re right. But what’s almost as absurd is how cool assembly is.

Assembly is the code that your C/C++ gets compiled to (other languages too, but fuck ’em). It’s a super low-level, close-to-the-metal language, where each line of code represents exactly one task for the processor. There’s a bunch of different flavors of assembly, depending on your processor, but we’re talking about 32-bit x86 assembly here.

Let’s see what the code int firstNum = 10; int secondNum = 31; int thirdNum = firstNum + secondNum; looks like in assembly:

mov dword ptr [firstNum],0Ah
mov dword ptr [secondNum],1Fh
mov eax,dword ptr [firstNum]
add eax,dword ptr [secondNum]
mov dword ptr [thirdNum],eax

As you can see, lines of code in assembly are structured command var1 var2.

  • command is one of a preset list of commands, called the instruction set. These instructions are the only things your processor can do; all your code is expressed in terms of these instructions.
  • var1 is the destination, and var2 is the source.
  • [x] means “don’t look at x, look at the memory at the address held in x“. So, it’s a pointer-dereference, like * in C.
  • dword ptr means x is 32 bits long, or double the size of a 16-bit word ptr.
  • mov means “move”, and 0Ah means “hex byte 0A”, so mov dword ptr [firstNum],0Ah writes 0x0A into firstNum.
  • Sometimes var1 is used as a source as well as a destination — add var1 var2 means var1 = var1 + var2

Cool! So that tells us everything, except… what is eax? Remember that processors can only do arithmetic and logic on data in registers. eax isn’t a variable, it’s a handle to a physical register! x86 assembly has only eight registers that you can read and modify at will, and eax is the one you’ll see most (because it’s favored for arithmetic). If you want to know more about the differences between the eight registers (oh my god are there differences), then CLICK HERE to expand a whole aside on them.

Anyhow, assembly isn’t just some academic concept. You can read the assembly your code gets compiled into, and even insert your own assembly in-line with C/C++ code for sick micro-optimizations (sort of — there’s caveats).

disassemblin
In Visual Studio, stick a breakpoint in some code and hit alt+8 when you hit it. Congratulations! You’re looking at assembly! You can even step through individual instructions to get some hot debugging action. This is a really powerful tool for learning low-level architecture, and I totally encourage you to play with it. There’s no abstractions left when you’re reading assembly. Check out how for and while loops are actually implemented — it’s all just GOTO instructions (well, the instruction is called JMP).

If you want to write in assembly, you can do that too! Maybe. You can write inline assembly for x86 processors, but compilers for newer x64 processors don’t accept inline assembly and recommend you use a predefined set of highly-optimized, low-level intrinsic functions instead.

This isn’t because of any hardware changes in x64. Instead, it’s because inline assembly isn’t necessarily a speed boost. Having inline assembly defeats a ton of compile-time optimizations, since it means the compiler doesn’t get full control over what data is in which registers at any time. You can string intrinsics together to get the speedy low-level behavior you want, and you aren’t fighting the compiler by doing so.

So, you may not want to use inline assembly as a performance tool, since support for it is going away and it can hurt your perf by ruining compiler optimizations. However, it’s still a great learning tool, so don’t be afraid to try it out! To add inline assembly, just use the __asm{ ... } command. For instance:

int myNum = 10;

__asm {
   mov eax, dword ptr [myNum]
   mov ebx, 20
   add eax, ebx
   mov dword ptr [myNum], eax
}

if(myNum == 30)
   cout << "OH DAAAAAAMN";

Anyhow, that's enough. I'm off to perfect my reaper-into-battlecruiser build. Happy coding!

Let’s Talk Processor Architecture

“Hey Ben Walker”, you say, “you’re really good looking, but can you explain how my computer’s processor works?”. Well, I’m double trouble, and by the end of this post, you’re gonna understand processor architecture.

Actually, you’re gonna understand single-core scalar (as opposed to superscalar) processor architecture. These processors went obsolete in, like, 1993. So you’ll be twenty years out of date. Ladies.


(Click for large)

Examine the schematic above (read left-to-right), and then let’s dive in!

HOLD UP, WHAT’S THIS ABOUT CODE VERSUS DATA? Your processor needs two things. It needs data (like array<string> myAnimes, a list of the 500 animes you own), and it needs instructions that act on that data, or code (like myAnimes.eraseAll() please). Mind you, code is just data. It’s stored on your hard drive as bytes, same as anything else. However, your processor knows which bytes represent code and which represent data, and it handles the two very differently. Anyhow.

SYSTEM BUS: So code and data are just bytes. Problem is, those bytes lie in your hard drive or RAM (generally, ‘main memory’) — they aren’t stored in the processor itself. The system bus’ job is to take requests from the processor to grab specific bytes, get them from main memory, and forward those bytes around when received.

IF THE SYSTEM BUS RECEIVES DATA: It forwards that data to the memory management unit, which will in turn forward it to the registers.

MEMORY MANAGEMENT UNIT: Called the MMU. It’s a clearinghouse for bytes. It receives requests for code/data from the rest of the processor, figures out where to look in main memory, and tells the system bus to do so. It also forwards received data to the registers, and determines which register to store the data in. It also has a cache for instructions and data, so it can fulfill requests without going to the system bus and main memory.

REGISTERS: Registers are the only memory that can be read and written to by the processor. There are very few registers. Modern processors have 16 registers that can hold 8 bytes each, meaning 128 bytes of memory. That’s not enough memory to store a paragraph of text. Because there’s so little memory, it’s very important that the processor is efficient — it can only load data immediately before that data gets used, and once that data is used, it needs to be replaced as soon as possible.

IF THE SYSTEM BUS RECEIVES CODE: It forwards code to the instruction pre-fetcher, which in turn goes to the decoder, the sequencer, and the ALU.

INSTRUCTION PRE-FETCH: It figures out what instructions we’re going to execute in a few cycles, and sends requests for those instructions to the MMU right now so we’ll have them on hand when the time comes. In other words, it keeps the instructions flowing. Whenever your code branches ( if(x > 0) DoThis(); else DoThat(); ), the instruction pre-fetch has the interesting task of trying to predict whether to pre-fetch DoThis() or DoThat() before we’ve run all the instructions that determine if x>0. That logic is called a branch predictor.

INSTRUCTION DECODE: Remember how instructions are just stored as bytes, same as everything else? The instruction decode unit is what takes the data 0x01c1 and decodes it as ADD [REGISTER0] TO [REGISTER2]. If the decoder decodes an instruction and finds out that it references data that isn’t in our registers yet, the decoder requests that data from the MMU.

INSTRUCTION SEQUENCING AND CONTROL: It manages out-of-order execution. Imagine your code says myAnimes[315].MarkWatched(); ++numAnimesWatched; but the processor has yet to load your 315th anime into registers. The instruction sequencer recognizes instructions that we can execute on immediately, and jumps them ahead of instructions that are still waiting for data. So, it allows numAnimesWatched to increment even though we’re still waiting to load myAnimes[315]. Heck, the instruction sequencer will allow any instructions that aren’t affected by the result of myAnimes[315].MarkWatched() to skip ahead in line, keeping the processor as busy as possible. To save money and power, some processors — including the Xbox 360 processor — don’t include this unit, and can only process instructions in order. Those instructions are passed to the arithmetic / logic unit.

ARITHMETIC / LOGIC UNIT: Also called the ALU. This is the core of the processor. It receives commands such as ADD THESE NUMBERS TOGETHER (arithmetic) or SAY '1' IF THIS NUMBER IS BIGGER THAN THAT NUMBER, OTHERWISE SAY '0' (logic), and it does them. The results get written into the registers, or get sent to the memory management unit to be written out to main memory.

And hey, you’re done! Don’t get me wrong, this is an absurdly simplified overview. The actual block diagram of an Intel 80386 processor handles plenty of issues I ignored, such as handling overflow/underflow in arithmetic, switching between 16/32 bit operating modes, integer vs. floating point pipelines, and pretty much everything else. But you know what? You did good. Give yourself a cookie. Or email me at walkerb@walkerb.net and complain about everything I did wrong. And happy coding!

Multiply Like A Pro

I’m going to teach you how to multiply any two numbers between 1 and 100, in your head, as quickly as it takes to input them in a calculator. And it’s dead simple.

(It does require some memorization though)

So, I figured out this technique – probably somebody else figured it out before me, but I’ve never heard of it before – that reduces the task of multiplying any two numbers together into some addition, some subtraction, and one division by two. It’s pretty simple. In fact, it all just boils down to one equation:

If your eyes glazed over at the above equation, be strong! Do the FOIL method and check it out for yourself. This is about as hard as the math gets.

Cool! What does that equation mean, though? Well, any two numbers (we’ll call them ‘a’ and ‘b’) can be rewritten in terms of ‘x’ and ‘n’ —

That is, we say that n is half of the difference between our numbers a and b, and that x is the number halfway between a and b. Now, b = x – n and a = x+n! Therefore, for any a and b,

We’re going to run through these formulas using some example numbers, to keep you following along, before we talk about how to solve x2 – n2. Let’s take the following two sets of numbers:

23 x 41
49 x 74

These numbers are big enough that most people wouldn’t even bother try to multiply them in their heads. Is it easier with this new method? (Yes.)

First, find n, half of the difference between the two numbers…

23 x 41:
41 – 23 = 18
n = (18 / 2) = 9

74 x 49:
74 – 49 = 25
n = (25 / 2) = 12.5

So the halfway point between 23 and 41 is

23+9 = 32

And the halfway point between 49 and 74 is

49 + 12.5 = 61.5

This means that…

23 x 41 = 322 – 92
49 x 74 = 61.52 – 12.52

But what are 61.52 , 12.52 , 322 and 92?

This is the hard part of the method. I don’t expect you to be able to solve 61.52 in your head. Instead, you’re going to have to memorize 100 squares – the square of every integer from 0 to 100. Yeah.

It’s a lot of work, but it’s entirely doable. There’s only 100 numbers, and you probably already know the squares of, like, 30 of them. If you’re thinking of giving up now, just think of all the sexy mathematician ladies (or dudes) you’ll be picking up at parties once you’ve memorized these and can multiply in a second!

If sexy mathematician ladies (or dudes) aren’t enough to make you memorize all 100 squares, though, there are tricks you can use to calculate squares very quickly in your head. I’m not going to talk about them here. I’m not supporting your laziness. But if you’re interested, you should read up on Vedic Mathematics, it’s got some pretty cool stuff about mental algebra.

But seriously, go the hardcore route, memorize this table. If you do so, you’ll already know the squares of all those integers-plus-one-half like 61.52 (but more on that later):

x x2 x x2 x x2 x x2
0 0 25 625 50 2500 75 5625
1 1 26 676 51 2601 76 5776
2 4 27 729 52 2704 77 5929
3 9 28 784 53 2809 78 6084
4 16 29 841 54 2916 79 6241
5 25 30 900 55 3025 80 6400
6 36 31 961 56 3136 81 6561
7 49 32 1024 57 3249 82 6724
8 64 33 1089 58 3364 83 6889
9 81 34 1156 59 3481 84 7056
10 100 35 1225 60 3600 85 7225
11 121 36 1296 61 3721 86 7396
12 144 37 1369 62 3844 87 7569
13 169 38 1444 63 3969 88 7744
14 196 39 1521 64 4096 89 7921
15 225 40 1600 65 4225 90 8100
16 256 41 1681 66 4356 91 8281
17 289 42 1764 67 4489 92 8464
18 324 43 1849 68 4624 93 8649
19 361 44 1936 69 4761 94 8836
20 400 45 2025 70 4900 95 9025
21 441 46 2116 71 5041 96 9216
22 484 47 2209 72 5184 97 9409
23 529 48 2304 73 5329 98 9604
24 576 49 2401 74 5476 99 9801

 

Now, here’s a cool trick: even though we need to know integers-plus-one-half like 61.5 for this method, we already know them if we know the square of that number’s closest integer (rounded down). This is because:

(x+0.5)2
= (x+0.5)(x+0.5)
= x2 + x + 0.25

And for bonus points, we can ignore that “+ 0.25” completely! It’s just going to be subtracted out by the second half of our equation! Forget about it! Therefore…

61.52 = 612 + 61 = 3721 + 61 = 3782
12.52 = 122 + 12 = 144 + 12 = 156

Now, it should be easy for you solve those two problems.

23 x 41 = 322 – 92 = 1024 – 81 = 943
49 x 74 = 61.52 – 12.52 = (3721 + 61) – (144 + 12) = 3782 – 156 = 3626

And there you go! To recap this method:

  • Find half of the difference between the two numbers you’re trying to multiply
  • Add that to the smaller number to get the midpoint between your two numbers
  • Midpoint squared minus half-of-difference squared equals your result!
  • If the number is an integer-plus-one-half, remember that (x+0.5)2 = x2 + x (for our purposes)

There’s no reason you can’t use this method to multiply two numbers over 100, either! You just have to memorize your squares tables up to the highest number you’re willing to multiply.

I hope you get tons of use out of this and impress all the sexy mathematician ladies (or dudes)!