Category Archives: Hardware

Going Down The List: RAM Stats Spec Sheet

So, we’ve already discussed CPUs, GPUs, and motherboards. But we haven’t discussed RAM! LET’S FIX THAT.

RAM stands for “Random Access Memory”. The name, like many in computer hardware, is antiquated and emphasizes aspects of memory that nobody cares about anymore. Although almost every component in a computer has some amount of random-access memory attached, most people are referring to the “main memory” in computers when they talk about RAM. For this conversation, we’re only talking about SDRAM, or Synchronized Dynamic RAM, which is the type of RAM used in desktop computers.

RAM only has two jobs: it holds a ton of data, and when the CPU asks for a specific piece of data, it finds that piece and returns it to the CPU as efficiently as possible (both in terms of bandwidth — megabytes transferred per second — and latency — time between the CPU asking for data and the RAM returning it). Although that’s not too much to do, there’s still plenty of jargon to analyze when discussing RAM.

We’re going to split this up into the EASY MODE terminology about RAM (self-evident information), and HARD MODE terminology (which requires understanding the nitty-gritty details of how RAM works).

EASY MODE

CAPACITY: This represents how much information can be stored in memory at once. More is better.

PIN COUNT: It’s the number of connections made between the RAM and the motherboard (if you count the little gold tabs at the bottom of your stick of RAM, that’s your pin count). Your motherboard only supports a certain type of RAM (usually 240-pin DDR3). Obnoxiously, two different generations of RAM can have the same pin count but not both be usable by the same motherboard, so make sure both pin count AND generation of DDR match between your RAM and your motherboard’s supported spec sheet.

DDR GENERATION (DDR1 vs DDR2 vs DDR3): Changes in generation represent major shifts in the internal design of RAM. Each generation is incompatible with the others, so you can’t install DDR2 RAM on a motherboard that supports DDR3. Later generations consume less energy and offer higher clock speeds (they also add more latency, but the higher clock speeds offset that). The “DDR” itself stands for “double data rate”, and it refers to the fact that all DDR memory can do two operations per clock cycle.

VOLTAGE: How much power it consumes. In general, DDR consumes 2.5V, DDR2 consumes 1.8V, and DDR3 RAM consumes 1.5V. You can buy RAM that doesn’t conform fully to this spec, but most RAM will match those numbers.

BUFFERED/REGISTERED RAM: Buffers/registers help the RAM during periods of prolonged access, and keeps RAM more stable, but it costs more money to purchase and adds latency. You don’t need it if you aren’t working on a server.

DDR[X]-[Y] PC[X]-[Z]: You’ll see this format appear on some RAM spec sheets. The “X” represents which generation DDR SDRAM is used (it will always be the same as the number after “PC”). “Y” represents the effective clock speed, or, how many operations it can perform per second. (Note that this number is actually double the real clock speed, due to DDR’s two-actions-per-cycle methodology). Finally, “Z” represents the maximum theoretical transfer rate in megabytes/second, or, its max bandwidth. Since DDR RAM transfers 8 bytes per operation, this number is always just 8 * clock speed in megahertz.

HARD MODE

As you may know, RAM only thinks of data in terms of addresses — for instance, instead of asking RAM what is the value of myInt, you’d look up the address of myInt, see that it’s 0x0f3c, and then ask the RAM for the data located at 0x0f3c. Well, internally, RAM memory banks are stored as a 2-dimensional table. So, once it receives that instruction, the RAM may internally split address 0x0f3c into 0x0f and 0x3c, in order to create the instruction read the data at row 0x0f, column 0x3c. Memory is stored in row major order, meaning that you read memory along rows, not along columns. With that in mind, let’s look at how to determine RAM latency.

TIMING: You’ll often see timing numbers that look like “A-B-C-D” or “A-B-C-D-E”. Each number represents the latency in performing certain operations. The smaller the numbers, the less latency, the better your RAM is. In order: A represents CAS Latency, B represents RAS to CAS delay, C represents RAS precharge delay, and D represents Row Active Time. If listed, E represents Command Rate. We’re going to define each term in a different order than it’s listed in the timing specs.

RAS AND CAS: These stand for “Row Access Strobe” and “Column Access Strobe”. Basically, it means “this latency appears when we look at a different row” or “this latency appears when we look at a different column”.

RAS PRECHARGE DELAY: Whenever you need to look at a new row of memory, you have to wait for a certain amount of clock cycles for that row to be prepared. If your current memory read is off the same row as your previous memory read, that row is already prepared and you don’t need to pay this cost. Referred to sometimes as tRP.

RAS to CAS DELAY: This represents the amount of clock cycles the RAM has to wait between defining which row to read and which column to read. Referred to sometimes as tRCD.

CAS LATENCY: This represents the amount of clock cycles the RAM has to wait after the address row and column are specified before it gets the data in that row/column back to send out. Referred to sometimes as tCAS. This is the most well-known source of latency, but really, RAS to CAS is just as important. So pay attention to your entire timing specs, not just a separately-listed CAS Latency spec.

ROW ACTIVE TIME: This represents the amount of clock cycles the RAM has to wait between activating a row and de-activating it in order to access a new row. Ideally, it should equal CAS latency + RAS-to-CAS delay + 2 clock cycles, or, the amount of time taken to read data from a row after it’s activated plus two clock cycles to push out the memory to the CPU.

COMMAND RATE: Often not represented in the timing numbers, because it’s not that important. Represents the time between activating a memory chip and it being able to receive its first command.

LATENCY: Your actual latency between deciding you want memory at a certain address and receiving it will vary depending on what memory has been accessed previously, but at worst, it is equal to RAS precharge + RAS to CAS delay + CAS latency (or, the sum of the first three numbers in your timing specs).

Motherboards: Where Are Their Ladyparts??

Continuing in a series of explanations of computer hardware, let’s look at motherboards! How do you tell motherboards apart? If you can hook all your hardware up to two motherboards, which one is better?

BASIC STUFF: THE MOTHERBOARD AS CONNECTIVE TISSUE

The most important thing about the motherboard is that it connects all the individual parts of your computer. The motherboard contains the wires that let data flow between CPU, GPU, RAM, HDD, your keyboard and mouse, etc. — it is the spinal cord of your computer.

When you’re buying a motherboard, the most important spec is the “socket type” of CPU that it supports. “Socket type” refers to the physical connection between the pins on the CPU and the sockets on the motherboard. A processor built for one socket type will physically not fit in a motherboard built to accept another socket type. Intel and AMD processors use different socket types. Furthermore, both companies create new socket types every few years, requiring motherboard upgrades to use the (presumably better) CPUs built to the new socket type. So, buying an Intel LGA 1155 socket-type motherboard locks you out of all AMD chips and all Intel chips older or newer than that socket type. Chances are, if you have to buy a new motherboard, it’s because new CPUs can’t work in your old one.

Number and type of expansion slots is the next most important thing. You may want to use many sticks of RAM or 2+ graphics cards, and some motherboards can’t support that. Furthermore, RAM, USB ports, hard drives, and GPUs are also built to standards that evolve over time. Although these standards don’t change as frequently as CPU chipsets (and Intel and AMD socket motherboards both support the same standards for all these things), you’ve still got to make sure that they’re supported.

Most desktop motherboards will support 240-pin DDR3 RAM, but different motherboards support different speeds of RAM (this is the 1066, 1333, 1600, etc.) Higher is better. Make sure to look up the number of memory slots too — Fewer slots of RAM isn’t a dealbreaker, but more slots is better. For instance, you can save money by buying 16GB of RAM as 4x 4GB sticks instead of 2x 8GB sticks.

You’ll also see PCI (“Peripheral Component Interconnect”) expansion slots listed on motherboard specs. These slots are where you insert specialized hardware as needed — audio cards, network cards, video cards, and even some SSDs use PCI slots. There’s multiple standards of PCI, and again, bigger numbers (and the word “express”) is better. Stuff that can afford to be slow, such as network cards, only need PCI. Nice sound cards and PCI-based SSDs will want PCI Express, which allows faster data transfer. Video cards are the only things that really need PCI Express 2.0+, since they transfer absurd amounts of data.

There’s other considerations like form factor (desktops are ATX), built-in audio/network/video cards (use expansion-slot cards if you can, but these get the job done), and HDD connectors (almost everything runs on SATA 6gb/s nowadays). These are all boring and don’t change very much, so we’re glossing over them.

GETTING FANCIER: THE MOTHERBOARD AS BRAIN

You can go out and buy a usable motherboard just based on the information above, and you’ll be fine. But stopping now is for losers! Motherboards are more than just the wires connecting components.

First, the motherboard contains the BIOS. BIOS stands for “Basic Input/Output System” (pretend like it means “Built-in Operating System” — Neal Stephenson’s idea — because that’s a better name). It’s a super-low-level system where you can see hardware stats and change them. This is the settings screen that you get when you hit F11 or Delete while booting, where you do things like RAID together hard drives, control voltage/timings, and specify to boot off HDD or CD. Honestly, 95% of motherboard manufacturers’ BIOS utilities feature 95% of the things you care about, so it’s not a factor in your motherboard purchase.

Less visibly, but more importantly, the motherboard contains the Northbridge and Southbridge chips. These chips manage communication between each part of your computer, and they are vital. The Northbridge manages access to high-importance / high-data-transfer-rate parts of the computer, like RAM and video cards in PCIe slots (also, the Northbridge is being phased out of existence, as more system-on-chips wrap the Northbridge into the CPU). Southbridges manage access to lower-importance / lower-data-transfer-rate peripherals in PCI, USB, or SATA slots (i.e. audio cards, keyboards/mice, slow hard drives).

Since a ton of data transfers happen that don’t involve the CPU at all (RAM <-> video card; USB stick <-> printer), and the CPU is usually held back by memory transfer speed anyway, the Northbridge and Southbridge play an important role: not slowing the CPU down by making it a middleman in these transfers. A good Northbridge and Southbridge are the caretakers that make your entire machine flow smoothly.

What’s the difference between a good and bridge Southbridge? the Intel Z77 chipset can communicate at 5GB/s with 8 PCIe 2.0 slots and 6GB/s with 6 SATA ports. However, Intel’s H61 can only handle 6 PCIe 2.0 slots, and only 4 SATA ports at 3GB/s — even if you installed a USB 3.0 card on a H61 motherboard, it wouldn’t run at full speed. (In general, ‘H’ means ‘budget’ and ‘Z’ means ‘performant’ for Intel).

These chips are the things that make one motherboard more expensive than another with the same slots — and although a budget Northbridge/Southbridge can hold you if you aren’t pushing the boundaries of your CPU, certain PC builds will see stunning increases by swapping out the motherboard alone.

Video Cards Have So Many Stats!

If you research video cards, because you’re buying one or something, you’re gonna see a TON of stats. And let’s be honest, you won’t understand all of them. This blog post will fix that problem! Maybe. Hopefully.

This is pretty much an info dump of all stats mentioned in NewEgg, AnandTech, and TomsHardware listings. Stats are split up by general category, whether they affect the video card EXIST AS A HUNK OF METAL, or MOVE DATA AROUND, or DO CALCULATIONS.

THESE MAKE THE VIDEO CARD EXIST AS A HUNK OF METAL

MANUFACTURING PROCESS: Measured in nanometers. This measures how small the semiconductors in the video card are (semiconductors are the building blocks of, like, all electronic devices). The smaller the semiconductors, the less heat/electricity they consume, and the more you can pack on a card.

TRANSISTOR COUNT: Transistors are made of semiconductors, so transistor count is inversely proportional to manufacturing process. Again, more transistors = more better.

THERMAL DESIGN POWER (TDP): Measured in watts. Measures how much power the video card expects to consume. Most overclocking software lets you increase wattage beyond TDP, but you’ll need to upgrade the stock fans to dissipate the extra heat, and you probably won’t get as good performance as just buying a card with a greater TDP. TDP should be close to load power, or how much power the card consumes when running Crysis or something. Most video cards have TDPs in the 200W range — which, for the record, is beastly, 3x+ the power of a good x64 CPU.

THESE MAKE THE VIDEO CARD MOVE DATA AROUND

PHYSICAL INTERFACE: The physical part that hooks in to the motherboard and lets data move between your video card and your motherboard. Whatever your video card’s interface is, make sure your motherboard has a slot of that interface type. New video cards are usually PCIe 2.0 x16 (which can transfer 8 gigabytes / second) or PCIe 3.0 x16 (15 gb/s!). Transfer speeds of 15gb/s may sound like overkill for a 4gb video game, but in addition to textures, etc., the computer is sending a LOT of data about game state to the GPU 30 times a second, so it’s needed.

RAMDAC: Stands for “Random Access Memory Digital-to-Analog Converter”. It takes a rendered frame and pushes pixels to your monitor to display. The DAC isn’t used if you’re using digital interfaces for your monitor (like HDMI), and the information held in the RAM isn’t used in modern full-color displays. So everything about the name ‘RAMDAC’ is outdated. Most RAMDACs run at 400MHz, which means it can output 400 million RGB pixel sets per second, enough to drive a 2560×1600 monitor at 97fps. Probably good enough for you.

MEMORY SIZE: How much data the video card can store in memory. Although video cards can communicate with the main computer and therefore save/load data in the computer’s RAM / hard drive, memory that resides inside the video card can be accessed with less latency and higher bandwidth. Bandwidth is one of the biggest bottlenecks (and therefore one of the most important measures) for graphics cards.

MEMORY TYPE: Probably GDDR 2/3/4/5. ‘DDR’ stands for “double data rate”, because DDR memory performs transfers twice per clock cycle. The ‘G’ stands for ‘Graphics’ — since memory access patterns differ between GPUs (who want lots of data / can wait for it) and CPUs (who want little data / can’t wait for it), GDDR memory and computer DDR memory went down separate upgrade paths. Higher numbers represent new architectures that allow more memory transfers per clock cycle.

MEMORY INTERFACE: Measured in bits. Represents how much data is carried per individual data transfer.

MEMORY CLOCK: Measured in MHz/GHz. Represents how many memory-transfer cycles occur per second (although more than one memory-transfer can occur per cycle). Sometimes you’ll see “effective memory clock” listed, which means “real clock speed * number of memory transfers per clock cycle afforded by our memory type”.

MEMORY BANDWIDTH: How many bytes of data can be transferred between the memory on the graphics cards and the GPUs themselves, per second. Measured by clock speed * interface size * [4 for GDDR2, 8 for GDDR3 or GDDR4, 16 for GDDR5]. Or, effective clock speed * interface size. This is one of the most important numbers for comparing graphics cards.

THESE MAKE THE VIDEO CARD DO CALCULATIONS

CORE CLOCK: Measured in MHz/GHz. Represents how many computation cycles occur per second. If you see references to the shader clock — it’s tied to the core clock.

BOOST CLOCK: Measured in MHz/GHz. If your GPU detects that it’s running at full capacity but not using much power (which happens when you’re not using all parts of the card — i.e. GPGPU computing that doesn’t render anything, or poorly optimized games), it’ll overclock itself until it consumes the extra power. It may overclock itself to a frequency below or above boost clock frequency, based on how little power it’s using, so boost clock is a nebulous measurement. Although only Nvidia uses the term ‘Boost Clock’, AMD offers the same controls, called ‘PowerTune’.

SHADER CORE COUNT: Sometimes called “CUDA cores” for Nvidia or “Stream Processors” for ATI. It’s the number of cores, similar to processor count in CPUs. Note that, compared to CPUs, GPUs generally have 100x the cores at 0.3x the clock speed (this is still, obviously, a win). That architecture means GPUs ideal for running the same operation thousands of times on thousands of different sets of data, which is exactly what video games need (for instance, figuring out where every vertex in a 3d model is located on screen).

TEXTURE UNITS: Also called texture mapping units or TMUs. Video games have 3D models and need to apply textures to them. However, there are many different issues that arise when you try texturing a model at an angle, or far away, or super-close up. When your GPU asks for a given pixel in a given texture, the texture unit handles that request and solves all these problems before passing it back to the GPU. More texture units means you can look up more textures per second!

TEXTURE FILL RATE: Measured as number of texture units * core clock speed. Represents the number of pixels in textures that the card can lookup every second. If you’re playing a game where every object on screen is textured (which is most games), this should be higher than the resolution of your screen * desired framerate, because one on-screen pixel can be determined by many textures (i.e. diffuse + specular + normal).

ROPs: Stands for “Raster Operators”. These units receive the final color value for a given pixel and write it to the output image, to be passed to the RAMDAC to be rendered on your monitor.

PIXEL FILL RATE: Measured as number of ROPs * core clock speed. Represents the number of pixels that can be written to the output image to display on your monitor, per second. This also has to be higher than screen resolution * desired framerate, because one on-screen pixel can be determined by the output color of many 3d models (i.e. looking at a mountain through a semi-transparent pane of glass requires 2 ROPs per pixel, one for the mountain, one for the pane of glass in front of it). If you see “fill rate”, it usually refers to this instead of texture fill rate.

FLOPs: Measured in megaflops/gigaflops. Stands for “floating point operations per second”, and represents how many times a video card can multiply/divide/add/subtract two floating point numbers, per-second (most video card calculations are done with floats instead of integers).

Whew! So that’s a pretty intense crash course in video card specs. Hope that helps!

x86/x64 vs ARM: What’s the difference anyhow?

You may have heard about Windows RT vs Windows 8. They’re, like, almost the same, but also really different? It’s confusing. Well, here’s the difference:

Windows 8 can only run on x86/x64 processors. Windows RT can only run on ARM processors.

Cool. But why does x86/x64 vs ARM matter? Why does each processor require different versions of Windows? Let’s drill down.

x86/x64 processors: They’re fast and powerful, but they require a lot of electricity. So, they’re used in desktop computers that can plug into the wall. All versions of Windows run on x86/x64.

ARM processors: They’re weak but low-power processors for smartphones and other devices that aren’t plugged into the wall. Mobile iOS and Android operating systems run on ARM.

The two processor architectures are mutually exclusive: a program that’s built for x86/x64 can’t run on ARM under any circumstances, and vice versa.

Since the 90s, these architectures have existed in parallel worlds: ARM for phones and small PDAs, x86/x64 for desktops and big laptops. But in the past few years, the market’s gotten all hot and bothered for tablets that are bigger and more powerful than phones, but simpler than laptops — like the iPad or Kindle. Tablets have to be small and light, which means tiny batteries, which means ARM processors. But tablets have USB ports, full web browsers, and word processing and photo editing apps, which means ARM-based portables have become direct challengers to x86/x64-based desktops.

Windows has always been a desktop-only operating system, so it’s only been available for x86/x64 processors. But Microsoft sees everyone moving to tablets, and it doesn’t want to lose all its future revenue, so it entered the mobile arena with Windows RT and the Surface. [update 2016: Windows RT sorta failed and the Surface is becoming a brand for x86/x64 processor powered laptops, with the Surface Pro and Surfacebook]

x86/x64 processor manufacturers are potentially the most harmed by the rise of tablets. But there’s an easy way for them to stay relevant — make an x86/x64 processor that’s low-power enough to place in tablets. Intel’s doing that with the Atom processors, which give up processing power, x64 support, and high-speed computing features like SSE in return for super-reduced power consumption.

BUT. GET THIS. Atom processors still can consume 2x the electricity of an ARM processor, even at the same processor count/clock speed. What?

Well, it’s because of an inherent difference between the processors. See, “ARM” stands for “Advanced RISC Machine”. RISC stands for ‘Reduced Instruction Set Computing’, and, befitting an acronym that’s part of your entire brand name, it’s what makes ARM so low-power.

You may remember that the instruction set refers to the set of commands that the processor can execute. Well, most code only requires a few instructions — read/write memory, do arithmetic, jump, boolean logic, not much more. ARM processors only offer these basic instructions. Thus, a reduced instruction set.

x86/x64 processors are CISC, or ‘Complex Instruction Set Computing’. Although almost all code can be represented by the basic instructions in RISC, certain patterns of instructions are common — for instance, “write this byte to memory then look at the immediately following byte”. CISC processors offer combo-instructions (previous example being STOSB) that handle these common instruction patterns super-efficiently. However, support for these combo-instructions requires extra hardware — and that hardware costs electricity.

That difference in hardware is why ARM processors use less power than x86/x64 processors at the same clock speed. Mind you, it also means that some programs run faster in x86/x64 processors than they do in ARM processors with the same specs — an algorithm that takes 3 cycles on an ARM processor can take 1 cycle on an x86/x64 processor if it’s been wrapped into a CISC combo-instruction. It’s also why programs built for x86/x64 can’t run in ARM — once you compile a program for x86/x64, it’s hardcoded to use these combo-instructions, and there’s no translation to ARM from there.

So what do the specs of two similar-release-date processors look like? Let’s compare the ARM AM3359 and the x86 Atom Z650.

ARM AM3359 Atom Z650
INTRODUCED IN Q3 2011 Q2 2011
# CORES 1 1
CLOCK SPEED 720MHz 1.2GHz
L1 CACHE 64KB 56KB
L2 CACHE 256KB 512KB
POWER CONSUMPTION 0.7W 3W

The Atom Z650 is definitely more powerful, with a 67% clock speed increase — but it consumes over 400% the electricity. That said, it theoretically could run a program over 4X faster than the ARM AM3359, if that program uses a ton of CISC instructions.

All the same, I’d eat my whole Beanie Baby collection if a real-world program can get more than a 2x speed increase. And that’s a lot of Beanie Babies.

WHAT IS THIS ABOUT ASSEMBLY NOW

“Dang Ben,” you say, “it’s absurd how good you are at Starcraft 2!” Well, yeah, you’re right. But what’s almost as absurd is how cool assembly is.

Assembly is the code that your C/C++ gets compiled to (other languages too, but fuck ’em). It’s a super low-level, close-to-the-metal language, where each line of code represents exactly one task for the processor. There’s a bunch of different flavors of assembly, depending on your processor, but we’re talking about 32-bit x86 assembly here.

Let’s see what the code int firstNum = 10; int secondNum = 31; int thirdNum = firstNum + secondNum; looks like in assembly:

mov dword ptr [firstNum],0Ah
mov dword ptr [secondNum],1Fh
mov eax,dword ptr [firstNum]
add eax,dword ptr [secondNum]
mov dword ptr [thirdNum],eax

As you can see, lines of code in assembly are structured command var1 var2.

  • command is one of a preset list of commands, called the instruction set. These instructions are the only things your processor can do; all your code is expressed in terms of these instructions.
  • var1 is the destination, and var2 is the source.
  • [x] means “don’t look at x, look at the memory at the address held in x“. So, it’s a pointer-dereference, like * in C.
  • dword ptr means x is 32 bits long, or double the size of a 16-bit word ptr.
  • mov means “move”, and 0Ah means “hex byte 0A”, so mov dword ptr [firstNum],0Ah writes 0x0A into firstNum.
  • Sometimes var1 is used as a source as well as a destination — add var1 var2 means var1 = var1 + var2

Cool! So that tells us everything, except… what is eax? Remember that processors can only do arithmetic and logic on data in registers. eax isn’t a variable, it’s a handle to a physical register! x86 assembly has only eight registers that you can read and modify at will, and eax is the one you’ll see most (because it’s favored for arithmetic). If you want to know more about the differences between the eight registers (oh my god are there differences), then CLICK HERE to expand a whole aside on them.

Anyhow, assembly isn’t just some academic concept. You can read the assembly your code gets compiled into, and even insert your own assembly in-line with C/C++ code for sick micro-optimizations (sort of — there’s caveats).

disassemblin
In Visual Studio, stick a breakpoint in some code and hit alt+8 when you hit it. Congratulations! You’re looking at assembly! You can even step through individual instructions to get some hot debugging action. This is a really powerful tool for learning low-level architecture, and I totally encourage you to play with it. There’s no abstractions left when you’re reading assembly. Check out how for and while loops are actually implemented — it’s all just GOTO instructions (well, the instruction is called JMP).

If you want to write in assembly, you can do that too! Maybe. You can write inline assembly for x86 processors, but compilers for newer x64 processors don’t accept inline assembly and recommend you use a predefined set of highly-optimized, low-level intrinsic functions instead.

This isn’t because of any hardware changes in x64. Instead, it’s because inline assembly isn’t necessarily a speed boost. Having inline assembly defeats a ton of compile-time optimizations, since it means the compiler doesn’t get full control over what data is in which registers at any time. You can string intrinsics together to get the speedy low-level behavior you want, and you aren’t fighting the compiler by doing so.

So, you may not want to use inline assembly as a performance tool, since support for it is going away and it can hurt your perf by ruining compiler optimizations. However, it’s still a great learning tool, so don’t be afraid to try it out! To add inline assembly, just use the __asm{ ... } command. For instance:

int myNum = 10;

__asm {
   mov eax, dword ptr [myNum]
   mov ebx, 20
   add eax, ebx
   mov dword ptr [myNum], eax
}

if(myNum == 30)
   cout << "OH DAAAAAAMN";

Anyhow, that's enough. I'm off to perfect my reaper-into-battlecruiser build. Happy coding!

Let’s Talk Processor Architecture

“Hey Ben Walker”, you say, “you’re really good looking, but can you explain how my computer’s processor works?”. Well, I’m double trouble, and by the end of this post, you’re gonna understand processor architecture.

Actually, you’re gonna understand single-core scalar (as opposed to superscalar) processor architecture. These processors went obsolete in, like, 1993. So you’ll be twenty years out of date. Ladies.


(Click for large)

Examine the schematic above (read left-to-right), and then let’s dive in!

HOLD UP, WHAT’S THIS ABOUT CODE VERSUS DATA? Your processor needs two things. It needs data (like array<string> myAnimes, a list of the 500 animes you own), and it needs instructions that act on that data, or code (like myAnimes.eraseAll() please). Mind you, code is just data. It’s stored on your hard drive as bytes, same as anything else. However, your processor knows which bytes represent code and which represent data, and it handles the two very differently. Anyhow.

SYSTEM BUS: So code and data are just bytes. Problem is, those bytes lie in your hard drive or RAM (generally, ‘main memory’) — they aren’t stored in the processor itself. The system bus’ job is to take requests from the processor to grab specific bytes, get them from main memory, and forward those bytes around when received.

IF THE SYSTEM BUS RECEIVES DATA: It forwards that data to the memory management unit, which will in turn forward it to the registers.

MEMORY MANAGEMENT UNIT: Called the MMU. It’s a clearinghouse for bytes. It receives requests for code/data from the rest of the processor, figures out where to look in main memory, and tells the system bus to do so. It also forwards received data to the registers, and determines which register to store the data in. It also has a cache for instructions and data, so it can fulfill requests without going to the system bus and main memory.

REGISTERS: Registers are the only memory that can be read and written to by the processor. There are very few registers. Modern processors have 16 registers that can hold 8 bytes each, meaning 128 bytes of memory. That’s not enough memory to store a paragraph of text. Because there’s so little memory, it’s very important that the processor is efficient — it can only load data immediately before that data gets used, and once that data is used, it needs to be replaced as soon as possible.

IF THE SYSTEM BUS RECEIVES CODE: It forwards code to the instruction pre-fetcher, which in turn goes to the decoder, the sequencer, and the ALU.

INSTRUCTION PRE-FETCH: It figures out what instructions we’re going to execute in a few cycles, and sends requests for those instructions to the MMU right now so we’ll have them on hand when the time comes. In other words, it keeps the instructions flowing. Whenever your code branches ( if(x > 0) DoThis(); else DoThat(); ), the instruction pre-fetch has the interesting task of trying to predict whether to pre-fetch DoThis() or DoThat() before we’ve run all the instructions that determine if x>0. That logic is called a branch predictor.

INSTRUCTION DECODE: Remember how instructions are just stored as bytes, same as everything else? The instruction decode unit is what takes the data 0x01c1 and decodes it as ADD [REGISTER0] TO [REGISTER2]. If the decoder decodes an instruction and finds out that it references data that isn’t in our registers yet, the decoder requests that data from the MMU.

INSTRUCTION SEQUENCING AND CONTROL: It manages out-of-order execution. Imagine your code says myAnimes[315].MarkWatched(); ++numAnimesWatched; but the processor has yet to load your 315th anime into registers. The instruction sequencer recognizes instructions that we can execute on immediately, and jumps them ahead of instructions that are still waiting for data. So, it allows numAnimesWatched to increment even though we’re still waiting to load myAnimes[315]. Heck, the instruction sequencer will allow any instructions that aren’t affected by the result of myAnimes[315].MarkWatched() to skip ahead in line, keeping the processor as busy as possible. To save money and power, some processors — including the Xbox 360 processor — don’t include this unit, and can only process instructions in order. Those instructions are passed to the arithmetic / logic unit.

ARITHMETIC / LOGIC UNIT: Also called the ALU. This is the core of the processor. It receives commands such as ADD THESE NUMBERS TOGETHER (arithmetic) or SAY '1' IF THIS NUMBER IS BIGGER THAN THAT NUMBER, OTHERWISE SAY '0' (logic), and it does them. The results get written into the registers, or get sent to the memory management unit to be written out to main memory.

And hey, you’re done! Don’t get me wrong, this is an absurdly simplified overview. The actual block diagram of an Intel 80386 processor handles plenty of issues I ignored, such as handling overflow/underflow in arithmetic, switching between 16/32 bit operating modes, integer vs. floating point pipelines, and pretty much everything else. But you know what? You did good. Give yourself a cookie. Or email me at walkerb@walkerb.net and complain about everything I did wrong. And happy coding!