ARM Processor

History

The ARM design was started in 1983 as a development project at Acorn Computers Ltd to build a compact RISC CPU. Led by Sophie Wilson and Steve Furber, a key design goal was achieving low-latency input/output (interrupt) handling like the MOS Technology 6502 used in Acorn's existing computer designs. The 6502's memory access architecture allowed developers to produce fast machines without the use of costly direct memory access hardware. The team completed development samples called ARM1 by April 1985^[2], and the first "real" production systems as ARM2 the following year.

The ARM2 featured a 32-bit data bus, a 32-bit (4 Gbyte) address space and sixteen 32-bit registers. Program code had to lie within the first 64 Mbyte of the memory, as the program counter was limited to 26 bits because the top 6 bits of the 32-bit register served as status flags. The ARM2 was possibly the simplest useful 32-bit microprocessor in the world, with only 30,000 transistors (compare with Motorola's six-year older 68000 model with around 70,000 transistors). Much of this simplicity comes from not having microcode (which represents about one-quarter to one-third of the 68000) and, like most CPUs of the day, not including any cache. This simplicity led to its low power usage, while performing better than the Intel 80286.^[3] A successor, ARM3, was produced with a 4KB cache, which further improved performance.

In the late 1980s Apple Computer and VLSI Technology started working with Acorn on newer versions of the ARM core. The work was so important that Acorn spun off the design team in 1990 into a new company called Advanced RISC Machines Ltd. For this reason, ARM is sometimes expanded as Advanced RISC Machine instead of Acorn RISC Machine. Advanced RISC Machines became ARM Ltd when its parent company, ARM Holdings plc, floated on the London Stock Exchange and NASDAQ in 1998.^[4]

The new Apple-ARM work would eventually turn into the ARM6, first released in 1991. Apple used the ARM6-based ARM 610 as the basis for their Apple Newton PDA. In 1994, Acorn used the ARM 610 as the main CPU in their Risc PC computers. DEC licensed the ARM6 architecture (which caused some confusion because they also produced the DEC Alpha) and produced the StrongARM. At 233 MHz this CPU drew only 1 watt of power (more recent versions draw far less). This work was later passed to Intel as a part of a lawsuit settlement, and Intel took the opportunity to supplement their aging i960 line with the StrongARM. Intel later developed its own high performance implementation known as XScale which it has since sold to Marvell.

The ARM core has remained largely the same size throughout these changes. ARM2 had 30,000 transistors, while the ARM6 grew to only 35,000. ARM's business has always been to sell IP cores, which licensees use to create microcontrollers and CPUs based on this core. The most successful implementation has been the ARM7TDMI with hundreds of millions sold in almost every kind of microcontroller equipped device. The idea is that the Original Design Manufacturer combines the ARM core with a number of optional parts to produce a complete CPU, one that can be built on old semiconductor fabs and still deliver substantial performance at a low cost. As of January 2008, over 10 billion ARM cores have been built, and iSuppli predicts that 5 billion a year will ship in 2011.^[5]

The common architecture supported on smartphones, Personal Digital Assistants and other handheld devices is ARMv4. XScale and ARM926 processors are ARMv5TE, and are now more numerous in high-end devices than the StrongARM, ARM925T and ARM7TDMI based ARMv4 processors.

ARM cores

Family	Architecture Version	Core	Feature	Cache (I/D)/MMU	Typical MIPS @ MHz	In application
ARM1	ARMv1	ARM1		None		ARM Evaluation System second processor for BBC Micro
ARM2	ARMv2	ARM2	Architecture 2 added the MUL (multiply) instruction	None	4 MIPS @ 8 MHz 0.33 DMIPS/MHz	Acorn Archimedes, Chessmachine
ARM2	ARMv2a	ARM250	Integrated MEMC (MMU), Graphics and IO processor. Architecture 2a added the SWP and SWPB (swap) instructions.	None, MEMC1a	7 MIPS @ 12 MHz	Acorn Archimedes
ARM3	ARMv2a	ARM2a	First use of a processor cache on the ARM.	4K unified	12 MIPS @ 25 MHz 0.50 DMIPS/MHz	Acorn Archimedes
ARM6	ARMv3	ARM60	v3 architecture first to support addressing 32 bits of memory (as opposed to 26 bits)	None	10 MIPS @ 12 MHz	3DO Interactive Multiplayer, Zarlink GPS Receiver
		ARM600	Cache and coprocessor bus (for FPA10 floating-point unit).	4K unified	28 MIPS @ 33 MHz
		ARM610	Cache, no coprocessor bus.	4K unified	17 MIPS @ 20 MHz 0.65 DMIPS/MHz	Acorn Risc PC 600, Apple Newton 100 series
ARM7	ARMv3	ARM700		8 KB unified	40 MHz	Acorn Risc PC prototype CPU card
		ARM710		8KB unified	40 MHz	Acorn Risc PC 700
		ARM710a		8 KB unified	40 MHz 0.68 DMIPS/MHz	Acorn Risc PC 700, Apple eMate 300
		ARM7100	Integrated SoC.	8 KB unified	18 MHz	Psion Series 5
		ARM7500	Integrated SoC.	4 KB unified	40 MHz	Acorn A7000
		ARM7500FE	Integrated SoC. "FE" Added FPA and EDO memory controller.	4 KB unified	56 MHz 0.73 DMIPS/MHz	Acorn A7000+
ARM7TDMI	ARMv4T	ARM7TDMI(-S)	3-stage pipeline, Thumb	none	15 MIPS @ 16.8 MHz	Game Boy Advance, Nintendo DS, iPod, Lego NXT, Atmel AT91SAM7, Juice Box
		ARM710T		8 KB unified, MMU	36 MIPS @ 40 MHz	Psion Series 5mx, Psion Revo/Revo Plus/Diamond Mako
		ARM720T		8 KB unified, MMU	60 MIPS @ 59.8 MHz	Zipit Wireless Messenger
		ARM740T		MPU
	ARMv5TEJ	ARM7EJ-S	Jazelle DBX, Enhanced DSP instructions, 5-stage pipeline	none
StrongARM	ARMv4	SA-110		16 KB/16 KB, MMU	203 MHz 1.0 DMIPS/MHz	Apple Newton 2x00 series, Acorn Risc PC, Rebel/Corel Netwinder, Chalice CATS, Psion Netbook
StrongARM	ARMv4	SA-1110		16 KB/16 KB, MMU	233 MHz	LART, Intel Assabet, Ipaq H36x0, Balloon2, Zaurus SL-5x00, HP Jornada 7xx
ARM8	ARMv4	ARM810^[6]	5-stage pipeline, static branch prediction, double-bandwidth memory	8 KB unified, MMU	84 MIPS @ 72 MHz 1.16 DMIPS/MHz	Acorn Risc PC prototype CPU card
ARM9TDMI	ARMv4T	ARM9TDMI	5-stage pipeline	none
		ARM920T		16 KB/16 KB, MMU	200 MIPS @ 180 MHz	Armadillo, GP32,GP2X (first core), Tapwave Zodiac (Motorola i. MX1), Hewlet Packard HP-49/50 Calculators, Sun SPOT, [Cirrus Logic EP9315], Samsung s3c2442 (HTC TyTN, FIC Neo FreeRunner^[7])
		ARM922T		8 KB/8 KB, MMU
		ARM940T		4 KB/4 KB, MPU		GP2X (second core), Meizu M6 Mini Player^[8] ^[9]
ARM9E	ARMv5TE	ARM946E-S	Enhanced DSP instructions	variable, tightly coupled memories, MPU		Nintendo DS, Nokia N-Gage, Conexant 802.11 chips
		ARM966E-S		no cache, TCMs		ST Micro STR91xF, includes Ethernet [1]
		ARM968E-S		no cache, TCMs
	ARMv5TEJ	ARM926EJ-S	Jazelle DBX, Enhanced DSP instructions	variable, TCMs, MMU	220 MIPS @ 200 MHz,	Mobile phones: Sony Ericsson (K, W series); Siemens and Benq (x65 series and newer); Texas Instruments OMAP1710, OMAP1610, OMAP1611, OMAP1612; Qualcomm MSM6100, MSM6125, MSM6225, MSM6245, MSM6250, MSM6255A, MSM6260, MSM6275, MSM6280, MSM6300, MSM6500, MSM6800; Freescale i.MX21, i.MX27, Atmel AT91SAM9
	ARMv5TE	ARM996HS	Clockless processor, Enhanced DSP instructions	no caches, TCMs, MPU
ARM10E	ARMv5TE	ARM1020E	(VFP), 6-stage pipeline, Enhanced DSP instructions	32 KB/32 KB, MMU
	ARMv5TE	ARM1022E	(VFP)	16 KB/16 KB, MMU
	ARMv5TEJ	ARM1026EJ-S	Jazelle DBX, Enhanced DSP instructions	variable, MMU or MPU
XScale	ARMv5TE	80200/IOP310/IOP315	I/O Processor, Enhanced DSP instructions
		80219			400/600 MHz	Thecus N2100
		IOP321			600 BogoMips @ 600 MHz	Iyonix
		IOP33x
		IOP34x	1-2 core, RAID Acceleration	32K/32K L1, 512K L2, MMU
		PXA210/PXA250	Applications processor, 7-stage pipeline			Zaurus SL-5600, iPAQ H3900
		PXA255		32KB/32KB, MMU	400 BogoMips @ 400 MHz	Gumstix basix & connex, Palm Tungsten E2,Mentor Ranger & Stryder
		PXA26x			default 400 MHz, up to 624 MHz	Palm Tungsten T3
		PXA27x	Applications processor	32 Kb/32 Kb, MMU	800 MIPS @ 624 MHz	Gumstix verdex, HTC Universal, HP hx4700, Zaurus SL-C1000, 3000, 3100, 3200, Dell Axim x30, x50, and x51 series, Motorola Q, Balloon3, Trolltech Greenphone, Palm TX, Motorola Ezx Platform A728, A780, A910, A1200, E680, E680i, E680g, E690, E895, Rokr E2, Rokr E6, Fujitsu Siemens LOOX N560, Toshiba Portégé G500, Treo 650-755p
		PXA800(E)F
		Monahans			1000 MIPS @ 1.25 GHz
		PXA900				Blackberry 8700, Blackberry Pearl (8100)
		IXC1100	Control Plane Processor
		IXP2400/IXP2800
		IXP2850
		IXP2325/IXP2350
		IXP42x				NSLU2
		IXP460/IXP465
ARM11	ARMv6	ARM1136J(F)-S	SIMD, Jazelle DBX, (VFP), 8-stage pipeline	variable, MMU	740 @ 532-665 MHz (i.MX31 SoC), 400-528 MHz	Texas Instruments OMAP2420 (Nokia E90, Nokia N93, Nokia N95, Nokia N82), Zune, BUGbase, Nokia N800, Nokia N810, Qualcomm MSM7200 (with integrated ARM926EJ-S Coprocessor@274MHz, used in HTC TyTN II (Kaiser), HTC Nike), Freescale i.MX31
	ARMv6T2	ARM1156T2(F)-S	SIMD, Thumb-2, (VFP), 9-stage pipeline	variable, MPU
	ARMv6KZ	ARM1176JZ(F)-S	SIMD, Jazelle DBX, (VFP)	variable, MMU+TrustZone		Apple iPhone, Apple iPod touch, Conexant CX2427X, Motorola RIZR Z8, Motorola RIZR Z10
	ARMv6K	ARM11 MPCore	1-4 core SMP, SIMD, Jazelle DBX, (VFP)	variable, MMU		Nvidia APX 2500
Cortex	ARMv7-A	Cortex-A8	Application profile, VFP, NEON, Jazelle RCT, Thumb-2, 13-stage superscalar pipeline	variable (L1+L2), MMU+TrustZone	up to 2000 (2.0 DMIPS/MHz in speed from 600 MHz to greater than 1 GHz)	Texas Instruments OMAP3, Pandora
		Cortex-A9	Application profile, (VFP), (NEON), Jazelle RCT and DBX, Thumb-2, Out-of-order speculative issue superscalar	MMU+TrustZone	2.0 DMIPS/MHz
		Cortex-A9 MPCore	As Cortex-A9, 1-4 core SMP	MMU+TrustZone	2.0 DMIPS/MHz
	ARMv7-R	Cortex-R4(F)	Embedded profile, (FPU)	variable cache, MPU optional	600 DMIPS	Broadcom is a user, TMS570 from Texas Instruments
	ARMv7-M	Cortex-M3	Microcontroller profile, Thumb-2 only.	no cache, (MPU)	125 DMIPS @ 100 MHz	Luminary Micro[2] microcontroller family, ST Microelectronics STM32[3]
	ARMv6-M	Cortex-M1	FPGA targeted, Microcontroller profile, Thumb-2 (BL, MRS, MSR, ISB, DSB, and DMB).	None, tightly coupled memory optional.	Up to 136 DMIPS @ 170 MHz^[10] (0.8 DMIPS/MHz^[11], MHz achievable FPGA-dependent)	"Actel ProASIC3 and Actel Fusion PSC devices will sample in Q3 2007"^[12]

Design notes

To keep the design clean, simple and fast, it was hardwired without microcode, like the much simpler 8-bit 6502 processor used in prior Acorn microcomputers.

The ARM architecture includes the following RISC features:

Load/store architecture
No support for misaligned memory accesses (now supported in ARMv6 cores)
Uniform 16 × 32-bit register file
Fixed instruction width of 32 bits to ease decoding and pipelining, at the cost of decreased code density. (Later, "Thumb mode" increased code density.)
Mostly single-cycle execution

To compensate for the simpler design, compared with contemporary processors like the Intel 80286 and Motorola 68020, some unique design features were used:

Conditional execution of most instructions, reducing branch overhead and compensating for the lack of a branch predictor
Arithmetic instructions alter condition codes only when desired
32-bit barrel shifter which can be used without performance penalty with most arithmetic instructions and address calculations
Powerful indexed addressing modes
A link register for fast leaf function calls.
Simple, but fast, 2-priority-level interrupt subsystem with switched register banks

An interesting addition to the ARM design is the use of a 4-bit condition code on the front of every instruction, meaning that execution of every instruction is optionally conditional. Other CPU architectures typically only have condition codes on branch instructions.

This cuts down significantly on the encoding bits available for displacements in memory access instructions, but on the other hand it avoids branch instructions when generating code for small if statements. The standard example of this is the Euclidean algorithm:

In the C programming language, the loop is:

    while (i != j)
 {
    if (i > j)
        i -= j;
    else
        j -= i;
 }

In ARM assembly, the loop is:


loop    CMP    Ri, Rj       ; set condition "NE" if (i != j)
                         ;               "GT" if (i > j), 
                         ;            or "LT" if (i <>
     SUBGT  Ri, Ri, Rj   ; if "GT", i = i-j;  
     SUBLT  Rj, Rj, Ri   ; if "LT", j = j-i; 
     BNE    loop         ; if "NE", then loop

which avoids the branches around the then and else clauses.

Another unique feature of the instruction set is the ability to fold shifts and rotates into the "data processing" (arithmetic, logical, and register-register move) instructions, so that, for example, the C statement

a += (j <<>

could be rendered as a single word, single cycle instruction on the ARM.

ADD Ra, Ra, Rj, LSL #2

This results in the typical ARM program being denser than expected with less memory access; thus the pipeline is used more efficiently. Even though the ARM runs at what many would consider to be low speeds, it nevertheless competes quite well with much more complex CPU designs.

The ARM processor also has some features rarely seen on other RISC architectures, such as PC-relative addressing (indeed, on the ARM the PC is one of its 16 registers) and pre- and post-increment addressing modes.

Another item of note is that the ARM has been around for a while, with the instruction set increasing somewhat over time. Some early ARM processors (prior to ARM7TDMI), for example, have no instruction to store a two-byte quantity, thus, strictly speaking, for them it's not possible to generate code that would behave the way one would expect for C objects of type "volatile short"^{[ citation needed ]}.

The ARM7 and earlier designs have a three stage pipeline; the stages being fetch, decode, and execute. Higher performance designs, such as the ARM9, have a five stage pipeline. Additional changes for higher performance include a faster adder, and more extensive branch prediction logic.

The architecture provides a non-intrusive way of extending the instruction set using "coprocessors" which can be addressed using MCR, MRC, MRRC and MCRR commands from software. The coprocessor space is divided logically into 16 coprocessors with numbers from 0 to 15, coprocessor 15 (cp15) being reserved for some typical control functions like managing the caches and MMU operation (on processors that have one).

In ARM based machines, peripheral devices are usually attached to the processor by mapping their physical registers into ARM memory space or into the coprocessor space or connecting to another device (a bus) which in turn attaches to the processor. Coprocessor accesses have lower latency so some peripherals (for example XScale interrupt controller) are designed to be accessible in both ways (through memory and through coprocessors).

Thumb

To improve compiled code-density, processors since the ARM7TDMI have featured the Thumb mode. When in this mode, the processor executes 16-bit instructions. Most of these 16-bit-wide Thumb instructions are directly mapped to normal ARM instructions. The space-saving comes from making some of the instruction operands implicit and limiting the number of possibilities compared to the full ARM mode instruction.

In Thumb, the smaller opcodes have less functionality. For example, only branches can be conditional, and many opcodes are restricted to accessing only half of all of the CPU's general purpose registers. The shorter opcodes give improved code density overall, even though some operations require extra instructions. In situations where the memory port or bus width is constrained to less than 32 bits, the shorter Thumb opcodes allow increased performance compared with 32-bit ARM code, as less program code may need to be loaded into the processor over the constrained memory bandwidth.

Embedded hardware, such as the Game Boy Advance, typically have a small amount of RAM accessible with a full 32-bit datapath; the majority is accessed via a 16 bit or narrower secondary datapath. In this situation, it usually makes sense to compile Thumb code and hand-optimise a few of the most CPU-intensive sections using full 32-bit ARM instructions, placing these wider instruction into the 32-bit bus accessible memory.

The first processor with a Thumb instruction decoder was the ARM7TDMI. All ARM9 and later families, including XScale have included a Thumb instruction decoder.

DSP Enhancement Instructions

To improve the ARM architecture for digital signal processing and multimedia applications, a few new instructions were added to the set [4]. These seem to be signified by an "E" in the name of the ARMv5TE and ARMv5TEJ architectures.

The new instructions are common in digital signal processor architectures. They are variations on signed multiply-accumulate, saturated add and subtract, and count leading zeros.

Jazelle

A technology called Jazelle DBX (Direct Bytecode eXecution) allows recent ARM architectures to execute some Java bytecode in hardware as a third execution state alongside the existing ARM and Thumb modes.

The most prominent use of Jazelle is by manufacturers of mobile phones to increase the execution speed of Java ME games and applications.

A Jazelle-aware Java Virtual Machine (JVM) will attempt to run Java bytecodes in hardware, while returning to the software for more complicated, or lesser-used bytecode operations. ARM claim that approximately 95% of bytecode in typical program usage ends up being directly processed in the hardware.

Jazelle functionality was specified in the ARMv5TEJ architecture^[13] and the first processor with Jazelle technology was the ARM926EJ-S^[14]: Jazelle is denoted by a 'J' appended to the CPU name.

The published specifications are very incomplete, being only sufficient for writing operating system code that can support a JVM that uses Jazelle. The declared intent is that only the JVM software needs to (or is allowed to) depend on the hardware interface details. This tight binding facilitates that the hardware and JVM can evolve together without affecting other software. In effect, this gives ARM Ltd. considerable control over which JVMs are able to exploit Jazelle.

Implementation

The Jazelle extension is implemented as an extra stage between the fetch and decode stages in the processor pipeline. Recognised bytecodes are converted into a string of one or more native ARM instructions.

The Jazelle mode moves JVM interpretation into hardware for the most common simple JVM instructions. This is intended to significantly reduce the cost of interpretation. Among other things, this reduces the need for JIT and other JVM accelerating techniques^[15]. JVM instructions that are not implemented in Jazelle hardware cause appropriate routines in the Jazelle-aware JVM implementation to be invoked. Details are not published.

Jazelle mode is entered via the BXJ instructions. A hardware implementation of Jazelle will only cover a subset of JVM bytecodes. For unhandled bytecodes—or if overridden by the operating system—the hardware will invoke the software JVM. The system is designed so that the software JVM does not need to know which bytecodes are implemented in hardware and a software fallback is provided by the software JVM for the full set of bytecodes.

Instruction set

The instruction set used in Jazelle mode is documented—it is Java bytecode after all. However, ARM have chosen to remain quiet on the exact execution environment details; the documentation provided with Sun's HotSpot Java Virtual Machine goes as far as to state: For the avoidance of doubt, distribution of products containing software code to exercise the BXJ instruction and enable the use of the ARM Jazelle architecture extension without [..] agreement from ARM is expressly forbidden.^[16].

Employees of ARM have in the past published several white papers that do give some good pointers about the processor extension. Versions of the ARM Architecture Reference Manual available from 2008 have included pseudocode for the 'BXJ' (Branch and eXchange to Java) instruction, but with the finer details being shown as "SUB-ARCHITECTURE DEFINED" and documented elsewhere.

Application binary interface (ABI)

The Jazelle state relies on an agreed calling convention between the JVM and the Jazelle hardware state. This application binary interface is not published by ARM, rendering Jazelle an undocumented feature for most users and Free Software JVMs.

The entire VM state is held within normal ARM registers, allowing compatibility with existing operating systems and interrupt handlers unmodified. Restarting a bytecode (such as following a return from interrupt) will re-execute the complete sequence of related ARM instructions.

Specific registers are designated to hold the most important parts the JVM state, registers r0-r3 hold an alias of the top of the Java stack, r4 holds Java local operand zero (pointer to *this) and r6 contains the Java stack pointer.^[17]

Jazelle reuses the existing Program Counter register r15^[18]. A pointer to the next bytecode goes in r14^[19], so the use of the PC is not generally user-visible except during debugging.

CPSR: Mode indication

Java bytecode is indicated as the current instruction set by a combination of two-bits in the ARM CPSR (Current Program Status Register). The 'T'-bit must be cleared and the 'J'-bit set.^[20]

Bytecodes are decoded by the hardware in two stages (versus a single stage for Thumb and ARM code) and switching between hardware and software decoding (Jazelle mode and ARM mode) takes ~4 clock cycles.^[21].

For entry to Jazelle hardware state to succeed, the JE (Jazelle Enable)^[13] bit in the CP14:c0(c2)[bit 0] register must be set; clearing of the JE bit by a [privileged] operating-system provides a high-level override to prevent application programs from using the hardware Jazelle acceleration^[22], additionally the CV (Configuration Valid) bit^[13] found in CP14:c0(c1)[bit 1]^[22] must be set to show that there is a consistent Jazelle state setup for the hardware to use.

BXJ: Branch to Java

The BXJ instruction attempts to switch to Jazelle state, and if allowed and successful, sets the 'J' bit in the CPSR; otherwise "falling through" and acting as a standard BX (Branch) instruction.^[13] The only time when an operating system, or debugger must be fully aware of the Jazelle mode is when decoding a faulted or trapped instruction. The Java PC pointing to the next instructions must be placed in the Link Register (r14) before executing the BXJ branch request, as regardless of hardware or software processing, the system must know where to begin decoding.

Because the current state is held in the CPSR, the bytecode instruction set is automatically reselected after task-switching and processing of the current Java bytecode is restarted.^[17]

Following an entry into the Jazelle state mode, bytecodes can be processed in one of three ways; decoded and executed natively in hardware, handled in software (with optimised ARM/ThumbEE JVM code), or treated as an invalid/illegal opcode. The third case will cause a branch to an ARM exception mode, as will a Java bytecode of 0xff, which is used for setting JVM breakpoints^[23].

Execution will continue in hardware until an unhandled bytecode is encountered, or an exception occurs. Between 134 and 149 bytecodes (out of 203 bytecodes specified in the JVM specification) are translated and executed directly in the hardware.

Low-level registers

Low-level configuration registers, for the hardware virtual machine, are held in the ARM Co-processor "CP14 register c0" allowing detecting, enabling or disabling the hardware accelerator—if it is available.^[24]

The Jazelle Identity Register in register CP14:c0(c0) is read-only accessible in all modes.
The Jazelle OS Control Register at CP14:c0(c1) is only accessible in kernel mode and will cause an exception when accessed in user-mode.
The Jazelle Main Configuration Register at CP14:c0(c2) is write-only in user-mode and read-write in kernel mode.

A "trival" hardware implementation of Jazelle (as found in the QEMU emulator) is only required to support the BXJ opcode itself, treating BXJ as a normal BX instruction^[13] and to return RAZ (Read-As-Zero) for all of the CP14:c0 Jazelle-related registers^[25].

Thumb-2

Thumb-2 technology made its debut in the ARM1156 core, announced in 2003. Thumb-2 extends the limited 16-bit instruction set of Thumb with additional 32-bit instructions to give the instruction set more breadth. The resulting stated aim for Thumb-2 is to achieve code density similar to Thumb with performance similar to the ARM instruction set on 32-bit memory.

Thumb-2 also extends both the ARM and Thumb instruction set with yet more instructions, including bit-field manipulation, table branches, and conditional execution.

All ARMv7 chips support the Thumb-2 instruction set. Some chips, such as the Cortex-M3, support only the Thumb-2 instruction set. Other chips in the Cortex and ARM11 series support both "ARM instruction set mode" and "Thumb-2 instruction set mode" [5] [6] [7].

Thumb Execution Environment (ThumbEE)

ThumbEE, also known as Thumb-2EE, and marketed as Jazelle RCT (Runtime Compilation Target), was announced in 2005, first appearing in the Cortex-A8 processor. ThumbEE provides a small extension to the Thumb-2 extended Thumb instruction set, making the instruction set particularly suited to code generated at runtime (e.g. by JIT compilation) in managed Execution Environments. ThumbEE is a target for languages such as Limbo, Java, C#, Perl and Python, and allows JIT compilers to output smaller compiled code without impacting performance.

New features provided by ThumbEE include automatic null pointer checks on every load and store instruction, an instruction to perform an array bounds check. Access to registers r8-r15 (where the Jazelle/DBX Java VM state is held) and the ability to branch to handlers—small sections of frequently called code—commonly used to implement a feature of a high level language, such as allocating memory for a new object.

Advanced SIMD (NEON)

The Advanced SIMD extension, marketed as NEON technology, is a combined 64 and 128 bit SIMD (Single Instruction Multiple Data) instruction set that provides standardized acceleration for media and signal processing applications. NEON can execute MP3 audio decoding on CPUs running at 10 MHz and can run the GSM AMR (Adaptive Multi-Rate) speech codec at no more than 13 MHz. It features a comprehensive instruction set, separate register files and independent execution hardware. NEON supports 8-, 16-, 32- and 64-bit integer and single precision floating-point data and operates in SIMD operations for handling audio/video processing as well as graphics and gaming processing. In NEON, the SIMD supports up to 16 operations at the same time.

VFP

VFP technology is a coprocessor extension to the ARM architecture. It provides low-cost single-precision and double-precision floating-point computation fully compliant with the ANSI/IEEE Std 754-1985 Standard for Binary Floating-Point Arithmetic. VFP provides floating-point computation suitable for a wide spectrum of applications such as PDAs, smartphones, voice compression and decompression, three-dimensional graphics and digital audio, printers, set-top boxes, and automotive applications. The VFP architecture also supports execution of short vector instructions allowing SIMD (Single Instruction Multiple Data) parallelism. This is useful in graphics and signal-processing applications by reducing code size and increasing throughput.

Other floating-point and/or SIMD coprocessors found in ARM-based processors include FPA, FPE, iwMMXt. They provide some of the same functionality as VFP but are not opcode-compatible with it.

Security Extensions (TrustZone)

The Security Extensions, marketed as TrustZone(TM) Technology, is found in ARMv6KZ and later application profile architectures. It provides a low cost alternative to adding an additional dedicated security core to a SoC, by providing two virtual processors backed by hardware based access control. This enables the application core to switch between two states, referred to as worlds (to reduce confusion with other names for capability domains), in a manner such that information can be prevented from leaking from the more trusted world to the less trusted world. This world switch is generally orthogonal to all other capabilities of the processor and so each world can operate independently of the other while using the same core. Memory and peripherals are then made aware of the operating world of the core and may use this to provide access control to secrets and code on the device. A typical application of TrustZone Technology is to run a rich operating system in the less trusted world, and smaller security-specialized code in the more trusted world.

ARM licensees

ARM Ltd does not manufacture and sell CPU devices based on their own designs, but rather, licenses the processor architecture to interested parties. ARM offers a variety of licensing terms, varying in cost and deliverables. To all licensees, ARM provides an integratable hardware description of the ARM core, as well as complete software development toolset (compiler, debugger, SDK), and the right to sell manufactured silicon containing the ARM CPU. Fabless licensees, who wish to integrate an ARM core into their own chip design, are usually only interested in acquiring a ready-to-manufacture verified IP core. For these customers, ARM delivers a gate netlist description of the chosen ARM core, along with an abstracted simulation model and test programs to aid design integration and verification. More ambitious customers, including integrated device manufacturers (IDM) and foundry operators, choose to acquire the processor IP in synthesizable RTL (Verilog) form. With the synthesizable RTL, the customer has the ability to perform architectural level optimizations and extensions. This allows the designer to achieve exotic design goals not otherwise possible with an unmodified netlist (high clock speed, very low power consumption, instruction set extensions, etc.). While ARM does not grant the licensee the right to resell the ARM architecture itself, licensees may freely sell manufactured product (chip devices, evaluation boards, complete systems, etc.). Merchant foundries can be a special case; not only are they allowed to sell finished silicon containing ARM cores, they generally hold the right to remanufacture ARM cores for other customers.

Like most IP vendors, ARM prices its IP based on perceived value. In architectural terms, the lower performance ARM cores command a lower license cost than the higher performance cores. In terms of silicon implementation, a synthesizable core is more expensive than a hard macro (blackbox) core. Complicating price matters, a merchant foundry who holds an ARM license (such as Samsung and Fujitsu) can offer reduced licensing costs to its fab customers. In exchange for acquiring the ARM core through the foundry's in-house design services, the customer can reduce or eliminate payment of ARM's upfront license fee. Compared to dedicated semiconductor foundries (such as TSMC and UMC) without in-house design services, Fujitsu/Samsung charge 2 to 3 times more per manufactured wafer. For low to mid volume applications, a design service foundry offers lower overall pricing (through subsidization of the license fee). For high volume mass produced parts, the long term cost reduction achievable through lower wafer pricing reduces the impact of ARM's NRE (Non-Recurring Engineering) costs, making the dedicated foundry a better choice.

Many semiconductor or IC design firms hold ARM licenses: Analog Devices, Atmel, Broadcom, Cirrus Logic, Faraday technology, Freescale (spun off from Motorola in 2004), Fujitsu, Intel (through its settlement with Digital Equipment Corporation), IBM, Infineon Technologies, Nintendo,NXP Semiconductors (spun off from Philips in 2006), OKI, Qualcomm, Samsung, Sharp, STMicroelectronics, Texas Instruments and VLSI are some of the many companies who have licensed the ARM in one form or another. Although ARM's license terms are covered by NDA, within the IP industry, ARM is widely known to be among the most expensive CPU cores. A single customer product containing a basic ARM core can incur a one-time license fee in excess of (USD) $200,000. Where significant quantity and architectural modification are involved, the license fee can exceed $10M.^{[ citation needed ]}

ARM believes that its base of 200+ semiconductor licensees gives it a chance to succeed in the ongoing controversies regarding the use of ARM or Intel architectures in mobile computers.

Approximate licensing costs

ARM's 2006 annual report and accounts state that royalties totalling 88.7 million GBP (164.1 million USD) were the result of licensees shipping 2.45 billion units^[26]. This is equivalent to 0.036 GBP (0.067 USD) per unit shipped. However, this is averaged across all cores, including expensive new cores and inexpensive older cores.

In the same year ARM's licensing revenues for processor cores were £65.2 million ($119.5 million)^[27], in a year when 65 processor licenses were signed^[28], an average of 1 million GBP (1.84 million USD) per license. Again, this is averaged across both new and old cores.

Given that ARM's 2006 income from processor cores was approximately 60% from royalties and 40% from licenses, ARM makes the equivalent of 0.06 GBP (0.11 USD) per unit shipped including both royalties and licenses. However, as one-off licenses are typically bought for new technologies, unit sales (and hence royalties) are dominated by more established products. Hence, these figures above do not reflect the true costs of any single ARM product.