CPU is the central processing unit (Central Processing Unit)
GPU is Graphics Processing Unit
TPU is Google’s Tensor Processing Unit (Tensor Processing Unit)
NPU stands for Neural Network Processing Unit
Summarize the difference between the three:
Although the CPU has multiple cores, there are generally only a few. Each core has a large enough cache and enough digital and logical operation units. It needs strong versatility to handle various data types, and logic Judgment will introduce a large number of branch jumps and interrupt processing, and assist with a lot of hardware that accelerates branch judgment and even more complex logic judgment;
GPUs have far more cores than CPUs and are called many-cores (NVIDIA Fermi has 512 cores). The cache size of each core is relatively small, and the digital logic operation unit is also small and simple (the GPU has always been weaker than the CPU in floating-point computing at the beginning), and it is faced with large-scale data of highly uniform types and independent of each other. And a pure computing environment that doesn’t need to be interrupted.
A TPU is a chip customized for machine learning, trained specifically for deep machine learning, and it has higher performance (computing power per watt). Roughly speaking, it has a 7-year lead over today’s processors, is more tolerant, can squeeze more operation time into the chip per second, and uses more complex and powerful machine learning models to make it faster. Deploy, users will also get smarter results more quickly.
The so-called NPU, the neural network processor, uses circuits to simulate the structure of human neurons and synapses.
Readers who want to understand more deeply, can read below, and then explain in detail. If you don’t want to learn more, skip to the end of the article for a surprise summary:
The central processing unit (CPU, Central Processing Unit) is one of the main equipment of electronic computers and the core accessories in the computer. Its function is mainly to interpret computer instructions and process data in computer software. All operations in the computer are the core components of the CPU responsible for reading the instructions, decoding the instructions and executing the instructions.
The structure of the CPU mainly includes an arithmetic unit (ALU, Arithmetic and Logic Unit), a control unit (CU, Control Unit), a register (Register), a high-speed cache (Cache) and a bus for communicating data, control and status between them.
The CPU follows the von Neumann architecture, and its core is: stored programs and sequential execution.
In addition, because of the von Neumann architecture (stored program, sequential execution), the CPU is like a regular butler, and it always does what people tell it to do step by step. But as the demand for larger scale and faster processing speed increased, the butler gradually became a little overwhelmed.
So, everyone thought, can you put multiple processors on the same chip and let them do things together, so that the efficiency will not be improved?
That’s right, the GPU was born.
Before formally explaining GPU, let’s talk about a concept mentioned above: parallel computing.
Parallel computing refers to the process of using multiple computing resources to solve computing problems at the same time, and it is an effective means to improve the computing speed and processing capacity of computer systems. Its basic idea is to use multiple processors to jointly solve the same problem, that is, to decompose the problem to be solved into several parts, and each part is calculated in parallel by an independent processor.
Parallel computing can be divided into parallel in time and parallel in space.
The parallel in time refers to the assembly line technology. For example, when a factory produces food, it is divided into four steps: cleaning – disinfection – cutting – packaging.
If the assembly line is not used, after one food completes the above four steps, the next food will be processed, which is time-consuming and affects efficiency. But with the assembly line technology, four foods can be processed at the same time. This is time parallelism in parallel algorithms, starting two or more operations at the same time, greatly improving computing performance.
Spatial parallelism refers to the concurrent execution of calculations by multiple processors, that is, connecting more than two processors through a network to simultaneously calculate different parts of the same task, or large-scale problems that cannot be solved by a single processor.
For example, Xiao Li is going to plant three trees on Arbor Day. If Xiao Li takes 6 hours to complete the task, he called his good friends Xiao Hong and Xiao Wang on the day of Arbor Day. The three of them started digging holes and planting trees at the same time. After an hour, everyone completes a tree-planting task, which is spatial parallelism in parallel algorithms, dividing a large task into multiple identical subtasks to speed up problem solving.
Therefore, if the CPU is used to perform the tree planting task, it will take 6 hours to plant trees one by one, but letting the GPU plant trees is equivalent to several people planting trees at the same time.
The full name of GPU is Graphics Processing Unit, which is graphics processor in Chinese. Just like its name, GPU was originally used to run graphics operations on personal computers, workstations, game consoles and some mobile devices (such as tablet computers, smart phones, etc.). microprocessor.
Why is the GPU so good at processing image data? This is because every pixel on the image needs to be processed, and the process and method of processing each pixel are very similar, making it a natural breeding ground for GPUs.
But the GPU cannot work alone, it must be controlled by the CPU to make it work. The CPU can act alone to process complex logical operations and different data types, but when a large amount of data of a unified type is required, the GPU can be called for parallel computing.
Most GPU work is computationally intensive, but not very technical, and it has to be repeated many, many times.
To borrow the words of a certain god on Zhihu, just like if you have a job that requires you to calculate addition, subtraction, multiplication and division within a hundred million times, the best way is to hire dozens of elementary school students to do the calculations together, and one person calculates a part. Anyway, these calculations are also There is no technical content, it is purely manual work; and the CPU is like an old professor, the integral and differential can be calculated, that is, the salary is high, an old professor is worth 20 elementary school students, which one would you hire if you were Foxconn?
But it needs to be emphasized that although the GPU is born for image processing, we can find from the previous introduction that it does not have any components dedicated to serving images in structure, but only optimizes and adjusts the structure of the CPU, so Now GPUs can not only show their talents in the field of image processing, but also be used in scientific computing, password cracking, numerical analysis, massive data processing (sorting, Map-Reduce, etc.), financial analysis and other fields that require large-scale parallel computing.
A tensor processing unit (TPU) is a custom ASIC chip designed from the ground up by Google and dedicated to machine learning workloads. TPUs power Google’s main products, including Translator, Photos, Search Assistant, and Gmail, among others. Cloud TPU uses TPU as a scalable cloud computing resource and provides computing resources for all developers and data scientists running cutting-edge ML models on Google Cloud.
As mentioned above, CPUs and GPUs are relatively general-purpose chips, but there is an old saying that a universal tool will never be as efficient as a dedicated tool.
As people’s computing needs become more and more specialized, people hope to have chips that can better meet their professional needs. At this time, the concept of ASIC (Application-Specific Integrated Circuit) is born.
ASIC refers to a special specification integrated circuit customized according to different product requirements, designed and manufactured by specific user requirements and specific electronic system needs.
The TPU (Tensor Processing Unit, tensor processor) is a chip specially developed by Google to accelerate the computing power of deep neural networks, and it is actually an ASIC.
The TPU is said to provide a 15-30x performance boost and a 30-80x efficiency (performance/watt) boost compared to the CPU and GPU of the same period. The first-generation TPU can only do inference, relying on Google Cloud to collect data and generate results in real time, and the training process requires additional resources; while the second-generation TPU can be used for both training neural networks and inference.
The so-called NPU (Neural network Processing Unit) is a neural network processor. Simulate human neuronal and synaptic structures with circuits.
The storage and processing in the neural network are integrated, which are reflected by the synaptic weights. In the von Neumann structure, storage and processing are separated, and are implemented by memory and arithmetic, respectively. There is a huge difference between the two. When using existing classical computers based on von Neumann architecture (such as X86 processors and NVIDIA GPUs) to run neural network applications, it is inevitably constrained by the separate structure of storage and processing, thus affecting efficiency. This is one of the reasons why professional chips specifically for artificial intelligence can have certain inherent advantages over traditional chips.
Typical representatives of NPU are domestic Cambrian chips and IBM’s TrueNorth. Taking China’s Cambrian as an example, the DianNaoYu instruction directly faces the processing of large-scale neurons and synapses. One instruction can complete the processing of a group of neurons, and provide the transmission of neuron and synaptic data on the chip. a range of dedicated support.
Speaking in numbers, CPU, GPU and NPU will have a performance or energy consumption ratio gap of more than 100 times – take the DianNao paper jointly published by the Cambrian team and Inria as an example – DianNao is a single-core processor with a main frequency It is 0.98GHz, the peak performance reaches 452 billion basic neural network operations per second, the power consumption is 0.485W under the 65nm process, and the area is 3.02 square millimeters.
BPU (Brain Processing Unit, brain processor) is an embedded artificial intelligence processor architecture proposed by Horizon Technology. The first generation is a Gaussian architecture, the second generation is a Bernoulli architecture, and the third generation is a Bayesian architecture. At present, Horizon has designed the first-generation Gaussian architecture, and jointly launched the ADAS system (Advanced Driver Assistance System) with Intel at the 2017 CES exhibition.
DPU (Deep learning Processing Unit, that is, deep learning processor) was first proposed by Shenjian Technology in China. Based on the FPGA chip with the reconfigurable characteristics of Xilinx, a dedicated deep learning processing unit (which can be designed based on existing logic units, can be designed in parallel and efficiently) multipliers and logic circuits, which belong to the IP category), and abstracts customized instruction sets and compilers (instead of using OpenCL), enabling rapid development and product iteration. In fact, the DPU proposed by Shenjian is a semi-customized FPGA.
APU – Accelerated Processing Unit, accelerated processor, AMD launched accelerated image processing chip products.
BPU – Brain Processing Unit, the embedded processor architecture dominated by Horizon Corporation.
CPU – Central Processing Unit, the current mainstream product of PC core.
DPU – Deep learning Processing Unit, deep learning processor, was first proposed by domestic Shenjian Technology; another said there are Dataflow Processing Unit data stream processor, AI architecture proposed by Wave Computing Company; Data storage Processing Unit, Shenzhen Dapuwei’s intelligent Solid State Drive Processor.
FPU – Floating Processing Unit Floating-point computing unit, floating-point arithmetic module in general-purpose processors.
GPU – Graphics Processing Unit, graphics processor, uses a multi-threaded SIMD architecture for graphics processing.
HPU – Holographics Processing Unit holographic image processor, a holographic computing chip and device produced by Microsoft.
IPU – Intelligence Processing Unit, an AI processor product produced by Graphcore, a company invested by Deep Mind.
MPU/MCU – Microprocessor/Micro controller Unit, microprocessor/microcontroller, RISC computer architecture products generally used for low computing applications, such as ARM-M series processors.
NPU – Neural Network Processing Unit, neural network processor, is a general term for new processors based on neural network algorithms and acceleration, such as the diannao series produced by the Institute of Computing Technology of the Chinese Academy of Sciences/Cambrian Company.
RPU – Radio Processing Unit, radio processor, a collection of Wifi/Bluetooth/FM/processors introduced by Imagination Technologies as a single-chip processor.
TPU – Tensor Processing Unit tensor processor, a special processor launched by Google to accelerate artificial intelligence algorithms. The current generation of TPU is for Inference, and the second generation is for training.
VPU – Vector Processing Unit vector processor, an accelerated computing core of a dedicated chip for image processing and artificial intelligence launched by Movidius, which was acquired by Intel.
WPU – Wearable Processing Unit, wearable processor, wearable system-on-chip products launched by Ineda Systems, including GPU/MIPS CPU and other IP.
XPU – FPGA intelligent cloud acceleration released by Baidu and Xilinx at the 2017 Hotchips conference, with 256 cores.
ZPU – Zylin Processing Unit, a 32-bit open source processor launched by the Norwegian company Zylin.