Design of high-speed memory module shared by baseband processing chip

Summary: The cache, as a small-scale fast memory between the central processing unit (CPU) and the main memory, solves the problem of the balance and matching of the data processing speed of the two, and helps to improve the overall performance of the system. Multi-processor (SMP) supports the caching of shared and private data, and the Cache coherency protocol is used to maintain multi-processor data consistency problems caused by multiple processors sharing data. Discussed a shared cache design suitable for 64-bit multi-core processors, including how to achieve multi-processor cache coherency and its fully customized back-end implementation.

Key words: Shared high-speed memory; multi-core processor; AMBA bus;

Chinese Library Classification Number:TP332 Document identification code: A

Article ID:

0 Preface

This article introduces the design of a shared high-speed memory module. The high-speed memory can realize data exchange between multi-core processors while occupying a small circuit area. Compared with the traditional multi-core processor data exchange method, this design can better improve the system performance. It is a circuit design structure with market competitiveness;

1. Shared cache structure design

1.1 Overall consideration

The shared cache in a multi-core CPU is mainly responsible for caching the data of multiple processor cores, processing missing requests to access these data, and sending requests to the DRAM controller to obtain the data returned by the DRAM. The shared cache is interconnected with each processor core through the crossbar bus, and the communication data packet is forwarded through the crossbar bus for data communication. The shared cache is divided into four cache groups, and each cache group uses group-associative address mapping. Each processing core can send a data packet to any buffer group, and the same data packet can also be sent in the opposite direction from any buffer group to any processing core.

The shared cache uses four-way group associative mapping, and the cache is divided into 1024 groups. The physical address of the cache block is divided into three parts, including the tag block, the index block, and the offset within the block. The index part is used to determine the group in which the cache block is located. By comparing the label block of the physical address with the four-way label in the selected group, the hit or miss of the access can be determined. When hit, the result of the comparison is sent to the data array as a way selection vector.Cache is determined by way selection vector and group selection vector

1.2 Cache consistency

In a symmetrical shared memory multiprocessor system, the multiprocessor 2 cache subsystem shares the same physical memory and is connected by a bus. The time for all processors to access the memory is the same, that is, uniform memory access (UMA). The symmetric shared memory system supports the caching of shared and private data. Private data is used by a single processor, while shared data is used by multiple processors. The communication between processors is completed by reading and writing shared data. Shared data forms a copy in multiple caches, which reduces access latency, reduces the requirements for memory bandwidth, and reduces competition when multiple processors read shared data. However, shared data brings cache coherency problems. The key to achieving cache coherency is to track the state of all shared data blocks. At present, two protocols, the directory type and the listening type, are widely used in order to achieve cache consistency. The design adopts a directory-style cache coherency protocol, puts the shared state of the physical memory in the directory table, and tracks which one and the cache owns a copy of the second-level cache block according to the directory. The first level cache is written directly, only invalid information is required, the shared cache is written back, and the data can always be retrieved from the shared cache. To reduce the overhead of the directory, put the directory in the cache instead of in the memory.

When a block has not been cached, there are 2 possible directory requests:

1) Read miss: The shared cache sends back the requested data to the requesting processor, and the requesting node becomes the only shared node. The state of the block is set to shared.

2) Missing write: Send data back to the requesting processor and make it a shared node. The data block is set to an exclusive state, indicating that this is the only valid cache copy. The owner is specified in the set of sharers. When the data block is in the shared state, the value in the shared cache is the latest, and there are 2 possible directory requests:

1) Read miss: The shared cache sends the requested data back to the requesting processor, and the requesting processor is placed in the shared set.

2) Write miss: Send data back to the processor that sent the request, invalidate the processor cache block in the shared set, save the identifier of the processor that sent the request, and set the data block to an exclusive state.

When the data block is in the exclusive state, the current value of the block is stored in the cache of the processor specified by the sharer set. There are 3 possible directory requests:

1) Read miss: Send a data message to the owner processor and set the state of the cache block to shared. The owner sends data to the directory, writes the data to the shared cache, and sends it back to the requesting processor. Then add the requesting processor to the set of sharers, and there will still be other owner processors in the set at this time.

2) Data write back: Perform write back operation, update the memory copy, and the sharer set is empty.

3) Missing write: The data block has a new owner. Send a message to the old owner, make the cache invalidate the data block, and send the value to the directory, and then send the value to the requesting processor through the directory. The requesting processor becomes the new owner. The set of sharers only retains the identity of the new owner, while the block remains in an exclusive state.

2. High-speed shared cache module

The user RAM size is 2MB, which is connected to the AHB bus between the dual cores, and the two core access areas can be configured arbitrarily. Inside it is a piece of SRAM and AHB bus slave interface circuit, as shown in Figure 2-1. Read access has a one-cycle delay, and write access has no delay. The read and write access sequence is shown in Figure 2-2 and Figure 2-3. Both read and write support byte access, half-word access or word access.

The address space range of user RAM is 0xA0000000 ~ 0xA01FFFFF.

Design of high-speed memory module shared by baseband processing chip

Figure 21 Schematic diagram of user RAM structure

Assume that CPU0 writes data to user RAM, and then CPU1 reads data from user RAM. In this case, CPU0 writes the data first, and then sets the flag variable to 1, indicating that the data in the user RAM has been updated. The tag variable address is located in the user RAM address range. Then CPU1 reads the flag variable, if the variable is 1, read the data written by CPU0 from the corresponding address in the user RAM, and set the flag variable to 0; if the flag variable is 0, it means that the data in the user RAM has been read by CPU1 pass.

Use the above method to realize data interaction between cores. Since only one device on the AHB bus can use the bus to read and write at the same time, the atomicity of the read and write operations can be guaranteed, that is, the flag variable cannot be accessed by CPU0 and CPU1 at the same time. Thereby ensuring the validity of the flag variable.

Design of high-speed memory module shared by baseband processing chip

Figure 22 User RAM read timing

Figure 23 User RAM write timing

references
[1 ]John L. Hennessy, David A. Patterson, Computer Architecture: A Quantitative Approach, Fourth Edition [ M ]. Ap professional ,1990
[2 ] Sun Microsystems Inc. OpenSPARC T1 Microarchitecture Specification[ R]. 2006
[3 ]David A. Patterson, John L. Hennessy, Computer organization and design[ M ]. Morgan Kaufmann, 2004
[4 ] Michael D. Ciletti, Advanced digital design with the Verilog HDL [ M ]. Pearson, 2005
[5 ]Zhou Li. Computer System Structure [ M ]. Beijing: Tsinghua University Press, 2006

The Links:   2MBI75N-120 1MBI30L-060