The data could be expected to be retained by which of the following type of memory

Data storage device

Computers process information stored in their memory, which consists of data storage units. Storage devices such as CD and DVD drives are called the external or auxiliary storage units, whereas the principal memory devices directly accessible from computers are called the internal or main storage units, which rely on semiconductor memory chips.
There are mainly two types of semiconductor memory: random-access memory (RAM) and read-only memory (ROM). RAM is a temporary data storage domain, whereas ROM serves as a semi-permanent storage domain. If RAM is likened to notebooks or memo pads, then ROM is comparable to dictionaries and textbooks.

RAM—a memory device for reading/writing data

Since random-access memory (RAM) is principally used as temporary storage for the operating system and the applications, it does not much matter that some types of RAM lose data when they are powered off. What matters more is the cost and the read/write speed. There are mainly two types of RAM: one is DRAM (dynamic RAM), and the other is SRAM (static RAM). DRAM stores information in capacitors, and since the capacitors slowly discharge, the information fades away unless the capacitor charge is refreshed periodically. In practice, the data on DRAM need to be read and rewritten (i.e., refreshed) dozens of times per second. In contrast, SRAM needs no refreshing because it uses flip-flop circuits* to preserve the data. SRAM is more expensive than DRAM because of the complex circuitry involved, but is also faster.

*Flip-flop circuit: An electronic circuit that stores a single bit of data that represents either 0 or 1.

ROM—a read-only memory device

Read-only memory (ROM) is used for retrieving stored data that are permanently fixed and cannot be rewritten. Many home appliances such as washing machines and rice cookers use ROM devices to store pre-set programs.
ROM is non-volatile memory, meaning that the data stored on ROM are not lost even when the power is shut off. ROM is designed specifically for reading data. It may be possible to erase or write data on ROM, but it takes an inordinately long time to do so. To correct this shortcoming, new kinds of devices have emerged in recent years that serve as a cross between ROM and RAM, including flash memory and EPROM.

Cache memory

G.R. Wilson, in Embedded Systems and Computer Architecture, 2002

15.2.1 Memory write operations

When the microprocessor performs a memory write operation, and the word is not in the cache, the new data is simply written into main memory. However, when the word is in the cache, both the word in main memory and the cache must be written in order to keep them the same. The question is: when is the main memory to be written? The simplest answer is to write to both the cache and main memory at the same time. This is called a write-through policy. The main memory then always contains the same data as the cache. This is important if there are any other devices in the computer that also access main memory5. We can overcome the slowing down due to the main memory write operation by providing that subsequent cache reads proceed concurrently with the write to main memory. Since more than 70% of memory references are read operations, it is likely that the cache can continue to be read while the write to main memory proceeds.

An alternative policy, called write-back or copy-back, is to write the new data to the cache only. At the same time, a flag in the cache line is set to indicate that the line has been modified. Immediately before the cache location is replaced with new words from main memory, if the flag is set, the line will be copied back into main memory. Of course, if the flag is not set, this copying is unnecessary.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978075065064950016X

Domain 2

Eric Conrad, ... Joshua Feldman, in Eleventh Hour CISSP® (Third Edition), 2017

Cache memory

Cache memory is the fastest system memory, required to keep up with the CPU as it fetches and executes instructions. The data most frequently used by the CPU is stored in cache memory. The fastest portion of the CPU cache is the register file, which contains multiple registers. Registers are small storage locations used by the CPU to store instructions and data.

The next fastest form of cache memory is Level 1 cache, located on the CPU itself. Finally, Level 2 cache is connected to (but outside of) the CPU. Static random-access memory (SRAM) is used for cache memory.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128112489000024

Performance improvement methods

Shigeyuki Takano, in Thinking Machines, 2021

6.7 Summary of performance improvement methods

Cache memory in traditional CPUs and GPUs is unsuitable for deep learning tasks. There is a mismatch between the locality of the data and the locality handling on the cache memory architecture. In addition, cache memory does not support exclusive storage in each level of memory hierarchy, which means a multiple copying data blocks is required across each level, which is an inefficient storage hierarchy.

Table 6.7 summarizes the hardware performance improvement methods, excluding a deeper pipelining for the logic circuit. The applied phase is categorized into three phases of inference, training, and a combination of both. In addition, the target applied is categorized into the parameters, activations, input vector, and any data used. For the quantization, a quantization error is an issue regarding the accuracy of the inference phase. There are several propositions used to add an interpolating term to the loss function to obtain the parameters, including the quantization error. The approximation must also add an interpolation term to the loss function because it directly affects the inference error. A co-design of the hardware and software is one recent trend.

Table 6.7. Summary of hardware performance improvement methods.

MethodDescriptionApplied PhaseApplying TargetPurpose, etc
Model Compression Reduction in the total number of parameters and/or activation functions
Pruning Pruning edges and units that do not contribute to accuracy Training Activations, Parameters Computation complexity and data reductions
Dropout Invalidating Activation by probability Training Activations Computation complexity and data reductions
DropConnect Invalidating edges by probability Training Parameters Computation complexity and data reductions
PCA Lowering input vector by finding uncorrelated elements Inference Input Data (Vector) Data reduction
Distillation Knowledge Distillation from Larger Model to Smaller Model Training Parameters Computation complexity and data reductions
Tensor Factorization Tensor Approximation with Low-Rank Tensors Training Parameters Computation complexity and data reductions
Weight-sharing Sharing the weights intra-layer and/or inter layer After training Parameters Data reduction
Numerical Compression Parameter size reduction
Quantization Quantizing numerical representation Inference Training Activations, Parameters Interpolation term needed for loss function (option)
Edge-cutting Value cutting under threshold value Inference After training Computation complexity and data reduction with zero-skipping
EncodingCompression/Decompression by data-encoding Inference Training Activations, Parameters Temporal Variables Data reduction
Zero-SkippingOperation skipping when operand has zero Inference and Training Operation Computation complexity and data reductions
ApproximationApproximate functions and operators Inference Activations, Parameters Interpolation term is needed for loss function
Optimization Model Optimization Constraining to model topology After training Topology Computation complexity and Data reductions
Data-Flow Optimization Use of Data Recycling and or Locality Inference and Training All Data-Flows Memory Access Optimization

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128182796000165

CACHES

ANDREW N. SLOSS, ... CHRIS WRIGHT, in ARM System Developer's Guide, 2004

12.2.1 BASIC ARCHITECTURE OF A CACHE MEMORY

A simple cache memory is shown on the right side of Figure 12.4. It has three main parts: a directory store, a data section, and status information. All three parts of the cache memory are present for each cache line.

The cache must know where the information stored in a cache line originates from in main memory. It uses a directory store to hold the address identifying where the cache line was copied from main memory. The directory entry is known as a cache-tag.

A cache memory must also store the data read from main memory. This information is held in the data section (see Figure 12.4).

The size of a cache is defined as the actual code or data the cache can store from main memory. Not included in the cache size is the cache memory required to support cache-tags or status bits.

There are also status bits in cache memory to maintain state information. Two common status bits are the valid bit and dirty bit. A valid bit marks a cache line as active, meaning it contains live data originally taken from main memory and is currently available to the processor core on demand. A dirty bit defines whether or not a cache line contains data that is different from the value it represents in main memory. We explain dirty bits in more detail in Section 12.3.1.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781558608740500137

Dataflow Processing

Zivojin Sustran, ... Veljko Milutinovic, in Advances in Computers, 2015

5 Problem Statement for the Analysis

Traditional cache memory architectures are based on the locality property of common memory reference patterns. This means that a part of the content of the main memory is replicated in smaller and faster memories closer to the processor. The processor can then access this data in a nearby fast cache, without suffering long penalties of waiting for main memory access.

Increasing the performance of a cache system is typically done by enlarging the cache. This has led to caches that consume an ever-growing part of modern microprocessor chips. However, bigger caches induce longer latencies. Thus, making a cache larger, beyond a certain point, becomes counter-productive. At that point, any further increasing of the performance of the cache represents a difficult problem, as indicated in Ref. [17].

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/S0065245814000060

Performance Estimation of Embedded Software with Instruction Cache Modeling

YAU-TSUN STEVEN LI, ... ANDREW WOLFE, in Readings in Hardware/Software Co-Design, 2002

4.1 Modified Cost Function

With cache memory, each instruction fetch will result in either a cache hit or a cache miss, which may in turn result in two very different instruction execution times. The simple microarchitecture model that each instruction takes a constant time to execute no longer models this situation accurately. We need to subdivide the original instruction counts into counts of cache hits and misses. If we can determine these counts, and the hit and miss execution times of each instruction, then a tighter bound on the execution time of the program can be established.

As in the previous section, we can group adjacent instructions together. We define a new type of atomic structure for analysis, the line-block or simply l-block. An l-block is defined as a contiguous sequence of code within the same basic block that is mapped to the same cache set in the instruction cache. In other words, the l-blocks are formed by the intersection of basic blocks with the cache set line size. All instructions within a l-block are always executed together in sequence. Further, since the cache controller always loads a line of code into the cache, these instructions are either in the cache completely or not in the cache at all. These are denoted as a cache hit or a cache miss, respectively, of the l-block.

Figure 2(i) shows a CFG with 3 basic blocks. Suppose that the instruction cache has 4 cache sets. Since the starting address of each basic block can be determined from the program's executable code, we can find all cache sets that each basic block is mapped to, and add an entry on these cache lines in the cache table (Figure 2(ii)). The boundary of each l-block is shown by the solid line rectangle. Suppose a basic block Bi is partitioned into ni l-blocks. We denote these l-blocks Bi.1, Bi.2, …, Bi.ni.

The data could be expected to be retained by which of the following type of memory

Fig. 2. An example showing how the l-blocks are constructed. Each rectangle in the cache table represents a l-block.

For any two l-blocks that are mapped to the same cache set, they will conflict with each other if they have different address tags. The execution of one l-block will displace the cache content of the other. For instance, l-block B1.1 conflicts with l-block B3.1 in Figure 2. There are also cases where two l-blocks do not conflict with each other. This situation happens when the basic block boundary is not aligned with the cache line boundary. For instance, l-blocks B1.3 and B2.1 in Figure 2, each occupies a partial cache line and they do not conflict with each other. They are called nonconflicting l-blocks.

Since l-block Bi.j is inside the basic block Bi, its execution count is equal to xi. The cache hit and the cache miss counts of l-block Bi.j are denoted xijhit and xijmiss, respectively, and

(12)xi=xijhit+xijmiss,j=1,2,…,ni.

The new total execution time (cost function) is given by

(13)totalexecutiontime= ∑i=1N∑j=1ni(cijhitxijhit+cijmissxijmiss)

where cijhit and cijmiss are, respectively, the hit cost and the miss cost of the l-block Bi.j.

Equation (12) links the new cost function (13) with the program structural constraints and the program functionality constraints, which remain unchanged. In addition, the cache behavior can now be specified in terms of the new variables xijhit's and xijmiss's.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781558607026500156

Modern Architectures

Bertil Schmidt, ... Moritz Schlarb, in Parallel Programming, 2018

Cache Algorithms

The cache memory is a resource that does not need to be explicitly managed by the user. Instead, the cache is managed by a set of cache replacement policies (also called cache algorithms) that determine which data is stored in the cache during the execution of a program. To be both cost-effective and efficient, caches are usually several orders-of-magnitude smaller than main memory (e.g., there are typically a few KB of L1-cache and a few MB of L3-cache versus many GB or even a few TB of main memory). As a consequence, the dataset that we are currently working on (the working set) can easily exceed the cache capacity for many applications. To handle this limitation, cache algorithms are required that address the questions of

Which data do we load from main memory and where in the cache do we store it?

If the cache is already full, which data do we evict?

If the CPU requests a data item during program execution, it is first determined whether it is already stored in cache. If this is the case, the request can be serviced by reading from the cache without the need for a time-consuming main memory transfer. This is called a cache hit. Otherwise, we have a cache miss. Cache algorithms aim at optimizing the hit ratio; i.e. the percentage of data requests resulting in a cache hit. Their design is guided by two principles:

Spatial locality. Many algorithms access data from contiguous memory locations with high spatial locality. Consider the following code fragment to determine the maximum value of an array a of size n (whereby the elements of a[] are stored contiguously):

     for (int i = 0; i < n; i++)
         maximum = max(a[i], maximum);

Assume the cache is initially empty. In the first iteration the value a[0] is requested resulting in a cache miss. Thus, it needs to be loaded from main memory. Instead of requesting only a single value, an entire cache line is loaded with values from neighboring addresses. Assuming a typical cache line size of 64 B and double-precision floating-point values, this would mean that the eight consecutive values a[0], a[1], a[2], a[3], a[4], a[5], a[6], and a[7] are loaded into cache. The next seven iterations will then result in cache hits. The subsequent iteration requests a[8] resulting again in a cache miss, and so on. Overall, the hit ratio in our example is as high as 87.5% thanks to the exploitation of spatial locality.

Temporal locality. The cache is organized into a number of blocks (cache lines) of fixed size (e.g. 64 B). The cache mapping strategy decides in which location in the cache a copy of a particular entry of main memory will be stored. In a direct-mapped cache, each block from main memory can be stored in exactly one cache line. Although this mode of operation can be easily implemented, it generally suffers from a high miss rate. In a two-way set associative cache, each block from main memory can be stored in one of two possible cache lines (as illustrated in Fig. 3.3). A commonly used policy in order to decide which of the two possible locations to choose is based on temporal locality and is called least-recently used (LRU). LRU simply evicts the least recently accessed entry. Going from a direct-mapped cache to a two-way set associative cache can improve the hit ratio significantly [2]. A generalization of the two-way set associative cache is called fully associative. In this approach, the replacement strategy is free to choose any cache line to hold the copy from main memory. Even though the hit rate might be improved further, the costs associated with implementing a fully associative cache are often prohibitive. Therefore, n-way associative cache designs with n=2,4, or 8 are usually preferred in practice.

The data could be expected to be retained by which of the following type of memory

Figure 3.3. Illustration of (A) a direct-mapped cache and (B) a two-way associative cache.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128498903000034

The Cache Layer

Bruce Jacob, ... David T. Wang, in Memory Systems, 2008

22.2.1 Desirable Features of Cache Organization

The disk cache memory space is relatively small, so it is important to use it efficiently. The more user data that can be stored in the cache, the more likely a read request can get a cache hit out of it. A good cache organization should have the following features:

High space utilization Cache space allocation should not result in lots of wasted space. This may seem obvious, but if a relatively large-size cache allows for only a limited number of segments, then each segment may be much bigger than what is required to hold the data.

Single copy of data There should not be more than one copy of data for any LBA in the cache. This is a generalization of the shared memory space between the reads and writes concept discussed in the previous section. Again, this may seem obvious, but in practice it does happen with a number of cache organizations that the same LBA can appear in multiple segments. A cache organization allowing such duplication must ensure correctness of behavior, and the wasted space should be the cost of increased performance (e.g., similar duplication is possible in trace caches—see “Another Dynamic Cache Block: Trace Caches” was dicussed in Chapter 2, Section 2.6.3).

One cache memory space The cache space should not be partitioned into a read cache and a write cache, as discussed in the previous section.

Ease of allocation Finding space for a new segment should be a simple task, as it is part of the overhead for handling a command and affects the command's response time.

In this section, the basic structures of three main classes of organization will be discussed. They are the segmentation, circular buffer, and virtual memory schemes.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123797513500242

Which of the following does a computer memory retain?

Device marked as (i) is of RAM, which is the primary memory of the computer. It retains the data till when computer is switched ON.

What are the types of memory used to store data?

Data storage device There are mainly two types of semiconductor memory: random-access memory (RAM) and read-only memory (ROM). RAM is a temporary data storage domain, whereas ROM serves as a semi-permanent storage domain. If RAM is likened to notebooks or memo pads, then ROM is comparable to dictionaries and textbooks.

Which of the memory holds the information?

EEROM (Electrically erasable programmable read-only memory) holds the information when the Power Supply is switched off.

What kind of memory that hold data only when the power is on?

Volatile memory is a type of memory that maintains its data only while the device is powered. If the power is interrupted for any reason, the data is lost.