第五章(优化程序性能)

编译器优化的限制

generally useful optimization

code motion:减少计算的频次,把相同的重复计算移动到循环外
reduction in strength:将昂贵的计算换成更便宜的(如乘变加和移位)
share common subexpressions:重复使用部分表达式

Opmization on Blocker

#1: Procedure Calls
- 实际上该代码的运行及其低效：每次循环都算一次strlen(s),程序是平方增长的
- 前面的编译器优化程度开高点都行,但是这个优化不了:因为编译器必须假定函数调用是黑盒,它无法确定该函数的调用会产生什么副作用
  - 可以内联来搞,也不算太good
#2:Memory
- 编译器不能确定有无别名(aliase,即两个变量指向同一片地址),所以只能一遍一遍地从内存中读出来,算完再放回去
aliase实际上在c中很容易发生:可以对指针进行运算操作,同时直接访问存储结构,程序员能做的就是习惯引入局部变量

Exploing Instruction-Level Parallelism (指令级并行)

benchmark的两个小技巧:使用抽象的data_type 和operation实现代码复用

superscalar processor:先读取一系列指令然后根据依赖关系算
流水线:(原本3*3个周期下降到7个周期)
但是有些时候你的计算过程具有顺序依赖性(时间取决于latency(执行时间))
- way:loop unrolling and reassociation
  - 事实证明这样极其有效,但是坏消息是计算浮点数很可能不能这么干
unrolling and accumulating的推广
还有一种矢量加法优化器
分支预测:猜某个分支然后运行(但是并不真正改变寄存器和内存的值)

memory hierarchy

storage technologies and trends

RAM(random-access memory):一组芯片,单元为cell(one bit per cell),许多RAM就形成了内存,分为SRAM和DRAM
- sram更复杂,高速,无需错误检查什么的,很贵,当缓存用
- dram用作主存,frame buffers,需要refresh(持续供电以保持数据)
- dram sram都是volatile memory(易失内存,断电丢失)
non volatile:ROM(read-only memories)闪存也是一种ROM
- 用处:存bios等,solid state disks(SSD),disk caches

load operation: movq A %rax
- cpu 把地址放在memory bus上
- 主存从bus上读取A,检索相应数据x,再放回bus
- cpu从bus上读取并将其复制到寄存器中(store基本同理)
磁盘:
- 容量:经典虚标(用十进制来算,密度:长乘宽)
- 磁盘访问
  - 主要时间瓶颈是seek和rotation latency
- 现代磁盘为cpu提供了简单逻辑块抽象(sector),对其进行编号,然后控制负责维护从逻辑块到实际物理区域的映射
  - 有一部分cylinders是用来备用的,故障的时候把当前数据复制到备用的就可以接着用,所以format capacity 要小于actual capacity
- 从磁盘中读取:
  - cpu初始化了一个三元组:(指令逻辑块编号目标内存地址)并发送给磁盘控制器
  - 磁盘控制器读取并存放到内存中,当它读完了才会发送一个interupt信号提示cpu读取完毕
  - 这种设计可以让cpu在等待读取完成的同时干别的

locality of reference

principle of locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently
- temporal locality:最近引用过的数据有可能在将来再次被引用
- spatial locality:引用过的数据的相邻位置的数据更有可能被引用
作为程序员最重要的能力:qulitative estimation of locality
- - 步长为1的线性访问模式,非常好的空间局部性

caching in memory hierarchy

缓存允许我们以最高层的速度访问处理最底层容量的数据
数据在缓存中:缓存命中(hit),反之则是(miss)

cache memories

Cache memory organization and operation

read
- 当e=1时,称direct mapped cache,实际上相当之低效
- 块的设计初衷就是为了利用空间局部性,但是大小设计相当棘手(trade-off)
- 缓存中的替换:很多种算法,最近最少,你要重新写回内存因为可能发生变化
write
衡量缓存的指标

performance impact of caches

memory mountain
- read throughput:number of bytes read from memory per second
- memory mountain:measured read throughput as a function of spatial and temperal locality
- 增大stride:减小空间局部性,增大size:减小时间局部性(因为size更大了之后高级缓存放不下了)
rearranging loop for better spatial locality
- 这种有额外的存储开销:但是在现代系统中写比读容易得多,不会成为瓶颈
blocking

cachelab

getopt():包含在unistd.h中的函数,用于解析程序的命令行参数
- int getopt(int argc,char* const argv[],const char* opstring)
- opstring:"abc:"表示-a -b普通选项加上一个必须有参数的-c选项
- 两个冒号为可选参数
模拟内存访问

#include "cachelab.h"
#include <unistd.h>
#include <getopt.h>
#include <stdlib.h>
#include <stdio.h>
#include <limits.h> // 为了使用 INT_MAX

typedef struct cache_line {
    int validbit;
    unsigned long tags; // 地址是64位的，建议用 unsigned long
    int LRU;            // 记录时间戳
} cache_line;

typedef struct cache {
    cache_line** item;
    int S;      // 组的数量 (2^s)
    int E;      // 每组的行数 (Associativity)
    int B;      // 块大小 (2^b)
    int s_bits; // 保存原始的 s 值，用于位移计算
    int b_bits; // 保存原始的 b 值，用于位移计算
    int time;   // 全局时间计数器
} cache;

typedef struct state {
    int hits;
    int misses;
    int evictions;
} state;

/**
 * 核心缓存访问逻辑
 * 合并了 load/store/modify 的公共部分
 */
void access_cache(unsigned long address, state* s, cache* c) {
    // 1. 计算组索引 (Set Index)
    // 逻辑：(地址 >> b位) & (S - 1)
    unsigned long set_index = (address >> c->b_bits) & (c->S - 1);
    
    // 2. 计算标记 (Tag)
    // 逻辑：地址 >> (s位 + b位)
    unsigned long tag = address >> (c->s_bits + c->b_bits);
    
    cache_line* current_set = c->item[set_index];
    
    // --- 第一步：检查 Hit ---
    for (int i = 0; i < c->E; i++) {
        if (current_set[i].validbit == 1 && current_set[i].tags == tag) {
            s->hits++;
            current_set[i].LRU = c->time; // 更新 LRU 时间
            return;
        }
    }

    // --- 第二步：处理 Miss ---
    s->misses++;
    
    int empty_idx = -1; // 空闲行的索引
    int lru_idx = -1;   // 需要被驱逐行的索引
    int min_lru = INT_MAX;

    // 寻找空行 或者 LRU 最小的行
    for (int i = 0; i < c->E; i++) {
        if (current_set[i].validbit == 0) {
            empty_idx = i;
            break; // 找到空行直接停止，优先使用空行
        }
        if (current_set[i].LRU < min_lru) {
            min_lru = current_set[i].LRU;
            lru_idx = i;
        }
    }

    if (empty_idx != -1) {
        // 有空行，直接填入
        current_set[empty_idx].validbit = 1;
        current_set[empty_idx].tags = tag;
        current_set[empty_idx].LRU = c->time;
    } else {
        // 没有空行，发生驱逐 (Eviction)
        s->evictions++;
        current_set[lru_idx].tags = tag;
        current_set[lru_idx].LRU = c->time;
        // validbit 已经是 1 了，无需修改
    }
}

int main(int argc, char* argv[]) {
    state* sta = (state*)malloc(sizeof(state));
    sta->hits = 0;
    sta->misses = 0;
    sta->evictions = 0;

    int option;
    int s_in = 0, e_in = 0, b_in = 0; // 接收输入的参数
    FILE* f = NULL;

    // 解析命令行参数
    while ((option = getopt(argc, argv, "hvs:E:b:t:")) != -1) {
        switch (option) {
            case 'h':
                printf("Usage: ./csim -s <num> -E <num> -b <num> -t <file>\n");
                exit(0);
            case 'v':
                // verbose 模式，如有需要可添加打印逻辑
                break;
            case 's':
                s_in = atoi(optarg);
                break;
            case 'E':
                e_in = atoi(optarg);
                break;
            case 'b':
                b_in = atoi(optarg);
                break;
            case 't':
                f = fopen(optarg, "r");
                break;
        }
    }

    if (s_in == 0 || e_in == 0 || b_in == 0 || f == NULL) {
        printf("Missing required arguments.\n");
        return 1;
    }

    // 初始化 Cache 结构体
    cache* cache_si = (cache*)malloc(sizeof(cache));
    cache_si->s_bits = s_in;       // 保存位数 s
    cache_si->b_bits = b_in;       // 保存位数 b
    cache_si->S = 1 << s_in;       // 计算 S = 2^s
    cache_si->E = e_in;
    cache_si->B = 1 << b_in;       // 计算 B = 2^b
    cache_si->time = 0;

    // 动态分配二维数组
    cache_si->item = (cache_line**)malloc(sizeof(cache_line*) * cache_si->S);
    for (int i = 0; i < cache_si->S; i++) {
        cache_si->item[i] = (cache_line*)malloc(sizeof(cache_line) * cache_si->E);
        for (int j = 0; j < e_in; j++) {
            cache_si->item[i][j].LRU = 0;
            cache_si->item[i][j].tags = 0;
            cache_si->item[i][j].validbit = 0;
        }
    }

    char operation;
    unsigned long address;
    int size;

    // 读取 trace 文件
    // 注意 " %c" 前面的空格，用于跳过空白符
    while (fscanf(f, " %c %lx,%d", &operation, &address, &size) != -1) {
        
        // 通常忽略指令加载 'I'
        if (operation == 'I') continue;

        cache_si->time++; // 增加全局时间

        switch (operation) {
        case 'L': // Load 数据加载
            access_cache(address, sta, cache_si);
            break;
        
        case 'S': // Store 数据存储
            access_cache(address, sta, cache_si);
            break;

        case 'M': // Modify 数据修改 (先读后写)
            access_cache(address, sta, cache_si); // 第一次访问（可能 Miss/Evict）
            sta->hits++;                          // 第二次访问（必定 Hit）
            break;
        }
    }

    printSummary(sta->hits, sta->misses, sta->evictions);

    // 释放内存 (良好的编程习惯)
    fclose(f);
    for (int i = 0; i < cache_si->S; i++) {
        free(cache_si->item[i]);
    }
    free(cache_si->item);
    free(cache_si);
    free(sta);

    return 0;
}

针对特定size的矩阵的block优化

void transpose_submit(int M, int N, int A[N][M], int B[M][N])
{
    for (int i = 0; i < 32; i += 8) {
    for (int j = 0; j < 32; j += 8) {
        for (int ii = i; ii < i + 8; ii++) {
            for (int jj = j; jj < j + 8; jj++) {
                B[jj][ii] = A[ii][jj];
            }
        }
    }
}

void transpose_64x64(int M, int N, int A[N][M], int B[M][N])
{
    int a_0, a_1, a_2, a_3, a_4, a_5, a_6, a_7;
    for (int i = 0; i < 64; i += 8){
        for (int j = 0; j < 64; j += 8){
            for (int k = i; k < i + 4; k++){
                // 得到A的第1,2块
                a_0 = A[k][j + 0];
                a_1 = A[k][j + 1];
                a_2 = A[k][j + 2];
                a_3 = A[k][j + 3];
                a_4 = A[k][j + 4];
                a_5 = A[k][j + 5];
                a_6 = A[k][j + 6];
                a_7 = A[k][j + 7];
                // 复制给B的第1,2块
                B[j + 0][k] = a_0;
                B[j + 1][k] = a_1;
                B[j + 2][k] = a_2;
                B[j + 3][k] = a_3;
                B[j + 0][k + 4] = a_4;
                B[j + 1][k + 4] = a_5;
                B[j + 2][k + 4] = a_6;
                B[j + 3][k + 4] = a_7;
            }
            for (int k = j; k < j + 4; k++){
                // 得到B的第2块
                a_0 = B[k][i + 4];
                a_1 = B[k][i + 5];
                a_2 = B[k][i + 6];
                a_3 = B[k][i + 7];
                // 得到A的第3块
                a_4 = A[i + 4][k];
                a_5 = A[i + 5][k];
                a_6 = A[i + 6][k];
                a_7 = A[i + 7][k];
                // 复制给B的第2块
                B[k][i + 4] = a_4;
                B[k][i + 5] = a_5;
                B[k][i + 6] = a_6;
                B[k][i + 7] = a_7;
                // B原来的第2块移动到第3块
                B[k + 4][i + 0] = a_0;
                B[k + 4][i + 1] = a_1;
                B[k + 4][i + 2] = a_2;
                B[k + 4][i + 3] = a_3;
            }
            for (int k = i + 4; k < i + 8; k++)
            {
                // 处理第4块
                a_4 = A[k][j + 4];
                a_5 = A[k][j + 5];
                a_6 = A[k][j + 6];
                a_7 = A[k][j + 7];
                B[j + 4][k] = a_4;
                B[j + 5][k] = a_5;
                B[j + 6][k] = a_6;
                B[j + 7][k] = a_7;
            }
        }
    }
}

其它的分块思路:先分8*8再在小块里面分4*4

csapp_cachelab

第五章(优化程序性能)

generally useful optimization

Opmization on Blocker

Exploing Instruction-Level Parallelism (指令级并行)

memory hierarchy

storage technologies and trends

locality of reference

caching in memory hierarchy

cache memories

Cache memory organization and operation

performance impact of caches

cachelab