《FPGA并行编程-- 以HLS实现信号处理为例》学习(一)

该书的中文翻译版本见FPGA并行编程

LAB和代码见Parallel Programming for FPGAs: Projects and Labs

前言

  • 设计者必须充分理解内存层级(应该是多级缓存)和带宽空间局部性时间局部性并行结构计算与存储之间的取舍与平衡。

第一章 简介

  • FPGA由一个可编程逻辑模块的矩阵和与之相连的内存组成,通常这些模块是以**查找表(LUT)**的形式存在。(目前主流FPGA都采用了基于SRAM工艺的查找表结构【断电程序就没了】,也有一些军品和宇航级FPGA采用Flash或者熔丝与反熔丝工艺的查找表结构【断电后程序依然存在】)
  • 目前,FPGA市场占有率最高的两大公司Xilinx和Altera生产的FPGA都是基于SRAM(latching circuitry (flip-flop) to store each bit)工艺的,需要在使用时外接一个片外存储器以保存程序。上电时,FPGA将外部存储器中的数据读入片内RAM,完成配置后,进入工作状态;掉电后FPGA恢复为白片,内部逻辑消失。
  • 查找表(Look-Up-Table, LUT):
    • 当用户通过原理图或HDL语言描述了一个逻辑电路以后,PLD/FPGA开发软件会自动计算逻辑电路的所有可能的结果,并把结果事先写入RAM,这样,每输入一个信号进行逻辑运算就等于输入一个地址进行查表,找出地址对应的内容,然后输出即可。
    • 实际中的FPGA大多使用4-6位输入的查找表作为运算基础, 大型FPGA内甚至有几百万个这一级别的查找表。
  • 触发器(flip-flop, FF):
    • 最基本的内存单位, 貌似学过SR、D、T、JK触发器(锁存器)。
      • SR触发器两个与非门或者两个或非门都可以组成,只有这个没有CLK信号控制:
        • 当R与S皆为低电位,Q与Q(Q的反相)保持于一个固定的状态。
        • 当S(Set)为高电位,R(Reset)为低电位时,输出Q会被强制设置为高电位;
        • 相反的,当S为低电位,R为高电位时,输出Q会被强制设置为低电位。
        • 同时为高电位,是没有定义的行为。
      • D触发器
        • Qnext=DQ_{next}=D
        • 当时脉由0转为1时,输出的值会和输入的值相等。此类触发器可用于防止因为噪声所带来的错误,以及通过管线增加处理资料的数量。
      • JK触发器
        • Qnext=KQ+JQQ_{next}=\overline{K} Q + J \overline{Q}
        • JK触发器是带了时钟的SR触发器,J=S,K=R,当然也多了一个同时为高电平的逻辑。
      • T 触发器
        • Qnext=TQ+TQ=QTQ_{next}=\overline{T} Q + T \overline{Q} = Q \oplus T
        • 把JK触发器的J和K输入点连接在一起,即构成一个T触发器

第二章 FIR滤波器

  • ​有限脉冲响应(FIR)是数字信号处理中应用最广泛的运算。因为它们可以采用高度优化的体系结构,所以它们非常适合于硬件实现。
  • 卷积公式:y[i]=j=0N1h[j]x[ij]y[i]=\sum_{j=0}^{N-1} h[j] \cdot x[i-j]
  • 不同变量类型使用 typedef; 方便地更改数据类型。

最初版本代码

#define N 11
#include "ap_int.h"

typedef int coef_t;
typedef int data_t;
typedef int acc_t;

/**
 * @description: 
 * @param {data_t} *y 经过滤波器的信号输出。
 * @param {data_t} x 信号输入。
 * @return {*}
 */
void fir(data_t *y, data_t x)
{
    coef_t C[N] = {
        53, 0, -91, 0, 313, 500, 313, 0, -91, 0, 53};
    // 可以把fir看成一个多次调用的函数。
    // 其中的static保证了每次进入不会初始化。
    // 这是声明一个放在RAM里面的数组。
    // shift_reg 是反着存的, 0:最近的一次的x
    // 这是一个11阶滤波器,我们必须存储之前的10个x的数据
    static data_t shift_reg[N];
    acc_t acc; // 累加量, accumulation
    int i;
    acc = 0;
Shift_Accum_Loop:
    for (i = N - 1; i >= 0; i--)
    {
        if (i == 0)
        {
            // 最后一次就需要累加acc+保存该x值
            acc += x * C[0];
            shift_reg[0] = x;
        }
        else
        {
            // 对shift做一个向右移位操作
            shift_reg[i] = shift_reg[i - 1];
            acc += shift_reg[i] * C[i];
        }
    }
    // 只输出一个int,感觉应该可以直接return呀?
    *y = acc;
}

上述代码它是串行执行的,并且使用了大量不必要的控制逻辑。

其综合结果为:

去除for循环内部的if/else

  • for循环内部的if/else语句效率很低。可以将条件提出来,拆成两个循环:
TDL:
    for (i = N - 1; i > 0; i--)
    {
        shift_reg[i] = shift_reg[i - 1];
    }
    shift_reg[0] = x;

    acc = 0;
MAC:
    for (i = N - 1; i >= 0; i--)
    {
        acc += shift_reg[i] * C[i];
    }

其综合结果为:

可以注意到,这个结果比起最初结果:

  • II(initial interval) 降低为1,即吞吐量提高了。
  • 触发器(FF)数量显著降低。

手动将循环展开

  • 将TDL部分将循环展开。
TDL:
    for (i = N - 1; i > 1; i = i - 2)
    {
        shift_reg[i] = shift_reg[i - 1];
        shift_reg[i - 1] = shift_reg[i - 2];
    }
    if (i == 1)
    {
        shift_reg[1] = shift_reg[0];
    }
    shift_reg[0] = x;
  • 结果如下, 改进不大:

最终版本

  • 直接对函数pipeline,中间的循环都会自动unroll。
  • 将两个数组完全拆分。
  • 使用ap_int类型。
#include "ap_int.h"
const int N=11;

typedef ap_int<10> coef_t;
typedef ap_int<18> data_t;
typedef ap_int<18> acc_t;

void fir(data_t *y, data_t x)
{
	const coef_t C[N] = {
		53, 0, -91, 0, 313, 500, 313, 0, -91, 0, 53};
	static data_t shift_reg[N];
	acc_t acc;
	int i;


TDL:
	for (i = N - 1; i > 0; i--)
	{
		shift_reg[i] = shift_reg[i - 1];
	}
	shift_reg[0] = x;

	acc = 0;
MAC:
	for (i = N - 1; i >= 0; i--)
	{
		acc += shift_reg[i] * C[i];
	}
	*y = acc;
}



第三章 CORDIC

简介

  • CORDIC核心思想是在一个二维平面上高效地执行一组矢量旋转。
  • CORDIC已经广泛应用于信号处理、机器人技术、通信和许多科学计算领域。
  • 由于CORDIC占用资源少,所以常用在FPGA设计中。
  • 旋转矩阵
    R(θ)=[cosθsinθsinθcosθ]R(\theta)=\left[\begin{array}{cc}\cos \theta & -\sin \theta \\ \sin \theta & \cos \theta\end{array}\right]
  • 高效地矩阵旋转
    • 经过下列变换
      cos(θi)=11+tan2(θi)sin(θi)=tan(θi)1+tan2(θi)\begin{aligned} \cos \left(\theta_{i}\right) &=\frac{1}{\sqrt{1+\tan ^{2}\left(\theta_{i}\right)}} \\ \sin \left(\theta_{i}\right) &=\frac{\tan \left(\theta_{i}\right)}{\sqrt{1+\tan ^{2}\left(\theta_{i}\right)}} \end{aligned}
    • 矩阵变成:
      Ri=11+tan2(θi)[1tan(θi)tan(θi)1]R_{i}=\frac{1}{\sqrt{1+\tan ^{2}\left(\theta_{i}\right)}}\left[\begin{array}{cc}1 & - \tan \left(\theta_{i}\right) \\ \tan \left(\theta_{i}\right) & 1\end{array}\right]
    • 限制tan(θi)\tan \left(\theta_{i}\right)是2的幂次,就可以将旋转变成数据移位的运算了。令tan(θi)=2i\tan \left(\theta_{i}\right)=2^{-i}, 则有
      vi=Ki[12i2i1][xi1yi1]v_{i}=K_{i}\left[\begin{array}{cc}1 & -2^{-i} \\ 2^{-i} & 1\end{array}\right]\left[\begin{array}{l}x_{i-1} \\ y_{i-1}\end{array}\right]
      Ki=11+22iK_{i}=\frac{1}{\sqrt{1+2^{-2 i}}}
    • 上述只能旋转一些固定角度,但是可以用多次旋转去修正错误。其中σ\sigma为1或-1表示修正
      vi=Ki[1σi2iσi2i1][xi1yi1]v_{i}=K_{i}\left[\begin{array}{cc}1 & -\sigma_{i} 2^{-i} \\ \sigma_{i} 2^{-i} & 1\end{array}\right]\left[\begin{array}{l}x_{i-1} \\ y_{i-1}\end{array}\right]
      K(n)=i=0n1Ki=i=0n111+22iK(n)=\prod_{i=0}^{n-1} K_{i}=\prod_{i=0}^{n-1} \frac{1}{\sqrt{1+2^{-2 i}}}
    • 上述的K(n)是可以预先计算并储存的。
    • 向左右移位在硬件上基本是无消耗的。

直角坐标系转极坐标系:第一版代码

#define NO_ITER 16

typedef float data_t;

data_t Kvalues[NO_ITER] = {1, 0.500000000000000, 0.250000000000000, 0.125000000000000, 0.0625000000000000, 0.0312500000000000, 0.0156250000000000, 0.00781250000000000, 0.00390625000000000, 0.00195312500000000, 0.000976562500000000, 0.000488281250000000, 0.000244140625000000, 0.000122070312500000, 6.10351562500000e-05, 3.05175781250000e-05};

data_t angles[NO_ITER] = {0.785398163397448, 0.463647609000806, 0.244978663126864, 0.124354994546761, 0.0624188099959574, 0.0312398334302683, 0.0156237286204768, 0.00781234106010111, 0.00390623013196697, 0.00195312251647882, 0.000976562189559320, 0.000488281211194898, 0.000244140620149362, 0.000122070311893670, 6.10351561742088e-05, 3.05175781155261e-05};

const data_t pi = 3.1415926535897932384626;

/**
 * @description:  coordinate to polar axis
 * @param {data_t} x  
 * @param {data_t} y
 * @param {data_t} *r
 * @param {data_t} *theta : radian system
 * @return {*}
 */
void cordiccart2pol(data_t x, data_t y, data_t *r, data_t *theta)
{
	// ratation degree
	data_t angle;
	data_t x_copy;
	data_t y_copy;
	// make the initial vector into I or IV Quadrant
	if (y > 0)
	{
		x_copy = y;
		y_copy = -x;
		angle = 90 * pi / 180;
	}
	else
	{
		x_copy = -y;
		y_copy = x;
		angle = -90 * pi / 180;
	}

	// start rotating
	int sigma;
ROTATE:
	for (int i = 0; i < NO_ITER; i++)
	{
		if (y_copy > 0)
		{
			sigma = -1;
		}
		else
		{
			sigma = 1;
		}
		data_t y_temp = y_copy;
		data_t x_temp = x_copy;
		data_t K_temp = Kvalues[i];
		x_copy -= sigma * y_temp * K_temp;
		y_copy += sigma * x_temp * K_temp;
		angle -= sigma * angles[i];
	}

	*theta = angle;
	*r = x_copy * 0.607253;
}
  • ROTATE循环的条件判断中的y_copy是需要上一次的结果,所以没法切流水。
  • 有一个对循环做Pipeline的优化, 但是综合并没有实现pipeline。
  • 综合结果分析中显示, y_temp * K_temp并不是用移位操作做的,而是直接调用的浮点数计算,所以没有达到简化的目标。
  • 综合结果如下:

直角坐标系转极坐标系:第二版代码

#define NO_ITER 16

typedef ap_fixed<18, 3, AP_RND> data_t;

data_t Kvalues[NO_ITER] = {1, 0.500000000000000, 0.250000000000000, 0.125000000000000, 0.0625000000000000, 0.0312500000000000, 0.0156250000000000, 0.00781250000000000, 0.00390625000000000, 0.00195312500000000, 0.000976562500000000, 0.000488281250000000, 0.000244140625000000, 0.000122070312500000, 6.10351562500000e-05, 3.05175781250000e-05};

data_t angles[NO_ITER] = {0.785398163397448, 0.463647609000806, 0.244978663126864, 0.124354994546761, 0.0624188099959574, 0.0312398334302683, 0.0156237286204768, 0.00781234106010111, 0.00390623013196697, 0.00195312251647882, 0.000976562189559320, 0.000488281211194898, 0.000244140620149362, 0.000122070311893670, 6.10351561742088e-05, 3.05175781155261e-05};

const data_t pi = 3.1415926535897932384626;

/**
 * @description:  coordinate to polar axis
 * @param {data_t} x  
 * @param {data_t} y
 * @param {data_t} *r
 * @param {data_t} *theta : radian system
 * @return {*}
 */
void cordiccart2pol(data_t x, data_t y, data_t *r, data_t *theta)
{
	// ratation degree
	data_t angle;
	data_t x_copy;
	data_t y_copy;
	// make the initial vector into I or IV Quadrant
	if (y > 0)
	{
		x_copy = y;
		y_copy = -x;
		angle = pi >> 1;
	}
	else
	{
		x_copy = -y;
		y_copy = x;
		angle = -pi >> 1;
	}

	// start rotating
	int sigma;
ROTATE:
	for (int i = 0; i < NO_ITER; i++)
	{
		if (y_copy > 0)
		{
			sigma = -1;
		}
		else
		{
			sigma = 1;
		}
		data_t y_temp = y_copy;
		data_t x_temp = x_copy;
		x_copy -= sigma * y_temp >> i;
		y_copy += sigma * x_temp >> i;
		angle -= sigma * angles[i];
	}


	data_t scale = 0.607253;
	*theta = angle;
	*r = scale * x_copy;
}

  • 结果测试的误差大概在1e-4这个量级。
  • 定点数采用的是18位宽,3位整数的浮点数,中间的乘法运算改成移位运算。
  • 有一个对循环做Pipeline的优化
  • 综合结果如下:
  • 相比于浮点数,减少了80% +的器件和时间消耗。
  • 分析结果如下:
  • HLS自动对sigma进行了优化,所以if应该不用手动改成两套不同电路。
  • 因为y_copy += sigma * x_temp >> i; 计算速度很快, 所以可以在一个周期内返回值,因此可以将该循环的II = 1。
  • 由于angles的读取速度较慢 ( > 1 周期), 导致了angle -= sigma * angles[i]; 需要在第二个状态才能计算,从而使得latency = 2.

直角坐标系转极坐标系:直接对函数进行pipeline

  • 结果如下:
  • II = 1, latency = 13,资源占用率有一定的提升(相当于从一个模块处理变成了16个模块同时处理)。

第四章 离散傅里叶变换(DFT)

基础

  • DFT 变换起始类似与矩阵乘法。
  • 将离散信号乘以下列矩阵就可以得到DFT结果:
    S=[11111ss2sN11s2s4s2(N1)1s3s6s3(N1)1sN1s2(N1)s(N1)(N1)]S=\left[\begin{array}{ccccc}1 & 1 & 1 & \cdots & 1 \\ 1 & s & s^{2} & \cdots & s^{N-1} \\ 1 & s^{2} & s^{4} & \cdots & s^{2(N-1)} \\ 1 & s^{3} & s^{6} & \cdots & s^{3(N-1)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & s^{N-1} & s^{2(N-1)} & \cdots & s^{(N-1)(N-1)}\end{array}\right]
  • 其中的:
    s=ej2πNs=e^{\frac{-j 2 \pi}{N}}
  • 通常我们也会把2πN\frac{2 \pi}{N}记为ω\omega
  • DFT 的矩阵运算即是:
    G=SgG=S \cdot g
  • GG 表示的就是DFT变换之后的结果。也即是:
    G[k]=n=0N1g[n]sknG[k]=\sum_{n=0}^{N-1} g[n] s^{k n} for k=0,,N1k=0, \ldots, N-1
  • 在计算时候为了简便,通常使用欧拉公式,将复数运算变成三角运算:
    eix=cosx+isinxe^{i x}=\cos x+i \sin x

Baseline

#include<math.h> //Required for cos and sin functions
#include "dft.h"
#define N 256                            // DFT Size

void dft(IN_TYPE sample_real[N], IN_TYPE sample_imag[N]) {
    int i, j;
    TEMP_TYPE w;
    TEMP_TYPE c, s;

    // Temporary arrays to hold the intermediate frequency domain results
    TEMP_TYPE temp_real[N];
    TEMP_TYPE temp_imag[N];

    // Calculate each frequency domain sample iteratively
    for (i = 0; i < N; i += 1) {
        temp_real[i] = 0;
        temp_imag[i] = 0;

        // (2 * pi * i)/N
        w = (-2.0 * 3.141592653589  / N) * (TEMP_TYPE)i;

        // Calculate the jth frequency sample sequentially
        for (j = 0; j < N; j += 1) {
            // Utilize HLS tool to calculate sine and cosine values
            c = cos(j * w);
            s = sin(j * w);

            // Multiply the current phasor with the appropriate input sample and keep
            // running sum
            temp_real[i] += (sample_real[j] * c - sample_imag[j] * s);
            temp_imag[i] += (sample_real[j] * s + sample_imag[j] * c);
        }
    }

    // Perform an inplace DFT, i.e., copy result into the input arrays
    for (i = 0; i < N; i += 1) {
        sample_real[i] = temp_real[i];
        sample_imag[i] = temp_imag[i];
    }
}

  • 作者提供代码中w计算少了一个负号。
  • 综合结果:
  • latency 和 area 均很高

优化0: 查表代替计算:

			// 由于已经储存了[0,2pi) 的 cos 和 -sin 的值 (共256个)
			// 所以只需要查表找到该值即可
			int index = (i * j) % N;
			c = cos_coefficients_table[index];
			s = sin_coefficients_table[index];
  • 综合结果:
  • latency 和 area均有大幅度降低。

优化1:定点数代替浮点数

typedef ap_fixed<32, 18, AP_RND> DTYPE;
typedef ap_fixed<32, 18, AP_RND> IN_TYPE;        // Data type for the input signal
typedef ap_fixed<32, 18, AP_RND> TEMP_TYPE; // Data type for the temporary variables
  • 使用32位宽,18位整数的定点数,可以满足误差要求。
  • 能够极大减少器件数量, 一定程度上减少计算时间。
  • 综合结果:

优化2: 循环交换或流水线交织处理 + 内层循环流水线

  • 瓶颈在temp_real[i]和sample_real[j] 的读取上。
  • 考虑到矩阵乘法实际上以列来做更好,每一列都乘以一个同样的元素,可以将内层外层循环交换,即先列后行循环。
  • 最终代码:
#include "dft.h"

const DTYPE cos_coefficients_table[]={1.000000,0.999699,0.998795,0.997290,0.995185,0.992480,0.989177,0.985278,0.980785,0.975702,0.970031,0.963776,0.956940,0.949528,0.941544,0.932993,0.923880,0.914210,0.903989,0.893224,0.881921,0.870087,0.857729,0.844854,0.831470,0.817585,0.803208,0.788346,0.773010,0.757209,0.740951,0.724247,0.707107,0.689541,0.671559,0.653173,0.634393,0.615232,0.595699,0.575808,0.555570,0.534998,0.514103,0.492898,0.471397,0.449611,0.427555,0.405241,0.382683,0.359895,0.336890,0.313682,0.290285,0.266713,0.242980,0.219101,0.195090,0.170962,0.146730,0.122411,0.098017,0.073565,0.049068,0.024541,0.000000,-0.024541,-0.049068,-0.073565,-0.098017,-0.122411,-0.146730,-0.170962,-0.195090,-0.219101,-0.242980,-0.266713,-0.290285,-0.313682,-0.336890,-0.359895,-0.382683,-0.405241,-0.427555,-0.449611,-0.471397,-0.492898,-0.514103,-0.534998,-0.555570,-0.575808,-0.595699,-0.615232,-0.634393,-0.653173,-0.671559,-0.689541,-0.707107,-0.724247,-0.740951,-0.757209,-0.773010,-0.788346,-0.803208,-0.817585,-0.831470,-0.844854,-0.857729,-0.870087,-0.881921,-0.893224,-0.903989,-0.914210,-0.923880,-0.932993,-0.941544,-0.949528,-0.956940,-0.963776,-0.970031,-0.975702,-0.980785,-0.985278,-0.989177,-0.992480,-0.995185,-0.997290,-0.998795,-0.999699,-1.000000,-0.999699,-0.998795,-0.997290,-0.995185,-0.992480,-0.989177,-0.985278,-0.980785,-0.975702,-0.970031,-0.963776,-0.956940,-0.949528,-0.941544,-0.932993,-0.923880,-0.914210,-0.903989,-0.893224,-0.881921,-0.870087,-0.857729,-0.844854,-0.831470,-0.817585,-0.803208,-0.788346,-0.773010,-0.757209,-0.740951,-0.724247,-0.707107,-0.689541,-0.671559,-0.653173,-0.634393,-0.615232,-0.595699,-0.575808,-0.555570,-0.534998,-0.514103,-0.492898,-0.471397,-0.449611,-0.427555,-0.405241,-0.382683,-0.359895,-0.336890,-0.313682,-0.290285,-0.266713,-0.242980,-0.219101,-0.195090,-0.170962,-0.146730,-0.122411,-0.098017,-0.073565,-0.049068,-0.024541,-0.000000,0.024541,0.049068,0.073565,0.098017,0.122411,0.146730,0.170962,0.195090,0.219101,0.242980,0.266713,0.290285,0.313682,0.336890,0.359895,0.382683,0.405241,0.427555,0.449611,0.471397,0.492898,0.514103,0.534998,0.555570,0.575808,0.595699,0.615232,0.634393,0.653173,0.671559,0.689541,0.707107,0.724247,0.740951,0.757209,0.773010,0.788346,0.803208,0.817585,0.831470,0.844854,0.857729,0.870087,0.881921,0.893224,0.903989,0.914210,0.923880,0.932993,0.941544,0.949528,0.956940,0.963776,0.970031,0.975702,0.980785,0.985278,0.989177,0.992480,0.995185,0.997290,0.998795,0.999699};
const DTYPE sin_coefficients_table[]={0.000000,-0.024541,-0.049068,-0.073565,-0.098017,-0.122411,-0.146730,-0.170962,-0.195090,-0.219101,-0.242980,-0.266713,-0.290285,-0.313682,-0.336890,-0.359895,-0.382683,-0.405241,-0.427555,-0.449611,-0.471397,-0.492898,-0.514103,-0.534998,-0.555570,-0.575808,-0.595699,-0.615232,-0.634393,-0.653173,-0.671559,-0.689541,-0.707107,-0.724247,-0.740951,-0.757209,-0.773010,-0.788346,-0.803208,-0.817585,-0.831470,-0.844854,-0.857729,-0.870087,-0.881921,-0.893224,-0.903989,-0.914210,-0.923880,-0.932993,-0.941544,-0.949528,-0.956940,-0.963776,-0.970031,-0.975702,-0.980785,-0.985278,-0.989177,-0.992480,-0.995185,-0.997290,-0.998795,-0.999699,-1.000000,-0.999699,-0.998795,-0.997290,-0.995185,-0.992480,-0.989177,-0.985278,-0.980785,-0.975702,-0.970031,-0.963776,-0.956940,-0.949528,-0.941544,-0.932993,-0.923880,-0.914210,-0.903989,-0.893224,-0.881921,-0.870087,-0.857729,-0.844854,-0.831470,-0.817585,-0.803208,-0.788346,-0.773010,-0.757209,-0.740951,-0.724247,-0.707107,-0.689541,-0.671559,-0.653173,-0.634393,-0.615232,-0.595699,-0.575808,-0.555570,-0.534998,-0.514103,-0.492898,-0.471397,-0.449611,-0.427555,-0.405241,-0.382683,-0.359895,-0.336890,-0.313682,-0.290285,-0.266713,-0.242980,-0.219101,-0.195090,-0.170962,-0.146730,-0.122411,-0.098017,-0.073565,-0.049068,-0.024541,-0.000000,0.024541,0.049068,0.073565,0.098017,0.122411,0.146730,0.170962,0.195090,0.219101,0.242980,0.266713,0.290285,0.313682,0.336890,0.359895,0.382683,0.405241,0.427555,0.449611,0.471397,0.492898,0.514103,0.534998,0.555570,0.575808,0.595699,0.615232,0.634393,0.653173,0.671559,0.689541,0.707107,0.724247,0.740951,0.757209,0.773010,0.788346,0.803208,0.817585,0.831470,0.844854,0.857729,0.870087,0.881921,0.893224,0.903989,0.914210,0.923880,0.932993,0.941544,0.949528,0.956940,0.963776,0.970031,0.975702,0.980785,0.985278,0.989177,0.992480,0.995185,0.997290,0.998795,0.999699,1.000000,0.999699,0.998795,0.997290,0.995185,0.992480,0.989177,0.985278,0.980785,0.975702,0.970031,0.963776,0.956940,0.949528,0.941544,0.932993,0.923880,0.914210,0.903989,0.893224,0.881921,0.870087,0.857729,0.844854,0.831470,0.817585,0.803208,0.788346,0.773010,0.757209,0.740951,0.724247,0.707107,0.689541,0.671559,0.653173,0.634393,0.615232,0.595699,0.575808,0.555570,0.534998,0.514103,0.492898,0.471397,0.449611,0.427555,0.405241,0.382683,0.359895,0.336890,0.313682,0.290285,0.266713,0.242980,0.219101,0.195090,0.170962,0.146730,0.122411,0.098017,0.073565,0.049068,0.024541};

#define N 256                            // DFT Size

void dft(IN_TYPE sample_real[N], IN_TYPE sample_imag[N]) {
    int i, j;
    TEMP_TYPE w;
    TEMP_TYPE c, s;

    // Temporary arrays to hold the intermediate frequency domain results
    TEMP_TYPE temp_real[N] = {0};
    TEMP_TYPE temp_imag[N] = {0};

    // Calculate each frequency domain sample iteratively
    OUTER:for (j = 0; j < N; j += 1) {

        // (2 * pi * i)/N
        // w = (-2.0 * 3.141592653589  / N) * (TEMP_TYPE)i;

        // Calculate the jth frequency sample sequentially
        INNER:for (i = 0; i < N; i += 1) {
            // Utilize HLS tool to calculate sine and cosine values
            // c = cos(j * w);
            // s = sin(j * w);
			// 由于已经储存了[0,2pi) 的 cos 和 -sin 的值 (共256个)
			// 所以只需要查表找到该值即可
			int index = (i * j) % N;
			c = cos_coefficients_table[index];
			s = sin_coefficients_table[index];


            // Multiply the current phasor with the appropriate input sample and keep
            // running sum
            temp_real[i] += (sample_real[j] * c - sample_imag[j] * s);
            temp_imag[i] += (sample_real[j] * s + sample_imag[j] * c);
        }
    }

    // Perform an inplace DFT, i.e., copy result into the input arrays
    for (i = 0; i < N; i += 1) {
        sample_real[i] = temp_real[i];
        sample_imag[i] = temp_imag[i];
    }
}

  • 综合结果:
  • 可以实现II=1的流水
  • 这个双层循环已经是一个完美嵌套循环(可以更好优化),所以HLS在处理中是当作一个循环来处理的。

hls::stream 实现

void dft_stream(hls::stream<IN_TYPE> &sample_real, hls::stream<IN_TYPE> &sample_imag,
		hls::stream<IN_TYPE> &output_real, hls::stream<IN_TYPE> &output_imag) {
    int i, j;
    TEMP_TYPE w;
    TEMP_TYPE c, s;

    // Temporary arrays to hold the intermediate frequency domain results
    TEMP_TYPE temp_real[N] = {0};
    TEMP_TYPE temp_imag[N] = {0};

    // Calculate each frequency domain sample iteratively
    OUTER:for (j = 0; j < N; j += 1) {

    	IN_TYPE sample_real_temp = sample_real.read();
    	IN_TYPE sample_imag_temp = sample_imag.read();
        // Calculate the jth frequency sample sequentially
        INNER:for (i = 0; i < N; i += 1) {
			// 由于已经储存了[0,2pi) 的 cos 和 -sin 的值 (共256个)
			// 所以只需要查表找到该值即可
			int index = (i * j) % N;
			c = cos_coefficients_table[index];
			s = sin_coefficients_table[index];

            // Multiply the current phasor with the appropriate input sample and keep
            // running sum
            temp_real[i] += (sample_real_temp * c - sample_imag_temp * s);
            temp_imag[i] += (sample_real_temp * s + sample_imag_temp * c);
        }
    }

    // Perform an inplace DFT, i.e., copy result into the input arrays
    for (i = 0; i < N; i += 1) {
    	output_real.write(temp_real[i]);
    	output_imag.write(temp_imag[i]);
    }
}
  • 综合结果和优化2结果类似。

第五章 快速傅里叶变换(FFT)

  • 当取样样本数量为N时,直接使用矩阵向量乘法来执行离散傅里叶变换需要O(n2)\mathcal{O}(n^2)次乘法和加法操作。
  • S矩阵具有大量冗余
  • 快速傅立叶变换(FFT)使用基于S矩阵对称性的分块处理方法,它需要O(nlogn)\mathcal{O}(n \log n)复杂度
  • 两点FFT:
  • 四点FFT:
  • 八点FFT(其中的bit reverse 只需要将bit左右颠倒就好了 001 -> 100):
  • 递归关系:

代码

#include "fft.h"


void bit_reverse(DTYPE X_R[SIZE], DTYPE X_I[SIZE], DTYPE OUT_R[SIZE], DTYPE OUT_I[SIZE]);
template<int STAGE>
void fft_stages (DTYPE X_R[SIZE], DTYPE X_I[SIZE], DTYPE OUT_R[SIZE], DTYPE OUT_I[SIZE]);

void fft(DTYPE X_R[SIZE], DTYPE X_I[SIZE], DTYPE OUT_R[SIZE], DTYPE OUT_I[SIZE])
{
	

	//Call fft
	DTYPE Stage0_R[SIZE], Stage0_I[SIZE];
	DTYPE Stage1_R[SIZE], Stage1_I[SIZE];
	DTYPE Stage2_R[SIZE], Stage2_I[SIZE];
	DTYPE Stage3_R[SIZE], Stage3_I[SIZE];
	DTYPE Stage4_R[SIZE], Stage4_I[SIZE];
	DTYPE Stage5_R[SIZE], Stage5_I[SIZE];
	DTYPE Stage6_R[SIZE], Stage6_I[SIZE];
	DTYPE Stage7_R[SIZE], Stage7_I[SIZE];
	DTYPE Stage8_R[SIZE], Stage8_I[SIZE];
	DTYPE Stage9_R[SIZE], Stage9_I[SIZE];

	bit_reverse(X_R, X_I, Stage0_R, Stage0_I);
	fft_stages<1>(Stage0_R, Stage0_I, Stage1_R, Stage1_I);
	fft_stages<2>(Stage1_R, Stage1_I, Stage2_R, Stage2_I);
	fft_stages<3>(Stage2_R, Stage2_I, Stage3_R, Stage3_I);
	fft_stages<4>(Stage3_R, Stage3_I, Stage4_R, Stage4_I);
	fft_stages<5>(Stage4_R, Stage4_I, Stage5_R, Stage5_I);
	fft_stages<6>(Stage5_R, Stage5_I, Stage6_R, Stage6_I);
	fft_stages<7>(Stage6_R, Stage6_I, Stage7_R, Stage7_I);
	fft_stages<8>(Stage7_R, Stage7_I, Stage8_R, Stage8_I);
	fft_stages<9>(Stage8_R, Stage8_I, Stage9_R, Stage9_I);
	fft_stages<10>(Stage9_R, Stage9_I, OUT_R, OUT_I);
}

unsigned int reverse_bits(unsigned int input)
{
	int i, rev = 0;
	for (i = 0; i < M; i++)
	{
		rev = (rev << 1) | (input & 1);
		input = input >> 1;
	}
	return rev;
}

void bit_reverse(DTYPE X_R[SIZE], DTYPE X_I[SIZE], DTYPE OUT_R[SIZE], DTYPE OUT_I[SIZE])
{
	unsigned int reversed;
	unsigned int i;
	DTYPE temp;

	for (i = 0; i < SIZE; i++)
	{
		reversed = reverse_bits(i); // Find the bit reversed index
		OUT_R[reversed] = X_R[i];
		OUT_I[reversed] = X_I[i];
	}
}
/*=======================BEGIN: FFT=========================*/


//stages
template<int STAGE>
void fft_stages (DTYPE X_R[SIZE], DTYPE X_I[SIZE], DTYPE OUT_R[SIZE], DTYPE OUT_I[SIZE])
{
	//Insert your code here
	int DFTpts = 1 << STAGE; // DFT = 2^stage = points in sub DFT
	int numBF = DFTpts >> 1;	 // Butterfly WIDTHS in sub-DFT
	DTYPE e = -6.283185307178 / DFTpts;
// Perform butterflies for j-th stage

butterfly_loop:
	for (int j = 0; j < (1 << (STAGE - 1)); j++)
	{
	// Compute butterflies that use same W**k
	dft_loop:
		for (int k = 0; k < (1 << (M - STAGE)); k++)
		// for (int i = j; i < SIZE; i += (1 << STAGE))
		{
			int i = j + k * (1 << STAGE);

			DTYPE c = W_real[j << (M - STAGE)];
			DTYPE s = W_imag[j << (M - STAGE)];
			int i_lower = i + (1 << (STAGE - 1)); // index of lower point in butterfly
			DTYPE temp_R = X_R[i_lower] * c - X_I[i_lower] * s;
			DTYPE temp_I = X_I[i_lower] * c + X_R[i_lower] * s;
			
			OUT_R[i_lower] = X_R[i] - temp_R;
			OUT_I[i_lower] = X_I[i] - temp_I;
			OUT_R[i] = X_R[i] + temp_R;
			OUT_I[i] = X_I[i] + temp_I;
		}
	}
}

/*=======================END: FFT=========================*/
  • 使用dataflow优化时候,不能同时读写一个函数的参数。否则会报错:Argument 'X_R' failed dataflow checking: Cannot read as well as write over function parameter.
  • 循环边界最好为常数,不然无法统计trip_count。
  • 设置内层II=1,最终结果是6212 CLK