DMA Throughput measurement on Kinetis MCU


I tried to measure a DMA data transfer throughput on Kinetis MCU. I was wondering how fast it is.

There are two type of transfers on Kinetis microcontrollers, RAM-to-RAM and ROM-to-RAM. Actually there is one more transfer, internal register-to-RAM. But this is more or less RAM-to-RAM transfer.

On KInetis MCU, there is only no wait access when DMA accesses to RAM, whereas when it accesses to ROM there should be one wait.

I was wondering how much difference, how fast it is between those type of transfer. I measured the throughput.

I will show you what I have done.

Then, let’s see.

What I want to do

It is so simple. I want to measure the throughput of two transfers, RAM-to-RAM transfer and FlashROM-to-RAM transfer.

Of course, there is no transfer from RAM-to-ROM as ROM can not be written, read-only.

The main purpose is that to measure and see the difference between the DMA throughput of accessing FlashROM and RAM.

Prepared Eval board

FRDM-K64F

Source code

There is my source code in GItHub. Please download it from here if you want.
GitHub : https://github.com/mcuthings/dma-throughput


/*******************************************************************************
 * Definitions
 ******************************************************************************/
//#define TRANSFER_ROM_TO_RAM
#define TRANSFER_RAM_TO_RAM

#define EXAMPLE_DMA DMA0
#define EXAMPLE_DMAMUX DMAMUX0
#define DMA_CH0 0
#define DMA_CH1 1
#define DMA_CH2 2
#define DMA_CH3 3
#define TRANSFER_MODE 
#define BUFF_LENGTH 8192UL

/* Systick Counter */
#define GET_TICKS() (0xFFFFFF - SysTick->VAL)
#define STOP_COUNTING() (SysTick->CTRL &= (~SysTick_CTRL_ENABLE_Msk))


/*******************************************************************************
 * Prototypes
 ******************************************************************************/
void SysTick_Handler(void);
void dmaTransfer(void* srcAddr, void* destAddr);
/*******************************************************************************
 * Variables
 ******************************************************************************/
/*DMA Variables*/
edma_handle_t* g_EDMA_Handle[4];
edma_handle_t g_EDMA_Handle_ch0; //for the test of 1byte(8bits) transfer
edma_handle_t g_EDMA_Handle_ch1; //for the test of 4byte(32bits) transfer
edma_handle_t g_EDMA_Handle_ch2; //for the test of 16byte transfer
edma_handle_t g_EDMA_Handle_ch3; //for the test of 32byte transfer

static edma_transfer_config_t transferConfig;
static edma_config_t userConfig;
static uint32_t dma_start,dma_end;
static void* srcAddr; //Source addr for DMA
extern char _binary_randomData_bin_start[]; //Source addr in case of ROM-to-RAM transfer 
static uint32_t SysIsrcount=0;

__attribute__ ((section (".data.$SRAM_LOWER") )) volatile bool g_Transfer_Done = false;
__attribute__ ((section(".data.SRAM_UPPER"))) __attribute__ ((aligned(32))) uint8_t srcRAM[BUFF_LENGTH]={0}; //source table array for RAM-to-RAM
__attribute__ ((section(".data.SRAM_UPPER"))) __attribute__ ((aligned(32))) uint8_t destRAM[BUFF_LENGTH] = {0};




/*******************************************************************************
 * Code
 ******************************************************************************/

/* User callback function for EDMA transfer. */
void EDMA_Callback(edma_handle_t *handle, void *param, bool transferDone, uint32_t tcds)
{
    
    dma_end = GET_TICKS();
    STOP_COUNTING();
    
    if (transferDone)
    {
        g_Transfer_Done = true;
    }
}

void SysTick_Handler(void){

    SysIsrcount++;
}

__attribute__ ((long_call, section (".ramfunc.$SRAM_LOWER") )) void dma_polling(void) 
{
    /* Wait for EDMA transfer finish */
    while (g_Transfer_Done != true)
    {
    }
}


/*!
 * @brief Main function
 */
int main(void)
{
    uint32_t i = 0;
    uint32_t srcRamTemp[BUFF_LENGTH];

    BOARD_InitPins();
    BOARD_BootClockRUN();
    BOARD_InitDebugConsole();

    /* Print source buffer */
    PRINTF("EDMA memory to memory transfer example begin.\r\n\r\n");

    /* For RAM-to-RAM transfer test
     * Prepare source data */
    for (uint32_t i = 0; i < BUFF_LENGTH; i++)
    {
        srcRAM[i] = i%256; //mod(256) to store a byte
    }

    /* Configure DMAMUX */
    DMAMUX_Init(EXAMPLE_DMAMUX);
#if defined(FSL_FEATURE_DMAMUX_HAS_A_ON) && FSL_FEATURE_DMAMUX_HAS_A_ON
    DMAMUX_EnableAlwaysOn(EXAMPLE_DMAMUX, 0, true);
#else
    for (uint8_t i=0; i<4;i++){
        DMAMUX_SetSource(EXAMPLE_DMAMUX, i, 63);
    }
#endif /* FSL_FEATURE_DMAMUX_HAS_A_ON */
    for (uint8_t i=0; i<4;i++){
        DMAMUX_EnableChannel(EXAMPLE_DMAMUX, i);
    }
    /* Configure EDMA one shot transfer */
    /*
     * userConfig.enableRoundRobinArbitration = false;
     * userConfig.enableHaltOnError = true;
     * userConfig.enableContinuousLinkMode = false;
     * userConfig.enableDebugMode = false;
     */
    

    EDMA_GetDefaultConfig(&userConfig);
    EDMA_Init(EXAMPLE_DMA, &userConfig);
    /*DMA handle creation*/
    EDMA_CreateHandle(&g_EDMA_Handle_ch0, EXAMPLE_DMA, DMA_CH0);
    EDMA_CreateHandle(&g_EDMA_Handle_ch1, EXAMPLE_DMA, DMA_CH1);
    EDMA_CreateHandle(&g_EDMA_Handle_ch2, EXAMPLE_DMA, DMA_CH2);
    EDMA_CreateHandle(&g_EDMA_Handle_ch3, EXAMPLE_DMA, DMA_CH3);
        
    g_EDMA_Handle[0] = &g_EDMA_Handle_ch0;
    g_EDMA_Handle[1] = &g_EDMA_Handle_ch1;
    g_EDMA_Handle[2] = &g_EDMA_Handle_ch2;
    g_EDMA_Handle[3] = &g_EDMA_Handle_ch3;

    /* DMA Transfer throughput test */    
    PRINTF("\r\nDMA Transfer from RAM to RAM\r\n");
    dmaTransfer(srcRAM, destRAM);

    PRINTF("\r\nDMA Transfer from ROM to RAM\r\n");
    dmaTransfer(_binary_randomData_bin_start, destRAM);

    PRINTF("\r\n\r\nEDMA memory to memory transfer example finish.\r\n\r\n");

    while (1)
    {
    }
}

void dmaTransfer(void* srcAddr, void* destAddr){
    /* DMA Throughput counter */
    uint32_t transferByte; 
    volatile uint32_t cnt;
    volatile uint32_t ret;
    double coreClock;
    double scalingFactor;
    uint32_t result;

    for (uint8_t i=0; i<4;i++){ //loop while ch0 - ch3 
        EDMA_SetCallback(g_EDMA_Handle[i], 
        EDMA_Callback, NULL); 
        
        switch (i){ 
            case 0: transferByte = (uint32_t) 1; //1Byte transfer 
                break; 
            case 1: transferByte = (uint32_t) 4; //4Byte transfer 
                break; 
            case 2: transferByte = (uint32_t) 16; //16Byte transfer 
                break; 
            case 3: transferByte = (uint32_t) 32; //32Byte transfer 
              break; 
            default: break; 
        } 

    EDMA_PrepareTransfer(&transferConfig, srcAddr, transferByte, destAddr, transferByte, (uint32_t)(BUFF_LENGTH), (uint32_t)(BUFF_LENGTH*512), kEDMA_MemoryToMemory); 
    EDMA_SubmitTransfer(g_EDMA_Handle[i], &transferConfig); EDMA_SetModulo(EXAMPLE_DMA, g_EDMA_Handle[i]->channel, kEDMA_Modulo8Kbytes, kEDMA_Modulo8Kbytes);


        ret= SysTick_Config(0xFFFFFF);/*<---- Here starts the cycle count */
        
        if(ret){
            PRINTF("SysTick configuration is failed.\n\r");
            while(1);
        }
    
        dma_start = GET_TICKS();
        g_Transfer_Done = false;
        SysIsrcount = 0;
        EDMA_StartTransfer(g_EDMA_Handle[i]);
        
        dma_polling(); //Polling the DMA transfer complete status
        
        cnt = dma_end - dma_start;
        cnt += (SysIsrcount*0xFFFFFF);
        coreClock = CLOCK_GetCoreSysClkFreq();
        scalingFactor = (double)cnt/coreClock;
        result=(BUFF_LENGTH*512/scalingFactor)/(1024*1024);//Unit is [MB/Sec]
        
        /* Print out result */
        PRINTF("DMA throughput (Transfer size %dByte) is %d MB/Sec\r\n",transferByte ,result);          

    }
    
}



Core clock and System clock

Core and System clock is now configured as 100MHz as it is easy to calculate and compare with other system.

If there is no wait, it should gives you 200MB/Sec throughput.

The data table to be transferred is prepared by 8KB. It is placed at RAM and ROM as a source to be transferred.

If you want to know how the variables(or function) is placed in RAM, here is the link of previous post. Please take a look at it.

Ref.:How to place RAM func in GCC environment???

Kinetis MCU has a dual bank of SRAM, LOWER and UPPER. In order to make an access faster, DMA access should be separated from core accessing to RAM. Kinetis’s SRAM can be accessed simultaniously as it is configured as dual port SRAM.

So, DMA should accesses to UPPDER side of SRAM, while core does to LOWER side of SRAM.

The transfer buffer size is prepared as BUF_LENGTH = 8KB. And total transfer size is 8kB x 512 times transfers = 4MB.

BUF_LENGTH : 8KB
Total transfer size : 4MB

DMA config

Basically, I didn’t change the configuration from SDK default configurations.

DMA source and Destination is configured as 8KB modulo.

There is no enough RAM space to transfer 4MB, that is because the 8KB buffer transfer is iterated 512 times, and that comes to total 4MB.

EDMA_Set_Callback():

When major loop count is expired(completed), the specified callback function is called.

EDMA_PrepareTransfer():

This configures DMA source and destination address and transfer size and total transfer Bytes.

EDMA_SubmitTransfer():

Transfer configuration is set to TCD (Transfer configuration Descriptor ).

EDMA_SetModulo():

8KB modulo is configured here.

EDMA_StartTransfer():

Lastly, you can start DMA.

Polling DMA transfer complete

An interrupt is generated when DMA transfer has finished. Core does polling a flag of DMA transfer (dma_polling() function). At this moment, core is accessing to LOWER side of SRAM to avoid conflict with DMA access.

Core accesses to SRAM_LOWER, DMA access to SRAM_UPPER.

Measurement of transferring time

For transfer time measurement, I used Systick timer this time. Cycle count measurement is so easy if you use CMSIS. If you are interested in Systick timer measurement, you can refer to below my previous post.

Ref: Let’s use CMSIS! Easy way to measure cycle count.

In case of using Systick Timer, as the timer has only 24bits counter, you need to pay an attention when you measure relatively long time.

If you measure time which it takes longer than 24 bit counter, the amount of the counter is not enough and the counter value get to zero before DMA transfer has done.

So, this time, I count how many times the timer gets to zero (SysIsrcount) in the callback function of Systick timer interrupt as below.

void SysTick_Handler(void){
SysIsrcount++;
}

The number of the SysIsrcount is taken into an account at the end of measurement calculation. The time of the SysIsrcount is added on the measured time. 

At the end, total measured time when DMA transfer has done is subtracted by the count value when DMA started. SysIsrcount(the number of count that the Systick timer has expired.) is added.

Lastly, you would get the DMA transfer time when you device the cnt by System clock.


cnt = dma_end - dma_start;
        cnt += (SysIsrcount*0xFFFFFF);
        coreClock = CLOCK_GetCoreSysClkFreq();
        scalingFactor = (double)cnt/coreClock;
        result=(BUFF_LENGTH*512/scalingFactor)/(1024*1024);//Unit is [MB/Sec]

Measurement result

Kinetis K64 DMA throughput measurement result

In case of RAM-to-RAM transfer and 32byte transfer size, its result came to 169MB/Sec. The value seems be lower than the value written in its reference manual.

I think there seems be a wait when even DMA accesses to RAM…While in case of ROM-to-RAM transfer it came to 152MB/Sec. From this result, I think there seems be more waits than accessing RAM to RAM.

Summary

I tried to measure the DMA transfer throughput on Kinetis MCU K64. With using DMA, the system performance effectively gets improved and I would like to use it.

However, I couldn’t get the performance to the full extent that is stated in the reference manual.

I think the Kinetis MCU’s DMA is highly flexible and featured to use. I would like you to consider to use it.