2

I've got a Xilinx Zynq 7000-based board with a peripheral in the FPGA fabric that has DMA capability (on an AXI bus). We've developed a circuit and are running Linux on the ARM cores. We're having performance problems accessing a DMA buffer from user space after it's been filled by hardware.

Summary:

We have pre-reserved at boot time a section of DRAM for use as a large DMA buffer. We're apparently using the wrong APIs to map this buffer, because it appears to be uncached, and the access speed is terrible.

Using it even as a bounce-buffer is untenably slow due to horrible performance. IIUC, ARM caches are not DMA coherent, so I would really appreciate some insight on how to do the following:

  1. Map a region of DRAM into the kernel virtual address space but ensure that it is cacheable.
  2. Ensure that mapping it into userspace doesn't also have an undesirable effect, even if that requires we provide an mmap call by our own driver.
  3. Explicitly invalidate a region of physical memory from the cache hierarchy before doing a DMA, to ensure coherency.

More info:

I've been trying to research this thoroughly before asking. Unfortunately, this being an ARM SoC/FPGA, there's very little information available on this, so I have to ask the experts directly.

Since this is an SoC, a lot of stuff is hard-coded for u-boot. For instance, the kernel and a ramdisk are loaded to specific places in DRAM before handing control over to the kernel. We've taken advantage of this to reserve a 64MB section of DRAM for a DMA buffer (it does need to be that big, which is why we pre-reserve it). There isn't any worry about conflicting memory types or the kernel stomping on this memory, because the boot parameters tell the kernel what region of DRAM it has control over.

Initially, we tried to map this physical address range into kernel space using ioremap, but that appears to mark the region uncacheable, and the access speed is horrible, even if we try to use memcpy to make it a bounce buffer. We use /dev/mem to map this also into userspace, and I've timed memcpy as being around 70MB/sec.

Based on a fair amount of searching on this topic, it appears that although half the people out there want to use ioremap like this (which is probably where we got the idea from), ioremap is not supposed to be used for this purpose and that there are DMA-related APIs that should be used instead. Unfortunately, it appears that DMA buffer allocation is totally dynamic, and I haven't figured out how to tell it, "here's a physical address already allocated -- use that."

One document I looked at is this one, but it's way too x86 and PC-centric: https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt

And this question also comes up at the top of my searches, but there's no real answer: get the physical address of a buffer under Linux

Looking at the standard calls, dma_set_mask_and_coherent and family won't take a pre-defined address and wants a device structure for PCI. I don't have such a structure, because this is an ARM SoC without PCI. I could manually populate such a structure, but that smells to me like abusing the API, not using it as intended.

BTW: This is a ring buffer, where we DMA data blocks into different offsets, but we align to cache line boundaries, so there is no risk of false sharing.

Thank you a million for any help you can provide!

UPDATE: It appears that there's no such thing as a cacheable DMA buffer on ARM if you do it the normal way. Maybe if I don't make the ioremap call, the region won't be marked as uncacheable, but then I have to figure out how to do cache management on ARM, which I can't figure out. One of the problems is that memcpy in userspace appears to really suck. Is there a memcpy implementation that's optimized for uncached memory I can use? Maybe I could write one. I have to figure out if this processor has Neon.

4

4 回答 4

1

mmap()您是否尝试过使用将缓冲区重新映射为可缓存的方法(通过)来实现自己的 char 设备remap_pfn_range()

于 2016-01-26T04:11:27.777 回答
0

如果您希望缓存映射,我相信您需要一个实现 mmap() 的驱动程序。

为此,我们使用了两个设备驱动程序:portalmem 和 zynqportal。在Connectal Project 中,我们将用户空间软件和 FPGA 逻辑之间的连接称为“门户”。这些驱动程序需要 dma-buf,自 Linux 内核版本 3.8.x 以来,它对我们来说一直很稳定。

portalmem驱动程序提供一个 ioctl 来分配一个引用计数的内存块,并返回一个与该内存关联的文件描述符。此驱动程序实现dma-buf 共享。它还实现了 mmap() 以便用户空间应用程序可以访问内存。

在分配时,应用程序可以选择内存的缓存或非缓存映射。在 x86 上,映射始终被缓存。我们的 mmap() 实现目前从portalmem driver 的第 173 行开始。如果映射未缓存,它会使用 pgprot_writecombine() 修改 vma->vm_page_prot,启用写入缓冲但禁用缓存。

Portalmem 驱动程序还提供了一个 ioctl 来使数据缓存行无效并可选地写回数据缓存行。

Portalmem 驱动程序不了解 FPGA。为此,我们使用了zynqportal驱动程序,它提供了一个用于将转换表传输到 FPGA 的 ioctl,以便我们可以使用 FPGA 上的逻辑连续地址并将它们转换为实际的 DMA 地址。Portalmem 使用的分配方案旨在生成紧凑的转换表。

我们为 PCI Express 连接的 FPGA 使用与pcieportal相同的 portalmem 驱动程序,无需更改用户软件。

于 2016-02-02T15:14:56.807 回答
0

Zynq 有 neon 指令,使用 neon 指令的 memcpy 汇编代码实现,使用在缓存边界上对齐(32 字节)将达到 300 MB/s 或更高的速率。

于 2017-07-20T16:04:24.643 回答
0

我为此苦苦挣扎了一段时间,udmabuf发现答案就像添加dma_coherent;到设备树中的条目一样简单。我从这个简单的步骤中看到了访问时间的显着加快——尽管我仍然需要添加代码来使设备从/到设备的所有权转移时无效/刷新。

于 2022-02-01T22:13:58.970 回答