Need Direct IO for ramfs or tmpfs or is there another Linux RAM disk that supports Direct IO

My use cases include loading disk images into RAM disk from Samsung 960 Pro nvme SSD. The disk images are typically 40GB or more but the RAM disks in Linux are very slow. I’m getting about 2.2 GB/sec write speed from Samsung 960 Pro (2 in RAID 0) to ramfs and tmpfs while I was expecting speeds to exceed 6 GB/sec. Using dd to benchmark ramfs and tmpfs showed results that gave similar numbers(i.e. about 2 GB/sec). In Windows 10, the bench mark results I got from using CrystalDiskmark on a RAM disk created by ImDisk gave results of; about 7 GB/sec for sequential reads (queue depth=32, threads=1), about 10 GB/sec for sequential writes. The only reason I can see that would make the Linux RAM disks slower is their lack of support of direct IO. I’ve found this mailing list https://lists.gt.net/linux/kernel/720702 discussing supporting direct IO for tmpfs and most people are actually in favor of supporting direct IO for tmpfs so I don’t understand why within these last 12 years (since the last post in that mailing list) hasn’t direct IO been supported for tmpfs…except that Linus Torvalds seems to hate(?) it. I’ve found an exchange of words between Linus Torvalds and Dave Chinner from about 5 months ago where Linus declared BS on Dave’s statement, “That said, the page cache is still far, far slower than direct IO”.

See Linus Still Based and Caches Are Faster than Direct IO - LinuxReviews

Dave is right and I know that from personal experience. For example, in qemu there is “writethrough” cache mode which uses the host cache and there is “none” cache mode which support direct IO semantics. When writing to a WD Black 7200 rpm hard disk with “writethrough” cache mode it gets speeds of about 30 MB/sec and but using “none” cache mode I get about 200 MB/sec. These speeds are reported by Windows explorer file copy in Windows 10 vm in qemu 4.0. This test is a easily replicable test that Linus can do to verify that Dave is indeed right.

Is there a different Linux RAM disk (i.e. besides ramfs and tmpfs) I can use that supports direct IO and can someone please explain to me why for the last 12 years direct IO still isn’t support for tmpfs or ramfs and what are the chances of direct IO support for ramfs and/or tmpfs happening if I request for it?

I think you’re comparing apples with oranges here.

First off, the benchmark is a bit flawed for two reasons:

  • cache=none actually still has a disk cache on the guest end.
  • The guest OS is already performing its own caching and such, therefore having a caching system running on top of a caching system isn’t very useful.

Here’s the bigger issue though: if you read through that thread, you’ll actually see that O_DIRECT, if implemented, would do absolutely nothing. The proposal was to have it be accepted but be a noop, vs the current situation where any calls that use O_DIRECT on a tmpfs file will fail.

The reason for this is that the purpose of O_DIRECT is to bypass the page cache while working with files. However, a tmpfs itself lives inside the page cache, so there’s nothing that can be bypassed. Therefore, O_DIRECT on a tmpfs can’t really do…anything. As the Debian wiki describes ramfs/tmpfs:

Basically, you’re mounting the disk cache as a filesystem.

It’s hard to say why you’re seeing a performance difference, but some guesses:

  • You didn’t specify what you’re using to benchmark it on Linux, and I’m guessing it’s not CrystalDiskInfo. Different tools may have different ways of manipulating the ramdisk, thus it’s hard to perform an accurate comparison here.
  • ImDisk may store files in a different area of RAM without relying on a cache implementation. Therefore, direct I/O performance (not sure how CrystalDiskInfo benchmarks, but I’m guessing it tries to bypass caches) will be faster…but general-purpose use will be slower because you have the Windows page cache running on top. That results in more back and forth copying in RAM, vs Linux’s implementation where the page cache is the tmpfs.
1 Like

cache=none actually still has a disk cache on the guest end.

The important thing is that cache=none is a lot faster than cache=writethrough so Dave is right. Whether the guest end performs cache or not is irrelevant, as long as it is done or not done for both cache modes leaving the only determining factors to be whether using direct IO vs using host cache is faster.

Here’s the bigger issue though: if you read through that thread, you’ll actually see that O_DIRECT, if implemented, would do absolutely nothing.

…really?

However, a tmpfs itself lives inside the page cache, so there’s nothing that can be bypassed. Therefore, O_DIRECT on a tmpfs can’t really do…anything.

Oh…OK…what about ramfs? Does ramfs also live inside page cache?

You didn’t specify what you’re using to benchmark it on Linux, and I’m guessing it’s not CrystalDiskInfo.

I already mentioned I was using dd to write to ramfs and tmpfs. I also manually tested it by timing how long it took to copy a large file to tmpfs/ramfs and the results was about the same as using dd.

ImDisk may store files in a different area of RAM without relying on a cache implementation.

Whatever ImDisk is doing, I need it done in Linux whether it’s via updates to tmpfs or ramfs or a new tool created. Can someone please point me to a page to explain how I can request a new tool/feature for Linux(I’m new to Linux).

Therefore, direct I/O performance (not sure how CrystalDiskInfo benchmarks, but I’m guessing it tries to bypass caches) will be faster…but general-purpose use will be slower because you have the Windows page cache running on top.

hmm…that’s something I can test out tomorrow.

Indeed, on Linux both tmpfs and ramfs are basically direct access to the page cache.

Welp, must’ve missed that part :sweat_smile: Do note that sequential operations on a single large chunk of data usually aren’t too indicative of the overall performance.

Unfortunately, I dont think this is generally desired or simple enough for someone to implement, most notably since it would be essentially an entirely new in memory filesystem that’s similar to tmpfs in most cases…

Yep, I’m aware that sequential operations on a single large chunk of data usually aren’t too indicative of the overall performance. General responsiveness of a system is more reflected by low queue depths and low thread count disk IO operation benchmark figures but having the virtual machines run from a RAM disk that supports direct IO does greatly increase responsiveness by increasing the speed of disk IO operations as these operations bypass the hosts page cache. I really have 2 problems here related to ramfs/tmpfs lacking in direct IO support (or whatever ImDisk is doing as well). My first problem is the length of time in takes to copy disk images from my SSD’s to ramfs/tmpfs which I’ve previously noted was about 2 GB/sec but using ImDisk I’m getting speeds of up to 13.7 GB/sec. The 13.7 GB/sec was obtain from copying a 45 GiB file in one RAM disk to another RAM disk and both RAM disks are created using ImDisk. To do the copying for the 45 GiB file, I use the command “Robocopy R: S: /J /MT” where the “/J” option allows me to do direct IO copy. My second problem is that virtual machines are nowhere near as responsive as they should be when running from ramfs/tmpsf. Since neither ramsf nor tmpfs support direct IO, I am forced to use “cache=writethrough” for the virtual machines and when the virtual machines are benchmarked using CrystalDiskMark I’m getting about 65 MB/secs for 4KiB queue depth=1, thread count=1 reads and for writes it was about 48 MB/secs for 4KiB queue depth=1, thread count=1 however doing the same tests on a ImDisk RAM Disk gives about 400 MB/secs read and write speed.

I can’t understand why this wouldn’t generally be desired. Having a faster RAM disk in Linux sounds like a win and as noted by the figures I’ve given, ramsf and tmpfs are really slow.

I think it’s more of a thing where the current state is “fast enough” for most purposes. That being said, I’m not that familiar with Linux virtualization in particular, so I’d also recommend asking around on qemu/KVM-related communities to see if there’s anything else that could be used to speed things up or stuff that might be a bottleneck.

There are alternative Linux ramdisk implementations out there. Many are commercial/proprietary, but there’s at least RapiDisk on the FLOSS end of things. I know nothing about RapiDisk other than that it exists, but it’s an example of a ramdisk implementation that isn’t constrained by the tmpfs model. (Whether any advantages it purports to offer over tmpfs are real or merely benchmark-goosing, I couldn’t say.)

Without having read any of those kernel discussions, my guess is that the objection to worrying about tmpfs performance is that it’s not intended to be a high-performance storage system. It’s meant to be a convenient temporary storage system.

If your (any) application is worried about cache R/W performance, it probably shouldn’t be caching on a filesystem at all. It should be caching directly in memory, where it will not be subject to the performance constraints of tmpfs. If an application wants to take advantage of the lazy convenience of caching on a filesystem, then it’s not really worried about cache performance. As @refi64 said, for what it’s actually for, tmpfs is “good enough”.

3 Likes

Thank you, I’ll test that out later :smiley:

I am trying to place the disk images directly in memory, but I’m not sure how to do that without using a RAM disk.

hmm…how do I do that in Linux?

I’m not sure what you mean by “lazy convenience” but what I’m after is somewhere (i.e. RAM disk or directly in memory if possible) I can copy disk images to(without bottle necking the 2 Samsung 960 Pro in RAID 0 that the disk images are stored in) and supports direct IO so that I can use tools to write directly to bypassing the page cache as well as allow me to run the disk images in Qemu with “cache=none” option.

I mean, being able to use filesystem semantics for memory I/O. Being able to allocate a chunk of memory and assign it a pathname where it can be transparently accessed by any code that operates on filesystem paths. A ram disk is a “trick” to do file I/O without the file. It’s always going to be less efficient than genuine memory operations because it still goes through the disk I/O subsystem.

Store data in memory? malloc and memcpy.

Point is, you don’t do it as an enduser, you do it in the application. In this case, Qemu. The application needs to implement its own direct memory access. And Qemu does, in very advanced ways.

Look, I have no idea why anyone would want to do what you’re trying to do, but as execution plans go, copying multi-gigabyte disk images into the host ramdisk, then pointing Qemu’s filesystem drivers at those images as uncached block devices, does not feel like the right way to go about it to me. Qemu has a vast array of excellent tools for allocating and managing guest resources in an efficient manner, and even where its existing tools are lacking the whole platform is highly extensible. I’d advise making use of those tools, rather than trying to bypass or manipulate them. Even if there’s zero practical justification for this and all you care about is completely meaningless benchmark numbers, I guaran-damn-tee you’ll get better ones that way.

One possible approach might be to use a Qemu ramdisk driver , one that loads and runs qcow2 images entirely in memory, passing the guest a handle to that memory region for access. Such a thing may very well exist (it wouldn’t surprise me), and if not it could certainly be written. But it would make far more sense, and work far better, than copying disk images into a host ramdisk and passing their paths to the Qemu filesystem drivers. That’s… just not right.

So I again agree with @refi64 — you should be talking to Qemu people about the right way to approach this. (But perhaps you already have, and you didn’t like their answers, so now you’re here?)

Regardless, I’m out. Good luck with whatever you’re trying to achieve.

1 Like

I’m not trying to use filesystem semantics for memory I/O.

Well, it’s not that I need it accessed my “any” code that operates of the filesystem paths. I just need Qemu to be able to access the disk image in memory by whatever means in a way that supports direct IO and doesn’t bottleneck the writing speed of SSD where if disk image is copied from.

I’m not trying to use RAM disk as a “trick” to do I/O without the file and of course it’s always going to be less efficient than genuine memory operations because it still goes through the disk I/O subsystem. I’m just trying to use RAM disk as any other disk but the issue with ramfs and tmpfs is that they are very slow compared to other RAM disks.

I don’t like the sound of studying the Qemu source, modifying it and then building it…sound extremely complicated and time consuming.

I’ve already explained what I am doing in my previous posts and the advantages. Basically, running disk images from a RAM disk oppose to slower storage devices like SSD and hard drives greatly increase of speeds in terms of both band width and latency in the virtual machines. Furthermore, running the disk images from a RAM disk also has the added advantage that intense disk write operation will not degrade the RAM from which the RAM disk is created from unlike an SSD and hard drive which will degrade the more they are written to.

Yep, I’m aware. Qemu has a “cache=none” options for setting up disks for virtual machines which allows direct IO disk writes BUT this option can not be used on disk images that are running from a disk that does not support direct IO like ramfs and tmpfs. I’ve already mentioned this in an earlier post and if you check the previous posts you can see the huge speed different between writing to a disk that supports direct IO compared to a disk that doesn’t.

I’ve never heard of a “Qemu ramdisk driver” before and I just did a quick search for it and couldn’t find anything. I’m pretty sure it doesn’t exist and I don’t think I’ll be writing Qemu RAM disk drivers because that’s a pretty steep learning curve plus I’ve got plenty of other development projects on my plate.

What? That makes no sense. It’s a lot more troublesome having to download the Qemu source then study the Qemu source then modify the Qemu source and then build Qemu from the modified source code compared to just giving Qemu a path to the disk images stored in a RAM disk.

hmm…I think you guys are really confused by that I’m trying to do. I’m just trying to use a RAM disk like any other storage device. Other storage devices provide support for direct IO and I also need a RAM disk (in Linux) that supports direct IO. Storage devices with direct IO support allows far greater writes speeds than storage devices that don’t have direct IO support. This is not an issue with Qemu, this is just a general Linux issue as the 2 RAM disks that comes with Linux does not support direct IO so it doesn’t make sense to specifically ask Qemu people.

I just finish testing out RapidDisk that @ferdnyc suggested I look at a few posts ago and it looks exactly like what I needed however the copy speed of the disk image from my 2 Samsung 960 Pro SDD’s in RAID 0 to a RapidDisk RAM disk isn’t that too far from that of just coping the disk image to ramfs or tmpfs. Here are my findings.

Copying from SSD’s to RapidDisk Ramdisk using,

sudo dd iflag=direct oflag=direct if=/run/media/adam/15551c6d-3eb4-4352-97b9-2f0cdeb0b8cb/vm/windows_server_2019_libvert/server_2019.img of=/mnt/temp2/randomfile bs=67108864

gives me between 2.3 GB/s to 3.4 GB/s from 3 runs.

So I was thinking there might be a bottle neck somewhere so I check the read speed of RapidDisk using,

sudo dd iflag=direct if=/mnt/temp2/randomfile of=/dev/null bs=1048576

and I got between 13.1 GB/s and 14.1 GB/s from 3 runs and this speed is on par with what I was seeing from ImDisk on Windows.

I then check the write speed of RapidDisk using,

sudo dd oflag=direct if=/dev/zero of=/mnt/temp2/randomfile bs=2097152 count=7680

and got between 5.2 GB/s and 5.5 GB/s from 3 runs.

So the write speeds of RapidDisk indeed are a bottle neck but I did copy tests to copy the disk image from one RapidDisk RAM disk to another just to check using the following command,

sudo dd iflag=direct oflag=direct if=/mnt/temp/randomfile of=/mnt/temp2/randomfile bs=2097152

and got between 4.5 GB/s and 4.6 GB/s from 3 runs. The RapidDisk RAM disk write speeds are a lot slower than that of RapidDisk RAM disk read speeds which explains the RapidDisk copy speeds but that still doesn’t explain the slow speeds from copying the disk image from my SSD’s in RAID 0 to RapidDisk RAM disk…this particular bottle neck could be from with mdadm. I used Linux’s mdadm tool to create the RAID 0. In theory, 2 Samsung 960 Pro SSD’s in RAID 0 should provide sequential read speeds exceeding 6 GB/s but I’m not getting anywhere near that or at the very least about 4.5 GB/s considering the bottle neck imposed by RapidDisk’s write speeds.

EDIT: I decided to test the read speeds of the Samsung 960 Pro SSD’s in RAID 0 using,

sudo dd iflag=direct if=/run/media/adam/15551c6d-3eb4-4352-97b9-2f0cdeb0b8cb/vm/windows_server_2019_libvert/server_2019.img of=/dev/null bs=33554432

I ran the test 14 times and the results are:
Test 1 = 5.1 GB/s
Test 2 = 5.0 GB/s
Test 3 = 5.4 GB/s
Test 4 = 5.1 GB/s
Test 6 = 6.3 GB/s
Test 7 = 6.4 GB/s
Test 8 = 6.4 GB/s
Test 9 = 6.3 GB/s
Test 10 = 4.7 GB/s
Test 11 = 2.5 GB/s
Test 13 = 6.4 GB/s
Test 14 = 6.4 GB/s

The speeds greater than 6 GB/s looks right but why are there lower speeds and why is test 11 so slow?

Also, I used the following command to clear the page cache after each test even though it shouldn’t be necessary since I was using “oflag=direct” and “iflag=direct” options of dd tool,

sudo sh -c ‘echo 3 >/proc/sys/vm/drop_caches’

To find an optimal block size to use with the dd commands, I used the dd_ibs_test.sh script.

From the SSD read tests, for a majority of the time, it doesn’t appear that the SSD’s are the bottle neck…I’m confused…maybe the dd command is the problem? Is there another Linux copy tool that supports direct IO I can test with?

Hello, was looking for similar (why tmpfs is “slow”), and most answers point to the dynamic allocation.
I came across nbdkit, particularly this memory plugin (man page).
of particular interest might be the option allocator=malloc, mlock=true .
Even though this is part of a Network Block Device, can be mounted localhost, and you can preload qemu images.
In theory, OP could preload image into locked ramdisk, then use nbd-client on localhost for the VM. (I haven’t tested that yet)