qemu with hax to log dma reads & writes jcs.org/2018/11/12/vfio

block: posix: Always allocate the first block

When creating an image with preallocation "off" or "falloc", the first
block of the image is typically not allocated. When using Gluster
storage backed by XFS filesystem, reading this block using direct I/O
succeeds regardless of request length, fooling alignment detection.

In this case we fallback to a safe value (4096) instead of the optimal
value (512), which may lead to unneeded data copying when aligning
requests. Allocating the first block avoids the fallback.

Since we allocate the first block even with preallocation=off, we no
longer create images with zero disk size:

$ ./qemu-img create -f raw test.raw 1g
Formatting 'test.raw', fmt=raw size=1073741824

$ ls -lhs test.raw
4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw

And converting the image requires additional cluster:

$ ./qemu-img measure -f raw -O qcow2 test.raw
required size: 458752
fully allocated size: 1074135040

When using format like vmdk with multiple files per image, we allocate
one block per file:

$ ./qemu-img create -f vmdk -o subformat=twoGbMaxExtentFlat test.vmdk 4g
Formatting 'test.vmdk', fmt=vmdk size=4294967296 compat6=off hwversion=undefined subformat=twoGbMaxExtentFlat

$ ls -lhs test*.vmdk
4.0K -rw-r--r--. 1 nsoffer nsoffer 2.0G Aug 27 03:23 test-f001.vmdk
4.0K -rw-r--r--. 1 nsoffer nsoffer 2.0G Aug 27 03:23 test-f002.vmdk
4.0K -rw-r--r--. 1 nsoffer nsoffer 353 Aug 27 03:23 test.vmdk

I did quick performance test for copying disks with qemu-img convert to
new raw target image to Gluster storage with sector size of 512 bytes:

for i in $(seq 10); do
rm -f dst.raw
sleep 10
time ./qemu-img convert -f raw -O raw -t none -T none src.raw dst.raw
done

Here is a table comparing the total time spent:

Type Before(s) After(s) Diff(%)
---------------------------------------
real 530.028 469.123 -11.4
user 17.204 10.768 -37.4
sys 17.881 7.011 -60.7

We can see very clear improvement in CPU usage.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Message-id: 20190827010528.8818-2-nsoffer@redhat.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>

authored by

Nir Soffer and committed by
Max Reitz
3a20013f b503de61

+99 -21
+51
block/file-posix.c
··· 1749 1749 return ret; 1750 1750 } 1751 1751 1752 + /* 1753 + * Help alignment probing by allocating the first block. 1754 + * 1755 + * When reading with direct I/O from unallocated area on Gluster backed by XFS, 1756 + * reading succeeds regardless of request length. In this case we fallback to 1757 + * safe alignment which is not optimal. Allocating the first block avoids this 1758 + * fallback. 1759 + * 1760 + * fd may be opened with O_DIRECT, but we don't know the buffer alignment or 1761 + * request alignment, so we use safe values. 1762 + * 1763 + * Returns: 0 on success, -errno on failure. Since this is an optimization, 1764 + * caller may ignore failures. 1765 + */ 1766 + static int allocate_first_block(int fd, size_t max_size) 1767 + { 1768 + size_t write_size = (max_size < MAX_BLOCKSIZE) 1769 + ? BDRV_SECTOR_SIZE 1770 + : MAX_BLOCKSIZE; 1771 + size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize()); 1772 + void *buf; 1773 + ssize_t n; 1774 + int ret; 1775 + 1776 + buf = qemu_memalign(max_align, write_size); 1777 + memset(buf, 0, write_size); 1778 + 1779 + do { 1780 + n = pwrite(fd, buf, write_size, 0); 1781 + } while (n == -1 && errno == EINTR); 1782 + 1783 + ret = (n == -1) ? -errno : 0; 1784 + 1785 + qemu_vfree(buf); 1786 + return ret; 1787 + } 1788 + 1752 1789 static int handle_aiocb_truncate(void *opaque) 1753 1790 { 1754 1791 RawPosixAIOData *aiocb = opaque; ··· 1788 1825 /* posix_fallocate() doesn't set errno. */ 1789 1826 error_setg_errno(errp, -result, 1790 1827 "Could not preallocate new data"); 1828 + } else if (current_length == 0) { 1829 + /* 1830 + * posix_fallocate() uses fallocate() if the filesystem 1831 + * supports it, or fallback to manually writing zeroes. If 1832 + * fallocate() was used, unaligned reads from the fallocated 1833 + * area in raw_probe_alignment() will succeed, hence we need to 1834 + * allocate the first block. 1835 + * 1836 + * Optimize future alignment probing; ignore failures. 1837 + */ 1838 + allocate_first_block(fd, offset); 1791 1839 } 1792 1840 } else { 1793 1841 result = 0; ··· 1849 1897 if (ftruncate(fd, offset) != 0) { 1850 1898 result = -errno; 1851 1899 error_setg_errno(errp, -result, "Could not resize file"); 1900 + } else if (current_length == 0 && offset > current_length) { 1901 + /* Optimize future alignment probing; ignore failures. */ 1902 + allocate_first_block(fd, offset); 1852 1903 } 1853 1904 return result; 1854 1905 default:
+1 -1
tests/qemu-iotests/059.out
··· 27 27 image: TEST_DIR/t.vmdk 28 28 file format: vmdk 29 29 virtual size: 0.977 TiB (1073741824000 bytes) 30 - disk size: 16 KiB 30 + disk size: 1.97 MiB 31 31 Format specific information: 32 32 cid: XXXXXXXX 33 33 parent cid: XXXXXXXX
tests/qemu-iotests/150.out tests/qemu-iotests/150.out.qcow2
+12
tests/qemu-iotests/150.out.raw
··· 1 + QA output created by 150 2 + 3 + === Mapping sparse conversion === 4 + 5 + Offset Length File 6 + 0 0x1000 TEST_DIR/t.IMGFMT 7 + 8 + === Mapping non-sparse conversion === 9 + 10 + Offset Length File 11 + 0 0x100000 TEST_DIR/t.IMGFMT 12 + *** done
+13 -6
tests/qemu-iotests/175
··· 37 37 # the file size. This function hides the resulting difference in the 38 38 # stat -c '%b' output. 39 39 # Parameter 1: Number of blocks an empty file occupies 40 - # Parameter 2: Image size in bytes 40 + # Parameter 2: Minimal number of blocks in an image 41 + # Parameter 3: Image size in bytes 41 42 _filter_blocks() 42 43 { 43 44 extra_blocks=$1 44 - img_size=$2 45 + min_blocks=$2 46 + img_size=$3 45 47 46 - sed -e "s/blocks=$extra_blocks\\(\$\\|[^0-9]\\)/nothing allocated/" \ 47 - -e "s/blocks=$((extra_blocks + img_size / 512))\\(\$\\|[^0-9]\\)/everything allocated/" 48 + sed -e "s/blocks=$min_blocks\\(\$\\|[^0-9]\\)/min allocation/" \ 49 + -e "s/blocks=$((extra_blocks + img_size / 512))\\(\$\\|[^0-9]\\)/max allocation/" 48 50 } 49 51 50 52 # get standard environment, filters and checks ··· 60 62 touch "$TEST_DIR/empty" 61 63 extra_blocks=$(stat -c '%b' "$TEST_DIR/empty") 62 64 65 + # We always write the first byte; check how many blocks this filesystem 66 + # allocates to match empty image alloation. 67 + printf "\0" > "$TEST_DIR/empty" 68 + min_blocks=$(stat -c '%b' "$TEST_DIR/empty") 69 + 63 70 echo 64 71 echo "== creating image with default preallocation ==" 65 72 _make_test_img $size | _filter_imgfmt 66 - stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks $size 73 + stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks $min_blocks $size 67 74 68 75 for mode in off full falloc; do 69 76 echo 70 77 echo "== creating image with preallocation $mode ==" 71 78 IMGOPTS=preallocation=$mode _make_test_img $size | _filter_imgfmt 72 - stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks $size 79 + stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks $min_blocks $size 73 80 done 74 81 75 82 # success, all done
+4 -4
tests/qemu-iotests/175.out
··· 2 2 3 3 == creating image with default preallocation == 4 4 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 5 - size=1048576, nothing allocated 5 + size=1048576, min allocation 6 6 7 7 == creating image with preallocation off == 8 8 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 preallocation=off 9 - size=1048576, nothing allocated 9 + size=1048576, min allocation 10 10 11 11 == creating image with preallocation full == 12 12 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 preallocation=full 13 - size=1048576, everything allocated 13 + size=1048576, max allocation 14 14 15 15 == creating image with preallocation falloc == 16 16 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 preallocation=falloc 17 - size=1048576, everything allocated 17 + size=1048576, max allocation 18 18 *** done
+2 -2
tests/qemu-iotests/178.out.qcow2
··· 101 101 == raw input image with data (human) == 102 102 103 103 Formatting 'TEST_DIR/t.qcow2', fmt=IMGFMT size=1073741824 104 - required size: 393216 104 + required size: 458752 105 105 fully allocated size: 1074135040 106 106 wrote 512/512 bytes at offset 512 107 107 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) ··· 257 257 258 258 Formatting 'TEST_DIR/t.qcow2', fmt=IMGFMT size=1073741824 259 259 { 260 - "required": 393216, 260 + "required": 458752, 261 261 "fully-allocated": 1074135040 262 262 } 263 263 wrote 512/512 bytes at offset 512
+8 -4
tests/qemu-iotests/221.out
··· 3 3 === Check mapping of unaligned raw image === 4 4 5 5 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=65537 6 - [{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] 7 - [{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] 6 + [{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, 7 + { "start": 4096, "length": 61952, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] 8 + [{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, 9 + { "start": 4096, "length": 61952, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] 8 10 wrote 1/1 bytes at offset 65536 9 11 1 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) 10 - [{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": OFFSET}, 12 + [{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, 13 + { "start": 4096, "length": 61440, "depth": 0, "zero": true, "data": false, "offset": OFFSET}, 11 14 { "start": 65536, "length": 1, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, 12 15 { "start": 65537, "length": 511, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] 13 - [{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": OFFSET}, 16 + [{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, 17 + { "start": 4096, "length": 61440, "depth": 0, "zero": true, "data": false, "offset": OFFSET}, 14 18 { "start": 65536, "length": 1, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, 15 19 { "start": 65537, "length": 511, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] 16 20 *** done
+8 -4
tests/qemu-iotests/253.out
··· 3 3 === Check mapping of unaligned raw image === 4 4 5 5 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048575 6 - [{ "start": 0, "length": 1048576, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] 7 - [{ "start": 0, "length": 1048576, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] 6 + [{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, 7 + { "start": 4096, "length": 1044480, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] 8 + [{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, 9 + { "start": 4096, "length": 1044480, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] 8 10 wrote 65535/65535 bytes at offset 983040 9 11 63.999 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) 10 - [{ "start": 0, "length": 983040, "depth": 0, "zero": true, "data": false, "offset": OFFSET}, 12 + [{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, 13 + { "start": 4096, "length": 978944, "depth": 0, "zero": true, "data": false, "offset": OFFSET}, 11 14 { "start": 983040, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": OFFSET}] 12 - [{ "start": 0, "length": 983040, "depth": 0, "zero": true, "data": false, "offset": OFFSET}, 15 + [{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, 16 + { "start": 4096, "length": 978944, "depth": 0, "zero": true, "data": false, "offset": OFFSET}, 13 17 { "start": 983040, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": OFFSET}] 14 18 *** done