qemu with hax to log dma reads & writes jcs.org/2018/11/12/vfio

qcow2: Default to 4KB for the qcow2 cache entry size

QEMU 2.12 (commit 1221fe6f636754ab5f2c1c87caa77633e9123622) introduced
a new setting called l2-cache-entry-size that allows making entries on
the qcow2 L2 cache smaller than the cluster size.

I have been performing several tests with different cluster and entry
sizes and all of them show that reducing the entry size (aka L2 slice)
consistently improves I/O performance, notably during random I/O (all
tests done with sequential I/O show similar results). This is to be
expected because loading and evicting an L2 slice is more expensive
the larger the slice is.

Here are some numbers on fully populated 40GB qcow2 images. The
rightmost column represents the maximum L2 cache size in both cases.

Cluster size = 64 KB
|-------------+--------------+--------------+--------------|
| | 1MB L2 cache | 3MB L2 cache | 5MB L2 cache |
|-------------+--------------+--------------+--------------|
| 4KB slices | 6545 IOPS | 12045 IOPS | 55680 IOPS |
| 16KB slices | 5177 IOPS | 9798 IOPS | 56278 IOPS |
| 64KB slices | 2718 IOPS | 5326 IOPS | 57355 IOPS |
|-------------+--------------+--------------+--------------|

Cluster size = 256 KB
|--------------+----------------+--------------+-----------------|
| | 512KB L2 cache | 1MB L2 cache | 1280KB L2 cache |
|--------------+----------------+--------------+-----------------|
| 4KB slices | 8539 IOPS | 21071 IOPS | 55417 IOPS |
| 64KB slices | 3598 IOPS | 9772 IOPS | 57687 IOPS |
| 256KB slices | 1415 IOPS | 4120 IOPS | 58001 IOPS |
|--------------+----------------+--------------+-----------------|

As can be seen in the numbers, the only exception to the rule is when
the cache is large enough to hold all L2 tables. This is also to be
expected because in this case no cache entry is ever evicted so
reducing its size doesn't bring any benefit.

This patch sets the default L2 cache entry size to 4KB except when the
cache is large enough for the whole disk.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

authored by

Alberto Garcia and committed by
Kevin Wolf
af39bd0d b74b1ade

+23 -6
+12
block/qcow2.c
··· 788 788 BDRVQcow2State *s = bs->opaque; 789 789 uint64_t combined_cache_size, l2_cache_max_setting; 790 790 bool l2_cache_size_set, refcount_cache_size_set, combined_cache_size_set; 791 + bool l2_cache_entry_size_set; 791 792 int min_refcount_cache = MIN_REFCOUNT_CACHE_SIZE * s->cluster_size; 792 793 uint64_t virtual_disk_size = bs->total_sectors * BDRV_SECTOR_SIZE; 793 794 uint64_t max_l2_cache = virtual_disk_size / (s->cluster_size / 8); ··· 795 796 combined_cache_size_set = qemu_opt_get(opts, QCOW2_OPT_CACHE_SIZE); 796 797 l2_cache_size_set = qemu_opt_get(opts, QCOW2_OPT_L2_CACHE_SIZE); 797 798 refcount_cache_size_set = qemu_opt_get(opts, QCOW2_OPT_REFCOUNT_CACHE_SIZE); 799 + l2_cache_entry_size_set = qemu_opt_get(opts, QCOW2_OPT_L2_CACHE_ENTRY_SIZE); 798 800 799 801 combined_cache_size = qemu_opt_get_size(opts, QCOW2_OPT_CACHE_SIZE, 0); 800 802 l2_cache_max_setting = qemu_opt_get_size(opts, QCOW2_OPT_L2_CACHE_SIZE, ··· 841 843 } 842 844 } 843 845 } 846 + 847 + /* 848 + * If the L2 cache is not enough to cover the whole disk then 849 + * default to 4KB entries. Smaller entries reduce the cost of 850 + * loads and evictions and increase I/O performance. 851 + */ 852 + if (*l2_cache_size < max_l2_cache && !l2_cache_entry_size_set) { 853 + *l2_cache_entry_size = MIN(s->cluster_size, 4096); 854 + } 855 + 844 856 /* l2_cache_size and refcount_cache_size are ensured to have at least 845 857 * their minimum values in qcow2_update_options_prepare() */ 846 858
+11 -6
docs/qcow2-cache.txt
··· 158 158 159 159 Using smaller cache entries 160 160 --------------------------- 161 - The qcow2 L2 cache stores complete tables by default. This means that 162 - if QEMU needs an entry from an L2 table then the whole table is read 163 - from disk and is kept in the cache. If the cache is full then a 164 - complete table needs to be evicted first. 161 + The qcow2 L2 cache can store complete tables. This means that if QEMU 162 + needs an entry from an L2 table then the whole table is read from disk 163 + and is kept in the cache. If the cache is full then a complete table 164 + needs to be evicted first. 165 165 166 166 This can be inefficient with large cluster sizes since it results in 167 167 more disk I/O and wastes more cache memory. ··· 172 172 173 173 -drive file=hd.qcow2,l2-cache-size=2097152,l2-cache-entry-size=4096 174 174 175 + Since QEMU 4.0 the value of l2-cache-entry-size defaults to 4KB (or 176 + the cluster size if it's smaller). 177 + 175 178 Some things to take into account: 176 179 177 180 - The L2 cache entry size has the same restrictions as the cluster ··· 185 188 186 189 - Try different entry sizes to see which one gives faster performance 187 190 in your case. The block size of the host filesystem is generally a 188 - good default (usually 4096 bytes in the case of ext4). 191 + good default (usually 4096 bytes in the case of ext4, hence the 192 + default). 189 193 190 194 - Only the L2 cache can be configured this way. The refcount cache 191 195 always uses the cluster size as the entry size. ··· 194 198 (as explained in the "Choosing the right cache sizes" and "How to 195 199 configure the cache sizes" sections in this document) then none of 196 200 this is necessary and you can omit the "l2-cache-entry-size" 197 - parameter altogether. 201 + parameter altogether. In this case QEMU makes the entry size 202 + equal to the cluster size by default. 198 203 199 204 200 205 Reducing the memory usage