From: Suparna Bhattacharya I noticed a problem with the way do_generic_mapping_read and readahead works for the case of large reads, especially random reads. This was leading to very inefficient behaviour for a stream for AIO reads. (See the results a little later in this note) 1) We should be reading ahead at least the pages that are required by the current read request (even if the ra window is maximally shrunk). I think I've seen this in 2.4 - we seem to have lost that in 2.5. The result is that sometimes (large random reads) we end up doing reads one page at a time waiting for it to complete being reading the next page and so on, even for a large read. (until we buildup a readahead window again) 2) Once the ra window is maximally shrunk, the responsibility for reading the pages and re-building the window is shifted to the slow path in read, which breaks down in the case of a stream of AIO reads where multiple iocbs submit reads to the same file rather than serialise the wait for i/o completion. So here is a patch that fixes this by making sure we do (1) and pushing up the handle_ra_miss calls for the maximally shrunk case before the loop that waits for I/O completion. Does it make a difference ? A lot, actually. Here's a summary of 64 KB random read throughput results running aio-stress for a 1GB file on 2.6.0-test2, 2.6.0-test2-mm4 and 2.6.0-test-mm4 on a 4way test system with this patch: ./aio-stress -o3 testdir/rwfile5 file size 1024MB, record size 64KB, depth 64, ios per iteration 8 max io_submit 8, buffer alignment set to 4KB 2.6.0-test2: ----------- random read on testdir/rwfile5 (8.54 MB/s) 1024.00 MB in 119.91s (not true aio btw - here aio_read is actually fully synchronous since buffered fs aio support isn't in) 2.6.0-test2-mm4: --------------- random read on testdir/rwfile5 (2.40 MB/s) 1024.00 MB in 426.10s (and this is with buffered fs aio support i.e. true aio which demonstrates just how bad the problem is) And with 2.6.0-test2-mm4 + the attached patch (read-speedup.patch) ------------------------------------- random read on testdir/rwfile5 (21.45 MB/s) 1024.00 MB in 47.74s (Throughput is now 2.5x that in vanilla 2.6.0-test2) Just as a reference, here are the throughput results for O_DIRECT aio ( ./aio-stress -O -o3 testdir/rwfile5) on the same kernel: random read on testdir/rwfile5 (17.71 MB/s) 1024.00 MB in 57.81s Note: aio-stress is available at Chris Mason's ftp site ftp.suse.com/pub/people/mason/utils/aio-stress.c Now, another way to solve the problem would be to modify page_cache_readahead to let it know about the size of the request, and to make it handle re-enablement of readahead, and have handle_ra_miss only deal with misses due to VM eviction. Perhaps this is what we should do in the long run ? 25-akpm/mm/filemap.c | 16 ++++++++++++---- 1 files changed, 12 insertions(+), 4 deletions(-) diff -puN mm/filemap.c~aio-readahead-speedup mm/filemap.c --- 25/mm/filemap.c~aio-readahead-speedup Thu Oct 9 02:05:36 2003 +++ 25-akpm/mm/filemap.c Thu Oct 9 02:05:36 2003 @@ -661,13 +661,13 @@ void do_generic_mapping_read(struct addr read_actor_t actor) { struct inode *inode = mapping->host; - unsigned long index, offset, last, end_index; + unsigned long index, offset, first, last, end_index; struct page *cached_page; loff_t isize = i_size_read(inode); int error; cached_page = NULL; - index = *ppos >> PAGE_CACHE_SHIFT; + first = *ppos >> PAGE_CACHE_SHIFT; offset = *ppos & ~PAGE_CACHE_MASK; last = (*ppos + desc->count) >> PAGE_CACHE_SHIFT; @@ -685,11 +685,19 @@ void do_generic_mapping_read(struct addr * Let the readahead logic know upfront about all * the pages we'll need to satisfy this request */ - for (; index < last; index++) + for (index = first ; index < last; index++) page_cache_readahead(mapping, ra, filp, index); - index = *ppos >> PAGE_CACHE_SHIFT; + + if (ra->next_size == -1UL) { + /* the readahead window was maximally shrunk */ + /* explicitly readahead at least what is needed now */ + for (index = first; index < last; index++) + handle_ra_miss(mapping, ra, index); + do_page_cache_readahead(mapping, filp, first, last - first); + } done_readahead: + index = first; for (;;) { struct page *page; unsigned long nr, ret; _