commit b0807bc10a6ac95ab8bf3bbf57703a0f2edd9aa9
Author: Jiri Slaby <jslaby@suse.cz>
Date:   Fri Sep 26 11:55:45 2014 +0200

    Linux 3.12.30

commit 99ed1bd0c77355d65de5f112eb92d79f9bace84f
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:45 2014 +0100

    mm: page_alloc: reduce cost of the fair zone allocation policy
    
    commit 4ffeaf3560a52b4a69cc7909873d08c0ef5909d4 upstream.
    
    The fair zone allocation policy round-robins allocations between zones
    within a node to avoid age inversion problems during reclaim.  If the
    first allocation fails, the batch counts are reset and a second attempt
    made before entering the slow path.
    
    One assumption made with this scheme is that batches expire at roughly
    the same time and the resets each time are justified.  This assumption
    does not hold when zones reach their low watermark as the batches will
    be consumed at uneven rates.  Allocation failure due to watermark
    depletion result in additional zonelist scans for the reset and another
    watermark check before hitting the slowpath.
    
    On UMA, the benefit is negligible -- around 0.25%.  On 4-socket NUMA
    machine it's variable due to the variability of measuring overhead with
    the vmstat changes.  The system CPU overhead comparison looks like
    
              3.16.0-rc3  3.16.0-rc3  3.16.0-rc3
                 vanilla   vmstat-v5 lowercost-v5
    User          746.94      774.56      802.00
    System      65336.22    32847.27    40852.33
    Elapsed     27553.52    27415.04    27368.46
    
    However it is worth noting that the overall benchmark still completed
    faster and intuitively it makes sense to take as few passes as possible
    through the zonelists.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 51aad0a51582e4147380137ba34785663a1b5f93
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:44 2014 +0100

    mm: page_alloc: abort fair zone allocation policy when remotes nodes are encountered
    
    commit f7b5d647946aae1647bf5cd26c16b3a793c1ac49 upstream.
    
    The purpose of numa_zonelist_order=zone is to preserve lower zones for
    use with 32-bit devices.  If locality is preferred then the
    numa_zonelist_order=node policy should be used.
    
    Unfortunately, the fair zone allocation policy overrides this by
    skipping zones on remote nodes until the lower one is found.  While this
    makes sense from a page aging and performance perspective, it breaks the
    expected zonelist policy.  This patch restores the expected behaviour
    for zone-list ordering.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit fc915114c8369c660b6876b06519902de3bd10c6
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:43 2014 +0100

    mm: vmscan: only update per-cpu thresholds for online CPU
    
    commit bb0b6dffa2ccfbd9747ad0cc87c7459622896e60 upstream.
    
    When kswapd is awake reclaiming, the per-cpu stat thresholds are lowered
    to get more accurate counts to avoid breaching watermarks.  This
    threshold update iterates over all possible CPUs which is unnecessary.
    Only online CPUs need to be updated.  If a new CPU is onlined,
    refresh_zone_stat_thresholds() will set the thresholds correctly.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 4a4ede23dd902513b3a17d3e61cef9baf650d33e
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:42 2014 +0100

    mm: move zone->pages_scanned into a vmstat counter
    
    commit 0d5d823ab4e608ec7b52ac4410de4cb74bbe0edd upstream.
    
    zone->pages_scanned is a write-intensive cache line during page reclaim
    and it's also updated during page free.  Move the counter into vmstat to
    take advantage of the per-cpu updates and do not update it in the free
    paths unless necessary.
    
    On a small UMA machine running tiobench the difference is marginal.  On
    a 4-node machine the overhead is more noticable.  Note that automatic
    NUMA balancing was disabled for this test as otherwise the system CPU
    overhead is unpredictable.
    
              3.16.0-rc3  3.16.0-rc3  3.16.0-rc3
                 vanillarearrange-v5   vmstat-v5
    User          746.94      759.78      774.56
    System      65336.22    58350.98    32847.27
    Elapsed     27553.52    27282.02    27415.04
    
    Note that the overhead reduction will vary depending on where exactly
    pages are allocated and freed.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit b4fc580f75325271de2841891bb5816cea5ca101
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:41 2014 +0100

    mm: rearrange zone fields into read-only, page alloc, statistics and page reclaim lines
    
    commit 3484b2de9499df23c4604a513b36f96326ae81ad upstream.
    
    The arrangement of struct zone has changed over time and now it has
    reached the point where there is some inappropriate sharing going on.
    On x86-64 for example
    
    o The zone->node field is shared with the zone lock and zone->node is
      accessed frequently from the page allocator due to the fair zone
      allocation policy.
    
    o span_seqlock is almost never used by shares a line with free_area
    
    o Some zone statistics share a cache line with the LRU lock so
      reclaim-intensive and allocator-intensive workloads can bounce the cache
      line on a stat update
    
    This patch rearranges struct zone to put read-only and read-mostly
    fields together and then splits the page allocator intensive fields, the
    zone statistics and the page reclaim intensive fields into their own
    cache lines.  Note that the type of lowmem_reserve changes due to the
    watermark calculations being signed and avoiding a signed/unsigned
    conversion there.
    
    On the test configuration I used the overall size of struct zone shrunk
    by one cache line.  On smaller machines, this is not likely to be
    noticable.  However, on a 4-node NUMA machine running tiobench the
    system CPU overhead is reduced by this patch.
    
              3.16.0-rc3  3.16.0-rc3
                 vanillarearrange-v5r9
    User          746.94      759.78
    System      65336.22    58350.98
    Elapsed     27553.52    27282.02
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 59d4b11371a0a23eef41138c122074aabaeaff4b
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:40 2014 +0100

    mm: pagemap: avoid unnecessary overhead when tracepoints are deactivated
    
    commit 24b7e5819ad5cbef2b7c7376510862aa8319d240 upstream.
    
    This was formerly the series "Improve sequential read throughput" which
    noted some major differences in performance of tiobench since 3.0.
    While there are a number of factors, two that dominated were the
    introduction of the fair zone allocation policy and changes to CFQ.
    
    The behaviour of fair zone allocation policy makes more sense than
    tiobench as a benchmark and CFQ defaults were not changed due to
    insufficient benchmarking.
    
    This series is what's left.  It's one functional fix to the fair zone
    allocation policy when used on NUMA machines and a reduction of overhead
    in general.  tiobench was used for the comparison despite its flaws as
    an IO benchmark as in this case we are primarily interested in the
    overhead of page allocator and page reclaim activity.
    
    On UMA, it makes little difference to overhead
    
              3.16.0-rc3   3.16.0-rc3
                 vanilla lowercost-v5
    User          383.61      386.77
    System        403.83      401.74
    Elapsed      5411.50     5413.11
    
    On a 4-socket NUMA machine it's a bit more noticable
    
              3.16.0-rc3   3.16.0-rc3
                 vanilla lowercost-v5
    User          746.94      802.00
    System      65336.22    40852.33
    Elapsed     27553.52    27368.46
    
    This patch (of 6):
    
    The LRU insertion and activate tracepoints take PFN as a parameter
    forcing the overhead to the caller.  Move the overhead to the tracepoint
    fast-assign method to ensure the cost is only incurred when the
    tracepoint is active.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 04bab05a95fece32015d897d4058880bbb5c65eb
Author: Jerome Marchand <jmarchan@redhat.com>
Date:   Thu Aug 28 19:35:39 2014 +0100

    memcg, vmscan: Fix forced scan of anonymous pages
    
    commit 2ab051e11bfa3cbb7b24177f3d6aaed10a0d743e upstream.
    
    When memory cgoups are enabled, the code that decides to force to scan
    anonymous pages in get_scan_count() compares global values (free,
    high_watermark) to a value that is restricted to a memory cgroup (file).
    It make the code over-eager to force anon scan.
    
    For instance, it will force anon scan when scanning a memcg that is
    mainly populated by anonymous page, even when there is plenty of file
    pages to get rid of in others memcgs, even when swappiness == 0.  It
    breaks user's expectation about swappiness and hurts performance.
    
    This patch makes sure that forced anon scan only happens when there not
    enough file pages for the all zone, not just in one random memcg.
    
    [hannes@cmpxchg.org: cleanups]
    Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
    Acked-by: Michal Hocko <mhocko@suse.cz>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 788a2f69f9b8b77b30ace8d1ef9380fa4ea5c6ec
Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date:   Thu Aug 28 19:35:38 2014 +0100

    vmalloc: use rcu list iterator to reduce vmap_area_lock contention
    
    commit 474750aba88817c53f39424e5567b8e4acc4b39b upstream.
    
    Richard Yao reported a month ago that his system have a trouble with
    vmap_area_lock contention during performance analysis by /proc/meminfo.
    Andrew asked why his analysis checks /proc/meminfo stressfully, but he
    didn't answer it.
    
      https://lkml.org/lkml/2014/4/10/416
    
    Although I'm not sure that this is right usage or not, there is a
    solution reducing vmap_area_lock contention with no side-effect.  That
    is just to use rcu list iterator in get_vmalloc_info().
    
    rcu can be used in this function because all RCU protocol is already
    respected by writers, since Nick Piggin commit db64fe02258f1 ("mm:
    rewrite vmap layer") back in linux-2.6.28
    
    Specifically :
       insertions use list_add_rcu(),
       deletions use list_del_rcu() and kfree_rcu().
    
    Note the rb tree is not used from rcu reader (it would not be safe),
    only the vmap_area_list has full RCU protection.
    
    Note that __purge_vmap_area_lazy() already uses this rcu protection.
    
            rcu_read_lock();
            list_for_each_entry_rcu(va, &vmap_area_list, list) {
                    if (va->flags & VM_LAZY_FREE) {
                            if (va->va_start < *start)
                                    *start = va->va_start;
                            if (va->va_end > *end)
                                    *end = va->va_end;
                            nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
                            list_add_tail(&va->purge_list, &valist);
                            va->flags |= VM_LAZY_FREEING;
                            va->flags &= ~VM_LAZY_FREE;
                    }
            }
            rcu_read_unlock();
    
    Peter:
    
    : While rcu list traversal over the vmap_area_list is safe, this may
    : arrive at different results than the spinlocked version. The rcu list
    : traversal version will not be a 'snapshot' of a single, valid instant
    : of the entire vmap_area_list, but rather a potential amalgam of
    : different list states.
    
    Joonsoo:
    
    : Yes, you are right, but I don't think that we should be strict here.
    : Meminfo is already not a 'snapshot' at specific time.  While we try to get
    : certain stats, the other stats can change.  And, although we may arrive at
    : different results than the spinlocked version, the difference would not be
    : large and would not make serious side-effect.
    
    [edumazet@google.com: add more commit description]
    Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Reported-by: Richard Yao <ryao@gentoo.org>
    Acked-by: Eric Dumazet <edumazet@google.com>
    Cc: Peter Hurley <peter@hurleysoftware.com>
    Cc: Zhang Yanfei <zhangyanfei.yes@gmail.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Andi Kleen <andi@firstfloor.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 84a6c7694aad1e1fe41ee7f66b9142e6c6b0347d
Author: Jerome Marchand <jmarchan@redhat.com>
Date:   Thu Aug 28 19:35:37 2014 +0100

    mm: make copy_pte_range static again
    
    commit 21bda264f4243f61dfcc485174055f12ad0530b4 upstream.
    
    Commit 71e3aac0724f ("thp: transparent hugepage core") adds
    copy_pte_range prototype to huge_mm.h.  I'm not sure why (or if) this
    function have been used outside of memory.c, but it currently isn't.
    This patch makes copy_pte_range() static again.
    
    Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 6a01f8dd2508cf79abbdccc44a6a41b2e17fb3cb
Author: David Rientjes <rientjes@google.com>
Date:   Thu Aug 28 19:35:36 2014 +0100

    mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_mode
    
    commit 14a4e2141e24304fff2c697be6382ffb83888185 upstream.
    
    Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
    node") improved the previous khugepaged logic which allocated a
    transparent hugepages from the node of the first page being collapsed.
    
    However, it is still possible to collapse pages to remote memory which
    may suffer from additional access latency.  With the current policy, it
    is possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed
    remotely if the majority are allocated from that node.
    
    When zone_reclaim_mode is enabled, it means the VM should make every
    attempt to allocate locally to prevent NUMA performance degradation.  In
    this case, we do not want to collapse hugepages to remote nodes that
    would suffer from increased access latency.  Thus, when
    zone_reclaim_mode is enabled, only allow collapsing to nodes with
    RECLAIM_DISTANCE or less.
    
    There is no functional change for systems that disable
    zone_reclaim_mode.
    
    Signed-off-by: David Rientjes <rientjes@google.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Bob Liu <bob.liu@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 1d08848674752a433634f5c144500447de8ab727
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Aug 28 19:35:35 2014 +0100

    mm/memory.c: use entry = ACCESS_ONCE(*pte) in handle_pte_fault()
    
    commit c0d73261f5c1355a35b8b40e871d31578ce0c044 upstream.
    
    Use ACCESS_ONCE() in handle_pte_fault() when getting the entry or
    orig_pte upon which all subsequent decisions and pte_same() tests will
    be made.
    
    I have no evidence that its lack is responsible for the mm/filemap.c:202
    BUG_ON(page_mapped(page)) in __delete_from_page_cache() found by
    trinity, and I am not optimistic that it will fix it.  But I have found
    no other explanation, and ACCESS_ONCE() here will surely not hurt.
    
    If gcc does re-access the pte before passing it down, then that would be
    disastrous for correct page fault handling, and certainly could explain
    the page_mapped() BUGs seen (concurrent fault causing page to be mapped
    in a second time on top of itself: mapcount 2 for a single pte).
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Sasha Levin <sasha.levin@oracle.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Konstantin Khlebnikov <koct9i@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 335886a3689bb3be6bc37f6601006ee20ccd80a0
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Aug 28 19:35:34 2014 +0100

    shmem: fix init_page_accessed use to stop !PageLRU bug
    
    commit 66d2f4d28cd030220e7ea2a628993fcabcb956d1 upstream.
    
    Under shmem swapping load, I sometimes hit the VM_BUG_ON_PAGE(!PageLRU)
    in isolate_lru_pages() at mm/vmscan.c:1281!
    
    Commit 2457aec63745 ("mm: non-atomically mark page accessed during page
    cache allocation where possible") looks like interrupted work-in-progress.
    
    mm/filemap.c's call to init_page_accessed() is fine, but not mm/shmem.c's
    - shmem_write_begin() is clearly wrong to use it after shmem_getpage(),
    when the page is always visible in radix_tree, and often already on LRU.
    
    Revert change to shmem_write_begin(), and use init_page_accessed() or
    mark_page_accessed() appropriately for SGP_WRITE in shmem_getpage_gfp().
    
    SGP_WRITE also covers shmem_symlink(), which did not mark_page_accessed()
    before; but since many other filesystems use [__]page_symlink(), which did
    and does mark the page accessed, consider this as rectifying an oversight.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Prabhakar Lad <prabhakar.csengg@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 9057c212b93709fa54e62d434502a64591cfa5e6
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:33 2014 +0100

    mm: avoid unnecessary atomic operations during end_page_writeback()
    
    commit 888cf2db475a256fb0cda042140f73d7881f81fe upstream.
    
    If a page is marked for immediate reclaim then it is moved to the tail of
    the LRU list.  This occurs when the system is under enough memory pressure
    for pages under writeback to reach the end of the LRU but we test for this
    using atomic operations on every writeback.  This patch uses an optimistic
    non-atomic test first.  It'll miss some pages in rare cases but the
    consequences are not severe enough to warrant such a penalty.
    
    While the function does not dominate profiles during a simple dd test the
    cost of it is reduced.
    
    73048     0.7428  vmlinux-3.15.0-rc5-mmotm-20140513 end_page_writeback
    23740     0.2409  vmlinux-3.15.0-rc5-lessatomic     end_page_writeback
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit d618a27c7808608e376de803a4fd3940f33776c2
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:32 2014 +0100

    mm: non-atomically mark page accessed during page cache allocation where possible
    
    commit 2457aec63745e235bcafb7ef312b182d8682f0fc upstream.
    
    aops->write_begin may allocate a new page and make it visible only to have
    mark_page_accessed called almost immediately after.  Once the page is
    visible the atomic operations are necessary which is noticable overhead
    when writing to an in-memory filesystem like tmpfs but should also be
    noticable with fast storage.  The objective of the patch is to initialse
    the accessed information with non-atomic operations before the page is
    visible.
    
    The bulk of filesystems directly or indirectly use
    grab_cache_page_write_begin or find_or_create_page for the initial
    allocation of a page cache page.  This patch adds an init_page_accessed()
    helper which behaves like the first call to mark_page_accessed() but may
    called before the page is visible and can be done non-atomically.
    
    The primary APIs of concern in this care are the following and are used
    by most filesystems.
    
    	find_get_page
    	find_lock_page
    	find_or_create_page
    	grab_cache_page_nowait
    	grab_cache_page_write_begin
    
    All of them are very similar in detail to the patch creates a core helper
    pagecache_get_page() which takes a flags parameter that affects its
    behavior such as whether the page should be marked accessed or not.  Then
    old API is preserved but is basically a thin wrapper around this core
    function.
    
    Each of the filesystems are then updated to avoid calling
    mark_page_accessed when it is known that the VM interfaces have already
    done the job.  There is a slight snag in that the timing of the
    mark_page_accessed() has now changed so in rare cases it's possible a page
    gets to the end of the LRU as PageReferenced where as previously it might
    have been repromoted.  This is expected to be rare but it's worth the
    filesystem people thinking about it in case they see a problem with the
    timing change.  It is also the case that some filesystems may be marking
    pages accessed that previously did not but it makes sense that filesystems
    have consistent behaviour in this regard.
    
    The test case used to evaulate this is a simple dd of a large file done
    multiple times with the file deleted on each iterations.  The size of the
    file is 1/10th physical memory to avoid dirty page balancing.  In the
    async case it will be possible that the workload completes without even
    hitting the disk and will have variable results but highlight the impact
    of mark_page_accessed for async IO.  The sync results are expected to be
    more stable.  The exception is tmpfs where the normal case is for the "IO"
    to not hit the disk.
    
    The test machine was single socket and UMA to avoid any scheduling or NUMA
    artifacts.  Throughput and wall times are presented for sync IO, only wall
    times are shown for async as the granularity reported by dd and the
    variability is unsuitable for comparison.  As async results were variable
    do to writback timings, I'm only reporting the maximum figures.  The sync
    results were stable enough to make the mean and stddev uninteresting.
    
    The performance results are reported based on a run with no profiling.
    Profile data is based on a separate run with oprofile running.
    
    async dd
                                        3.15.0-rc3            3.15.0-rc3
                                           vanilla           accessed-v2
    ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
    tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
    btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
    ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
    xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)
    
    The XFS figure is a bit strange as it managed to avoid a worst case by
    sheer luck but the average figures looked reasonable.
    
            samples percentage
    ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
    ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
    ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
    xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
    btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
    tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    
    [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 967e64285aef3670019bd05bef74085d9ea98104
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:31 2014 +0100

    fs: buffer: do not use unnecessary atomic operations when discarding buffers
    
    commit e7470ee89f003634a88e7b5e5a7b65b3025987de upstream.
    
    Discarding buffers uses a bunch of atomic operations when discarding
    buffers because ......  I can't think of a reason.  Use a cmpxchg loop to
    clear all the necessary flags.  In most (all?) cases this will be a single
    atomic operations.
    
    [akpm@linux-foundation.org: move BUFFER_FLAGS_DISCARD into the .c file]
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 6ffef5d8bfc16845e25a7ee784426382b5c82c20
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:30 2014 +0100

    mm: do not use unnecessary atomic operations when adding pages to the LRU
    
    commit 6fb81a17d21f2a138b8f424af4cf379f2b694060 upstream.
    
    When adding pages to the LRU we clear the active bit unconditionally.
    As the page could be reachable from other paths we cannot use unlocked
    operations without risk of corruption such as a parallel
    mark_page_accessed.  This patch tests if is necessary to clear the
    active flag before using an atomic operation.  This potentially opens a
    tiny race when PageActive is checked as mark_page_accessed could be
    called after PageActive was checked.  The race already exists but this
    patch changes it slightly.  The consequence is that that the page may be
    promoted to the active list that might have been left on the inactive
    list before the patch.  It's too tiny a race and too marginal a
    consequence to always use atomic operations for.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit fa6d2dd2223cafd36fcd901d83eed2f134c8b793
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:29 2014 +0100

    mm: do not use atomic operations when releasing pages
    
    commit e3741b506c5088fa8c911bb5884c430f770fb49d upstream.
    
    There should be no references to it any more and a parallel mark should
    not be reordered against us.  Use non-locked varient to clear page active.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Rik van Riel <riel@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 6d3e133532cd49fa925fe5c2a949c5900afd601c
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:28 2014 +0100

    mm: shmem: avoid atomic operation during shmem_getpage_gfp
    
    commit 07a427884348d38a6fd56fa4d78249c407196650 upstream.
    
    shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
    before it's even added to the LRU or visible.  This is unnecessary as what
    could it possible race against?  Use an unlocked variant.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Rik van Riel <riel@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit f161eedc71da293a9bcfcf3d7f6c1da070a61ef0
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:27 2014 +0100

    mm: page_alloc: lookup pageblock migratetype with IRQs enabled during free
    
    commit cfc47a2803db42140167b92d991ef04018e162c7 upstream.
    
    get_pageblock_migratetype() is called during free with IRQs disabled.
    This is unnecessary and disables IRQs for longer than necessary.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Rik van Riel <riel@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 3e7379c0f4fae4e35784a1a3954bc43683b86308
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:26 2014 +0100

    mm: page_alloc: convert hot/cold parameter and immediate callers to bool
    
    commit b745bc85f21ea707e4ea1a91948055fa3e72c77b upstream.
    
    cold is a bool, make it one.  Make the likely case the "if" part of the
    block instead of the else as according to the optimisation manual this is
    preferred.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Rik van Riel <riel@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit c01947d6dfa1a3fee7bd54523dd59414e1d1fefc
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:25 2014 +0100

    mm: page_alloc: reduce number of times page_to_pfn is called
    
    commit dc4b0caff24d9b2918e9f27bc65499ee63187eba upstream.
    
    In the free path we calculate page_to_pfn multiple times. Reduce that.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Rik van Riel <riel@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit da530fd87d14f640f23d1ef51333951493f508af
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:24 2014 +0100

    mm: page_alloc: use unsigned int for order in more places
    
    commit 7aeb09f9104b760fc53c98cb7d20d06640baf9e6 upstream.
    
    X86 prefers the use of unsigned types for iterators and there is a
    tendency to mix whether a signed or unsigned type if used for page order.
    This converts a number of sites in mm/page_alloc.c to use unsigned int for
    order where possible.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Rik van Riel <riel@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 35515de99f95c55f6c416fc7cc2b16832b9f58ee
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:23 2014 +0100

    mm: page_alloc: take the ALLOC_NO_WATERMARK check out of the fast path
    
    commit 5dab29113ca56335c78be3f98bf5ddf2ef8eb6a6 upstream.
    
    ALLOC_NO_WATERMARK is set in a few cases.  Always by kswapd, always for
    __GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc.  Each of these
    cases are relatively rare events but the ALLOC_NO_WATERMARK check is an
    unlikely branch in the fast path.  This patch moves the check out of the
    fast path and after it has been determined that the watermarks have not
    been met.  This helps the common fast path at the cost of making the slow
    path slower and hitting kswapd with a performance cost.  It's a reasonable
    tradeoff.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit d6217476e187b28df3c99100e6f9a054b6ba80c6
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:22 2014 +0100

    mm: page_alloc: only check the alloc flags and gfp_mask for dirty once
    
    commit a6e21b14f22041382e832d30deda6f26f37b1097 upstream.
    
    Currently it's calculated once per zone in the zonelist.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 6a85a5ad29a2582ae347ac3f38baa9e2e5daab5b
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:21 2014 +0100

    mm: page_alloc: only check the zone id check if pages are buddies
    
    commit d34c5fa06fade08a689fc171bf756fba2858ae73 upstream.
    
    A node/zone index is used to check if pages are compatible for merging
    but this happens unconditionally even if the buddy page is not free. Defer
    the calculation as long as possible. Ideally we would check the zone boundary
    but nodes can overlap.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Rik van Riel <riel@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit dc2786f0c19a779395ef69189dd5e7df2573b29b
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:20 2014 +0100

    mm: page_alloc: calculate classzone_idx once from the zonelist ref
    
    commit d8846374a85f4290a473a4e2a64c1ba046c4a0e1 upstream.
    
    There is no need to calculate zone_idx(preferred_zone) multiple times
    or use the pgdat to figure it out.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Rik van Riel <riel@redhat.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Dan Carpenter <dan.carpenter@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit ee1760b2b4920841f45b2ad07a1c9f99e08568e7
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:19 2014 +0100

    mm: page_alloc: use jump labels to avoid checking number_of_cpusets
    
    commit 664eeddeef6539247691197c1ac124d4aa872ab6 upstream.
    
    If cpusets are not in use then we still check a global variable on every
    page allocation.  Use jump labels to avoid the overhead.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit f99bfd27bda2e71932f41115098aacb16263988c
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:18 2014 +0100

    include/linux/jump_label.h: expose the reference count
    
    commit ea5e9539abf1258f23e725cb9cb25aa74efa29eb upstream.
    
    This patch exposes the jump_label reference count in preparation for the
    next patch.  cpusets cares about both the jump_label being enabled and how
    many users of the cpusets there currently are.
    
    Signed-off-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 605cf7a17583cbbdbb58a99c6752133ad0f5c378
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:17 2014 +0100

    mm: page_alloc: do not treat a zone that cannot be used for dirty pages as "full"
    
    commit 800a1e750c7b04c2aa2459afca77e936e01c0029 upstream.
    
    If a zone cannot be used for a dirty page then it gets marked "full" which
    is cached in the zlc and later potentially skipped by allocation requests
    that have nothing to do with dirty zones.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 55dcadc2753639b07efed5d8baafeab47d4613a7
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:16 2014 +0100

    mm: page_alloc: do not update zlc unless the zlc is active
    
    commit 65bb371984d6a2c909244eb749e482bb40b72e36 upstream.
    
    The zlc is used on NUMA machines to quickly skip over zones that are full.
     However it is always updated, even for the first zone scanned when the
    zlc might not even be active.  As it's a write to a bitmap that
    potentially bounces cache line it's deceptively expensive and most
    machines will not care.  Only update the zlc if it was active.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 69aa12f2961a0310d58375815fa391e5528a79b3
Author: Jianyu Zhan <nasa4836@gmail.com>
Date:   Thu Aug 28 19:35:15 2014 +0100

    mm/swap.c: clean up *lru_cache_add* functions
    
    commit 2329d3751b082b4fd354f334a88662d72abac52d upstream.
    
    In mm/swap.c, __lru_cache_add() is exported, but actually there are no
    users outside this file.
    
    This patch unexports __lru_cache_add(), and makes it static.  It also
    exports lru_cache_add_file(), as it is use by cifs and fuse, which can
    loaded as modules.
    
    Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Shaohua Li <shli@kernel.org>
    Cc: Bob Liu <bob.liu@oracle.com>
    Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Rafael Aquini <aquini@redhat.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Acked-by: Rik van Riel <riel@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 4cd64dcede969cf30f47a6e6ba8e378e74d0790d
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Thu Aug 28 19:35:14 2014 +0100

    mm/page_alloc: prevent MIGRATE_RESERVE pages from being misplaced
    
    commit 5bcc9f86ef09a933255ee66bd899d4601785dad5 upstream.
    
    For the MIGRATE_RESERVE pages, it is useful when they do not get
    misplaced on free_list of other migratetype, otherwise they might get
    allocated prematurely and e.g.  fragment the MIGRATE_RESEVE pageblocks.
    While this cannot be avoided completely when allocating new
    MIGRATE_RESERVE pageblocks in min_free_kbytes sysctl handler, we should
    prevent the misplacement where possible.
    
    Currently, it is possible for the misplacement to happen when a
    MIGRATE_RESERVE page is allocated on pcplist through rmqueue_bulk() as a
    fallback for other desired migratetype, and then later freed back
    through free_pcppages_bulk() without being actually used.  This happens
    because free_pcppages_bulk() uses get_freepage_migratetype() to choose
    the free_list, and rmqueue_bulk() calls set_freepage_migratetype() with
    the *desired* migratetype and not the page's original MIGRATE_RESERVE
    migratetype.
    
    This patch fixes the problem by moving the call to
    set_freepage_migratetype() from rmqueue_bulk() down to
    __rmqueue_smallest() and __rmqueue_fallback() where the actual page's
    migratetype (e.g.  from which free_list the page is taken from) is used.
    Note that this migratetype might be different from the pageblock's
    migratetype due to freepage stealing decisions.  This is OK, as page
    stealing never uses MIGRATE_RESERVE as a fallback, and also takes care
    to leave all MIGRATE_CMA pages on the correct freelist.
    
    Therefore, as an additional benefit, the call to
    get_pageblock_migratetype() from rmqueue_bulk() when CMA is enabled, can
    be removed completely.  This relies on the fact that MIGRATE_CMA
    pageblocks are created only during system init, and the above.  The
    related is_migrate_isolate() check is also unnecessary, as memory
    isolation has other ways to move pages between freelists, and drain pcp
    lists containing pages that should be isolated.  The buffered_rmqueue()
    can also benefit from calling get_freepage_migratetype() instead of
    get_pageblock_migratetype().
    
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
    Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
    Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Suggested-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Minchan Kim <minchan@kernel.org>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Michal Nazarewicz <mina86@mina86.com>
    Cc: "Wang, Yalin" <Yalin.Wang@sonymobile.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 2ce666c175bb502cae050c97364acf48449b9d6a
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:35:13 2014 +0100

    mm: vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY
    
    commit 1a501907bbea8e6ebb0b16cf6db9e9cbf1d2c813 upstream.
    
    Commit "mm: vmscan: obey proportional scanning requirements for kswapd"
    ensured that file/anon lists were scanned proportionally for reclaim from
    kswapd but ignored it for direct reclaim.  The intent was to minimse
    direct reclaim latency but Yuanhan Liu pointer out that it substitutes one
    long stall for many small stalls and distorts aging for normal workloads
    like streaming readers/writers.  Hugh Dickins pointed out that a
    side-effect of the same commit was that when one LRU list dropped to zero
    that the entirety of the other list was shrunk leading to excessive
    reclaim in memcgs.  This patch scans the file/anon lists proportionally
    for direct reclaim to similarly age page whether reclaimed by kswapd or
    direct reclaim but takes care to abort reclaim if one LRU drops to zero
    after reclaiming the requested number of pages.
    
    Based on ext4 and using the Intel VM scalability test
    
                                                  3.15.0-rc5            3.15.0-rc5
                                                    shrinker            proportion
    Unit  lru-file-readonce    elapsed      5.3500 (  0.00%)      5.4200 ( -1.31%)
    Unit  lru-file-readonce time_range      0.2700 (  0.00%)      0.1400 ( 48.15%)
    Unit  lru-file-readonce time_stddv      0.1148 (  0.00%)      0.0536 ( 53.33%)
    Unit lru-file-readtwice    elapsed      8.1700 (  0.00%)      8.1700 (  0.00%)
    Unit lru-file-readtwice time_range      0.4300 (  0.00%)      0.2300 ( 46.51%)
    Unit lru-file-readtwice time_stddv      0.1650 (  0.00%)      0.0971 ( 41.16%)
    
    The test cases are running multiple dd instances reading sparse files. The results are within
    the noise for the small test machine. The impact of the patch is more noticable from the vmstats
    
                                3.15.0-rc5  3.15.0-rc5
                                  shrinker  proportion
    Minor Faults                     35154       36784
    Major Faults                       611        1305
    Swap Ins                           394        1651
    Swap Outs                         4394        5891
    Allocation stalls               118616       44781
    Direct pages scanned           4935171     4602313
    Kswapd pages scanned          15921292    16258483
    Kswapd pages reclaimed        15913301    16248305
    Direct pages reclaimed         4933368     4601133
    Kswapd efficiency                  99%         99%
    Kswapd velocity             670088.047  682555.961
    Direct efficiency                  99%         99%
    Direct velocity             207709.217  193212.133
    Percentage direct scans            23%         22%
    Page writes by reclaim        4858.000    6232.000
    Page writes file                   464         341
    Page writes anon                  4394        5891
    
    Note that there are fewer allocation stalls even though the amount
    of direct reclaim scanning is very approximately the same.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Tim Chen <tim.c.chen@linux.intel.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
    Cc: Bob Liu <bob.liu@oracle.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 2d37a72e406e226656ff9009bb7913e6ff9c3025
Author: Tim Chen <tim.c.chen@linux.intel.com>
Date:   Thu Aug 28 19:35:12 2014 +0100

    fs/superblock: avoid locking counting inodes and dentries before reclaiming them
    
    commit d23da150a37c9fe3cc83dbaf71b3e37fd434ed52 upstream.
    
    We remove the call to grab_super_passive in call to super_cache_count.
    This becomes a scalability bottleneck as multiple threads are trying to do
    memory reclamation, e.g.  when we are doing large amount of file read and
    page cache is under pressure.  The cached objects quickly got reclaimed
    down to 0 and we are aborting the cache_scan() reclaim.  But counting
    creates a log jam acquiring the sb_lock.
    
    We are holding the shrinker_rwsem which ensures the safety of call to
    list_lru_count_node() and s_op->nr_cached_objects.  The shrinker is
    unregistered now before ->kill_sb() so the operation is safe when we are
    doing unmount.
    
    The impact will depend heavily on the machine and the workload but for a
    small machine using postmark tuned to use 4xRAM size the results were
    
                                      3.15.0-rc5            3.15.0-rc5
                                         vanilla         shrinker-v1r1
    Ops/sec Transactions         21.00 (  0.00%)       24.00 ( 14.29%)
    Ops/sec FilesCreate          39.00 (  0.00%)       44.00 ( 12.82%)
    Ops/sec CreateTransact       10.00 (  0.00%)       12.00 ( 20.00%)
    Ops/sec FilesDeleted       6202.00 (  0.00%)     6202.00 (  0.00%)
    Ops/sec DeleteTransact       11.00 (  0.00%)       12.00 (  9.09%)
    Ops/sec DataRead/MB          25.97 (  0.00%)       29.10 ( 12.05%)
    Ops/sec DataWrite/MB         49.99 (  0.00%)       56.02 ( 12.06%)
    
    ffsb running in a configuration that is meant to simulate a mail server showed
    
                                     3.15.0-rc5             3.15.0-rc5
                                        vanilla          shrinker-v1r1
    Ops/sec readall           9402.63 (  0.00%)      9567.97 (  1.76%)
    Ops/sec create            4695.45 (  0.00%)      4735.00 (  0.84%)
    Ops/sec delete             173.72 (  0.00%)       179.83 (  3.52%)
    Ops/sec Transactions     14271.80 (  0.00%)     14482.81 (  1.48%)
    Ops/sec Read                37.00 (  0.00%)        37.60 (  1.62%)
    Ops/sec Write               18.20 (  0.00%)        18.30 (  0.55%)
    
    Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
    Cc: Bob Liu <bob.liu@oracle.com>
    Cc: Jan Kara <jack@suse.cz>
    Acked-by: Rik van Riel <riel@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 2f11e3a821aff1a50c8ebc2fadbeb22c1ea1adbd
Author: Dave Chinner <david@fromorbit.com>
Date:   Thu Aug 28 19:35:11 2014 +0100

    fs/superblock: unregister sb shrinker before ->kill_sb()
    
    commit 28f2cd4f6da24a1aa06c226618ed5ad69e13df64 upstream.
    
    This series is aimed at regressions noticed during reclaim activity.  The
    first two patches are shrinker patches that were posted ages ago but never
    merged for reasons that are unclear to me.  I'm posting them again to see
    if there was a reason they were dropped or if they just got lost.  Dave?
    Time?  The last patch adjusts proportional reclaim.  Yuanhan Liu, can you
    retest the vm scalability test cases on a larger machine?  Hugh, does this
    work for you on the memcg test cases?
    
    Based on ext4, I get the following results but unfortunately my larger
    test machines are all unavailable so this is based on a relatively small
    machine.
    
    postmark
                                      3.15.0-rc5            3.15.0-rc5
                                         vanilla       proportion-v1r4
    Ops/sec Transactions         21.00 (  0.00%)       25.00 ( 19.05%)
    Ops/sec FilesCreate          39.00 (  0.00%)       45.00 ( 15.38%)
    Ops/sec CreateTransact       10.00 (  0.00%)       12.00 ( 20.00%)
    Ops/sec FilesDeleted       6202.00 (  0.00%)     6202.00 (  0.00%)
    Ops/sec DeleteTransact       11.00 (  0.00%)       12.00 (  9.09%)
    Ops/sec DataRead/MB          25.97 (  0.00%)       30.02 ( 15.59%)
    Ops/sec DataWrite/MB         49.99 (  0.00%)       57.78 ( 15.58%)
    
    ffsb (mail server simulator)
                                     3.15.0-rc5             3.15.0-rc5
                                        vanilla        proportion-v1r4
    Ops/sec readall           9402.63 (  0.00%)      9805.74 (  4.29%)
    Ops/sec create            4695.45 (  0.00%)      4781.39 (  1.83%)
    Ops/sec delete             173.72 (  0.00%)       177.23 (  2.02%)
    Ops/sec Transactions     14271.80 (  0.00%)     14764.37 (  3.45%)
    Ops/sec Read                37.00 (  0.00%)        38.50 (  4.05%)
    Ops/sec Write               18.20 (  0.00%)        18.50 (  1.65%)
    
    dd of a large file
                                    3.15.0-rc5            3.15.0-rc5
                                       vanilla       proportion-v1r4
    WallTime DownloadTar       75.00 (  0.00%)       61.00 ( 18.67%)
    WallTime DD               423.00 (  0.00%)      401.00 (  5.20%)
    WallTime Delete             2.00 (  0.00%)        5.00 (-150.00%)
    
    stutter (times mmap latency during large amounts of IO)
    
                                3.15.0-rc5            3.15.0-rc5
                                   vanilla       proportion-v1r4
    Unit >5ms Delays  80252.0000 (  0.00%)  81523.0000 ( -1.58%)
    Unit Mmap min         8.2118 (  0.00%)      8.3206 ( -1.33%)
    Unit Mmap mean       17.4614 (  0.00%)     17.2868 (  1.00%)
    Unit Mmap stddev     24.9059 (  0.00%)     34.6771 (-39.23%)
    Unit Mmap max      2811.6433 (  0.00%)   2645.1398 (  5.92%)
    Unit Mmap 90%        20.5098 (  0.00%)     18.3105 ( 10.72%)
    Unit Mmap 93%        22.9180 (  0.00%)     20.1751 ( 11.97%)
    Unit Mmap 95%        25.2114 (  0.00%)     22.4988 ( 10.76%)
    Unit Mmap 99%        46.1430 (  0.00%)     43.5952 (  5.52%)
    Unit Ideal  Tput     85.2623 (  0.00%)     78.8906 (  7.47%)
    Unit Tput min        44.0666 (  0.00%)     43.9609 (  0.24%)
    Unit Tput mean       45.5646 (  0.00%)     45.2009 (  0.80%)
    Unit Tput stddev      0.9318 (  0.00%)      1.1084 (-18.95%)
    Unit Tput max        46.7375 (  0.00%)     46.7539 ( -0.04%)
    
    This patch (of 3):
    
    We will like to unregister the sb shrinker before ->kill_sb().  This will
    allow cached objects to be counted without call to grab_super_passive() to
    update ref count on sb.  We want to avoid locking during memory
    reclamation especially when we are skipping the memory reclaim when we are
    out of cached objects.
    
    This is safe because grab_super_passive does a try-lock on the
    sb->s_umount now, and so if we are in the unmount process, it won't ever
    block.  That means what used to be a deadlock and races we were avoiding
    by using grab_super_passive() is now:
    
            shrinker                        umount
    
            down_read(shrinker_rwsem)
                                            down_write(sb->s_umount)
                                            shrinker_unregister
                                              down_write(shrinker_rwsem)
                                                <blocks>
            grab_super_passive(sb)
              down_read_trylock(sb->s_umount)
                <fails>
            <shrinker aborts>
            ....
            <shrinkers finish running>
            up_read(shrinker_rwsem)
                                              <unblocks>
                                              <removes shrinker>
                                              up_write(shrinker_rwsem)
                                            ->kill_sb()
                                            ....
    
    So it is safe to deregister the shrinker before ->kill_sb().
    
    Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
    Cc: Bob Liu <bob.liu@oracle.com>
    Cc: Jan Kara <jack@suse.cz>
    Acked-by: Rik van Riel <riel@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 94109cd2a5dbda0eb858ebe8d4d709150e4e04f0
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Aug 28 19:35:10 2014 +0100

    mm: fix direct reclaim writeback regression
    
    commit 8bdd638091605dc66d92c57c4b80eb87fffc15f7 upstream.
    
    Shortly before 3.16-rc1, Dave Jones reported:
    
      WARNING: CPU: 3 PID: 19721 at fs/xfs/xfs_aops.c:971
               xfs_vm_writepage+0x5ce/0x630 [xfs]()
      CPU: 3 PID: 19721 Comm: trinity-c61 Not tainted 3.15.0+ #3
      Call Trace:
        xfs_vm_writepage+0x5ce/0x630 [xfs]
        shrink_page_list+0x8f9/0xb90
        shrink_inactive_list+0x253/0x510
        shrink_lruvec+0x563/0x6c0
        shrink_zone+0x3b/0x100
        shrink_zones+0x1f1/0x3c0
        try_to_free_pages+0x164/0x380
        __alloc_pages_nodemask+0x822/0xc90
        alloc_pages_vma+0xaf/0x1c0
        handle_mm_fault+0xa31/0xc50
      etc.
    
     970   if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
     971                   PF_MEMALLOC))
    
    I did not respond at the time, because a glance at the PageDirty block
    in shrink_page_list() quickly shows that this is impossible: we don't do
    writeback on file pages (other than tmpfs) from direct reclaim nowadays.
    Dave was hallucinating, but it would have been disrespectful to say so.
    
    However, my own /var/log/messages now shows similar complaints
    
      WARNING: CPU: 1 PID: 28814 at fs/ext4/inode.c:1881 ext4_writepage+0xa7/0x38b()
      WARNING: CPU: 0 PID: 27347 at fs/ext4/inode.c:1764 ext4_writepage+0xa7/0x38b()
    
    from stressing some mmotm trees during July.
    
    Could a dirty xfs or ext4 file page somehow get marked PageSwapBacked,
    so fail shrink_page_list()'s page_is_file_cache() test, and so proceed
    to mapping->a_ops->writepage()?
    
    Yes, 3.16-rc1's commit 68711a746345 ("mm, migration: add destination
    page freeing callback") has provided such a way to compaction: if
    migrating a SwapBacked page fails, its newpage may be put back on the
    list for later use with PageSwapBacked still set, and nothing will clear
    it.
    
    Whether that can do anything worse than issue WARN_ON_ONCEs, and get
    some statistics wrong, is unclear: easier to fix than to think through
    the consequences.
    
    Fixing it here, before the put_new_page(), addresses the bug directly,
    but is probably the worst place to fix it.  Page migration is doing too
    many parts of the job on too many levels: fixing it in
    move_to_new_page() to complement its SetPageSwapBacked would be
    preferable, except why is it (and newpage->mapping and newpage->index)
    done there, rather than down in migrate_page_move_mapping(), once we are
    sure of success? Not a cleanup to get into right now, especially not
    with memcg cleanups coming in 3.17.
    
    Reported-by: Dave Jones <davej@redhat.com>
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit c32064785ad77722a76437f7d0826f063158d3f2
Author: Shaohua Li <shli@kernel.org>
Date:   Thu Aug 28 19:35:09 2014 +0100

    x86/mm: In the PTE swapout page reclaim case clear the accessed bit instead of flushing the TLB
    
    commit b13b1d2d8692b437203de7a404c6b809d2cc4d99 upstream.
    
    We use the accessed bit to age a page at page reclaim time,
    and currently we also flush the TLB when doing so.
    
    But in some workloads TLB flush overhead is very heavy. In my
    simple multithreaded app with a lot of swap to several pcie
    SSDs, removing the tlb flush gives about 20% ~ 30% swapout
    speedup.
    
    Fortunately just removing the TLB flush is a valid optimization:
    on x86 CPUs, clearing the accessed bit without a TLB flush
    doesn't cause data corruption.
    
    It could cause incorrect page aging and the (mistaken) reclaim of
    hot pages, but the chance of that should be relatively low.
    
    So as a performance optimization don't flush the TLB when
    clearing the accessed bit, it will eventually be flushed by
    a context switch or a VM operation anyway. [ In the rare
    event of it not getting flushed for a long time the delay
    shouldn't really matter because there's no real memory
    pressure for swapout to react to. ]
    
    Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Shaohua Li <shli@fusionio.com>
    Acked-by: Rik van Riel <riel@redhat.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Hugh Dickins <hughd@google.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: linux-mm@kvack.org
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Link: http://lkml.kernel.org/r/20140408075809.GA1764@kernel.org
    [ Rewrote the changelog and the code comments. ]
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit b25fd5de48e3ad079abd40a4ea946f126a8bceb0
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Thu Aug 28 19:35:08 2014 +0100

    mm, compaction: properly signal and act upon lock and need_sched() contention
    
    commit be9765722e6b7ece8263cbab857490332339bd6f upstream.
    
    Compaction uses compact_checklock_irqsave() function to periodically check
    for lock contention and need_resched() to either abort async compaction,
    or to free the lock, schedule and retake the lock.  When aborting,
    cc->contended is set to signal the contended state to the caller.  Two
    problems have been identified in this mechanism.
    
    First, compaction also calls directly cond_resched() in both scanners when
    no lock is yet taken.  This call either does not abort async compaction,
    or set cc->contended appropriately.  This patch introduces a new
    compact_should_abort() function to achieve both.  In isolate_freepages(),
    the check frequency is reduced to once by SWAP_CLUSTER_MAX pageblocks to
    match what the migration scanner does in the preliminary page checks.  In
    case a pageblock is found suitable for calling isolate_freepages_block(),
    the checks within there are done on higher frequency.
    
    Second, isolate_freepages() does not check if isolate_freepages_block()
    aborted due to contention, and advances to the next pageblock.  This
    violates the principle of aborting on contention, and might result in
    pageblocks not being scanned completely, since the scanning cursor is
    advanced.  This problem has been noticed in the code by Joonsoo Kim when
    reviewing related patches.  This patch makes isolate_freepages_block()
    check the cc->contended flag and abort.
    
    In case isolate_freepages() has already isolated some pages before
    aborting due to contention, page migration will proceed, which is OK since
    we do not want to waste the work that has been done, and page migration
    has own checks for contention.  However, we do not want another isolation
    attempt by either of the scanners, so cc->contended flag check is added
    also to compaction_alloc() and compact_finished() to make sure compaction
    is aborted right after the migration.
    
    The outcome of the patch should be reduced lock contention by async
    compaction and lower latencies for higher-order allocations where direct
    compaction is involved.
    
    [akpm@linux-foundation.org: fix typo in comment]
    Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
    Cc: Michal Nazarewicz <mina86@mina86.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Rik van Riel <riel@redhat.com>
    Acked-by: Michal Nazarewicz <mina86@mina86.com>
    Tested-by: Shawn Guo <shawn.guo@linaro.org>
    Tested-by: Kevin Hilman <khilman@linaro.org>
    Tested-by: Stephen Warren <swarren@nvidia.com>
    Tested-by: Fabio Estevam <fabio.estevam@freescale.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 71a5b801344ae1f03e2fb6ddad600c554f243257
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Thu Aug 28 19:35:07 2014 +0100

    mm/compaction: avoid rescanning pageblocks in isolate_freepages
    
    commit e9ade569910a82614ff5f2c2cea2b65a8d785da4 upstream.
    
    The compaction free scanner in isolate_freepages() currently remembers PFN
    of the highest pageblock where it successfully isolates, to be used as the
    starting pageblock for the next invocation.  The rationale behind this is
    that page migration might return free pages to the allocator when
    migration fails and we don't want to skip them if the compaction
    continues.
    
    Since migration now returns free pages back to compaction code where they
    can be reused, this is no longer a concern.  This patch changes
    isolate_freepages() so that the PFN for restarting is updated with each
    pageblock where isolation is attempted.  Using stress-highalloc from
    mmtests, this resulted in 10% reduction of the pages scanned by the free
    scanner.
    
    Note that the somewhat similar functionality that records highest
    successful pageblock in zone->compact_cached_free_pfn, remains unchanged.
    This cache is used when the whole compaction is restarted, not for
    multiple invocations of the free scanner during single compaction.
    
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
    Acked-by: Michal Nazarewicz <mina86@mina86.com>
    Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Rik van Riel <riel@redhat.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 5e4084a627820bd6a8d94bca6acb878e5308716c
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Thu Aug 28 19:35:06 2014 +0100

    mm/compaction: do not count migratepages when unnecessary
    
    commit f8c9301fa5a2a8b873c67f2a3d8230d5c13f61b7 upstream.
    
    During compaction, update_nr_listpages() has been used to count remaining
    non-migrated and free pages after a call to migrage_pages().  The
    freepages counting has become unneccessary, and it turns out that
    migratepages counting is also unnecessary in most cases.
    
    The only situation when it's needed to count cc->migratepages is when
    migrate_pages() returns with a negative error code.  Otherwise, the
    non-negative return value is the number of pages that were not migrated,
    which is exactly the count of remaining pages in the cc->migratepages
    list.
    
    Furthermore, any non-zero count is only interesting for the tracepoint of
    mm_compaction_migratepages events, because after that all remaining
    unmigrated pages are put back and their count is set to 0.
    
    This patch therefore removes update_nr_listpages() completely, and changes
    the tracepoint definition so that the manual counting is done only when
    the tracepoint is enabled, and only when migrate_pages() returns a
    negative error code.
    
    Furthermore, migrate_pages() and the tracepoints won't be called when
    there's nothing to migrate.  This potentially avoids some wasted cycles
    and reduces the volume of uninteresting mm_compaction_migratepages events
    where "nr_migrated=0 nr_failed=0".  In the stress-highalloc mmtest, this
    was about 75% of the events.  The mm_compaction_isolate_migratepages event
    is better for determining that nothing was isolated for migration, and
    this one was just duplicating the info.
    
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
    Acked-by: Michal Nazarewicz <mina86@mina86.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Rik van Riel <riel@redhat.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 0d9b79244228f7ce9d2b2d3a16418a1927669ace
Author: David Rientjes <rientjes@google.com>
Date:   Thu Aug 28 19:35:05 2014 +0100

    mm, compaction: terminate async compaction when rescheduling
    
    commit aeef4b83806f49a0c454b7d4578671b71045bee2 upstream.
    
    Async compaction terminates prematurely when need_resched(), see
    compact_checklock_irqsave().  This can never trigger, however, if the
    cond_resched() in isolate_migratepages_range() always takes care of the
    scheduling.
    
    If the cond_resched() actually triggers, then terminate this pageblock
    scan for async compaction as well.
    
    Signed-off-by: David Rientjes <rientjes@google.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 812fcdf3758425af5bd8c8ec32ffdb7a9615e00c
Author: David Rientjes <rientjes@google.com>
Date:   Thu Aug 28 19:35:04 2014 +0100

    mm, compaction: embed migration mode in compact_control
    
    commit e0b9daeb453e602a95ea43853dc12d385558ce1f upstream.
    
    We're going to want to manipulate the migration mode for compaction in the
    page allocator, and currently compact_control's sync field is only a bool.
    
    Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction
    depending on the value of this bool.  Convert the bool to enum
    migrate_mode and pass the migration mode in directly.  Later, we'll want
    to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to
    avoid unnecessary latency.
    
    This also alters compaction triggered from sysfs, either for the entire
    system or for a node, to force MIGRATE_SYNC.
    
    [akpm@linux-foundation.org: fix build]
    [iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()]
    Signed-off-by: David Rientjes <rientjes@google.com>
    Suggested-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
    Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 7e95430e4232f55bfd8f1728fb620957912c32f4
Author: David Rientjes <rientjes@google.com>
Date:   Thu Aug 28 19:35:03 2014 +0100

    mm, compaction: add per-zone migration pfn cache for async compaction
    
    commit 35979ef3393110ff3c12c6b94552208d3bdf1a36 upstream.
    
    Each zone has a cached migration scanner pfn for memory compaction so that
    subsequent calls to memory compaction can start where the previous call
    left off.
    
    Currently, the compaction migration scanner only updates the per-zone
    cached pfn when pageblocks were not skipped for async compaction.  This
    creates a dependency on calling sync compaction to avoid having subsequent
    calls to async compaction from scanning an enormous amount of non-MOVABLE
    pageblocks each time it is called.  On large machines, this could be
    potentially very expensive.
    
    This patch adds a per-zone cached migration scanner pfn only for async
    compaction.  It is updated everytime a pageblock has been scanned in its
    entirety and when no pages from it were successfully isolated.  The cached
    migration scanner pfn for sync compaction is updated only when called for
    sync compaction.
    
    Signed-off-by: David Rientjes <rientjes@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit b264e9ab205b3038b032a18df181fd6ed7a77ce3
Author: David Rientjes <rientjes@google.com>
Date:   Thu Aug 28 19:35:02 2014 +0100

    mm, compaction: return failed migration target pages back to freelist
    
    commit d53aea3d46d64e95da9952887969f7533b9ab25e upstream.
    
    Greg reported that he found isolated free pages were returned back to the
    VM rather than the compaction freelist.  This will cause holes behind the
    free scanner and cause it to reallocate additional memory if necessary
    later.
    
    He detected the problem at runtime seeing that ext4 metadata pages (esp
    the ones read by "sbi->s_group_desc[i] = sb_bread(sb, block)") were
    constantly visited by compaction calls of migrate_pages().  These pages
    had a non-zero b_count which caused fallback_migrate_page() ->
    try_to_release_page() -> try_to_free_buffers() to fail.
    
    Memory compaction works by having a "freeing scanner" scan from one end of
    a zone which isolates pages as migration targets while another "migrating
    scanner" scans from the other end of the same zone which isolates pages
    for migration.
    
    When page migration fails for an isolated page, the target page is
    returned to the system rather than the freelist built by the freeing
    scanner.  This may require the freeing scanner to continue scanning memory
    after suitable migration targets have already been returned to the system
    needlessly.
    
    This patch returns destination pages to the freeing scanner freelist when
    page migration fails.  This prevents unnecessary work done by the freeing
    scanner but also encourages memory to be as compacted as possible at the
    end of the zone.
    
    Signed-off-by: David Rientjes <rientjes@google.com>
    Reported-by: Greg Thelen <gthelen@google.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit ee92d4d6d28561c5019da10ed62c5bb8bdcd3c1e
Author: David Rientjes <rientjes@google.com>
Date:   Thu Aug 28 19:35:01 2014 +0100

    mm, migration: add destination page freeing callback
    
    commit 68711a746345c44ae00c64d8dbac6a9ce13ac54a upstream.
    
    Memory migration uses a callback defined by the caller to determine how to
    allocate destination pages.  When migration fails for a source page,
    however, it frees the destination page back to the system.
    
    This patch adds a memory migration callback defined by the caller to
    determine how to free destination pages.  If a caller, such as memory
    compaction, builds its own freelist for migration targets, this can reuse
    already freed memory instead of scanning additional memory.
    
    If the caller provides a function to handle freeing of destination pages,
    it is called when page migration fails.  If the caller passes NULL then
    freeing back to the system will be handled as usual.  This patch
    introduces no functional change.
    
    Signed-off-by: David Rientjes <rientjes@google.com>
    Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Greg Thelen <gthelen@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit e644c10bf85cd655e616a5985a42b907bc7f6f19
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Thu Aug 28 19:35:00 2014 +0100

    mm/compaction: cleanup isolate_freepages()
    
    commit c96b9e508f3d06ddb601dcc9792d62c044ab359e upstream.
    
    isolate_freepages() is currently somewhat hard to follow thanks to many
    looks like it is related to the 'low_pfn' variable, but in fact it is not.
    
    This patch renames the 'high_pfn' variable to a hopefully less confusing name,
    and slightly changes its handling without a functional change. A comment made
    obsolete by recent changes is also updated.
    
    [akpm@linux-foundation.org: comment fixes, per Minchan]
    [iamjoonsoo.kim@lge.com: cleanups]
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
    Cc: Michal Nazarewicz <mina86@mina86.com>
    Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Dongjun Shin <d.j.shin@samsung.com>
    Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
    Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 87db4a8abdf503c5e67eca7373dadb177bd0d715
Author: Heesub Shin <heesub.shin@samsung.com>
Date:   Thu Aug 28 19:34:59 2014 +0100

    mm/compaction: clean up unused code lines
    
    commit 13fb44e4b0414d7e718433a49e6430d5b76bd46e upstream.
    
    Remove code lines currently not in use or never called.
    
    Signed-off-by: Heesub Shin <heesub.shin@samsung.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Dongjun Shin <d.j.shin@samsung.com>
    Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
    Cc: Michal Nazarewicz <mina86@mina86.com>
    Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Dongjun Shin <d.j.shin@samsung.com>
    Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 32e8fcae4ace22d053805fe258a2ae78973d4a85
Author: Fabian Frederick <fabf@skynet.be>
Date:   Thu Aug 28 19:34:58 2014 +0100

    mm/readahead.c: inline ra_submit
    
    commit 29f175d125f0f3a9503af8a5596f93d714cceb08 upstream.
    
    Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left
    ra_submit with a single function call.
    
    Move ra_submit to internal.h and inline it to save some stack.  Thanks
    to Andrew Morton for commenting different versions.
    
    Signed-off-by: Fabian Frederick <fabf@skynet.be>
    Suggested-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 72ef5b50e048aae663fe9b8a3702646a773ea414
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Thu Aug 28 19:34:57 2014 +0100

    callers of iov_copy_from_user_atomic() don't need pagecache_disable()
    
    commit 9e8c2af96e0d2d5fe298dd796fb6bc16e888a48d upstream.
    
    ... it does that itself (via kmap_atomic())
    
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 4fb08e5ab578244164bd2269bd0d5d14669ed35a
Author: Sasha Levin <sasha.levin@oracle.com>
Date:   Thu Aug 28 19:34:56 2014 +0100

    mm: remove read_cache_page_async()
    
    commit 67f9fd91f93c582b7de2ab9325b6e179db77e4d5 upstream.
    
    This patch removes read_cache_page_async() which wasn't really needed
    anywhere and simplifies the code around it a bit.
    
    read_cache_page_async() is useful when we want to read a page into the
    cache without waiting for it to complete.  This happens when the
    appropriate callback 'filler' doesn't complete its read operation and
    releases the page lock immediately, and instead queues a different
    completion routine to do that.  This never actually happened anywhere in
    the code.
    
    read_cache_page_async() had 3 different callers:
    
    - read_cache_page() which is the sync version, it would just wait for
      the requested read to complete using wait_on_page_read().
    
    - JFFS2 would call it from jffs2_gc_fetch_page(), but the filler
      function it supplied doesn't do any async reads, and would complete
      before the filler function returns - making it actually a sync read.
    
    - CRAMFS would call it using the read_mapping_page_async() wrapper, with
      a similar story to JFFS2 - the filler function doesn't do anything that
      reminds async reads and would always complete before the filler function
      returns.
    
    To sum it up, the code in mm/filemap.c never took advantage of having
    read_cache_page_async().  While there are filler callbacks that do async
    reads (such as the block one), we always called it with the
    read_cache_page().
    
    This patch adds a mandatory wait for read to complete when adding a new
    page to the cache, and removes read_cache_page_async() and its wrappers.
    
    Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 5c3ce5b2b6bd0c685914b68c1e6106278845add0
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Thu Aug 28 19:34:55 2014 +0100

    mm: madvise: fix MADV_WILLNEED on shmem swapouts
    
    commit 55231e5c898c5c03c14194001e349f40f59bd300 upstream.
    
    MADV_WILLNEED currently does not read swapped out shmem pages back in.
    
    Commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page
    cache radix trees") made find_get_page() filter exceptional radix tree
    entries but failed to convert all find_get_page() callers that WANT
    exceptional entries over to find_get_entry().  One of them is shmem swap
    readahead in madvise, which now skips over any swap-out records.
    
    Convert it to find_get_entry().
    
    Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees")
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Reported-by: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit e714f0cf03c3adfba331c85deca6844e47b60cc5
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Thu Aug 28 19:34:54 2014 +0100

    mm + fs: prepare for non-page entries in page cache radix trees
    
    commit 0cd6144aadd2afd19d1aca880153530c52957604 upstream.
    
    shmem mappings already contain exceptional entries where swap slot
    information is remembered.
    
    To be able to store eviction information for regular page cache, prepare
    every site dealing with the radix trees directly to handle entries other
    than pages.
    
    The common lookup functions will filter out non-page entries and return
    NULL for page cache holes, just as before.  But provide a raw version of
    the API which returns non-page entries as well, and switch shmem over to
    use it.
    
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Reviewed-by: Minchan Kim <minchan@kernel.org>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Bob Liu <bob.liu@oracle.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Luigi Semenzato <semenzato@google.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Metin Doslu <metin@citusdata.com>
    Cc: Michel Lespinasse <walken@google.com>
    Cc: Ozgun Erdogan <ozgun@citusdata.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Roman Gushchin <klamm@yandex-team.ru>
    Cc: Ryan Mallon <rmallon@gmail.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 3721b421230f4e8546bff908156187cba5b49215
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Thu Aug 28 19:34:53 2014 +0100

    mm: filemap: move radix tree hole searching here
    
    commit e7b563bb2a6f4d974208da46200784b9c5b5a47e upstream.
    
    The radix tree hole searching code is only used for page cache, for
    example the readahead code trying to get a a picture of the area
    surrounding a fault.
    
    It sufficed to rely on the radix tree definition of holes, which is
    "empty tree slot".  But this is about to change, though, as shadow page
    descriptors will be stored in the page cache after the actual pages get
    evicted from memory.
    
    Move the functions over to mm/filemap.c and make them native page cache
    operations, where they can later be adapted to handle the new definition
    of "page cache hole".
    
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Reviewed-by: Minchan Kim <minchan@kernel.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Bob Liu <bob.liu@oracle.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Luigi Semenzato <semenzato@google.com>
    Cc: Metin Doslu <metin@citusdata.com>
    Cc: Michel Lespinasse <walken@google.com>
    Cc: Ozgun Erdogan <ozgun@citusdata.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Roman Gushchin <klamm@yandex-team.ru>
    Cc: Ryan Mallon <rmallon@gmail.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit a3d18e49a8fdaed9fba4b65b00ca6dffc0418f45
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Thu Aug 28 19:34:52 2014 +0100

    mm: shmem: save one radix tree lookup when truncating swapped pages
    
    commit 6dbaf22ce1f1dfba33313198eb5bd989ae76dd87 upstream.
    
    Page cache radix tree slots are usually stabilized by the page lock, but
    shmem's swap cookies have no such thing.  Because the overall truncation
    loop is lockless, the swap entry is currently confirmed by a tree lookup
    and then deleted by another tree lookup under the same tree lock region.
    
    Use radix_tree_delete_item() instead, which does the verification and
    deletion with only one lookup.  This also allows removing the
    delete-only special case from shmem_radix_tree_replace().
    
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: Minchan Kim <minchan@kernel.org>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Bob Liu <bob.liu@oracle.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Luigi Semenzato <semenzato@google.com>
    Cc: Metin Doslu <metin@citusdata.com>
    Cc: Michel Lespinasse <walken@google.com>
    Cc: Ozgun Erdogan <ozgun@citusdata.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Roman Gushchin <klamm@yandex-team.ru>
    Cc: Ryan Mallon <rmallon@gmail.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 50c4613d2293063a18c05dccfc8f4335be21b3e4
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Thu Aug 28 19:34:51 2014 +0100

    lib: radix-tree: add radix_tree_delete_item()
    
    commit 53c59f262d747ea82e7414774c59a489501186a0 upstream.
    
    Provide a function that does not just delete an entry at a given index,
    but also allows passing in an expected item.  Delete only if that item
    is still located at the specified index.
    
    This is handy when lockless tree traversals want to delete entries as
    well because they don't have to do an second, locked lookup to verify
    the slot has not changed under them before deleting the entry.
    
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: Minchan Kim <minchan@kernel.org>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Bob Liu <bob.liu@oracle.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Luigi Semenzato <semenzato@google.com>
    Cc: Metin Doslu <metin@citusdata.com>
    Cc: Michel Lespinasse <walken@google.com>
    Cc: Ozgun Erdogan <ozgun@citusdata.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Roman Gushchin <klamm@yandex-team.ru>
    Cc: Ryan Mallon <rmallon@gmail.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit c05ac84a8523a88e23dfd89f5bb96934409cda8a
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Thu Aug 28 19:34:50 2014 +0100

    mm: don't pointlessly use BUG_ON() for sanity check
    
    commit 50f5aa8a9b248fa4262cf379863ec9a531b49737 upstream.
    
    BUG_ON() is a big hammer, and should be used _only_ if there is some
    major corruption that you cannot possibly recover from, making it
    imperative that the current process (and possibly the whole machine) be
    terminated with extreme prejudice.
    
    The trivial sanity check in the vmacache code is *not* such a fatal
    error.  Recovering from it is absolutely trivial, and using BUG_ON()
    just makes it harder to debug for no actual advantage.
    
    To make matters worse, the placement of the BUG_ON() (only if the range
    check matched) actually makes it harder to hit the sanity check to begin
    with, so _if_ there is a bug (and we just got a report from Srivatsa
    Bhat that this can indeed trigger), it is harder to debug not just
    because the machine is possibly dead, but because we don't have better
    coverage.
    
    BUG_ON() must *die*.  Maybe we should add a checkpatch warning for it,
    because it is simply just about the worst thing you can ever do if you
    hit some "this cannot happen" situation.
    
    Reported-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
    Cc: Davidlohr Bueso <davidlohr@hp.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 9c0073071075474cb066cca50b7b8574c1314e3d
Author: Davidlohr Bueso <davidlohr@hp.com>
Date:   Thu Aug 28 19:34:49 2014 +0100

    mm: per-thread vma caching
    
    commit 615d6e8756c87149f2d4c1b93d471bca002bd849 upstream.
    
    This patch is a continuation of efforts trying to optimize find_vma(),
    avoiding potentially expensive rbtree walks to locate a vma upon faults.
    The original approach (https://lkml.org/lkml/2013/11/1/410), where the
    largest vma was also cached, ended up being too specific and random,
    thus further comparison with other approaches were needed.  There are
    two things to consider when dealing with this, the cache hit rate and
    the latency of find_vma().  Improving the hit-rate does not necessarily
    translate in finding the vma any faster, as the overhead of any fancy
    caching schemes can be too high to consider.
    
    We currently cache the last used vma for the whole address space, which
    provides a nice optimization, reducing the total cycles in find_vma() by
    up to 250%, for workloads with good locality.  On the other hand, this
    simple scheme is pretty much useless for workloads with poor locality.
    Analyzing ebizzy runs shows that, no matter how many threads are
    running, the mmap_cache hit rate is less than 2%, and in many situations
    below 1%.
    
    The proposed approach is to replace this scheme with a small per-thread
    cache, maximizing hit rates at a very low maintenance cost.
    Invalidations are performed by simply bumping up a 32-bit sequence
    number.  The only expensive operation is in the rare case of a seq
    number overflow, where all caches that share the same address space are
    flushed.  Upon a miss, the proposed replacement policy is based on the
    page number that contains the virtual address in question.  Concretely,
    the following results are seen on an 80 core, 8 socket x86-64 box:
    
    1) System bootup: Most programs are single threaded, so the per-thread
       scheme does improve ~50% hit rate by just adding a few more slots to
       the cache.
    
    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline       | 50.61%   | 19.90            |
    | patched        | 73.45%   | 13.58            |
    +----------------+----------+------------------+
    
    2) Kernel build: This one is already pretty good with the current
       approach as we're dealing with good locality.
    
    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline       | 75.28%   | 11.03            |
    | patched        | 88.09%   | 9.31             |
    +----------------+----------+------------------+
    
    3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
    
    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline       | 70.66%   | 17.14            |
    | patched        | 91.15%   | 12.57            |
    +----------------+----------+------------------+
    
    4) Ebizzy: There's a fair amount of variation from run to run, but this
       approach always shows nearly perfect hit rates, while baseline is just
       about non-existent.  The amounts of cycles can fluctuate between
       anywhere from ~60 to ~116 for the baseline scheme, but this approach
       reduces it considerably.  For instance, with 80 threads:
    
    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline       | 1.06%    | 91.54            |
    | patched        | 99.97%   | 14.18            |
    +----------------+----------+------------------+
    
    [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
    [akpm@linux-foundation.org: document vmacache_valid() logic]
    [akpm@linux-foundation.org: attempt to untangle header files]
    [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
    [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: adjust and enhance comments]
    Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Reviewed-by: Michel Lespinasse <walken@google.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Tested-by: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit edce92fcf24868572b57678126f17de622ae660d
Author: Christoph Lameter <cl@linux.com>
Date:   Thu Aug 28 19:34:48 2014 +0100

    vmscan: reclaim_clean_pages_from_list() must use mod_zone_page_state()
    
    commit 83da7510058736c09a14b9c17ec7d851940a4332 upstream.
    
    Seems to be called with preemption enabled.  Therefore it must use
    mod_zone_page_state instead.
    
    Signed-off-by: Christoph Lameter <cl@linux.com>
    Reported-by: Grygorii Strashko <grygorii.strashko@ti.com>
    Tested-by: Grygorii Strashko <grygorii.strashko@ti.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 4b946426a0af6a745684f38d365ef505f29f7188
Author: Vladimir Davydov <vdavydov@parallels.com>
Date:   Thu Aug 28 19:34:47 2014 +0100

    mm: vmscan: shrink_slab: rename max_pass -> freeable
    
    commit d5bc5fd3fcb7b8dfb431694a8c8052466504c10c upstream.
    
    The name `max_pass' is misleading, because this variable actually keeps
    the estimate number of freeable objects, not the maximal number of
    objects we can scan in this pass, which can be twice that.  Rename it to
    reflect its actual meaning.
    
    Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit b3b0bd3970bbe5c770a3f0764c4da9111605ad47
Author: Vladimir Davydov <vdavydov@parallels.com>
Date:   Thu Aug 28 19:34:46 2014 +0100

    mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim
    
    commit 99120b772b52853f9a2b829a21dd44d9b20558f1 upstream.
    
    When direct reclaim is executed by a process bound to a set of NUMA
    nodes, we should scan only those nodes when possible, but currently we
    will scan kmem from all online nodes even if the kmem shrinker is NUMA
    aware.  That said, binding a process to a particular NUMA node won't
    prevent it from shrinking inode/dentry caches from other nodes, which is
    not good.  Fix this.
    
    Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Dave Chinner <dchinner@redhat.com>
    Cc: Glauber Costa <glommer@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 5a6ad555bd1cc9c3126df8b8d69ea332b9a4ec8b
Author: Jens Axboe <axboe@fb.com>
Date:   Thu Aug 28 19:34:45 2014 +0100

    mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT
    
    commit 7fcbbaf18392f0b17c95e2f033c8ccf87eecde1d upstream.
    
    In some testing I ran today (some fio jobs that spread over two nodes),
    we end up spending 40% of the time in filemap_check_errors().  That
    smells fishy.  Looking further, this is basically what happens:
    
    blkdev_aio_read()
        generic_file_aio_read()
            filemap_write_and_wait_range()
                if (!mapping->nr_pages)
                    filemap_check_errors()
    
    and filemap_check_errors() always attempts two test_and_clear_bit() on
    the mapping flags, thus dirtying it for every single invocation.  The
    patch below tests each of these bits before clearing them, avoiding this
    issue.  In my test case (4-socket box), performance went from 1.7M IOPS
    to 4.0M IOPS.
    
    Signed-off-by: Jens Axboe <axboe@fb.com>
    Acked-by: Jeff Moyer <jmoyer@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 337c9823cf0b93c8f4d1c4654bd93cf24e5b837b
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:34:44 2014 +0100

    mm: optimize put_mems_allowed() usage
    
    commit d26914d11751b23ca2e8747725f2cae10c2f2c1b upstream.
    
    Since put_mems_allowed() is strictly optional, its a seqcount retry, we
    don't need to evaluate the function if the allocation was in fact
    successful, saving a smp_rmb some loads and comparisons on some relative
    fast-paths.
    
    Since the naming, get/put_mems_allowed() does suggest a mandatory
    pairing, rename the interface, as suggested by Mel, to resemble the
    seqcount interface.
    
    This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
    where it is important to note that the return value of the latter call
    is inverted from its previous incarnation.
    
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 8010da4957163cce03a9a27b7f3dbeb5a77a3954
Author: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Date:   Thu Aug 28 19:34:43 2014 +0100

    mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages
    
    commit 6d2be915e589b58cb11418cbe1f22ff90732b6ac upstream.
    
    Currently max_sane_readahead() returns zero on the cpu whose NUMA node
    has no local memory which leads to readahead failure.  Fix this
    readahead failure by returning minimum of (requested pages, 512).  Users
    running applications on a memory-less cpu which needs readahead such as
    streaming application see considerable boost in the performance.
    
    Result:
    
    fadvise experiment with FADV_WILLNEED on a PPC machine having memoryless
    CPU with 1GB testfile (12 iterations) yielded around 46.66% improvement.
    
    fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB
    testfile 32GB* 4G RAM numa machine (12 iterations) showed no impact on
    the normal NUMA cases w/ patch.
    
      Kernel       Avg  Stddev
      base      7.4975   3.92%
      patched   7.4174   3.26%
    
    [Andrew: making return value PAGE_SIZE independent]
    Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
    Acked-by: Jan Kara <jack@suse.cz>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: David Rientjes <rientjes@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit c4199ba1a1d8802a69ed9f65026bc8781a72c708
Author: David Rientjes <rientjes@google.com>
Date:   Thu Aug 28 19:34:42 2014 +0100

    mm, compaction: ignore pageblock skip when manually invoking compaction
    
    commit 91ca9186484809c57303b33778d841cc28f696ed upstream.
    
    The cached pageblock hint should be ignored when triggering compaction
    through /proc/sys/vm/compact_memory so all eligible memory is isolated.
    Manually invoking compaction is known to be expensive, there's no need
    to skip pageblocks based on heuristics (mainly for debugging).
    
    Signed-off-by: David Rientjes <rientjes@google.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit d36e7004412992303202433b25bdbb1e7c929f57
Author: David Rientjes <rientjes@google.com>
Date:   Thu Aug 28 19:34:41 2014 +0100

    mm, compaction: determine isolation mode only once
    
    commit da1c67a76f7cf2b3404823d24f9f10fa91aa5dc5 upstream.
    
    The conditions that control the isolation mode in
    isolate_migratepages_range() do not change during the iteration, so
    extract them out and only define the value once.
    
    This actually does have an effect, gcc doesn't optimize it itself because
    of cc->sync.
    
    Signed-off-by: David Rientjes <rientjes@google.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Acked-by: Rik van Riel <riel@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 0a1802eaa49d84dcd7052afb4c334a0c94af1832
Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date:   Thu Aug 28 19:34:40 2014 +0100

    mm/compaction: clean-up code on success of ballon isolation
    
    commit b6c750163c0d138f5041d95fcdbd1094b6928057 upstream.
    
    It is just for clean-up to reduce code size and improve readability.
    There is no functional change.
    
    Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit fc8dc0a9f5b062d2b5fbfe368e4d9cdff23ca76b
Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date:   Thu Aug 28 19:34:39 2014 +0100

    mm/compaction: check pageblock suitability once per pageblock
    
    commit c122b2087ab94192f2b937e47b563a9c4e688ece upstream.
    
    isolation_suitable() and migrate_async_suitable() is used to be sure
    that this pageblock range is fine to be migragted.  It isn't needed to
    call it on every page.  Current code do well if not suitable, but, don't
    do well when suitable.
    
    1) It re-checks isolation_suitable() on each page of a pageblock that was
       already estabilished as suitable.
    2) It re-checks migrate_async_suitable() on each page of a pageblock that
       was not entered through the next_pageblock: label, because
       last_pageblock_nr is not otherwise updated.
    
    This patch fixes situation by 1) calling isolation_suitable() only once
    per pageblock and 2) always updating last_pageblock_nr to the pageblock
    that was just checked.
    
    Additionally, move PageBuddy() check after pageblock unit check, since
    pageblock check is the first thing we should do and makes things more
    simple.
    
    [vbabka@suse.cz: rephrase commit description]
    Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit e45dcd3deabda867dbb5ceb05706d32bb4267d79
Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date:   Thu Aug 28 19:34:38 2014 +0100

    mm/compaction: change the timing to check to drop the spinlock
    
    commit be1aa03b973c7dcdc576f3503f7a60429825c35d upstream.
    
    It is odd to drop the spinlock when we scan (SWAP_CLUSTER_MAX - 1) th
    pfn page.  This may results in below situation while isolating
    migratepage.
    
    1. try isolate 0x0 ~ 0x200 pfn pages.
    2. When low_pfn is 0x1ff, ((low_pfn+1) % SWAP_CLUSTER_MAX) == 0, so drop
       the spinlock.
    3. Then, to complete isolating, retry to aquire the lock.
    
    I think that it is better to use SWAP_CLUSTER_MAX th pfn for checking the
    criteria about dropping the lock.  This has no harm 0x0 pfn, because, at
    this time, locked variable would be false.
    
    Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 093b8ab7059129701849d3816f3d16f7b57fbfe0
Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date:   Thu Aug 28 19:34:37 2014 +0100

    mm/compaction: do not call suitable_migration_target() on every page
    
    commit 01ead5340bcf5f3a1cd2452c75516d0ef4d908d7 upstream.
    
    suitable_migration_target() checks that pageblock is suitable for
    migration target.  In isolate_freepages_block(), it is called on every
    page and this is inefficient.  So make it called once per pageblock.
    
    suitable_migration_target() also checks if page is highorder or not, but
    it's criteria for highorder is pageblock order.  So calling it once
    within pageblock range has no problem.
    
    Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit d9c7a696a1bec88a4c5e4a54f8ae3919957c208f
Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date:   Thu Aug 28 19:34:36 2014 +0100

    mm/compaction: disallow high-order page for migration target
    
    commit 7d348b9ea64db0a315d777ce7d4b06697f946503 upstream.
    
    Purpose of compaction is to get a high order page.  Currently, if we
    find high-order page while searching migration target page, we break it
    to order-0 pages and use them as migration target.  It is contrary to
    purpose of compaction, so disallow high-order page to be used for
    migration target.
    
    Additionally, clean-up logic in suitable_migration_target() to simplify
    the code.  There is no functional changes from this clean-up.
    
    Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 5939eba4f79c4d689957bd64877aef2da9310286
Author: David Rientjes <rientjes@google.com>
Date:   Thu Aug 28 19:34:35 2014 +0100

    mm, compaction: avoid isolating pinned pages
    
    commit 119d6d59dcc0980dcd581fdadb6b2033b512a473 upstream.
    
    Page migration will fail for memory that is pinned in memory with, for
    example, get_user_pages().  In this case, it is unnecessary to take
    zone->lru_lock or isolating the page and passing it to page migration
    which will ultimately fail.
    
    This is a racy check, the page can still change from under us, but in
    that case we'll just fail later when attempting to move the page.
    
    This avoids very expensive memory compaction when faulting transparent
    hugepages after pinning a lot of memory with a Mellanox driver.
    
    On a 128GB machine and pinning ~120GB of memory, before this patch we
    see the enormous disparity in the number of page migration failures
    because of the pinning (from /proc/vmstat):
    
    	compact_pages_moved 8450
    	compact_pagemigrate_failed 15614415
    
    0.05% of pages isolated are successfully migrated and explicitly
    triggering memory compaction takes 102 seconds.  After the patch:
    
    	compact_pages_moved 9197
    	compact_pagemigrate_failed 7
    
    99.9% of pages isolated are now successfully migrated in this
    configuration and memory compaction takes less than one second.
    
    Signed-off-by: David Rientjes <rientjes@google.com>
    Acked-by: Hugh Dickins <hughd@google.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Greg Thelen <gthelen@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit c824c468fd7660207f10cd333b2a1fa188d533b0
Author: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Date:   Thu Aug 28 19:34:34 2014 +0100

    mm: get rid of unnecessary pageblock scanning in setup_zone_migrate_reserve
    
    commit 943dca1a1fcbccb58de944669b833fd38a6c809b upstream.
    
    Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on
    9TB memory machine since onlining memory sections is too slow.  And we
    found out setup_zone_migrate_reserve spent >90% of the time.
    
    The problem is, setup_zone_migrate_reserve scans all pageblocks
    unconditionally, but it is only necessary if the number of reserved
    block was reduced (i.e.  memory hot remove).
    
    Moreover, maximum MIGRATE_RESERVE per zone is currently 2.  It means
    that the number of reserved pageblocks is almost always unchanged.
    
    This patch adds zone->nr_migrate_reserve_block to maintain the number of
    MIGRATE_RESERVE pageblocks and it reduces the overhead of
    setup_zone_migrate_reserve dramatically.  The following table shows time
    of onlining a memory section.
    
      Amount of memory     | 128GB | 192GB | 256GB|
      ---------------------------------------------
      linux-3.12           |  23.9 |  31.4 | 44.5 |
      This patch           |   8.3 |   8.3 |  8.6 |
      Mel's proposal patch |  10.9 |  19.2 | 31.3 |
      ---------------------------------------------
                                       (millisecond)
    
      128GB : 4 nodes and each node has 32GB of memory
      192GB : 6 nodes and each node has 32GB of memory
      256GB : 8 nodes and each node has 32GB of memory
    
      (*1) Mel proposed his idea by the following threads.
           https://lkml.org/lkml/2013/10/30/272
    
    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
    Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
    Tested-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 760265401f255337347c42bc8fb0b9c1c682d904
Author: Vladimir Davydov <vdavydov@parallels.com>
Date:   Thu Aug 28 19:34:33 2014 +0100

    mm: vmscan: call NUMA-unaware shrinkers irrespective of nodemask
    
    commit ec97097bca147d5718a5d2c024d1ec740b10096d upstream.
    
    If a shrinker is not NUMA-aware, shrink_slab() should call it exactly
    once with nid=0, but currently it is not true: if node 0 is not set in
    the nodemask or if it is not online, we will not call such shrinkers at
    all.  As a result some slabs will be left untouched under some
    circumstances.  Let us fix it.
    
    Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
    Reported-by: Dave Chinner <dchinner@redhat.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Glauber Costa <glommer@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 671133cd6bd95dd7c752655b4041f05645d3bc27
Author: Vladimir Davydov <vdavydov@parallels.com>
Date:   Thu Aug 28 19:34:32 2014 +0100

    mm: vmscan: shrink all slab objects if tight on memory
    
    commit 0b1fb40a3b1291f2f12f13f644ac95cf756a00e6 upstream.
    
    When reclaiming kmem, we currently don't scan slabs that have less than
    batch_size objects (see shrink_slab_node()):
    
            while (total_scan >= batch_size) {
                    shrinkctl->nr_to_scan = batch_size;
                    shrinker->scan_objects(shrinker, shrinkctl);
                    total_scan -= batch_size;
            }
    
    If there are only a few shrinkers available, such a behavior won't cause
    any problems, because the batch_size is usually small, but if we have a
    lot of slab shrinkers, which is perfectly possible since FS shrinkers
    are now per-superblock, we can end up with hundreds of megabytes of
    practically unreclaimable kmem objects.  For instance, mounting a
    thousand of ext2 FS images with a hundred of files in each and iterating
    over all the files using du(1) will result in about 200 Mb of FS caches
    that cannot be dropped even with the aid of the vm.drop_caches sysctl!
    
    This problem was initially pointed out by Glauber Costa [*].  Glauber
    proposed to fix it by making the shrink_slab() always take at least one
    pass, to put it simply, turning the scan loop above to a do{}while()
    loop.  However, this proposal was rejected, because it could result in
    more aggressive and frequent slab shrinking even under low memory
    pressure when total_scan is naturally very small.
    
    This patch is a slightly modified version of Glauber's approach.
    Similarly to Glauber's patch, it makes shrink_slab() scan less than
    batch_size objects, but only if the total number of objects we want to
    scan (total_scan) is greater than the total number of objects available
    (max_pass).  Since total_scan is biased as half max_pass if the current
    delta change is small:
    
            if (delta < max_pass / 4)
                    total_scan = min(total_scan, max_pass / 2);
    
    this is only possible if we are scanning at high prio.  That said, this
    patch shouldn't change the vmscan behaviour if the memory pressure is
    low, but if we are tight on memory, we will do our best by trying to
    reclaim all available objects, which sounds reasonable.
    
    [*] http://www.spinics.net/lists/cgroups/msg06913.html
    
    Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Dave Chinner <dchinner@redhat.com>
    Cc: Glauber Costa <glommer@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit df94648c0144550a0d6205ad8260fe626ec39caa
Author: Shaohua Li <shli@kernel.org>
Date:   Thu Aug 28 19:34:31 2014 +0100

    swap: add a simple detector for inappropriate swapin readahead
    
    commit 579f82901f6f41256642936d7e632f3979ad76d4 upstream.
    
    This is a patch to improve swap readahead algorithm.  It's from Hugh and
    I slightly changed it.
    
    Hugh's original changelog:
    
    swapin readahead does a blind readahead, whether or not the swapin is
    sequential.  This may be ok on harddisk, because large reads have
    relatively small costs, and if the readahead pages are unneeded they can
    be reclaimed easily - though, what if their allocation forced reclaim of
    useful pages? But on SSD devices large reads are more expensive than
    small ones: if the readahead pages are unneeded, reading them in caused
    significant overhead.
    
    This patch adds very simplistic random read detection.  Stealing the
    PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the
    vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages()
    simply looks at readahead's current success rate, and narrows or widens
    its readahead window accordingly.  There is little science to its
    heuristic: it's about as stupid as can be whilst remaining effective.
    
    The table below shows elapsed times (in centiseconds) when running a
    single repetitive swapping load across a 1000MB mapping in 900MB ram
    with 1GB swap (the harddisk tests had taken painfully too long when I
    used mem=500M, but SSD shows similar results for that).
    
    Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his
    Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch
    which Shaohua showed to be defective; HughNew this Nov 14 patch, with
    page_cluster as usual at default of 3 (8-page reads); HughPC4 this same
    patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0
    (1-page reads: no readahead).
    
    HDD for swapping to harddisk, SSD for swapping to VertexII SSD.  Seq for
    sequential access to the mapping, cycling five times around; Rand for
    the same number of random touches.  Anon for a MAP_PRIVATE anon mapping;
    Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.
    
    One weakness of Shaohua's vma/anon_vma approach was that it did not
    optimize Shmem: seen below.  Konstantin's approach was perhaps mistuned,
    50% slower on Seq: did not compete and is not shown below.
    
    HDD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
    Seq Anon     73921   76210   75611   76904   78191  121542
    Seq Shmem    73601   73176   73855   72947   74543  118322
    Rand Anon   895392  831243  871569  845197  846496  841680
    Rand Shmem 1058375 1053486  827935  764955  764376  756489
    
    SSD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
    Seq Anon     24634   24198   24673   25107   21614   70018
    Seq Shmem    24959   24932   25052   25703   22030   69678
    Rand Anon    43014   26146   28075   25989   26935   25901
    Rand Shmem   45349   45215   28249   24268   24138   24332
    
    These tests are, of course, two extremes of a very simple case: under
    heavier mixed loads I've not yet observed any consistent improvement or
    degradation, and wider testing would be welcome.
    
    Shaohua Li:
    
    Test shows Vanilla is slightly better in sequential workload than Hugh's
    patch.  I observed with Hugh's patch sometimes the readahead size is
    shrinked too fast (from 8 to 1 immediately) in sequential workload if
    there is no hit.  And in such case, continuing doing readahead is good
    actually.
    
    I don't prepare a sophisticated algorithm for the sequential workload
    because so far we can't guarantee sequential accessed pages are swap out
    sequentially.  So I slightly change Hugh's heuristic - don't shrink
    readahead size too fast.
    
    Here is my test result (unit second, 3 runs average):
    	Vanilla		Hugh		New
    Seq	356		370		360
    Random	4525		2447		2444
    
    Attached graph is the swapin/swapout throughput I collected with 'vmstat
    2'.  The first part is running a random workload (till around 1200 of
    the x-axis) and the second part is running a sequential workload.
    swapin and swapout throughput are almost identical in steady state in
    both workloads.  These are expected behavior.  while in Vanilla, swapin
    is much bigger than swapout especially in random workload (because wrong
    readahead).
    
    Original patches by: Shaohua Li and Konstantin Khlebnikov.
    
    [fengguang.wu@intel.com: swapin_nr_pages() can be static]
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Signed-off-by: Shaohua Li <shli@fusionio.com>
    Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 3d3de516335991816962c2027830f88d8a412a4a
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Thu Aug 28 19:34:30 2014 +0100

    mm: compaction: reset scanner positions immediately when they meet
    
    commit 55b7c4c99f6a448f72179297fe6432544f220063 upstream.
    
    Compaction used to start its migrate and free page scaners at the zone's
    lowest and highest pfn, respectively.  Later, caching was introduced to
    remember the scanners' progress across compaction attempts so that
    pageblocks are not re-scanned uselessly.  Additionally, pageblocks where
    isolation failed are marked to be quickly skipped when encountered again
    in future compactions.
    
    Currently, both the reset of cached pfn's and clearing of the pageblock
    skip information for a zone is done in __reset_isolation_suitable().
    This function gets called when:
    
     - compaction is restarting after being deferred
     - compact_blockskip_flush flag is set in compact_finished() when the scanners
       meet (and not again cleared when direct compaction succeeds in allocation)
       and kswapd acts upon this flag before going to sleep
    
    This behavior is suboptimal for several reasons:
    
     - when direct sync compaction is called after async compaction fails (in the
       allocation slowpath), it will effectively do nothing, unless kswapd
       happens to process the compact_blockskip_flush flag meanwhile. This is racy
       and goes against the purpose of sync compaction to more thoroughly retry
       the compaction of a zone where async compaction has failed.
       The restart-after-deferring path cannot help here as deferring happens only
       after the sync compaction fails. It is also done only for the preferred
       zone, while the compaction might be done for a fallback zone.
    
     - the mechanism of marking pageblock to be skipped has little value since the
       cached pfn's are reset only together with the pageblock skip flags. This
       effectively limits pageblock skip usage to parallel compactions.
    
    This patch changes compact_finished() so that cached pfn's are reset
    immediately when the scanners meet.  Clearing pageblock skip flags is
    unchanged, as well as the other situations where cached pfn's are reset.
    This allows the sync-after-async compaction to retry pageblocks not
    marked as skipped, such as blocks !MIGRATE_MOVABLE blocks that async
    compactions now skips without marking them.
    
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Rik van Riel <riel@redhat.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit b488972c6afdf534055b9413f4da0359087db991
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Thu Aug 28 19:34:29 2014 +0100

    mm: compaction: do not mark unmovable pageblocks as skipped in async compaction
    
    commit 50b5b094e683f8e51e82c6dfe97b1608cf97e6c0 upstream.
    
    Compaction temporarily marks pageblocks where it fails to isolate pages
    as to-be-skipped in further compactions, in order to improve efficiency.
    One of the reasons to fail isolating pages is that isolation is not
    attempted in pageblocks that are not of MIGRATE_MOVABLE (or CMA) type.
    
    The problem is that blocks skipped due to not being MIGRATE_MOVABLE in
    async compaction become skipped due to the temporary mark also in future
    sync compaction.  Moreover, this may follow quite soon during
    __alloc_page_slowpath, without much time for kswapd to clear the
    pageblock skip marks.  This goes against the idea that sync compaction
    should try to scan these blocks more thoroughly than the async
    compaction.
    
    The fix is to ensure in async compaction that these !MIGRATE_MOVABLE
    blocks are not marked to be skipped.  Note this should not affect
    performance or locking impact of further async compactions, as skipping
    a block due to being !MIGRATE_MOVABLE is done soon after skipping a
    block marked to be skipped, both without locking.
    
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Rik van Riel <riel@redhat.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 6e3335e29177af49e8e4b8ba67452aa7afd91d7e
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Thu Aug 28 19:34:28 2014 +0100

    mm: compaction: encapsulate defer reset logic
    
    commit de6c60a6c115acaa721cfd499e028a413d1fcbf3 upstream.
    
    Currently there are several functions to manipulate the deferred
    compaction state variables.  The remaining case where the variables are
    touched directly is when a successful allocation occurs in direct
    compaction, or is expected to be successful in the future by kswapd.
    Here, the lowest order that is expected to fail is updated, and in the
    case of successful allocation, the deferred status and counter is reset
    completely.
    
    Create a new function compaction_defer_reset() to encapsulate this
    functionality and make it easier to understand the code.  No functional
    change.
    
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 280f6d65ae22f07d7a8e36b006e2bb39c2d287bb
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:34:27 2014 +0100

    mm: compaction: trace compaction begin and end
    
    commit 0eb927c0ab789d3d7d69f68acb850f69d4e7c36f upstream.
    
    The broad goal of the series is to improve allocation success rates for
    huge pages through memory compaction, while trying not to increase the
    compaction overhead.  The original objective was to reintroduce
    capturing of high-order pages freed by the compaction, before they are
    split by concurrent activity.  However, several bugs and opportunities
    for simple improvements were found in the current implementation, mostly
    through extra tracepoints (which are however too ugly for now to be
    considered for sending).
    
    The patches mostly deal with two mechanisms that reduce compaction
    overhead, which is caching the progress of migrate and free scanners,
    and marking pageblocks where isolation failed to be skipped during
    further scans.
    
    Patch 1 (from mgorman) adds tracepoints that allow calculate time spent in
            compaction and potentially debug scanner pfn values.
    
    Patch 2 encapsulates the some functionality for handling deferred compactions
            for better maintainability, without a functional change
            type is not determined without being actually needed.
    
    Patch 3 fixes a bug where cached scanner pfn's are sometimes reset only after
            they have been read to initialize a compaction run.
    
    Patch 4 fixes a bug where scanners meeting is sometimes not properly detected
            and can lead to multiple compaction attempts quitting early without
            doing any work.
    
    Patch 5 improves the chances of sync compaction to process pageblocks that
            async compaction has skipped due to being !MIGRATE_MOVABLE.
    
    Patch 6 improves the chances of sync direct compaction to actually do anything
            when called after async compaction fails during allocation slowpath.
    
    The impact of patches were validated using mmtests's stress-highalloc
    benchmark with mmtests's stress-highalloc benchmark on a x86_64 machine
    with 4GB memory.
    
    Due to instability of the results (mostly related to the bugs fixed by
    patches 2 and 3), 10 iterations were performed, taking min,mean,max
    values for success rates and mean values for time and vmstat-based
    metrics.
    
    First, the default GFP_HIGHUSER_MOVABLE allocations were tested with the
    patches stacked on top of v3.13-rc2.  Patch 2 is OK to serve as baseline
    due to no functional changes in 1 and 2.  Comments below.
    
    stress-highalloc
                                 3.13-rc2              3.13-rc2              3.13-rc2              3.13-rc2              3.13-rc2
                                  2-nothp               3-nothp               4-nothp               5-nothp               6-nothp
    Success 1 Min          9.00 (  0.00%)       10.00 (-11.11%)       43.00 (-377.78%)       43.00 (-377.78%)       33.00 (-266.67%)
    Success 1 Mean        27.50 (  0.00%)       25.30 (  8.00%)       45.50 (-65.45%)       45.90 (-66.91%)       46.30 (-68.36%)
    Success 1 Max         36.00 (  0.00%)       36.00 (  0.00%)       47.00 (-30.56%)       48.00 (-33.33%)       52.00 (-44.44%)
    Success 2 Min         10.00 (  0.00%)        8.00 ( 20.00%)       46.00 (-360.00%)       45.00 (-350.00%)       35.00 (-250.00%)
    Success 2 Mean        26.40 (  0.00%)       23.50 ( 10.98%)       47.30 (-79.17%)       47.60 (-80.30%)       48.10 (-82.20%)
    Success 2 Max         34.00 (  0.00%)       33.00 (  2.94%)       48.00 (-41.18%)       50.00 (-47.06%)       54.00 (-58.82%)
    Success 3 Min         65.00 (  0.00%)       63.00 (  3.08%)       85.00 (-30.77%)       84.00 (-29.23%)       85.00 (-30.77%)
    Success 3 Mean        76.70 (  0.00%)       70.50 (  8.08%)       86.20 (-12.39%)       85.50 (-11.47%)       86.00 (-12.13%)
    Success 3 Max         87.00 (  0.00%)       86.00 (  1.15%)       88.00 ( -1.15%)       87.00 (  0.00%)       87.00 (  0.00%)
    
                3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2
                 2-nothp     3-nothp     4-nothp     5-nothp     6-nothp
    User         6437.72     6459.76     5960.32     5974.55     6019.67
    System       1049.65     1049.09     1029.32     1031.47     1032.31
    Elapsed      1856.77     1874.48     1949.97     1994.22     1983.15
    
                                  3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2
                                   2-nothp     3-nothp     4-nothp     5-nothp     6-nothp
    Minor Faults                 253952267   254581900   250030122   250507333   250157829
    Major Faults                       420         407         506         530         530
    Swap Ins                             4           9           9           6           6
    Swap Outs                          398         375         345         346         333
    Direct pages scanned            197538      189017      298574      287019      299063
    Kswapd pages scanned           1809843     1801308     1846674     1873184     1861089
    Kswapd pages reclaimed         1806972     1798684     1844219     1870509     1858622
    Direct pages reclaimed          197227      188829      298380      286822      298835
    Kswapd efficiency                  99%         99%         99%         99%         99%
    Kswapd velocity                953.382     970.449     952.243     934.569     922.286
    Direct efficiency                  99%         99%         99%         99%         99%
    Direct velocity                104.058     101.832     153.961     143.200     148.205
    Percentage direct scans             9%          9%         13%         13%         13%
    Zone normal velocity           347.289     359.676     348.063     339.933     332.983
    Zone dma32 velocity            710.151     712.605     758.140     737.835     737.507
    Zone dma velocity                0.000       0.000       0.000       0.000       0.000
    Page writes by reclaim         557.600     429.000     353.600     426.400     381.800
    Page writes file                   159          53           7          79          48
    Page writes anon                   398         375         345         346         333
    Page reclaim immediate             825         644         411         575         420
    Sector Reads                   2781750     2769780     2878547     2939128     2910483
    Sector Writes                 12080843    12083351    12012892    12002132    12010745
    Page rescued immediate               0           0           0           0           0
    Slabs scanned                  1575654     1545344     1778406     1786700     1794073
    Direct inode steals               9657       10037       15795       14104       14645
    Kswapd inode steals              46857       46335       50543       50716       51796
    Kswapd skipped wait                  0           0           0           0           0
    THP fault alloc                     97          91          81          71          77
    THP collapse alloc                 456         506         546         544         565
    THP splits                           6           5           5           4           4
    THP fault fallback                   0           1           0           0           0
    THP collapse fail                   14          14          12          13          12
    Compaction stalls                 1006         980        1537        1536        1548
    Compaction success                 303         284         562         559         578
    Compaction failures                702         696         974         976         969
    Page migrate success           1177325     1070077     3927538     3781870     3877057
    Page migrate failure                 0           0           0           0           0
    Compaction pages isolated      2547248     2306457     8301218     8008500     8200674
    Compaction migrate scanned    42290478    38832618   153961130   154143900   159141197
    Compaction free scanned       89199429    79189151   356529027   351943166   356326727
    Compaction cost                   1566        1426        5312        5156        5294
    NUMA PTE updates                     0           0           0           0           0
    NUMA hint faults                     0           0           0           0           0
    NUMA hint local faults               0           0           0           0           0
    NUMA hint local percent            100         100         100         100         100
    NUMA pages migrated                  0           0           0           0           0
    AutoNUMA cost                        0           0           0           0           0
    
    Observations:
    
    - The "Success 3" line is allocation success rate with system idle
      (phases 1 and 2 are with background interference).  I used to get stable
      values around 85% with vanilla 3.11.  The lower min and mean values came
      with 3.12.  This was bisected to commit 81c0a2bb ("mm: page_alloc: fair
      zone allocator policy") As explained in comment for patch 3, I don't
      think the commit is wrong, but that it makes the effect of compaction
      bugs worse.  From patch 3 onwards, the results are OK and match the 3.11
      results.
    
    - Patch 4 also clearly helps phases 1 and 2, and exceeds any results
      I've seen with 3.11 (I didn't measure it that thoroughly then, but it
      was never above 40%).
    
    - Compaction cost and number of scanned pages is higher, especially due
      to patch 4.  However, keep in mind that patches 3 and 4 fix existing
      bugs in the current design of compaction overhead mitigation, they do
      not change it.  If overhead is found unacceptable, then it should be
      decreased differently (and consistently, not due to random conditions)
      than the current implementation does.  In contrast, patches 5 and 6
      (which are not strictly bug fixes) do not increase the overhead (but
      also not success rates).  This might be a limitation of the
      stress-highalloc benchmark as it's quite uniform.
    
    Another set of results is when configuring stress-highalloc t allocate
    with similar flags as THP uses:
     (GFP_HIGHUSER_MOVABLE|__GFP_NOMEMALLOC|__GFP_NORETRY|__GFP_NO_KSWAPD)
    
    stress-highalloc
                                 3.13-rc2              3.13-rc2              3.13-rc2              3.13-rc2              3.13-rc2
                                    2-thp                 3-thp                 4-thp                 5-thp                 6-thp
    Success 1 Min          2.00 (  0.00%)        7.00 (-250.00%)       18.00 (-800.00%)       19.00 (-850.00%)       26.00 (-1200.00%)
    Success 1 Mean        19.20 (  0.00%)       17.80 (  7.29%)       29.20 (-52.08%)       29.90 (-55.73%)       32.80 (-70.83%)
    Success 1 Max         27.00 (  0.00%)       29.00 ( -7.41%)       35.00 (-29.63%)       36.00 (-33.33%)       37.00 (-37.04%)
    Success 2 Min          3.00 (  0.00%)        8.00 (-166.67%)       21.00 (-600.00%)       21.00 (-600.00%)       32.00 (-966.67%)
    Success 2 Mean        19.30 (  0.00%)       17.90 (  7.25%)       32.20 (-66.84%)       32.60 (-68.91%)       35.70 (-84.97%)
    Success 2 Max         27.00 (  0.00%)       30.00 (-11.11%)       36.00 (-33.33%)       37.00 (-37.04%)       39.00 (-44.44%)
    Success 3 Min         62.00 (  0.00%)       62.00 (  0.00%)       85.00 (-37.10%)       75.00 (-20.97%)       64.00 ( -3.23%)
    Success 3 Mean        66.30 (  0.00%)       65.50 (  1.21%)       85.60 (-29.11%)       83.40 (-25.79%)       83.50 (-25.94%)
    Success 3 Max         70.00 (  0.00%)       69.00 (  1.43%)       87.00 (-24.29%)       86.00 (-22.86%)       87.00 (-24.29%)
    
                3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2
                   2-thp       3-thp       4-thp       5-thp       6-thp
    User         6547.93     6475.85     6265.54     6289.46     6189.96
    System       1053.42     1047.28     1043.23     1042.73     1038.73
    Elapsed      1835.43     1821.96     1908.67     1912.74     1956.38
    
                                  3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2
                                     2-thp       3-thp       4-thp       5-thp       6-thp
    Minor Faults                 256805673   253106328   253222299   249830289   251184418
    Major Faults                       395         375         423         434         448
    Swap Ins                            12          10          10          12           9
    Swap Outs                          530         537         487         455         415
    Direct pages scanned             71859       86046      153244      152764      190713
    Kswapd pages scanned           1900994     1870240     1898012     1892864     1880520
    Kswapd pages reclaimed         1897814     1867428     1894939     1890125     1877924
    Direct pages reclaimed           71766       85908      153167      152643      190600
    Kswapd efficiency                  99%         99%         99%         99%         99%
    Kswapd velocity               1029.000    1067.782    1000.091     991.049     951.218
    Direct efficiency                  99%         99%         99%         99%         99%
    Direct velocity                 38.897      49.127      80.747      79.983      96.468
    Percentage direct scans             3%          4%          7%          7%          9%
    Zone normal velocity           351.377     372.494     348.910     341.689     335.310
    Zone dma32 velocity            716.520     744.414     731.928     729.343     712.377
    Zone dma velocity                0.000       0.000       0.000       0.000       0.000
    Page writes by reclaim         669.300     604.000     545.700     538.900     429.900
    Page writes file                   138          66          58          83          14
    Page writes anon                   530         537         487         455         415
    Page reclaim immediate             806         655         772         548         517
    Sector Reads                   2711956     2703239     2811602     2818248     2839459
    Sector Writes                 12163238    12018662    12038248    11954736    11994892
    Page rescued immediate               0           0           0           0           0
    Slabs scanned                  1385088     1388364     1507968     1513292     1558656
    Direct inode steals               1739        2564        4622        5496        6007
    Kswapd inode steals              47461       46406       47804       48013       48466
    Kswapd skipped wait                  0           0           0           0           0
    THP fault alloc                    110          82          84          69          70
    THP collapse alloc                 445         482         467         462         539
    THP splits                           6           5           4           5           3
    THP fault fallback                   3           0           0           0           0
    THP collapse fail                   15          14          14          14          13
    Compaction stalls                  659         685        1033        1073        1111
    Compaction success                 222         225         410         427         456
    Compaction failures                436         460         622         646         655
    Page migrate success            446594      439978     1085640     1095062     1131716
    Page migrate failure                 0           0           0           0           0
    Compaction pages isolated      1029475     1013490     2453074     2482698     2565400
    Compaction migrate scanned     9955461    11344259    24375202    27978356    30494204
    Compaction free scanned       27715272    28544654    80150615    82898631    85756132
    Compaction cost                    552         555        1344        1379        1436
    NUMA PTE updates                     0           0           0           0           0
    NUMA hint faults                     0           0           0           0           0
    NUMA hint local faults               0           0           0           0           0
    NUMA hint local percent            100         100         100         100         100
    NUMA pages migrated                  0           0           0           0           0
    AutoNUMA cost                        0           0           0           0           0
    
    There are some differences from the previous results for THP-like allocations:
    
    - Here, the bad result for unpatched kernel in phase 3 is much more
      consistent to be between 65-70% and not related to the "regression" in
      3.12.  Still there is the improvement from patch 4 onwards, which brings
      it on par with simple GFP_HIGHUSER_MOVABLE allocations.
    
    - Compaction costs have increased, but nowhere near as much as the
      non-THP case.  Again, the patches should be worth the gained
      determininsm.
    
    - Patches 5 and 6 somewhat increase the number of migrate-scanned pages.
       This is most likely due to __GFP_NO_KSWAPD flag, which means the cached
      pfn's and pageblock skip bits are not reset by kswapd that often (at
      least in phase 3 where no concurrent activity would wake up kswapd) and
      the patches thus help the sync-after-async compaction.  It doesn't
      however show that the sync compaction would help so much with success
      rates, which can be again seen as a limitation of the benchmark
      scenario.
    
    This patch (of 6):
    
    Add two tracepoints for compaction begin and end of a zone.  Using this it
    is possible to calculate how much time a workload is spending within
    compaction and potentially debug problems related to cached pfns for
    scanning.  In combination with the direct reclaim and slab trace points it
    should be possible to estimate most allocation-related overhead for a
    workload.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit b33660ee981e0667fdd1c3ef3902dcb1684bf64c
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:34:26 2014 +0100

    x86/mm: Eliminate redundant page table walk during TLB range flushing
    
    commit 71b54f8263860a37dd9f50f81880a9d681fd9c10 upstream.
    
    When choosing between doing an address space or ranged flush,
    the x86 implementation of flush_tlb_mm_range takes into account
    whether there are any large pages in the range.  A per-page
    flush typically requires fewer entries than would covered by a
    single large page and the check is redundant.
    
    There is one potential exception.  THP migration flushes single
    THP entries and it conceivably would benefit from flushing a
    single entry instead of the mm.  However, this flush is after a
    THP allocation, copy and page table update potentially with any
    other threads serialised behind it.  In comparison to that, the
    flush is noise.  It makes more sense to optimise balancing to
    require fewer flushes than to optimise the flush itself.
    
    This patch deletes the redundant huge page check.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Tested-by: Davidlohr Bueso <davidlohr@hp.com>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Cc: Alex Shi <alex.shi@linaro.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Link: http://lkml.kernel.org/n/tip-sgei1drpOcburujPsfh6ovmo@git.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 05f7ec4c8472b294cfec973540c9a583f66ee980
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:34:25 2014 +0100

    x86/mm: Clean up inconsistencies when flushing TLB ranges
    
    commit 15aa368255f249df0b2af630c9487bb5471bd7da upstream.
    
    NR_TLB_LOCAL_FLUSH_ALL is not always accounted for correctly and
    the comparison with total_vm is done before taking
    tlb_flushall_shift into account.  Clean it up.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Tested-by: Davidlohr Bueso <davidlohr@hp.com>
    Reviewed-by: Alex Shi <alex.shi@linaro.org>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Hugh Dickins <hughd@google.com>
    Link: http://lkml.kernel.org/n/tip-Iz5gcahrgskIldvukulzi0hh@git.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 1af7e30f9a0fdaa4e73579cf76ff14159fadb51c
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Aug 28 19:34:24 2014 +0100

    mm, x86: Account for TLB flushes only when debugging
    
    commit ec65993443736a5091b68e80ff1734548944a4b8 upstream.
    
    Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm:
    vmstats: tlb flush counters") to cause overhead problems.
    
    The counters are undeniably useful but how often do we really
    need to debug TLB flush related issues?  It does not justify
    taking the penalty everywhere so make it a debugging option.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Tested-by: Davidlohr Bueso <davidlohr@hp.com>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Alex Shi <alex.shi@linaro.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit e961e1e65ada5c41c1f03280c9929bb8c5ac3651
Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date:   Thu Aug 28 19:34:23 2014 +0100

    mm: __rmqueue_fallback() should respect pageblock type
    
    commit 0cbef29a782162a3896487901eca4550bfa397ef upstream.
    
    When __rmqueue_fallback() doesn't find a free block with the required size
    it splits a larger page and puts the rest of the page onto the free list.
    
    But it has one serious mistake.  When putting back, __rmqueue_fallback()
    always use start_migratetype if type is not CMA.  However,
    __rmqueue_fallback() is only called when all of the start_migratetype
    queue is empty.  That said, __rmqueue_fallback always puts back memory to
    the wrong queue except try_to_steal_freepages() changed pageblock type
    (i.e.  requested size is smaller than half of page block).  The end result
    is that the antifragmentation framework increases fragmenation instead of
    decreasing it.
    
    Mel's original anti fragmentation does the right thing.  But commit
    47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") broke it.
    
    This patch restores sane and old behavior.  It also removes an incorrect
    comment which was introduced by commit fef903efcf0c ("mm/page_alloc.c:
    restructure free-page stealing code and fix a bug").
    
    Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Michal Nazarewicz <mina86@mina86.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 0a0809a87b57b648be1dec3c311a5bfb34ea3b14
Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date:   Thu Aug 28 19:34:22 2014 +0100

    mm: get rid of unnecessary overhead of trace_mm_page_alloc_extfrag()
    
    commit 52c8f6a5aeb0bdd396849ecaa72d96f8175528f5 upstream.
    
    In general, every tracepoint should be zero overhead if it is disabled.
    However, trace_mm_page_alloc_extfrag() is one of exception.  It evaluate
    "new_type == start_migratetype" even if tracepoint is disabled.
    
    However, the code can be moved into tracepoint's TP_fast_assign() and
    TP_fast_assign exist exactly such purpose.  This patch does it.
    
    Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit cd5900b2c8e9b9a4c71541d7ec158a3b579cbb67
Author: Damien Ramonda <damien.ramonda@intel.com>
Date:   Thu Aug 28 19:34:21 2014 +0100

    readahead: fix sequential read cache miss detection
    
    commit af248a0c67457e5c6d2bcf288f07b4b2ed064f1f upstream.
    
    The kernel's readahead algorithm sometimes interprets random read
    accesses as sequential and triggers unnecessary data prefecthing from
    storage device (impacting random read average latency).
    
    In order to identify sequential cache read misses, the readahead
    algorithm intends to check whether offset - previous offset == 1
    (trivial sequential reads) or offset - previous offset == 0 (sequential
    reads not aligned on page boundary):
    
      if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL)
    
    The current offset is stored in the "offset" variable of type "pgoff_t"
    (unsigned long), while previous offset is stored in "ra->prev_pos" of
    type "loff_t" (long long).  Therefore, operands of the if statement are
    implicitly converted to type long long.  Consequently, when previous
    offset > current offset (which happens on random pattern), the if
    condition is true and access is wrongly interpeted as sequential.  An
    unnecessary data prefetching is triggered, impacting the average random
    read latency.
    
    Storing the previous offset value in a "pgoff_t" variable (unsigned
    long) fixes the sequential read detection logic.
    
    Signed-off-by: Damien Ramonda <damien.ramonda@intel.com>
    Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
    Acked-by: Pierre Tardy <pierre.tardy@intel.com>
    Acked-by: David Cohen <david.a.cohen@linux.intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 56f7f3617f64dd1530671600faf44fbe90dd2407
Author: Dan Streetman <ddstreet@ieee.org>
Date:   Thu Aug 28 19:34:20 2014 +0100

    swap: change swap_list_head to plist, add swap_avail_head
    
    commit 18ab4d4ced0817421e6db6940374cc39d28d65da upstream.
    
    Originally get_swap_page() started iterating through the singly-linked
    list of swap_info_structs using swap_list.next or highest_priority_index,
    which both were intended to point to the highest priority active swap
    target that was not full.  The first patch in this series changed the
    singly-linked list to a doubly-linked list, and removed the logic to start
    at the highest priority non-full entry; it starts scanning at the highest
    priority entry each time, even if the entry is full.
    
    Replace the manually ordered swap_list_head with a plist, swap_active_head.
    Add a new plist, swap_avail_head.  The original swap_active_head plist
    contains all active swap_info_structs, as before, while the new
    swap_avail_head plist contains only swap_info_structs that are active and
    available, i.e. not full.  Add a new spinlock, swap_avail_lock, to protect
    the swap_avail_head list.
    
    Mel Gorman suggested using plists since they internally handle ordering
    the list entries based on priority, which is exactly what swap was doing
    manually.  All the ordering code is now removed, and swap_info_struct
    entries and simply added to their corresponding plist and automatically
    ordered correctly.
    
    Using a new plist for available swap_info_structs simplifies and
    optimizes get_swap_page(), which no longer has to iterate over full
    swap_info_structs.  Using a new spinlock for swap_avail_head plist
    allows each swap_info_struct to add or remove themselves from the
    plist when they become full or not-full; previously they could not
    do so because the swap_info_struct->lock is held when they change
    from full<->not-full, and the swap_lock protecting the main
    swap_active_head must be ordered before any swap_info_struct->lock.
    
    Signed-off-by: Dan Streetman <ddstreet@ieee.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Shaohua Li <shli@fusionio.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dan Streetman <ddstreet@ieee.org>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
    Cc: Weijie Yang <weijieut@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Bob Liu <bob.liu@oracle.com>
    Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit af1f48ee7af9d57dbcb3fc19a488eae4a22f4922
Author: Dan Streetman <ddstreet@ieee.org>
Date:   Thu Aug 28 19:34:19 2014 +0100

    lib/plist: add plist_requeue
    
    commit a75f232ce0fe38bd01301899ecd97ffd0254316a upstream.
    
    Add plist_requeue(), which moves the specified plist_node after all other
    same-priority plist_nodes in the list.  This is essentially an optimized
    plist_del() followed by plist_add().
    
    This is needed by swap, which (with the next patch in this set) uses a
    plist of available swap devices.  When a swap device (either a swap
    partition or swap file) are added to the system with swapon(), the device
    is added to a plist, ordered by the swap device's priority.  When swap
    needs to allocate a page from one of the swap devices, it takes the page
    from the first swap device on the plist, which is the highest priority
    swap device.  The swap device is left in the plist until all its pages are
    used, and then removed from the plist when it becomes full.
    
    However, as described in man 2 swapon, swap must allocate pages from swap
    devices with the same priority in round-robin order; to do this, on each
    swap page allocation, swap uses a page from the first swap device in the
    plist, and then calls plist_requeue() to move that swap device entry to
    after any other same-priority swap devices.  The next swap page allocation
    will again use a page from the first swap device in the plist and requeue
    it, and so on, resulting in round-robin usage of equal-priority swap
    devices.
    
    Also add plist_test_requeue() test function, for use by plist_test() to
    test plist_requeue() function.
    
    Signed-off-by: Dan Streetman <ddstreet@ieee.org>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Shaohua Li <shli@fusionio.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dan Streetman <ddstreet@ieee.org>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
    Cc: Weijie Yang <weijieut@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Bob Liu <bob.liu@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 80e85acd070d4fce00ac5d75012e0182efbc066e
Author: Dan Streetman <ddstreet@ieee.org>
Date:   Thu Aug 28 19:34:18 2014 +0100

    lib/plist: add helper functions
    
    commit fd16618e12a05df79a3439d72d5ffdac5d34f3da upstream.
    
    Add PLIST_HEAD() to plist.h, equivalent to LIST_HEAD() from list.h, to
    define and initialize a struct plist_head.
    
    Add plist_for_each_continue() and plist_for_each_entry_continue(),
    equivalent to list_for_each_continue() and list_for_each_entry_continue(),
    to iterate over a plist continuing after the current position.
    
    Add plist_prev() and plist_next(), equivalent to (struct list_head*)->prev
    and ->next, implemented by list_prev_entry() and list_next_entry(), to
    access the prev/next struct plist_node entry.  These are needed because
    unlike struct list_head, direct access of the prev/next struct plist_node
    isn't possible; the list must be navigated via the contained struct
    list_head.  e.g.  instead of accessing the prev by list_prev_entry(node,
    node_list) it can be accessed by plist_prev(node).
    
    Signed-off-by: Dan Streetman <ddstreet@ieee.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Shaohua Li <shli@fusionio.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dan Streetman <ddstreet@ieee.org>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
    Cc: Weijie Yang <weijieut@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Bob Liu <bob.liu@oracle.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 75b1f2d3ed3169675b69b2f68217ebb839414657
Author: Dan Streetman <ddstreet@ieee.org>
Date:   Thu Aug 28 19:34:17 2014 +0100

    swap: change swap_info singly-linked list to list_head
    
    commit adfab836f4908deb049a5128082719e689eed964 upstream.
    
    The logic controlling the singly-linked list of swap_info_struct entries
    for all active, i.e.  swapon'ed, swap targets is rather complex, because:
    
     - it stores the entries in priority order
     - there is a pointer to the highest priority entry
     - there is a pointer to the highest priority not-full entry
     - there is a highest_priority_index variable set outside the swap_lock
     - swap entries of equal priority should be used equally
    
    this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
    where different priority swap targets are incorrectly used equally.
    
    That bug probably could be solved with the existing singly-linked lists,
    but I think it would only add more complexity to the already difficult to
    understand get_swap_page() swap_list iteration logic.
    
    The first patch changes from a singly-linked list to a doubly-linked list
    using list_heads; the highest_priority_index and related code are removed
    and get_swap_page() starts each iteration at the highest priority
    swap_info entry, even if it's full.  While this does introduce unnecessary
    list iteration (i.e.  Schlemiel the painter's algorithm) in the case where
    one or more of the highest priority entries are full, the iteration and
    manipulation code is much simpler and behaves correctly re: the above bug;
    and the fourth patch removes the unnecessary iteration.
    
    The second patch adds some minor plist helper functions; nothing new
    really, just functions to match existing regular list functions.  These
    are used by the next two patches.
    
    The third patch adds plist_requeue(), which is used by get_swap_page() in
    the next patch - it performs the requeueing of same-priority entries
    (which moves the entry to the end of its priority in the plist), so that
    all equal-priority swap_info_structs get used equally.
    
    The fourth patch converts the main list into a plist, and adds a new plist
    that contains only swap_info entries that are both active and not full.
    As Mel suggested using plists allows removing all the ordering code from
    swap - plists handle ordering automatically.  The list naming is also
    clarified now that there are two lists, with the original list changed
    from swap_list_head to swap_active_head and the new list named
    swap_avail_head.  A new spinlock is also added for the new list, so
    swap_info entries can be added or removed from the new list immediately as
    they become full or not full.
    
    This patch (of 4):
    
    Replace the singly-linked list tracking active, i.e.  swapon'ed,
    swap_info_struct entries with a doubly-linked list using struct
    list_heads.  Simplify the logic iterating and manipulating the list of
    entries, especially get_swap_page(), by using standard list_head
    functions, and removing the highest priority iteration logic.
    
    The change fixes the bug:
    https://lkml.org/lkml/2014/2/13/181
    in which different priority swap entries after the highest priority entry
    are incorrectly used equally in pairs.  The swap behavior is now as
    advertised, i.e. different priority swap entries are used in order, and
    equal priority swap targets are used concurrently.
    
    Signed-off-by: Dan Streetman <ddstreet@ieee.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Shaohua Li <shli@fusionio.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dan Streetman <ddstreet@ieee.org>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
    Cc: Weijie Yang <weijieut@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Bob Liu <bob.liu@oracle.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit f14c889d082607a242c3f960c57937f6f659fb9d
Author: Michal Hocko <mhocko@suse.cz>
Date:   Thu Aug 28 19:34:15 2014 +0100

    mm: exclude memoryless nodes from zone_reclaim
    
    commit 70ef57e6c22c3323dce179b7d0d433c479266612 upstream.
    
    We had a report about strange OOM killer strikes on a PPC machine
    although there was a lot of swap free and a tons of anonymous memory
    which could be swapped out.  In the end it turned out that the OOM was a
    side effect of zone reclaim which wasn't unmapping and swapping out and
    so the system was pushed to the OOM.  Although this sounds like a bug
    somewhere in the kswapd vs.  zone reclaim vs.  direct reclaim
    interaction numactl on the said hardware suggests that the zone reclaim
    should not have been set in the first place:
    
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
      node 0 size: 0 MB
      node 0 free: 0 MB
      node 2 cpus:
      node 2 size: 7168 MB
      node 2 free: 6019 MB
      node distances:
      node   0   2
      0:  10  40
      2:  40  10
    
    So all the CPUs are associated with Node0 which doesn't have any memory
    while Node2 contains all the available memory.  Node distances cause an
    automatic zone_reclaim_mode enabling.
    
    Zone reclaim is intended to keep the allocations local but this doesn't
    make any sense on the memoryless nodes.  So let's exclude such nodes for
    init_zone_allows_reclaim which evaluates zone reclaim behavior and
    suitable reclaim_nodes.
    
    Signed-off-by: Michal Hocko <mhocko@suse.cz>
    Acked-by: David Rientjes <rientjes@google.com>
    Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
    Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit a32d86748d0b0de86eddeb4f544a41caa413000c
Author: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Date:   Thu Aug 28 19:34:14 2014 +0100

    hugetlb: ensure hugepage access is denied if hugepages are not supported
    
    commit 457c1b27ed56ec472d202731b12417bff023594a upstream.
    
    Currently, I am seeing the following when I `mount -t hugetlbfs /none
    /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`.  I think it's
    related to the fact that hugetlbfs is properly not correctly setting
    itself up in this state?:
    
      Unable to handle kernel paging request for data at address 0x00000031
      Faulting instruction address: 0xc000000000245710
      Oops: Kernel access of bad area, sig: 11 [#1]
      SMP NR_CPUS=2048 NUMA pSeries
      ....
    
    In KVM guests on Power, in a guest not backed by hugepages, we see the
    following:
    
      AnonHugePages:         0 kB
      HugePages_Total:       0
      HugePages_Free:        0
      HugePages_Rsvd:        0
      HugePages_Surp:        0
      Hugepagesize:         64 kB
    
    HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
    are not supported at boot-time, but this is only checked in
    hugetlb_init().  Extract the check to a helper function, and use it in a
    few relevant places.
    
    This does make hugetlbfs not supported (not registered at all) in this
    environment.  I believe this is fine, as there are no valid hugepages
    and that won't change at runtime.
    
    [akpm@linux-foundation.org: use pr_info(), per Mel]
    [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
    Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
    Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Randy Dunlap <rdunlap@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 75858faa45de3bef7519b046f5453c2d7a3527c7
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Aug 28 19:34:13 2014 +0100

    mm: fix bad rss-counter if remap_file_pages raced migration
    
    commit 887843961c4b4681ee993c36d4997bf4b4aa8253 upstream.
    
    Fix some "Bad rss-counter state" reports on exit, arising from the
    interaction between page migration and remap_file_pages(): zap_pte()
    must count a migration entry when zapping it.
    
    And yes, it is possible (though very unusual) to find an anon page or
    swap entry in a VM_SHARED nonlinear mapping: coming from that horrid
    get_user_pages(write, force) case which COWs even in a shared mapping.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Tested-by: Sasha Levin sasha.levin@oracle.com>
    Tested-by: Dave Jones davej@redhat.com>
    Cc: Cyrill Gorcunov <gorcunov@gmail.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 73210d0adac4c6f283eb825e85febdc12dbd3e36
Author: Han Pingtian <hanpt@linux.vnet.ibm.com>
Date:   Thu Aug 28 19:34:12 2014 +0100

    mm: prevent setting of a value less than 0 to min_free_kbytes
    
    commit da8c757b080ee84f219fa2368cb5dd23ac304fc0 upstream.
    
    If echo -1 > /proc/vm/sys/min_free_kbytes, the system will hang.  Changing
    proc_dointvec() to proc_dointvec_minmax() in the
    min_free_kbytes_sysctl_handler() can prevent this to happen.
    
    mhocko said:
    
    : You can still do echo $BIG_VALUE > /proc/vm/sys/min_free_kbytes and make
    : your machine unusable but I agree that proc_dointvec_minmax is more
    : suitable here as we already have:
    :
    : 	.proc_handler   = min_free_kbytes_sysctl_handler,
    : 	.extra1         = &zero,
    :
    : It used to work properly but then 6fce56ec91b5 ("sysctl: Remove references
    : to ctl_name and strategy from the generic sysctl table") has removed
    : sysctl_intvec strategy and so extra1 is ignored.
    
    Signed-off-by: Han Pingtian <hanpt@linux.vnet.ibm.com>
    Acked-by: Michal Hocko <mhocko@suse.cz>
    Acked-by: David Rientjes <rientjes@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit 3ddc614ca27c84a91d3b51ab57e50c68fdf4889b
Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date:   Thu Aug 28 19:34:11 2014 +0100

    slab: correct pfmemalloc check
    
    commit 73293c2f900d0adbb6a415b312cd57976d5ae242 upstream.
    
    We checked pfmemalloc by slab unit, not page unit. You can see this
    in is_slab_pfmemalloc(). So other pages don't need to be set/cleared
    pfmemalloc.
    
    And, therefore we should check pfmemalloc in page flag of first page,
    but current implementation don't do that. virt_to_head_page(obj) just
    return 'struct page' of that object, not one of first page, since the SLAB
    don't use __GFP_COMP when CONFIG_MMU. To get 'struct page' of first page,
    we first get a slab and try to get it via virt_to_head_page(slab->s_mem).
    
    Acked-by: Andi Kleen <ak@linux.intel.com>
    Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Signed-off-by: Pekka Enberg <penberg@iki.fi>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit fc2dd02e0959cb07fd49ab57662ad755d5b3fed2
Author: Bob Liu <lliubbo@gmail.com>
Date:   Thu Aug 28 19:34:10 2014 +0100

    mm: thp: khugepaged: add policy for finding target node
    
    commit 9f1b868a13ac36bd207a571f5ea1193d823ab18d upstream.
    
    Khugepaged will scan/free HPAGE_PMD_NR normal pages and replace with a
    hugepage which is allocated from the node of the first scanned normal
    page, but this policy is too rough and may end with unexpected result to
    upper users.
    
    The problem is the original page-balancing among all nodes will be
    broken after hugepaged started.  Thinking about the case if the first
    scanned normal page is allocated from node A, most of other scanned
    normal pages are allocated from node B or C..  But hugepaged will always
    allocate hugepage from node A which will cause extra memory pressure on
    node A which is not the situation before khugepaged started.
    
    This patch try to fix this problem by making khugepaged allocate
    hugepage from the node which have max record of scaned normal pages hit,
    so that the effect to original page-balancing can be minimized.
    
    The other problem is if normal scanned pages are equally allocated from
    Node A,B and C, after khugepaged started Node A will still suffer extra
    memory pressure.
    
    Andrew Davidoff reported a related issue several days ago.  He wanted
    his application interleaving among all nodes and "numactl
    --interleave=all ./test" was used to run the testcase, but the result
    wasn't not as expected.
    
      cat /proc/2814/numa_maps:
      7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435 N3=50098
    
    The end result showed that most pages are from Node3 instead of
    interleave among node0-3 which was unreasonable.
    
    This patch also fix this issue by allocating hugepage round robin from
    all nodes have the same record, after this patch the result was as
    expected:
    
      7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235 N3=12722
    
    The simple testcase is like this:
    
    int main() {
    	char *p;
    	int i;
    	int j;
    
    	for (i=0; i < 200; i++) {
    		p = (char *)malloc(1048576);
    		printf("malloc done\n");
    
    		if (p == 0) {
    			printf("Out of memory\n");
    			return 1;
    		}
    		for (j=0; j < 1048576; j++) {
    			p[j] = 'A';
    		}
    		printf("touched memory\n");
    
    		sleep(1);
    	}
    	printf("enter sleep\n");
    	while(1) {
    		sleep(100);
    	}
    }
    
    [akpm@linux-foundation.org: make last_khugepaged_target_node local to khugepaged_find_target_node()]
    Reported-by: Andrew Davidoff <davidoff@qedmf.net>
    Tested-by: Andrew Davidoff <davidoff@qedmf.net>
    Signed-off-by: Bob Liu <bob.liu@oracle.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Mel Gorman <mel@csn.ul.ie>
    Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
    Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>

commit c3bd31a1700393488d631dce29e38a71bf1c413f
Author: Bob Liu <lliubbo@gmail.com>
Date:   Thu Aug 28 19:34:09 2014 +0100

    mm: thp: cleanup: mv alloc_hugepage to better place
    
    commit 10dc4155c7714f508fe2e4667164925ea971fb25 upstream.
    
    Move alloc_hugepage() to a better place, no need for a seperate #ifndef
    CONFIG_NUMA
    
    Signed-off-by: Bob Liu <bob.liu@oracle.com>
    Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Mel Gorman <mel@csn.ul.ie>
    Cc: Andrew Davidoff <davidoff@qedmf.net>
    Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Jiri Slaby <jslaby@suse.cz>