commit 89161fe91f2fd1049bcc38f5d4b814acab7b83f5
Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date:   Thu Oct 9 12:21:39 2014 -0700

    Linux 3.14.21

commit c56af023a5fe80252b4dd555926eeaa294d9112e
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Mon Apr 28 14:24:09 2014 -0700

    mm: don't pointlessly use BUG_ON() for sanity check
    
    commit 50f5aa8a9b248fa4262cf379863ec9a531b49737 upstream.
    
    BUG_ON() is a big hammer, and should be used _only_ if there is some
    major corruption that you cannot possibly recover from, making it
    imperative that the current process (and possibly the whole machine) be
    terminated with extreme prejudice.
    
    The trivial sanity check in the vmacache code is *not* such a fatal
    error.  Recovering from it is absolutely trivial, and using BUG_ON()
    just makes it harder to debug for no actual advantage.
    
    To make matters worse, the placement of the BUG_ON() (only if the range
    check matched) actually makes it harder to hit the sanity check to begin
    with, so _if_ there is a bug (and we just got a report from Srivatsa
    Bhat that this can indeed trigger), it is harder to debug not just
    because the machine is possibly dead, but because we don't have better
    coverage.
    
    BUG_ON() must *die*.  Maybe we should add a checkpatch warning for it,
    because it is simply just about the worst thing you can ever do if you
    hit some "this cannot happen" situation.
    
    Reported-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
    Cc: Davidlohr Bueso <davidlohr@hp.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit efb5fea23009a0223996e699b54cc9533e2070e9
Author: Davidlohr Bueso <davidlohr@hp.com>
Date:   Mon Apr 7 15:37:25 2014 -0700

    mm: per-thread vma caching
    
    commit 615d6e8756c87149f2d4c1b93d471bca002bd849 upstream.
    
    This patch is a continuation of efforts trying to optimize find_vma(),
    avoiding potentially expensive rbtree walks to locate a vma upon faults.
    The original approach (https://lkml.org/lkml/2013/11/1/410), where the
    largest vma was also cached, ended up being too specific and random,
    thus further comparison with other approaches were needed.  There are
    two things to consider when dealing with this, the cache hit rate and
    the latency of find_vma().  Improving the hit-rate does not necessarily
    translate in finding the vma any faster, as the overhead of any fancy
    caching schemes can be too high to consider.
    
    We currently cache the last used vma for the whole address space, which
    provides a nice optimization, reducing the total cycles in find_vma() by
    up to 250%, for workloads with good locality.  On the other hand, this
    simple scheme is pretty much useless for workloads with poor locality.
    Analyzing ebizzy runs shows that, no matter how many threads are
    running, the mmap_cache hit rate is less than 2%, and in many situations
    below 1%.
    
    The proposed approach is to replace this scheme with a small per-thread
    cache, maximizing hit rates at a very low maintenance cost.
    Invalidations are performed by simply bumping up a 32-bit sequence
    number.  The only expensive operation is in the rare case of a seq
    number overflow, where all caches that share the same address space are
    flushed.  Upon a miss, the proposed replacement policy is based on the
    page number that contains the virtual address in question.  Concretely,
    the following results are seen on an 80 core, 8 socket x86-64 box:
    
    1) System bootup: Most programs are single threaded, so the per-thread
       scheme does improve ~50% hit rate by just adding a few more slots to
       the cache.
    
    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline       | 50.61%   | 19.90            |
    | patched        | 73.45%   | 13.58            |
    +----------------+----------+------------------+
    
    2) Kernel build: This one is already pretty good with the current
       approach as we're dealing with good locality.
    
    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline       | 75.28%   | 11.03            |
    | patched        | 88.09%   | 9.31             |
    +----------------+----------+------------------+
    
    3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
    
    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline       | 70.66%   | 17.14            |
    | patched        | 91.15%   | 12.57            |
    +----------------+----------+------------------+
    
    4) Ebizzy: There's a fair amount of variation from run to run, but this
       approach always shows nearly perfect hit rates, while baseline is just
       about non-existent.  The amounts of cycles can fluctuate between
       anywhere from ~60 to ~116 for the baseline scheme, but this approach
       reduces it considerably.  For instance, with 80 threads:
    
    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline       | 1.06%    | 91.54            |
    | patched        | 99.97%   | 14.18            |
    +----------------+----------+------------------+
    
    [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
    [akpm@linux-foundation.org: document vmacache_valid() logic]
    [akpm@linux-foundation.org: attempt to untangle header files]
    [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
    [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: adjust and enhance comments]
    Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Reviewed-by: Michel Lespinasse <walken@google.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Tested-by: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 264a8ae739402ab7e14d1794fd4bbce0e339d415
Author: Christoph Lameter <cl@linux.com>
Date:   Fri Apr 18 15:07:10 2014 -0700

    vmscan: reclaim_clean_pages_from_list() must use mod_zone_page_state()
    
    commit 83da7510058736c09a14b9c17ec7d851940a4332 upstream.
    
    Seems to be called with preemption enabled.  Therefore it must use
    mod_zone_page_state instead.
    
    Signed-off-by: Christoph Lameter <cl@linux.com>
    Reported-by: Grygorii Strashko <grygorii.strashko@ti.com>
    Tested-by: Grygorii Strashko <grygorii.strashko@ti.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 8e524793fdfb061fc31936d9adaa026d7bdc916c
Author: Vladimir Davydov <vdavydov@parallels.com>
Date:   Thu Apr 3 14:47:32 2014 -0700

    mm: vmscan: shrink_slab: rename max_pass -> freeable
    
    commit d5bc5fd3fcb7b8dfb431694a8c8052466504c10c upstream.
    
    The name `max_pass' is misleading, because this variable actually keeps
    the estimate number of freeable objects, not the maximal number of
    objects we can scan in this pass, which can be twice that.  Rename it to
    reflect its actual meaning.
    
    Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 12f2f0bab442ecdff45d67ff1c3d0b21f46794bd
Author: Vladimir Davydov <vdavydov@parallels.com>
Date:   Thu Apr 3 14:47:19 2014 -0700

    mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim
    
    commit 99120b772b52853f9a2b829a21dd44d9b20558f1 upstream.
    
    When direct reclaim is executed by a process bound to a set of NUMA
    nodes, we should scan only those nodes when possible, but currently we
    will scan kmem from all online nodes even if the kmem shrinker is NUMA
    aware.  That said, binding a process to a particular NUMA node won't
    prevent it from shrinking inode/dentry caches from other nodes, which is
    not good.  Fix this.
    
    Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Dave Chinner <dchinner@redhat.com>
    Cc: Glauber Costa <glommer@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 6591981a30a8c752b1c537211f7fe4b9530fea41
Author: Jens Axboe <axboe@fb.com>
Date:   Thu May 22 11:54:16 2014 -0700

    mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT
    
    commit 7fcbbaf18392f0b17c95e2f033c8ccf87eecde1d upstream.
    
    In some testing I ran today (some fio jobs that spread over two nodes),
    we end up spending 40% of the time in filemap_check_errors().  That
    smells fishy.  Looking further, this is basically what happens:
    
    blkdev_aio_read()
        generic_file_aio_read()
            filemap_write_and_wait_range()
                if (!mapping->nr_pages)
                    filemap_check_errors()
    
    and filemap_check_errors() always attempts two test_and_clear_bit() on
    the mapping flags, thus dirtying it for every single invocation.  The
    patch below tests each of these bits before clearing them, avoiding this
    issue.  In my test case (4-socket box), performance went from 1.7M IOPS
    to 4.0M IOPS.
    
    Signed-off-by: Jens Axboe <axboe@fb.com>
    Acked-by: Jeff Moyer <jmoyer@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 29c2a88157819d1e68ffea8b7d80117b332c8efe
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Apr 3 14:47:24 2014 -0700

    mm: optimize put_mems_allowed() usage
    
    commit d26914d11751b23ca2e8747725f2cae10c2f2c1b upstream.
    
    Since put_mems_allowed() is strictly optional, its a seqcount retry, we
    don't need to evaluate the function if the allocation was in fact
    successful, saving a smp_rmb some loads and comparisons on some relative
    fast-paths.
    
    Since the naming, get/put_mems_allowed() does suggest a mandatory
    pairing, rename the interface, as suggested by Mel, to resemble the
    seqcount interface.
    
    This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
    where it is important to note that the return value of the latter call
    is inverted from its previous incarnation.
    
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit d4995db1ea96e5f357b92469c9b6c3ecc6bdfbaa
Author: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Date:   Thu Apr 3 14:48:23 2014 -0700

    mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages
    
    commit 6d2be915e589b58cb11418cbe1f22ff90732b6ac upstream.
    
    Currently max_sane_readahead() returns zero on the cpu whose NUMA node
    has no local memory which leads to readahead failure.  Fix this
    readahead failure by returning minimum of (requested pages, 512).  Users
    running applications on a memory-less cpu which needs readahead such as
    streaming application see considerable boost in the performance.
    
    Result:
    
    fadvise experiment with FADV_WILLNEED on a PPC machine having memoryless
    CPU with 1GB testfile (12 iterations) yielded around 46.66% improvement.
    
    fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB
    testfile 32GB* 4G RAM numa machine (12 iterations) showed no impact on
    the normal NUMA cases w/ patch.
    
      Kernel       Avg  Stddev
      base      7.4975   3.92%
      patched   7.4174   3.26%
    
    [Andrew: making return value PAGE_SIZE independent]
    Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
    Acked-by: Jan Kara <jack@suse.cz>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: David Rientjes <rientjes@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 67c58ce60024b9b3d9cf203da994bf265aa7e369
Author: David Rientjes <rientjes@google.com>
Date:   Thu Apr 3 14:47:23 2014 -0700

    mm, compaction: ignore pageblock skip when manually invoking compaction
    
    commit 91ca9186484809c57303b33778d841cc28f696ed upstream.
    
    The cached pageblock hint should be ignored when triggering compaction
    through /proc/sys/vm/compact_memory so all eligible memory is isolated.
    Manually invoking compaction is known to be expensive, there's no need
    to skip pageblocks based on heuristics (mainly for debugging).
    
    Signed-off-by: David Rientjes <rientjes@google.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit e8cd5b562a4bfa351d07a5e4c5758b1afe6a4c0b
Author: David Rientjes <rientjes@google.com>
Date:   Mon Apr 7 15:37:34 2014 -0700

    mm, compaction: determine isolation mode only once
    
    commit da1c67a76f7cf2b3404823d24f9f10fa91aa5dc5 upstream.
    
    The conditions that control the isolation mode in
    isolate_migratepages_range() do not change during the iteration, so
    extract them out and only define the value once.
    
    This actually does have an effect, gcc doesn't optimize it itself because
    of cc->sync.
    
    Signed-off-by: David Rientjes <rientjes@google.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Acked-by: Rik van Riel <riel@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit ca82ea2e650adab8d03ffdca6f1cccd3c0bd40ec
Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date:   Mon Apr 7 15:37:07 2014 -0700

    mm/compaction: clean-up code on success of ballon isolation
    
    commit b6c750163c0d138f5041d95fcdbd1094b6928057 upstream.
    
    It is just for clean-up to reduce code size and improve readability.
    There is no functional change.
    
    Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 6128cc0567321c5548e1a3cf1467eb3d682e21f6
Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date:   Mon Apr 7 15:37:06 2014 -0700

    mm/compaction: check pageblock suitability once per pageblock
    
    commit c122b2087ab94192f2b937e47b563a9c4e688ece upstream.
    
    isolation_suitable() and migrate_async_suitable() is used to be sure
    that this pageblock range is fine to be migragted.  It isn't needed to
    call it on every page.  Current code do well if not suitable, but, don't
    do well when suitable.
    
    1) It re-checks isolation_suitable() on each page of a pageblock that was
       already estabilished as suitable.
    2) It re-checks migrate_async_suitable() on each page of a pageblock that
       was not entered through the next_pageblock: label, because
       last_pageblock_nr is not otherwise updated.
    
    This patch fixes situation by 1) calling isolation_suitable() only once
    per pageblock and 2) always updating last_pageblock_nr to the pageblock
    that was just checked.
    
    Additionally, move PageBuddy() check after pageblock unit check, since
    pageblock check is the first thing we should do and makes things more
    simple.
    
    [vbabka@suse.cz: rephrase commit description]
    Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 4fff5ca78029f4df452334ecf013e53bf29079cc
Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date:   Mon Apr 7 15:37:05 2014 -0700

    mm/compaction: change the timing to check to drop the spinlock
    
    commit be1aa03b973c7dcdc576f3503f7a60429825c35d upstream.
    
    It is odd to drop the spinlock when we scan (SWAP_CLUSTER_MAX - 1) th
    pfn page.  This may results in below situation while isolating
    migratepage.
    
    1. try isolate 0x0 ~ 0x200 pfn pages.
    2. When low_pfn is 0x1ff, ((low_pfn+1) % SWAP_CLUSTER_MAX) == 0, so drop
       the spinlock.
    3. Then, to complete isolating, retry to aquire the lock.
    
    I think that it is better to use SWAP_CLUSTER_MAX th pfn for checking the
    criteria about dropping the lock.  This has no harm 0x0 pfn, because, at
    this time, locked variable would be false.
    
    Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit cc85d67f4fdc4fb89c74ff327d2bbe7803951a0f
Author: Lars Ellenberg <lars.ellenberg@linbit.com>
Date:   Wed Jul 9 21:18:32 2014 +0200

    drbd: fix regression 'out of mem, failed to invoke fence-peer helper'
    
    commit bbc1c5e8ad6dfebf9d13b8a4ccdf66c92913eac9 upstream.
    
    Since linux kernel 3.13, kthread_run() internally uses
    wait_for_completion_killable().  We sometimes may use kthread_run()
    while we still have a signal pending, which we used to kick our threads
    out of potentially blocking network functions, causing kthread_run() to
    mistake that as a new fatal signal and fail.
    
    Fix: flush_signals() before kthread_run().
    
    Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
    Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
    Signed-off-by: Jens Axboe <axboe@fb.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit e292d9ad60b820e49a6825a501461df7f527b8d8
Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date:   Mon Apr 7 15:37:04 2014 -0700

    mm/compaction: do not call suitable_migration_target() on every page
    
    commit 01ead5340bcf5f3a1cd2452c75516d0ef4d908d7 upstream.
    
    suitable_migration_target() checks that pageblock is suitable for
    migration target.  In isolate_freepages_block(), it is called on every
    page and this is inefficient.  So make it called once per pageblock.
    
    suitable_migration_target() also checks if page is highorder or not, but
    it's criteria for highorder is pageblock order.  So calling it once
    within pageblock range has no problem.
    
    Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 96b3fde44d498edb0b69f38ee09f3fcc0060d214
Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date:   Mon Apr 7 15:37:03 2014 -0700

    mm/compaction: disallow high-order page for migration target
    
    commit 7d348b9ea64db0a315d777ce7d4b06697f946503 upstream.
    
    Purpose of compaction is to get a high order page.  Currently, if we
    find high-order page while searching migration target page, we break it
    to order-0 pages and use them as migration target.  It is contrary to
    purpose of compaction, so disallow high-order page to be used for
    migration target.
    
    Additionally, clean-up logic in suitable_migration_target() to simplify
    the code.  There is no functional changes from this clean-up.
    
    Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 1190e5c69c4329c8b8d220e8c149906af1e787c8
Author: David Rientjes <rientjes@google.com>
Date:   Thu Apr 3 14:48:00 2014 -0700

    mm, compaction: avoid isolating pinned pages
    
    commit 119d6d59dcc0980dcd581fdadb6b2033b512a473 upstream.
    
    Page migration will fail for memory that is pinned in memory with, for
    example, get_user_pages().  In this case, it is unnecessary to take
    zone->lru_lock or isolating the page and passing it to page migration
    which will ultimately fail.
    
    This is a racy check, the page can still change from under us, but in
    that case we'll just fail later when attempting to move the page.
    
    This avoids very expensive memory compaction when faulting transparent
    hugepages after pinning a lot of memory with a Mellanox driver.
    
    On a 128GB machine and pinning ~120GB of memory, before this patch we
    see the enormous disparity in the number of page migration failures
    because of the pinning (from /proc/vmstat):
    
    	compact_pages_moved 8450
    	compact_pagemigrate_failed 15614415
    
    0.05% of pages isolated are successfully migrated and explicitly
    triggering memory compaction takes 102 seconds.  After the patch:
    
    	compact_pages_moved 9197
    	compact_pagemigrate_failed 7
    
    99.9% of pages isolated are now successfully migrated in this
    configuration and memory compaction takes less than one second.
    
    Signed-off-by: David Rientjes <rientjes@google.com>
    Acked-by: Hugh Dickins <hughd@google.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Greg Thelen <gthelen@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit d540b168908615f1382b7eb3dfa12f95fc79bba0
Author: Dan Streetman <ddstreet@ieee.org>
Date:   Wed Jun 4 16:09:59 2014 -0700

    swap: change swap_list_head to plist, add swap_avail_head
    
    commit 18ab4d4ced0817421e6db6940374cc39d28d65da upstream.
    
    Originally get_swap_page() started iterating through the singly-linked
    list of swap_info_structs using swap_list.next or highest_priority_index,
    which both were intended to point to the highest priority active swap
    target that was not full.  The first patch in this series changed the
    singly-linked list to a doubly-linked list, and removed the logic to start
    at the highest priority non-full entry; it starts scanning at the highest
    priority entry each time, even if the entry is full.
    
    Replace the manually ordered swap_list_head with a plist, swap_active_head.
    Add a new plist, swap_avail_head.  The original swap_active_head plist
    contains all active swap_info_structs, as before, while the new
    swap_avail_head plist contains only swap_info_structs that are active and
    available, i.e. not full.  Add a new spinlock, swap_avail_lock, to protect
    the swap_avail_head list.
    
    Mel Gorman suggested using plists since they internally handle ordering
    the list entries based on priority, which is exactly what swap was doing
    manually.  All the ordering code is now removed, and swap_info_struct
    entries and simply added to their corresponding plist and automatically
    ordered correctly.
    
    Using a new plist for available swap_info_structs simplifies and
    optimizes get_swap_page(), which no longer has to iterate over full
    swap_info_structs.  Using a new spinlock for swap_avail_head plist
    allows each swap_info_struct to add or remove themselves from the
    plist when they become full or not-full; previously they could not
    do so because the swap_info_struct->lock is held when they change
    from full<->not-full, and the swap_lock protecting the main
    swap_active_head must be ordered before any swap_info_struct->lock.
    
    Signed-off-by: Dan Streetman <ddstreet@ieee.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Shaohua Li <shli@fusionio.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dan Streetman <ddstreet@ieee.org>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
    Cc: Weijie Yang <weijieut@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Bob Liu <bob.liu@oracle.com>
    Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit ae604916e258d7197cf4ca0249298897f29f0d20
Author: Dan Streetman <ddstreet@ieee.org>
Date:   Wed Jun 4 16:09:57 2014 -0700

    lib/plist: add plist_requeue
    
    commit a75f232ce0fe38bd01301899ecd97ffd0254316a upstream.
    
    Add plist_requeue(), which moves the specified plist_node after all other
    same-priority plist_nodes in the list.  This is essentially an optimized
    plist_del() followed by plist_add().
    
    This is needed by swap, which (with the next patch in this set) uses a
    plist of available swap devices.  When a swap device (either a swap
    partition or swap file) are added to the system with swapon(), the device
    is added to a plist, ordered by the swap device's priority.  When swap
    needs to allocate a page from one of the swap devices, it takes the page
    from the first swap device on the plist, which is the highest priority
    swap device.  The swap device is left in the plist until all its pages are
    used, and then removed from the plist when it becomes full.
    
    However, as described in man 2 swapon, swap must allocate pages from swap
    devices with the same priority in round-robin order; to do this, on each
    swap page allocation, swap uses a page from the first swap device in the
    plist, and then calls plist_requeue() to move that swap device entry to
    after any other same-priority swap devices.  The next swap page allocation
    will again use a page from the first swap device in the plist and requeue
    it, and so on, resulting in round-robin usage of equal-priority swap
    devices.
    
    Also add plist_test_requeue() test function, for use by plist_test() to
    test plist_requeue() function.
    
    Signed-off-by: Dan Streetman <ddstreet@ieee.org>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Shaohua Li <shli@fusionio.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dan Streetman <ddstreet@ieee.org>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
    Cc: Weijie Yang <weijieut@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Bob Liu <bob.liu@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit fd6d61cc8a2c9ceb768423a3979127833b30c12d
Author: Dan Streetman <ddstreet@ieee.org>
Date:   Wed Jun 4 16:09:55 2014 -0700

    lib/plist: add helper functions
    
    commit fd16618e12a05df79a3439d72d5ffdac5d34f3da upstream.
    
    Add PLIST_HEAD() to plist.h, equivalent to LIST_HEAD() from list.h, to
    define and initialize a struct plist_head.
    
    Add plist_for_each_continue() and plist_for_each_entry_continue(),
    equivalent to list_for_each_continue() and list_for_each_entry_continue(),
    to iterate over a plist continuing after the current position.
    
    Add plist_prev() and plist_next(), equivalent to (struct list_head*)->prev
    and ->next, implemented by list_prev_entry() and list_next_entry(), to
    access the prev/next struct plist_node entry.  These are needed because
    unlike struct list_head, direct access of the prev/next struct plist_node
    isn't possible; the list must be navigated via the contained struct
    list_head.  e.g.  instead of accessing the prev by list_prev_entry(node,
    node_list) it can be accessed by plist_prev(node).
    
    Signed-off-by: Dan Streetman <ddstreet@ieee.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Shaohua Li <shli@fusionio.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dan Streetman <ddstreet@ieee.org>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
    Cc: Weijie Yang <weijieut@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Bob Liu <bob.liu@oracle.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit bcbfe6fdf8576a545fafdfe4611f59cc6b166589
Author: Dan Streetman <ddstreet@ieee.org>
Date:   Wed Jun 4 16:09:53 2014 -0700

    swap: change swap_info singly-linked list to list_head
    
    commit adfab836f4908deb049a5128082719e689eed964 upstream.
    
    The logic controlling the singly-linked list of swap_info_struct entries
    for all active, i.e.  swapon'ed, swap targets is rather complex, because:
    
     - it stores the entries in priority order
     - there is a pointer to the highest priority entry
     - there is a pointer to the highest priority not-full entry
     - there is a highest_priority_index variable set outside the swap_lock
     - swap entries of equal priority should be used equally
    
    this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
    where different priority swap targets are incorrectly used equally.
    
    That bug probably could be solved with the existing singly-linked lists,
    but I think it would only add more complexity to the already difficult to
    understand get_swap_page() swap_list iteration logic.
    
    The first patch changes from a singly-linked list to a doubly-linked list
    using list_heads; the highest_priority_index and related code are removed
    and get_swap_page() starts each iteration at the highest priority
    swap_info entry, even if it's full.  While this does introduce unnecessary
    list iteration (i.e.  Schlemiel the painter's algorithm) in the case where
    one or more of the highest priority entries are full, the iteration and
    manipulation code is much simpler and behaves correctly re: the above bug;
    and the fourth patch removes the unnecessary iteration.
    
    The second patch adds some minor plist helper functions; nothing new
    really, just functions to match existing regular list functions.  These
    are used by the next two patches.
    
    The third patch adds plist_requeue(), which is used by get_swap_page() in
    the next patch - it performs the requeueing of same-priority entries
    (which moves the entry to the end of its priority in the plist), so that
    all equal-priority swap_info_structs get used equally.
    
    The fourth patch converts the main list into a plist, and adds a new plist
    that contains only swap_info entries that are both active and not full.
    As Mel suggested using plists allows removing all the ordering code from
    swap - plists handle ordering automatically.  The list naming is also
    clarified now that there are two lists, with the original list changed
    from swap_list_head to swap_active_head and the new list named
    swap_avail_head.  A new spinlock is also added for the new list, so
    swap_info entries can be added or removed from the new list immediately as
    they become full or not full.
    
    This patch (of 4):
    
    Replace the singly-linked list tracking active, i.e.  swapon'ed,
    swap_info_struct entries with a doubly-linked list using struct
    list_heads.  Simplify the logic iterating and manipulating the list of
    entries, especially get_swap_page(), by using standard list_head
    functions, and removing the highest priority iteration logic.
    
    The change fixes the bug:
    https://lkml.org/lkml/2014/2/13/181
    in which different priority swap entries after the highest priority entry
    are incorrectly used equally in pairs.  The swap behavior is now as
    advertised, i.e. different priority swap entries are used in order, and
    equal priority swap targets are used concurrently.
    
    Signed-off-by: Dan Streetman <ddstreet@ieee.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Shaohua Li <shli@fusionio.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dan Streetman <ddstreet@ieee.org>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
    Cc: Weijie Yang <weijieut@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Bob Liu <bob.liu@oracle.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit a4c51bde13ee78405827287ef24f80b1b40d45d7
Author: Michal Hocko <mhocko@suse.cz>
Date:   Mon Apr 7 15:37:01 2014 -0700

    mm: exclude memoryless nodes from zone_reclaim
    
    commit 70ef57e6c22c3323dce179b7d0d433c479266612 upstream.
    
    We had a report about strange OOM killer strikes on a PPC machine
    although there was a lot of swap free and a tons of anonymous memory
    which could be swapped out.  In the end it turned out that the OOM was a
    side effect of zone reclaim which wasn't unmapping and swapping out and
    so the system was pushed to the OOM.  Although this sounds like a bug
    somewhere in the kswapd vs.  zone reclaim vs.  direct reclaim
    interaction numactl on the said hardware suggests that the zone reclaim
    should not have been set in the first place:
    
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
      node 0 size: 0 MB
      node 0 free: 0 MB
      node 2 cpus:
      node 2 size: 7168 MB
      node 2 free: 6019 MB
      node distances:
      node   0   2
      0:  10  40
      2:  40  10
    
    So all the CPUs are associated with Node0 which doesn't have any memory
    while Node2 contains all the available memory.  Node distances cause an
    automatic zone_reclaim_mode enabling.
    
    Zone reclaim is intended to keep the allocations local but this doesn't
    make any sense on the memoryless nodes.  So let's exclude such nodes for
    init_zone_allows_reclaim which evaluates zone reclaim behavior and
    suitable reclaim_nodes.
    
    Signed-off-by: Michal Hocko <mhocko@suse.cz>
    Acked-by: David Rientjes <rientjes@google.com>
    Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
    Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 5c0f0c017ce6bc8c4da65b4d0066358b3e0fbbf8
Author: Andrew Hunter <ahh@google.com>
Date:   Thu Sep 4 14:17:16 2014 -0700

    jiffies: Fix timeval conversion to jiffies
    
    commit d78c9300c51d6ceed9f6d078d4e9366f259de28c upstream.
    
    timeval_to_jiffies tried to round a timeval up to an integral number
    of jiffies, but the logic for doing so was incorrect: intervals
    corresponding to exactly N jiffies would become N+1. This manifested
    itself particularly repeatedly stopping/starting an itimer:
    
    setitimer(ITIMER_PROF, &val, NULL);
    setitimer(ITIMER_PROF, NULL, &val);
    
    would add a full tick to val, _even if it was exactly representable in
    terms of jiffies_ (say, the result of a previous rounding.)  Doing
    this repeatedly would cause unbounded growth in val.  So fix the math.
    
    Here's what was wrong with the conversion: we essentially computed
    (eliding seconds)
    
    jiffies = usec  * (NSEC_PER_USEC/TICK_NSEC)
    
    by using scaling arithmetic, which took the best approximation of
    NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
    x/(2^USEC_JIFFIE_SC), and computed:
    
    jiffies = (usec * x) >> USEC_JIFFIE_SC
    
    and rounded this calculation up in the intermediate form (since we
    can't necessarily exactly represent TICK_NSEC in usec.) But the
    scaling arithmetic is a (very slight) *over*approximation of the true
    value; that is, instead of dividing by (1 usec/ 1 jiffie), we
    effectively divided by (1 usec/1 jiffie)-epsilon (rounding
    down). This would normally be fine, but we want to round timeouts up,
    and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
    would be fine if our division was exact, but dividing this by the
    slightly smaller factor was equivalent to adding just _over_ 1 to the
    final result (instead of just _under_ 1, as desired.)
    
    In particular, with HZ=1000, we consistently computed that 10000 usec
    was 11 jiffies; the same was true for any exact multiple of
    TICK_NSEC.
    
    We could possibly still round in the intermediate form, adding
    something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
    convert usec->nsec, round in nanoseconds, and then convert using
    time*spec*_to_jiffies.  This adds one constant multiplication, and is
    not observably slower in microbenchmarks on recent x86 hardware.
    
    Tested: the following program:
    
    int main() {
      struct itimerval zero = {{0, 0}, {0, 0}};
      /* Initially set to 10 ms. */
      struct itimerval initial = zero;
      initial.it_interval.tv_usec = 10000;
      setitimer(ITIMER_PROF, &initial, NULL);
      /* Save and restore several times. */
      for (size_t i = 0; i < 10; ++i) {
        struct itimerval prev;
        setitimer(ITIMER_PROF, &zero, &prev);
        /* on old kernels, this goes up by TICK_USEC every iteration */
        printf("previous value: %ld %ld %ld %ld\n",
               prev.it_interval.tv_sec, prev.it_interval.tv_usec,
               prev.it_value.tv_sec, prev.it_value.tv_usec);
        setitimer(ITIMER_PROF, &prev, NULL);
      }
        return 0;
    }
    
    
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Paul Turner <pjt@google.com>
    Cc: Richard Cochran <richardcochran@gmail.com>
    Cc: Prarit Bhargava <prarit@redhat.com>
    Reviewed-by: Paul Turner <pjt@google.com>
    Reported-by: Aaron Jacobs <jacobsa@google.com>
    Signed-off-by: Andrew Hunter <ahh@google.com>
    [jstultz: Tweaked to apply to 3.17-rc]
    Signed-off-by: John Stultz <john.stultz@linaro.org>
    [bwh: Backported to 3.16: adjust filename]
    Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit cbb87efb98dfabed6dd73b040f25badeaf319960
Author: Hans Verkuil <hans.verkuil@cisco.com>
Date:   Sat Sep 20 16:16:35 2014 -0300

    media: vb2: fix VBI/poll regression
    
    commit 58d75f4b1ce26324b4d809b18f94819843a98731 upstream.
    
    The recent conversion of saa7134 to vb2 unconvered a poll() bug that
    broke the teletext applications alevt and mtt. These applications
    expect that calling poll() without having called VIDIOC_STREAMON will
    cause poll() to return POLLERR. That did not happen in vb2.
    
    This patch fixes that behavior. It also fixes what should happen when
    poll() is called when STREAMON is called but no buffers have been
    queued. In that case poll() will also return POLLERR, but only for
    capture queues since output queues will always return POLLOUT
    anyway in that situation.
    
    This brings the vb2 behavior in line with the old videobuf behavior.
    
    Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
    Acked-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
    Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 5f50c44d8a63ee6c4801cdcb372b8048ac77efcf
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Oct 2 19:47:42 2014 +0100

    mm: numa: Do not mark PTEs pte_numa when splitting huge pages
    
    commit abc40bd2eeb77eb7c2effcaf63154aad929a1d5f upstream.
    
    This patch reverts 1ba6e0b50b ("mm: numa: split_huge_page: transfer the
    NUMA type from the pmd to the pte"). If a huge page is being split due
    a protection change and the tail will be in a PROT_NONE vma then NUMA
    hinting PTEs are temporarily created in the protected VMA.
    
     VM_RW|VM_PROTNONE
    |-----------------|
          ^
          split here
    
    In the specific case above, it should get fixed up by change_pte_range()
    but there is a window of opportunity for weirdness to happen. Similarly,
    if a huge page is shrunk and split during a protection update but before
    pmd_numa is cleared then a pte_numa can be left behind.
    
    Instead of adding complexity trying to deal with the case, this patch
    will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults
    will not be triggered which is marginal in comparison to the complexity
    in dealing with the corner cases during THP split.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Rik van Riel <riel@redhat.com>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 1da286ebc5a1d23d0b4b88ba0d64fc141ac4c37d
Author: Waiman Long <Waiman.Long@hp.com>
Date:   Wed Aug 6 16:05:36 2014 -0700

    mm, thp: move invariant bug check out of loop in __split_huge_page_map
    
    commit f8303c2582b889351e261ff18c4d8eb197a77db2 upstream.
    
    In __split_huge_page_map(), the check for page_mapcount(page) is
    invariant within the for loop.  Because of the fact that the macro is
    implemented using atomic_read(), the redundant check cannot be optimized
    away by the compiler leading to unnecessary read to the page structure.
    
    This patch moves the invariant bug check out of the loop so that it will
    be done only once.  On a 3.16-rc1 based kernel, the execution time of a
    microbenchmark that broke up 1000 transparent huge pages using munmap()
    had an execution time of 38,245us and 38,548us with and without the
    patch respectively.  The performance gain is about 1%.
    
    Signed-off-by: Waiman Long <Waiman.Long@hp.com>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Scott J Norton <scott.norton@hp.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit de1fc405fbc586005607e51599da5997463fbefc
Author: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Date:   Tue May 6 12:50:00 2014 -0700

    hugetlb: ensure hugepage access is denied if hugepages are not supported
    
    commit 457c1b27ed56ec472d202731b12417bff023594a upstream.
    
    Currently, I am seeing the following when I `mount -t hugetlbfs /none
    /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`.  I think it's
    related to the fact that hugetlbfs is properly not correctly setting
    itself up in this state?:
    
      Unable to handle kernel paging request for data at address 0x00000031
      Faulting instruction address: 0xc000000000245710
      Oops: Kernel access of bad area, sig: 11 [#1]
      SMP NR_CPUS=2048 NUMA pSeries
      ....
    
    In KVM guests on Power, in a guest not backed by hugepages, we see the
    following:
    
      AnonHugePages:         0 kB
      HugePages_Total:       0
      HugePages_Free:        0
      HugePages_Rsvd:        0
      HugePages_Surp:        0
      Hugepagesize:         64 kB
    
    HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
    are not supported at boot-time, but this is only checked in
    hugetlb_init().  Extract the check to a helper function, and use it in a
    few relevant places.
    
    This does make hugetlbfs not supported (not registered at all) in this
    environment.  I believe this is fine, as there are no valid hugepages
    and that won't change at runtime.
    
    [akpm@linux-foundation.org: use pr_info(), per Mel]
    [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
    Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
    Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Randy Dunlap <rdunlap@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 04ceeee9bb02eabbaaf14c0d0dc759caa60de230
Author: Pavel Shilovsky <pshilovsky@samba.org>
Date:   Mon Aug 18 20:49:57 2014 +0400

    CIFS: Fix SMB2 readdir error handling
    
    commit 52755808d4525f4d5b86d112d36ffc7a46f3fb48 upstream.
    
    SMB2 servers indicates the end of a directory search with
    STATUS_NO_MORE_FILE error code that is not processed now.
    This causes generic/257 xfstest to fail. Fix this by triggering
    the end of search by this error code in SMB2_query_directory.
    
    Also when negotiating CIFS protocol we tell the server to close
    the search automatically at the end and there is no need to do
    it itself. In the case of SMB2 protocol, we need to close it
    explicitly - separate close directory checks for different
    protocols.
    
    Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
    Signed-off-by: Steve French <smfrench@gmail.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 3a525e231651bb3c8fe7be16b8b83e78146740aa
Author: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Date:   Thu Oct 2 16:51:18 2014 -0400

    ring-buffer: Fix infinite spin in reading buffer
    
    commit 24607f114fd14f2f37e3e0cb3d47bce96e81e848 upstream.
    
    Commit 651e22f2701b "ring-buffer: Always reset iterator to reader page"
    fixed one bug but in the process caused another one. The reset is to
    update the header page, but that fix also changed the way the cached
    reads were updated. The cache reads are used to test if an iterator
    needs to be updated or not.
    
    A ring buffer iterator, when created, disables writes to the ring buffer
    but does not stop other readers or consuming reads from happening.
    Although all readers are synchronized via a lock, they are only
    synchronized when in the ring buffer functions. Those functions may
    be called by any number of readers. The iterator continues down when
    its not interrupted by a consuming reader. If a consuming read
    occurs, the iterator starts from the beginning of the buffer.
    
    The way the iterator sees that a consuming read has happened since
    its last read is by checking the reader "cache". The cache holds the
    last counts of the read and the reader page itself.
    
    Commit 651e22f2701b changed what was saved by the cache_read when
    the rb_iter_reset() occurred, making the iterator never match the cache.
    Then if the iterator calls rb_iter_reset(), it will go into an
    infinite loop by checking if the cache doesn't match, doing the reset
    and retrying, just to see that the cache still doesn't match! Which
    should never happen as the reset is suppose to set the cache to the
    current value and there's locks that keep a consuming reader from
    having access to the data.
    
    Fixes: 651e22f2701b "ring-buffer: Always reset iterator to reader page"
    Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 769721dbd6dd08ecae0455b5efdc011fe8006da3
Author: Josh Triplett <josh@joshtriplett.org>
Date:   Fri Oct 3 16:19:24 2014 -0700

    init/Kconfig: Fix HAVE_FUTEX_CMPXCHG to not break up the EXPERT menu
    
    commit 62b4d2041117f35ab2409c9f5c4b8d3dc8e59d0f upstream.
    
    commit 03b8c7b623c80af264c4c8d6111e5c6289933666 ("futex: Allow
    architectures to skip futex_atomic_cmpxchg_inatomic() test") added the
    HAVE_FUTEX_CMPXCHG symbol right below FUTEX.  This placed it right in
    the middle of the options for the EXPERT menu.  However,
    HAVE_FUTEX_CMPXCHG does not depend on EXPERT or FUTEX, so Kconfig stops
    placing items in the EXPERT menu, and displays the remaining several
    EXPERT items (starting with EPOLL) directly in the General Setup menu.
    
    Since both users of HAVE_FUTEX_CMPXCHG only select it "if FUTEX", make
    HAVE_FUTEX_CMPXCHG itself depend on FUTEX.  With this change, the
    subsequent items display as part of the EXPERT menu again; the EMBEDDED
    menu now appears as the next top-level item in the General Setup menu,
    which makes General Setup much shorter and more usable.
    
    Signed-off-by: Josh Triplett <josh@joshtriplett.org>
    Acked-by: Randy Dunlap <rdunlap@infradead.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit f012edf688f867dc35a3501808afc6fa9fba640d
Author: Steve French <smfrench@gmail.com>
Date:   Thu Sep 25 01:26:55 2014 -0500

    Fix problem recognizing symlinks
    
    commit 19e81573fca7b87ced7701e01ba164b968d929bd upstream.
    
    Changeset eb85d94bd introduced a problem where if a cifs open
    fails during query info of a file we
    will still try to close the file (happens with certain types
    of reparse points) even though the file handle is not valid.
    
    In addition for SMB2/SMB3 we were not mapping the return code returned
    by Windows when trying to open a file (like a Windows NFS symlink)
    which is a reparse point.
    
    Signed-off-by: Steve French <smfrench@gmail.com>
    Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 7365af49809ff97ab55d76c746ed49c044ca4b2d
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Sep 25 10:13:12 2014 +0100

    drm/i915: Flush the PTEs after updating them before suspend
    
    commit 91e56499304f3d612053a9cf17f350868182c7d8 upstream.
    
    As we use WC updates of the PTE, we are responsible for notifying the
    hardware when to flush its TLBs. Do so after we zap all the PTEs before
    suspend (and the BIOS tries to read our GTT).
    
    Fixes a regression from
    
    commit 828c79087cec61eaf4c76bb32c222fbe35ac3930
    Author: Ben Widawsky <benjamin.widawsky@intel.com>
    Date:   Wed Oct 16 09:21:30 2013 -0700
    
        drm/i915: Disable GGTT PTEs on GEN6+ suspend
    
    that survived and continue to cause harm even after
    
    commit e568af1c626031925465a5caaab7cca1303d55c7
    Author: Daniel Vetter <daniel.vetter@ffwll.ch>
    Date:   Wed Mar 26 20:08:20 2014 +0100
    
        drm/i915: Undo gtt scratch pte unmapping again
    
    v2: Trivial rebase.
    v3: Fixes requires pointer dances.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=82340
    Tested-by: ming.yao@intel.com
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Takashi Iwai <tiwai@suse.de>
    Cc: Paulo Zanoni <paulo.r.zanoni@intel.com>
    Cc: Todd Previte <tprevite@gmail.com>
    Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Signed-off-by: Jani Nikula <jani.nikula@intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 6af970fada5c6d714fdc6320dce68439cc2c4423
Author: NeilBrown <neilb@suse.de>
Date:   Thu Oct 2 13:45:00 2014 +1000

    md/raid5: disable 'DISCARD' by default due to safety concerns.
    
    commit 8e0e99ba64c7ba46133a7c8a3e3f7de01f23bd93 upstream.
    
    It has come to my attention (thanks Martin) that 'discard_zeroes_data'
    is only a hint.  Some devices in some cases don't do what it
    says on the label.
    
    The use of DISCARD in RAID5 depends on reads from discarded regions
    being predictably zero.  If a write to a previously discarded region
    performs a read-modify-write cycle it assumes that the parity block
    was consistent with the data blocks.  If all were zero, this would
    be the case.  If some are and some aren't this would not be the case.
    This could lead to data corruption after a device failure when
    data needs to be reconstructed from the parity.
    
    As we cannot trust 'discard_zeroes_data', ignore it by default
    and so disallow DISCARD on all raid4/5/6 arrays.
    
    As many devices are trustworthy, and as there are benefits to using
    DISCARD, add a module parameter to over-ride this caution and cause
    DISCARD to work if discard_zeroes_data is set.
    
    If a site want to enable DISCARD on some arrays but not on others they
    should select DISCARD support at the filesystem level, and set the
    raid456 module parameter.
        raid456.devices_handle_discard_safely=Y
    
    As this is a data-safety issue, I believe this patch is suitable for
    -stable.
    DISCARD support for RAID456 was added in 3.7
    
    Cc: Shaohua Li <shli@kernel.org>
    Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
    Cc: Mike Snitzer <snitzer@redhat.com>
    Cc: Heinz Mauelshagen <heinzm@redhat.com>
    Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
    Acked-by: Mike Snitzer <snitzer@redhat.com>
    Fixes: 620125f2bf8ff0c4969b79653b54d7bcc9d40637
    Signed-off-by: NeilBrown <neilb@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit da893cbeda65e2d43663056fa21be8036bdefe2e
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Fri Sep 26 22:19:12 2014 +0200

    cpufreq: integrator: fix integrator_cpufreq_remove return type
    
    commit d62dbf77f7dfaa6fb455b4b9828069a11965929c upstream.
    
    When building this driver as a module, we get a helpful warning
    about the return type:
    
    drivers/cpufreq/integrator-cpufreq.c:232:2: warning: initialization from incompatible pointer type
      .remove = __exit_p(integrator_cpufreq_remove),
    
    If the remove callback returns void, the caller gets an undefined
    value as it expects an integer to be returned. This fixes the
    problem by passing down the value from cpufreq_unregister_driver.
    
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 5b81f9368aa61e521599a23d576c01bc655f95e7
Author: Mel Gorman <mgorman@suse.de>
Date:   Thu Oct 2 19:47:41 2014 +0100

    mm: migrate: Close race between migration completion and mprotect
    
    commit d3cb8bf6081b8b7a2dabb1264fe968fd870fa595 upstream.
    
    A migration entry is marked as write if pte_write was true at the time the
    entry was created. The VMA protections are not double checked when migration
    entries are being removed as mprotect marks write-migration-entries as
    read. It means that potentially we take a spurious fault to mark PTEs write
    again but it's straight-forward. However, there is a race between write
    migrations being marked read and migrations finishing. This potentially
    allows a PTE to be write that should have been read. Close this race by
    double checking the VMA permissions using maybe_mkwrite when migration
    completes.
    
    [torvalds@linux-foundation.org: use maybe_mkwrite]
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Rik van Riel <riel@redhat.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit fddac5ed7699cbf4828f7493d4995e739d6fe6a5
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Thu Oct 2 16:17:02 2014 -0700

    perf: fix perf bug in fork()
    
    commit 6c72e3501d0d62fc064d3680e5234f3463ec5a86 upstream.
    
    Oleg noticed that a cleanup by Sylvain actually uncovered a bug; by
    calling perf_event_free_task() when failing sched_fork() we will not yet
    have done the memset() on ->perf_event_ctxp[] and will therefore try and
    'free' the inherited contexts, which are still in use by the parent
    process.  This is bad..
    
    Suggested-by: Oleg Nesterov <oleg@redhat.com>
    Reported-by: Oleg Nesterov <oleg@redhat.com>
    Reported-by: Sylvain 'ythier' Hitier <sylvain.hitier@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Ingo Molnar <mingo@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 82335226733fdf82ee3f231c08269a17fd62a3fc
Author: Jan Kara <jack@suse.cz>
Date:   Thu Sep 4 14:06:55 2014 +0200

    udf: Avoid infinite loop when processing indirect ICBs
    
    commit c03aa9f6e1f938618e6db2e23afef0574efeeb65 upstream.
    
    We did not implement any bound on number of indirect ICBs we follow when
    loading inode. Thus corrupted medium could cause kernel to go into an
    infinite loop, possibly causing a stack overflow.
    
    Fix the possible stack overflow by removing recursion from
    __udf_read_inode() and limit number of indirect ICBs we follow to avoid
    infinite loops.
    
    Signed-off-by: Jan Kara <jack@suse.cz>
    Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>