Alexandre Courbot [Wed, 8 Jun 2016 08:32:39 +0000 (17:32 +0900)]
drm/nouveau/gr/gf100: handle secure boot errors
Handle and propagate secure boot errors. Failure to do so results in
Nouveau incorrectly believing init has succeeded and a completely
black display during boot. If we propagate the error, GR init will fail
and the user will at least have a working display.
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Alexandre Courbot [Wed, 8 Jun 2016 08:32:38 +0000 (17:32 +0900)]
drm/nouveau/secboot: fix kerneldoc for secure boot structures
Some members were documented in the wrong structure.
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Karol Herbst [Sun, 17 Apr 2016 12:51:23 +0000 (14:51 +0200)]
drm/nouveau/hwmon: add in_min and in_max
it is a little help for hardware monitoring tools
Signed-off-by: Karol Herbst <karolherbst@gmail.de>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Karol Herbst [Fri, 26 Feb 2016 06:49:08 +0000 (07:49 +0100)]
drm/nouveau/volt: save the voltage range we are able to set
We shouldn't set voltages below the min or above the max voltage the gpu is
able to set, so save the range for future lookups.
Signed-off-by: Karol Herbst <karolherbst@gmail.de>
Reviewed-by: Martin Peres <martin.peres@free.fr>
Tested-by: Pierre Moreau <pierre.morrow@free.fr>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Alexandre Courbot [Wed, 1 Jun 2016 08:39:29 +0000 (17:39 +0900)]
drm/nouveau/clk/gm20b: add glitchless and DFS support
This patch adds support for advanced features supported by the
Noise-Aware PLL of Maxwell. Glitchless switch allows the PL field to be
updated without disabling the PLL first if the SYNC_MODE bit of the CFG
register is set.
More significantly, DFS allows the PLL to monitor the actual input
voltage and to dynamically lower the output frequency accordingly. This
allows the clock to be more tolerant of lower voltages.
These improvements are only supported for Tegra speedos >= 1.
Also add the voltage table that is suitable for GM20B's NAPLL. This
change needs to be done atomically for the right voltages to be used by
the clock driver.
v2. Fix build on non-Tegra platforms
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Alexandre Courbot [Wed, 1 Jun 2016 08:39:28 +0000 (17:39 +0900)]
drm/nouveau/clk/gk20a: rename constructor
Strip the _ prefix off the gk20a clock constructor.
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Alexandre Courbot [Wed, 1 Jun 2016 08:39:27 +0000 (17:39 +0900)]
drm/nouveau/clk/gk20a: improve MNP programming
Split the MNP programming function into two functions for the cases
where we allow sliding or not, instead of making it take a parameter for
this. This results in less conditionals in the code and makes it easier
to read.
Also make the MNP programming functions take the PLL parameters as
arguments, and move bits of code to more relevant places (previous
programming tended to be just-in-time, which added more conditionnals in
the code).
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Alexandre Courbot [Wed, 1 Jun 2016 08:39:26 +0000 (17:39 +0900)]
drm/nouveau/clk/gk20a: factorize n_lo computation code
Use a dedicated function instead of always calculating n_lo on the fly.
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Alexandre Courbot [Wed, 1 Jun 2016 08:39:25 +0000 (17:39 +0900)]
drm/nouveau/clk/gk20a: parameterize PLL settings
Make functions manipulating PLL settings take them as an argument,
instead of assuming we want to work on the copy in the gk20a_clk
structure. This makes these functions more flexible, which we will need
in GM20B.
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Alexandre Courbot [Wed, 1 Jun 2016 08:39:24 +0000 (17:39 +0900)]
drm/nouveau/clk/gk20a: add and use MNP programming functions
Add relevant functions to work with the gk20a_pll structure and use them
where they ought to be instead of directly manipulating registers.
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Alexandre Courbot [Wed, 1 Jun 2016 08:39:23 +0000 (17:39 +0900)]
drm/nouveau/clk/gk20a: use nvkm_ functions in slide()
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Alexandre Courbot [Wed, 1 Jun 2016 08:39:22 +0000 (17:39 +0900)]
drm/nouveau/clk/gk20a: reorganize MNP calculation a bit
Move variables declarations to their actual scope of use, and simplify
code a bit.
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Alexandre Courbot [Wed, 1 Jun 2016 08:39:21 +0000 (17:39 +0900)]
drm/nouveau/clk/gk20a: setup slide once during init
Slide setup needs to be performed only once, during init. Also
use the proper parameters for different clock speeds.
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Alexandre Courbot [Wed, 1 Jun 2016 08:39:20 +0000 (17:39 +0900)]
drm/nouveau/clk/gk20a: properly protect macro argument
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Alexandre Courbot [Wed, 1 Jun 2016 08:39:19 +0000 (17:39 +0900)]
drm/nouveau/volt/gm20b: add support for vmin parameter
Chips may be characterized for a minimum voltage. Support this extra
parameter and select the appropriate minimum voltage for the detected
GPU speedo.
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Alexandre Courbot [Wed, 1 Jun 2016 08:39:18 +0000 (17:39 +0900)]
drm/nouveau/volt/gk20a: rename constructor
Strip the _ prefix off the gk20a volt constructor.
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Alexandre Courbot [Wed, 1 Jun 2016 08:39:17 +0000 (17:39 +0900)]
drm/nouveau/volt/gk20a: constify and name v_scale
Give a name to this constant so we at least get an idea of what it is
for.
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Alexandre Courbot [Wed, 1 Jun 2016 08:39:16 +0000 (17:39 +0900)]
drm/nouveau/volt/gk20a: make unused public functions static
Nobody else is using these, so make them private.
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Alexandre Courbot [Wed, 1 Jun 2016 08:39:15 +0000 (17:39 +0900)]
drm/nouveau/tegra: fetch gpu_speedo_id
The GPU speedo ID is required to select the right clk/volt parameters on
GM20B.
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Ben Skeggs [Sun, 29 May 2016 22:57:55 +0000 (08:57 +1000)]
drm/nouveau/secboot: use nvkm_mc_enable/disable()
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Ben Skeggs [Sun, 29 May 2016 22:56:23 +0000 (08:56 +1000)]
drm/nouveau/secboot: use nvkm_mc_intr_mask/unmask()
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Ben Skeggs [Sun, 29 May 2016 22:53:06 +0000 (08:53 +1000)]
drm/nouveau/mc/gk104-: add pmu reset mask
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Ben Skeggs [Sun, 29 May 2016 22:50:50 +0000 (08:50 +1000)]
drm/nouveau/mc/gf100-: support for masking interrupts
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Ben Skeggs [Sun, 29 May 2016 22:48:21 +0000 (08:48 +1000)]
drm/nouveau/mc/gt215: support for masking interrupts
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Ben Skeggs [Sun, 29 May 2016 22:39:27 +0000 (08:39 +1000)]
drm/nouveau/mc: support for temporarily masking interrupts from a specific device
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Ben Skeggs [Sun, 29 May 2016 22:27:22 +0000 (08:27 +1000)]
drm/nouveau/mc: s/intr_mask/intr_stat/
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Ben Skeggs [Sun, 29 May 2016 22:23:41 +0000 (08:23 +1000)]
drm/nouveau/mc: expose device enable/disable separately, as well as reset
There are cases where subdevs need to perform additonal actions around
the master reset, so we want to expost the operations separately.
This commit also adds a flag to the NV_PMC_ENABLE bitfield definitions
which allow skipping the automatic reset() called from core/subdev.c.
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Ben Skeggs [Sun, 29 May 2016 22:17:58 +0000 (08:17 +1000)]
drm/nouveau/mc: take nvkm_device as argument to public functions
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Ben Skeggs [Sun, 29 May 2016 23:23:06 +0000 (09:23 +1000)]
drm/nouveau/mc: allow construction of subclassed device
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Ben Skeggs [Mon, 30 May 2016 00:36:02 +0000 (10:36 +1000)]
drm/nouveau/top: add function to lookup interrupt mask for a given device
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Ben Skeggs [Mon, 30 May 2016 00:32:55 +0000 (10:32 +1000)]
drm/nouveau/top: take nvkm_device as argument to public functions
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Dave Airlie [Mon, 4 Jul 2016 23:43:02 +0000 (09:43 +1000)]
Merge tag 'asoc-hdmi-codec-pdata' of git://git./linux/kernel/git/broonie/sound into drm-next
ASoC: Add private data for HDMI CODEC callbacks
Allow the HDMI CODECs to get private data passed in in callbacks.
[airlied:
Add STI/mediatek patches from Arnd for drivers merged later in drm tree.]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Airlie <airlied@redhat.com>
* tag 'asoc-hdmi-codec-pdata' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound:
ASoC: hdmi-codec: callback function will be called with private data
Dave Airlie [Sat, 2 Jul 2016 06:21:35 +0000 (16:21 +1000)]
Merge branch 'for-next' of git.agner.ch/git/linux-drm-fsl-dcu into drm-next
The patchset contains a new helper in drm_fb_cma_helper.c for suspend/
resume when using cma backed framebuffers.
* 'for-next' of http://git.agner.ch/git/linux-drm-fsl-dcu:
drm/fsl-dcu: disable vblank events on CRTC disable
drm/fsl-dcu: implement suspend/resume using atomic helpers
drm/fsl-dcu: use clk helpers
drm/fsl-dcu: move layer initialization to plane file
drm/fsl-dcu: store layer registers in soc_data
drm/fb_cma_helper: add suspend helper
Dave Airlie [Sat, 2 Jul 2016 05:56:01 +0000 (15:56 +1000)]
Back-merge tag 'v4.7-rc5' into drm-next
Linux 4.7-rc5
The fsl-dcu pull needs -rc3 so go to -rc5 for now.
Dave Airlie [Sat, 2 Jul 2016 05:51:46 +0000 (15:51 +1000)]
Merge branch 'sti-drm-next-2016-06-30' of git.linaro.org/people/benjamin.gaignard/kernel into drm-next
This pull request include 3 minors fix and one feature evolution
around ASoC hdmi codec support.
* 'sti-drm-next-2016-06-30' of http://git.linaro.org/people/benjamin.gaignard/kernel:
drm: sti: Add ASoC generic hdmi codec support.
drm/sti: adjust delay for AWG
drm: sti: fix clocking issues in crtc
drm/sti: Use 64-bit timestamps
Arnaud Pouliquen [Mon, 30 May 2016 13:31:37 +0000 (15:31 +0200)]
drm: sti: Add ASoC generic hdmi codec support.
Add the interface needed by audio hdmi-codec driver.
Signed-off-by: Arnaud Pouliquen <arnaud.pouliquen@st.com>
Kuninori Morimoto [Fri, 24 Jun 2016 02:47:55 +0000 (02:47 +0000)]
ASoC: hdmi-codec: callback function will be called with private data
Current hdmi-codec driver is assuming that it will be registered
from HDMI driver. Because of this assumption, each callback function
has struct device pointer which is parent device (= HDMI).
Then, it can use dev_get_drvdata() to get private data.
OTOH, on some SoC/HDMI case, SoC has VIDEO/SOUND and HDMI IPs.
This case, it needs SoC VIDEO, SoC SOUND and HDMI video, HDMI codec
driver. In DesignWare HDMI IP case, SoC VIDEO (= DRM/KMS) driver tries
to bind DesignWare HDMI video driver, and HDMI codec driver
(= hdmi-codec). This case, above "parent device" of HDMI codec driver
is DRM/KMS driver and its "device" already has private data.
And, from DT and ASoC CPU/Codec/Card binding point of view, HDMI codec
(= hdmi-codec) needs to have "parent device" (= DRM/KMS), otherwise,
it never detect sound card.
Because of these reasons, some driver can't use dev_get_drvdata() to
get private data on hdmi-codec driver. This patch add new void pointer
on hdmi_codec_pdata for private data, and callback function will be
called with it.
Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Bich Hemon [Tue, 15 Mar 2016 16:11:14 +0000 (17:11 +0100)]
drm/sti: adjust delay for AWG
Compensate delay introduced by AWG IP during DE generation
Signed-off-by: Bich Hemon <bich.hemon@st.com>
Reviewed-by: Vincent ABRIOU <vincent.abriou@st.com>
Benjamin Gaignard [Thu, 26 May 2016 08:39:20 +0000 (10:39 +0200)]
drm: sti: fix clocking issues in crtc
fix and simplify clock management in crtc to avoid unbalanced
call to clk_prepare_enable and clk_disable_unprepare functions
remove unused functions
Signed-off-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tina Ruchandani [Wed, 13 Apr 2016 09:28:02 +0000 (02:28 -0700)]
drm/sti: Use 64-bit timestamps
'struct timespec' uses a 32-bit field for seconds, which
will overflow in year 2038 and beyond. This patch is part
of a larger attempt to remove instances of timeval, timespec
and time_t, all of which suffer from the y2038 issue, from the
kernel.
Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com>
Linus Torvalds [Mon, 27 Jun 2016 00:52:03 +0000 (17:52 -0700)]
Linux 4.7-rc5
Linus Torvalds [Sun, 26 Jun 2016 17:08:49 +0000 (10:08 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Two straightforward fixes.
One is a concurrency issue only affecting SAS connected SATA drives,
but which could hang the storage subsystem if it triggers (because the
outstanding command count on error never goes back to zero) and the
other is a NO_TAG fallout from the switch to hostwide tags which
causes the system to crash on module insertion (we've checked
carefully and only the 53c700 family of drivers is vulnerable to this
issue)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
53c700: fix BUG on untagged commands
scsi: fix race between simultaneous decrements of ->host_failed
Linus Torvalds [Sat, 25 Jun 2016 15:53:38 +0000 (08:53 -0700)]
Merge branch 'for-linus-4.7-part2' of git://git./linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes part 2 from Chris Mason:
"This has one patch from Omar to bring iterate_shared back to btrfs.
We have a tree of work we queue up for directory items and it doesn't
lend itself well to shared access. While we're cleaning it up, Omar
has changed things to use an exclusive lock when there are delayed
items"
* 'for-linus-4.7-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes
Linus Torvalds [Sat, 25 Jun 2016 15:42:31 +0000 (08:42 -0700)]
Merge branch 'for-linus-4.7' of git://git./linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"I have a two part pull this time because one of the patches Dave
Sterba collected needed to be against v4.7-rc2 or higher (we used
rc4). I try to make my for-linus-xx branch testable on top of the
last major so we can hand fixes to people on the list more easily, so
I've split this pull in two.
This first part has some fixes and two performance improvements that
we've been testing for some time.
Josef's two performance fixes are most notable. The transid tracking
patch makes a big improvement on pretty much every workload"
* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: Force stripesize to the value of sectorsize
btrfs: fix disk_i_size update bug when fallocate() fails
Btrfs: fix error handling in map_private_extent_buffer
Btrfs: fix error return code in btrfs_init_test_fs()
Btrfs: don't do nocow check unless we have to
btrfs: fix deadlock in delayed_ref_async_start
Btrfs: track transid for delayed ref flushing
Linus Torvalds [Sat, 25 Jun 2016 13:55:48 +0000 (06:55 -0700)]
Merge tag 'sound-4.7-rc5' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Again pretty calm weeks: we've had only a few trivial / stable
HD-audio fixes in addition to a possible race fix for snd-dummy driver
spotted by syzkaller"
* tag 'sound-4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: dummy: Fix a use-after-free at closing
ALSA: hda / realtek - add two more Thinkpad IDs (5050,5053) for tpt460 fixup
ALSA: hda - Fix the headset mic jack detection on Dell machine
ALSA: hda/tegra: iomem fixups for sparse warnings
ALSA: hdac_regmap - fix the register access for runtime PM
Linus Torvalds [Sat, 25 Jun 2016 13:49:32 +0000 (06:49 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 kprobe fix from Thomas Gleixner:
"A single fix clearing the TF bit when a fault is single stepped"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kprobes/x86: Clear TF bit in fault on single-stepping
Linus Torvalds [Sat, 25 Jun 2016 13:38:42 +0000 (06:38 -0700)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
"A couple of scheduler fixes:
- force watchdog reset while processing sysrq-w
- fix a deadlock when enabling trace events in the scheduler
- fixes to the throttled next buddy logic
- fixes for the average accounting (missing serialization and
underflow handling)
- allow kernel threads for fallback to online but not active cpus"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Allow kthreads to fall back to online && !active cpus
sched/fair: Do not announce throttled next buddy in dequeue_task_fair()
sched/fair: Initialize throttle_count for new task-groups lazily
sched/fair: Fix cfs_rq avg tracking underflow
kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w
sched/debug: Fix deadlock when enabling sched events
sched/fair: Fix post_init_entity_util_avg() serialization
Omar Sandoval [Fri, 20 May 2016 20:50:33 +0000 (13:50 -0700)]
Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes
Commit
fe742fd4f90f ("Revert "btrfs: switch to ->iterate_shared()"")
backed out the conversion to ->iterate_shared() for Btrfs because the
delayed inode handling in btrfs_real_readdir() is racy. However, we can
still do readdir in parallel if there are no delayed nodes.
This is a temporary fix which upgrades the shared inode lock to an
exclusive lock only when we have delayed items until we come up with a
more complete solution. While we're here, rename the
btrfs_{get,put}_delayed_items functions to make it very clear that
they're just for readdir.
Tested with xfstests and by doing a parallel kernel build:
while make tinyconfig && make -j4 && git clean dqfx; do
:
done
along with a bunch of parallel finds in another shell:
while true; do
for ((i=0; i<4; i++)); do
find . >/dev/null &
done
wait
done
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Linus Torvalds [Sat, 25 Jun 2016 13:14:44 +0000 (06:14 -0700)]
Merge branch 'locking-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull locking fix from Thomas Gleixner:
"A single fix to address a race in the static key logic"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/static_key: Fix concurrent static_key_slow_inc()
Linus Torvalds [Sat, 25 Jun 2016 13:09:59 +0000 (06:09 -0700)]
Merge branch 'irq-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
"A single fix for the fallout from the conversion of MIPS GIC to irq
domains"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/mips-gic: Fix IRQs in gic_dev_domain
Linus Torvalds [Sat, 25 Jun 2016 13:01:48 +0000 (06:01 -0700)]
Merge tag 'powerpc-4.7-4' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"mm/radix (Aneesh Kumar K.V):
- Update to tlb functions ric argument
- Flush page walk cache when freeing page table
- Update Radix tree size as per ISA 3.0
mm/hash (Aneesh Kumar K.V):
- Use the correct PPP mask when updating HPTE
- Don't add memory coherence if cache inhibited is set
eeh (Gavin Shan):
- Fix invalid cached PE primary bus
bpf/jit (Naveen N. Rao):
- Disable classic BPF JIT on ppc64le
.. and fix faults caused by radix patching of SLB miss handler
(Michael Ellerman)"
* tag 'powerpc-4.7-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/bpf/jit: Disable classic BPF JIT on ppc64le
powerpc: Fix faults caused by radix patching of SLB miss handler
powerpc/eeh: Fix invalid cached PE primary bus
powerpc/mm/radix: Update Radix tree size as per ISA 3.0
powerpc/mm/hash: Don't add memory coherence if cache inhibited is set
powerpc/mm/hash: Use the correct PPP mask when updating HPTE
powerpc/mm/radix: Flush page walk cache when freeing page table
powerpc/mm/radix: Update to tlb functions ric argument
Michael Ellerman [Sat, 25 Jun 2016 11:53:30 +0000 (21:53 +1000)]
Fix build break in fork.c when THREAD_SIZE < PAGE_SIZE
Commit
b235beea9e99 ("Clarify naming of thread info/stack allocators")
breaks the build on some powerpc configs, where THREAD_SIZE < PAGE_SIZE:
kernel/fork.c:235:2: error: implicit declaration of function 'free_thread_stack'
kernel/fork.c:355:8: error: assignment from incompatible pointer type
stack = alloc_thread_stack_node(tsk, node);
^
Fix it by renaming free_stack() to free_thread_stack(), and updating the
return type of alloc_thread_stack_node().
Fixes:
b235beea9e99 ("Clarify naming of thread info/stack allocators")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 25 Jun 2016 02:08:33 +0000 (19:08 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"Two weeks worth of fixes here"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits)
init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64
autofs: don't get stuck in a loop if vfs_write() returns an error
mm/page_owner: avoid null pointer dereference
tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences"
fs/nilfs2: fix potential underflow in call to crc32_le
oom, suspend: fix oom_reaper vs. oom_killer_disable race
ocfs2: disable BUG assertions in reading blocks
mm, compaction: abort free scanner if split fails
mm: prevent KASAN false positives in kmemleak
mm/hugetlb: clear compound_mapcount when freeing gigantic pages
mm/swap.c: flush lru pvecs on compound page arrival
memcg: css_alloc should return an ERR_PTR value on error
memcg: mem_cgroup_migrate() may be called with irq disabled
hugetlb: fix nr_pmds accounting with shared page tables
Revert "mm: disable fault around on emulated access bit architecture"
Revert "mm: make faultaround produce old ptes"
mailmap: add Boris Brezillon's email
mailmap: add Antoine Tenart's email
mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask
mm: mempool: kasan: don't poot mempool objects in quarantine
...
Linus Torvalds [Sat, 25 Jun 2016 01:52:31 +0000 (18:52 -0700)]
Merge tag 'for-linus' of git://git./linux/kernel/git/dledford/rdma
Pull rdma fixes from Doug Ledford:
"This is the second batch of queued up rdma patches for this rc cycle.
There isn't anything really major in here. It's passed 0day,
linux-next, and local testing across a wide variety of hardware.
There are still a few known issues to be tracked down, but this should
amount to the vast majority of the rdma RC fixes.
Round two of 4.7 rc fixes:
- A couple minor fixes to the rdma core
- Multiple minor fixes to hfi1
- Multiple minor fixes to mlx4/mlx4
- A few minor fixes to i40iw"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (31 commits)
IB/srpt: Reduce QP buffer size
i40iw: Enable level-1 PBL for fast memory registration
i40iw: Return correct max_fast_reg_page_list_len
i40iw: Correct status check on i40iw_get_pble
i40iw: Correct CQ arming
IB/rdmavt: Correct qp_priv_alloc() return value test
IB/hfi1: Don't zero out qp->s_ack_queue in rvt_reset_qp
IB/hfi1: Fix deadlock with txreq allocation slow path
IB/mlx4: Prevent cross page boundary allocation
IB/mlx4: Fix memory leak if QP creation failed
IB/mlx4: Verify port number in flow steering create flow
IB/mlx4: Fix error flow when sending mads under SRIOV
IB/mlx4: Fix the SQ size of an RC QP
IB/mlx5: Fix wrong naming of port_rcv_data counter
IB/mlx5: Fix post send fence logic
IB/uverbs: Initialize ib_qp_init_attr with zeros
IB/core: Fix false search of the IB_SA_WELL_KNOWN_GUID
IB/core: Fix RoCE v1 multicast join logic issue
IB/core: Fix no default GIDs when netdevice reregisters
IB/hfi1: Send a pkey change event on driver pkey update
...
Linus Torvalds [Sat, 25 Jun 2016 01:43:58 +0000 (18:43 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jikos/hid
Pull HID fix from Jiri Kosina:
"hiddev ioctl() validation fix from Scott Bauer"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
HID: hiddev: validate num_values for HIDIOCGUSAGES, HIDIOCSUSAGES commands
Linus Torvalds [Sat, 25 Jun 2016 01:36:15 +0000 (18:36 -0700)]
Merge tag 'hwmon-for-linus-v4.7-rc5' of git://git./linux/kernel/git/groeck/linux-staging
Pull hwmon fix from Guenter Roeck:
"Improve fan type detection for dell-smm to prevent kernel hang"
* tag 'hwmon-for-linus-v4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (dell-smm) Cache fan_type() calls and change fan detection
Linus Torvalds [Sat, 25 Jun 2016 01:29:55 +0000 (18:29 -0700)]
Merge tag 'acpi-4.7-rc5' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"Stable-candidate fix for a deadlock in ACPICA introduced during the
4.5 development cycle by a commit attempting to improve the handling
of AML code that doesn't belong to any namespace objects in a given
definition block (Lv Zheng)"
* tag 'acpi-4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPICA: Namespace: Fix deadlock triggered by MLC support in dynamic table loading
Linus Torvalds [Sat, 25 Jun 2016 01:03:22 +0000 (18:03 -0700)]
Merge tag 'pm-4.7-rc5' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"Fix for a latent cpufreq driver bug uncovered by a recent ACPICA
change and several fixes for the devfreq framework, including one fix
for an issue introduced recently.
Specifics:
- Fix a latent initialization issue in the pcc-cpufreq driver
(incorrect initial value of a structure field) that has been
uncovered by a recent ACPICA commit (Mike Galbraith).
- Add a missing notification in an update_devfreq() error code path
forgotten by a recent devfreq commit (Chanwoo Choi).
- Fix devfreq device frequency initialization (Lukasz Luba).
- Fix an incorrect IS_ERR() check in the devfreq framework discovered
by the Smatch checker (Dan Carpenter).
- Drop two excessive put_device() calls from the devfreq framework
(MyungJoo Ham, Cai Zhiyong).
- Fix a possible memory leak in the devfreq framework and drop an
unnecessary kfree() invocation from it (MyungJoo Ham)"
* tag 'pm-4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / devfreq: Send the DEVFREQ_POSTCHANGE notification when target() is failed
cpufreq: pcc-cpufreq: Fix doorbell.access_width
PM / devfreq: fix initialization of current frequency in last status
PM / devfreq: exynos-nocp: Remove incorrect IS_ERR() check
PM / devfreq: remove double put_device
PM / devfreq: fix double call put_device
PM / devfreq: fix duplicated kfree on devfreq pointer
PM / devfreq: devm_kzalloc to have dev pointer more precisely
Linus Torvalds [Sat, 25 Jun 2016 00:57:37 +0000 (17:57 -0700)]
Merge tag 'for-linus-4.7b-rc4-tag' of git://git./linux/kernel/git/xen/tip
Pull xen bug fixes from David Vrabel:
- fix x86 PV dom0 crash during early boot on some hardware
- fix two pciback bugs affects certain devices
- fix potential overflow when clearing page tables in x86 PV
* tag 'for-linus-4.7b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen-pciback: return proper values during BAR sizing
x86/xen: avoid m2p lookup when setting early page table entries
xen/pciback: Fix conf_space read/write overlap check.
x86/xen: fix upper bound of pmd loop in xen_cleanhighmap()
xen/balloon: Fix declared-but-not-defined warning
Linus Torvalds [Sat, 25 Jun 2016 00:51:14 +0000 (17:51 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"Here are a few more arm64 fixes, but things do finally appear to be
slowing down. The main fix is avoiding hibernation in a previously
unanticipated situation where we have CPUs parked in the kernel, but
it's all good stuff.
- Fix icache/dcache sync for anonymous pages under migration
- Correct the ASID limit check
- Fix parallel builds of Image and Image.gz
- Refuse to hibernate when we have CPUs that we can't offline"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: hibernate: Don't hibernate on systems with stuck CPUs
arm64: smp: Add function to determine if cpus are stuck in the kernel
arm64: mm: remove page_mapping check in __sync_icache_dcache
arm64: fix boot image dependencies to not generate invalid images
arm64: update ASID limit
Rasmus Villemoes [Fri, 24 Jun 2016 21:50:30 +0000 (14:50 -0700)]
init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64
When I replaced kasprintf("%pf") with a direct call to
sprint_symbol_no_offset I must have broken the initcall blacklisting
feature on the arches where dereference_function_descriptor() is
non-trivial.
Fixes:
c8cdd2be213f (init/main.c: simplify initcall_blacklisted())
Link: http://lkml.kernel.org/r/1466027283-4065-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Petr Mladek <pmladek@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Vagin [Fri, 24 Jun 2016 21:50:27 +0000 (14:50 -0700)]
autofs: don't get stuck in a loop if vfs_write() returns an error
__vfs_write() returns a negative value in a error case.
Link: http://lkml.kernel.org/r/20160616083108.6278.65815.stgit@pluto.themaw.net
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sudip Mukherjee [Fri, 24 Jun 2016 21:50:24 +0000 (14:50 -0700)]
mm/page_owner: avoid null pointer dereference
We have dereferenced page_ext before checking it. Lets check it first
and then used it.
Fixes:
f86e4271978b ("mm: check the return value of lookup_page_ext for all call sites")
Link: http://lkml.kernel.org/r/1465249059-7883-1-git-send-email-sudipm.mukherjee@gmail.com
Signed-off-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Colin Ian King [Fri, 24 Jun 2016 21:50:21 +0000 (14:50 -0700)]
tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences"
trivial fix to spelling mistake
Link: http://lkml.kernel.org/r/1466672144-831-1-git-send-email-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Torsten Hilbrich [Fri, 24 Jun 2016 21:50:18 +0000 (14:50 -0700)]
fs/nilfs2: fix potential underflow in call to crc32_le
The value `bytes' comes from the filesystem which is about to be
mounted. We cannot trust that the value is always in the range we
expect it to be.
Check its value before using it to calculate the length for the crc32_le
call. It value must be larger (or equal) sumoff + 4.
This fixes a kernel bug when accidentially mounting an image file which
had the nilfs2 magic value 0x3434 at the right offset 0x406 by chance.
The bytes 0x01 0x00 were stored at 0x408 and were interpreted as a
s_bytes value of 1. This caused an underflow when substracting sumoff +
4 (20) in the call to crc32_le.
BUG: unable to handle kernel paging request at
ffff88021e600000
IP: crc32_le+0x36/0x100
...
Call Trace:
nilfs_valid_sb.part.5+0x52/0x60 [nilfs2]
nilfs_load_super_block+0x142/0x300 [nilfs2]
init_nilfs+0x60/0x390 [nilfs2]
nilfs_mount+0x302/0x520 [nilfs2]
mount_fs+0x38/0x160
vfs_kern_mount+0x67/0x110
do_mount+0x269/0xe00
SyS_mount+0x9f/0x100
entry_SYSCALL_64_fastpath+0x16/0x71
Link: http://lkml.kernel.org/r/1466778587-5184-2-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Tested-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 24 Jun 2016 21:50:16 +0000 (14:50 -0700)]
oom, suspend: fix oom_reaper vs. oom_killer_disable race
Tetsuo has reported the following potential oom_killer_disable vs.
oom_reaper race:
(1) freeze_processes() starts freezing user space threads.
(2) Somebody (maybe a kenrel thread) calls out_of_memory().
(3) The OOM killer calls mark_oom_victim() on a user space thread
P1 which is already in __refrigerator().
(4) oom_killer_disable() sets oom_killer_disabled = true.
(5) P1 leaves __refrigerator() and enters do_exit().
(6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
exit_oom_victim(P1).
(7) oom_killer_disable() returns while P1 not yet finished
(8) P1 perform IO/interfere with the freezer.
This situation is unfortunate. We cannot move oom_killer_disable after
all the freezable kernel threads are frozen because the oom victim might
depend on some of those kthreads to make a forward progress to exit so
we could deadlock. It is also far from trivial to teach the oom_reaper
to not call exit_oom_victim() because then we would lose a guarantee of
the OOM killer and oom_killer_disable forward progress because
exit_mm->mmput might block and never call exit_oom_victim.
It seems the easiest way forward is to workaround this race by calling
try_to_freeze_tasks again after oom_killer_disable. This will make sure
that all the tasks are frozen or it bails out.
Fixes:
449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper")
Link: http://lkml.kernel.org/r/1466597634-16199-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gang He [Fri, 24 Jun 2016 21:50:13 +0000 (14:50 -0700)]
ocfs2: disable BUG assertions in reading blocks
According to some high-load testing, these two BUG assertions were
encountered, this led system panic. Actually, there were some
discussions about removing these two BUG() assertions, it would not
bring any side effect.
Then, I did the the following changes,
1) use the existing macro CATCH_BH_JBD_RACES to wrap BUG() in the
ocfs2_read_blocks_sync function like before.
2) disable the macro CATCH_BH_JBD_RACES in Makefile by default.
Link: http://lkml.kernel.org/r/1466574294-26863-1-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Fri, 24 Jun 2016 21:50:10 +0000 (14:50 -0700)]
mm, compaction: abort free scanner if split fails
If the memory compaction free scanner cannot successfully split a free
page (only possible due to per-zone low watermark), terminate the free
scanner rather than continuing to scan memory needlessly. If the
watermark is insufficient for a free page of order <= cc->order, then
terminate the scanner since all future splits will also likely fail.
This prevents the compaction freeing scanner from scanning all memory on
very large zones (very noticeable for zones > 128GB, for instance) when
all splits will likely fail while holding zone->lock.
compaction_alloc() iterating a 128GB zone has been benchmarked to take
over 400ms on some systems whereas any free page isolated and ready to
be split ends up failing in split_free_page() because of the low
watermark check and thus the iteration continues.
The next time compaction occurs, the freeing scanner will likely start
at the end of the zone again since no success was made previously and we
get the same lengthy iteration until the zone is brought above the low
watermark. All thp page faults can take >400ms in such a state without
this fix.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211820350.97086@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dmitry Vyukov [Fri, 24 Jun 2016 21:50:07 +0000 (14:50 -0700)]
mm: prevent KASAN false positives in kmemleak
When kmemleak dumps contents of leaked objects it reads whole objects
regardless of user-requested size. This upsets KASAN. Disable KASAN
checks around object dump.
Link: http://lkml.kernel.org/r/1466617631-68387-1-git-send-email-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gerald Schaefer [Fri, 24 Jun 2016 21:50:04 +0000 (14:50 -0700)]
mm/hugetlb: clear compound_mapcount when freeing gigantic pages
While working on s390 support for gigantic hugepages I ran into the
following "Bad page state" warning when freeing gigantic pages:
BUG: Bad page state in process bash pfn:580001
page:
000003d116000040 count:0 mapcount:0 mapping:
ffffffff00000000 index:0x0
flags: 0x7fffc0000000000()
page dumped because: non-NULL mapping
This is because page->compound_mapcount, which is part of a union with
page->mapping, is initialized with -1 in prep_compound_gigantic_page(),
and not cleared again during destroy_compound_gigantic_page(). Fix this
by clearing the compound_mapcount in destroy_compound_gigantic_page()
before clearing compound_head.
Interestingly enough, the warning will not show up on x86_64, although
this should not be architecture specific. Apparently there is an
endianness issue, combined with the fact that the union contains both a
64 bit ->mapping pointer and a 32 bit atomic_t ->compound_mapcount as
members. The resulting bogus page->mapping on x86_64 therefore contains
00000000ffffffff instead of
ffffffff00000000 on s390, which will falsely
trigger the PageAnon() check in free_pages_prepare() because
page->mapping & PAGE_MAPPING_ANON is true on little-endian architectures
like x86_64 in this case (the page is not compound anymore,
->compound_head was already cleared before). As a result, page->mapping
will be cleared before doing the checks in free_pages_check().
Not sure if the bogus "PageAnon() returning true" on x86_64 for the
first tail page of a gigantic page (at this stage) has other theoretical
implications, but they would also be fixed with this patch.
Link: http://lkml.kernel.org/r/1466612719-5642-1-git-send-email-gerald.schaefer@de.ibm.com
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lukasz Odzioba [Fri, 24 Jun 2016 21:50:01 +0000 (14:50 -0700)]
mm/swap.c: flush lru pvecs on compound page arrival
Currently we can have compound pages held on per cpu pagevecs, which
leads to a lot of memory unavailable for reclaim when needed. In the
systems with hundreads of processors it can be GBs of memory.
On of the way of reproducing the problem is to not call munmap
explicitly on all mapped regions (i.e. after receiving SIGTERM). After
that some pages (with THP enabled also huge pages) may end up on
lru_add_pvec, example below.
void main() {
#pragma omp parallel
{
size_t size = 55 * 1000 * 1000; // smaller than MEM/CPUS
void *p = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS , -1, 0);
if (p != MAP_FAILED)
memset(p, 0, size);
//munmap(p, size); // uncomment to make the problem go away
}
}
When we run it with THP enabled it will leave significant amount of
memory on lru_add_pvec. This memory will be not reclaimed if we hit
OOM, so when we run above program in a loop:
for i in `seq 100`; do ./a.out; done
many processes (95% in my case) will be killed by OOM.
The primary point of the LRU add cache is to save the zone lru_lock
contention with a hope that more pages will belong to the same zone and
so their addition can be batched. The huge page is already a form of
batched addition (it will add 512 worth of memory in one go) so skipping
the batching seems like a safer option when compared to a potential
excess in the caching which can be quite large and much harder to fix
because lru_add_drain_all is way to expensive and it is not really clear
what would be a good moment to call it.
Similarly we can reproduce the problem on lru_deactivate_pvec by adding:
madvise(p, size, MADV_FREE); after memset.
This patch flushes lru pvecs on compound page arrival making the problem
less severe - after applying it kill rate of above example drops to 0%,
due to reducing maximum amount of memory held on pvec from 28MB (with
THP) to 56kB per CPU.
Suggested-by: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/1466180198-18854-1-git-send-email-lukasz.odzioba@intel.com
Signed-off-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Ming Li <mingli199x@qq.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Fri, 24 Jun 2016 21:49:58 +0000 (14:49 -0700)]
memcg: css_alloc should return an ERR_PTR value on error
mem_cgroup_css_alloc() was returning NULL on failure while cgroup core
expected it to return an ERR_PTR value leading to the following NULL
deref after a css allocation failure. Fix it by return
ERR_PTR(-ENOMEM) instead. I'll also update cgroup core so that it
can handle NULL returns.
mkdir: page allocation failure: order:6, mode:0x240c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO)
CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
...
Call Trace:
dump_stack+0x68/0xa1
warn_alloc_failed+0xd6/0x130
__alloc_pages_nodemask+0x4c6/0xf20
alloc_pages_current+0x66/0xe0
alloc_kmem_pages+0x14/0x80
kmalloc_order_trace+0x2a/0x1a0
__kmalloc+0x291/0x310
memcg_update_all_caches+0x6c/0x130
mem_cgroup_css_alloc+0x590/0x610
cgroup_apply_control_enable+0x18b/0x370
cgroup_mkdir+0x1de/0x2e0
kernfs_iop_mkdir+0x55/0x80
vfs_mkdir+0xb9/0x150
SyS_mkdir+0x66/0xd0
do_syscall_64+0x53/0x120
entry_SYSCALL64_slow_path+0x25/0x25
...
BUG: unable to handle kernel NULL pointer dereference at
00000000000000d0
IP: init_and_link_css+0x37/0x220
PGD
34b1e067 PUD
3a109067 PMD 0
Oops: 0002 [#1] SMP
Modules linked in:
CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.2-20160422_131301-anatol 04/01/2014
task:
ffff88007cbc5200 ti:
ffff8800666d4000 task.ti:
ffff8800666d4000
RIP: 0010:[<
ffffffff810f2ca7>] [<
ffffffff810f2ca7>] init_and_link_css+0x37/0x220
RSP: 0018:
ffff8800666d7d90 EFLAGS:
00010246
RAX:
0000000000000000 RBX:
0000000000000000 RCX:
0000000000000000
RDX:
ffffffff810f2499 RSI:
0000000000000000 RDI:
0000000000000008
RBP:
ffff8800666d7db8 R08:
0000000000000003 R09:
0000000000000000
R10:
0000000000000001 R11:
0000000000000000 R12:
ffff88005a5fb400
R13:
ffffffff81f0f8a0 R14:
ffff88005a5fb400 R15:
0000000000000010
FS:
00007fc944689700(0000) GS:
ffff88007fc00000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00007f3aed0d2b80 CR3:
000000003a1e8000 CR4:
00000000000006f0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Call Trace:
cgroup_apply_control_enable+0x1ac/0x370
cgroup_mkdir+0x1de/0x2e0
kernfs_iop_mkdir+0x55/0x80
vfs_mkdir+0xb9/0x150
SyS_mkdir+0x66/0xd0
do_syscall_64+0x53/0x120
entry_SYSCALL64_slow_path+0x25/0x25
Code: 89 f5 48 89 fb 49 89 d4 48 83 ec 08 8b 05 72 3b d8 00 85 c0 0f 85 60 01 00 00 4c 89 e7 e8 72 f7 ff ff 48 8d 7b 08 48 89 d9 31 c0 <48> c7 83 d0 00 00 00 00 00 00 00 48 83 e7 f8 48 29 f9 81 c1 d8
RIP init_and_link_css+0x37/0x220
RSP <
ffff8800666d7d90>
CR2:
00000000000000d0
---[ end trace
a2d8836ae1e852d1 ]---
Link: http://lkml.kernel.org/r/20160621165740.GJ3262@mtj.duckdns.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Fri, 24 Jun 2016 21:49:54 +0000 (14:49 -0700)]
memcg: mem_cgroup_migrate() may be called with irq disabled
mem_cgroup_migrate() uses local_irq_disable/enable() but can be called
with irq disabled from migrate_page_copy(). This ends up enabling irq
while holding a irq context lock triggering the following lockdep
warning. Fix it by using irq_save/restore instead.
=================================
[ INFO: inconsistent lock state ]
4.7.0-rc1+ #52 Tainted: G W
---------------------------------
inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
kcompactd0/151 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&(&ctx->completion_lock)->rlock){+.?.-.}, at: [<
000000000038fd96>] aio_migratepage+0x156/0x1e8
{IN-SOFTIRQ-W} state was registered at:
__lock_acquire+0x5b6/0x1930
lock_acquire+0xee/0x270
_raw_spin_lock_irqsave+0x66/0xb0
aio_complete+0x98/0x328
dio_complete+0xe4/0x1e0
blk_update_request+0xd4/0x450
scsi_end_request+0x48/0x1c8
scsi_io_completion+0x272/0x698
blk_done_softirq+0xca/0xe8
__do_softirq+0xc8/0x518
irq_exit+0xee/0x110
do_IRQ+0x6a/0x88
io_int_handler+0x11a/0x25c
__mutex_unlock_slowpath+0x144/0x1d8
__mutex_unlock_slowpath+0x140/0x1d8
kernfs_iop_permission+0x64/0x80
__inode_permission+0x9e/0xf0
link_path_walk+0x6e/0x510
path_lookupat+0xc4/0x1a8
filename_lookup+0x9c/0x160
user_path_at_empty+0x5c/0x70
SyS_readlinkat+0x68/0x140
system_call+0xd6/0x270
irq event stamp: 971410
hardirqs last enabled at (971409): migrate_page_move_mapping+0x3ea/0x588
hardirqs last disabled at (971410): _raw_spin_lock_irqsave+0x3c/0xb0
softirqs last enabled at (970526): __do_softirq+0x460/0x518
softirqs last disabled at (970519): irq_exit+0xee/0x110
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&ctx->completion_lock)->rlock);
<Interrupt>
lock(&(&ctx->completion_lock)->rlock);
*** DEADLOCK ***
3 locks held by kcompactd0/151:
#0: (&(&mapping->private_lock)->rlock){+.+.-.}, at: aio_migratepage+0x42/0x1e8
#1: (&ctx->ring_lock){+.+.+.}, at: aio_migratepage+0x5a/0x1e8
#2: (&(&ctx->completion_lock)->rlock){+.?.-.}, at: aio_migratepage+0x156/0x1e8
stack backtrace:
CPU: 20 PID: 151 Comm: kcompactd0 Tainted: G W 4.7.0-rc1+ #52
Call Trace:
show_trace+0xea/0xf0
show_stack+0x72/0xf0
dump_stack+0x9a/0xd8
print_usage_bug.part.27+0x2d4/0x2e8
mark_lock+0x17e/0x758
mark_held_locks+0xa2/0xd0
trace_hardirqs_on_caller+0x140/0x1c0
mem_cgroup_migrate+0x266/0x370
aio_migratepage+0x16a/0x1e8
move_to_new_page+0xb0/0x260
migrate_pages+0x8f4/0x9f0
compact_zone+0x4dc/0xdc8
kcompactd_do_work+0x1aa/0x358
kcompactd+0xba/0x2c8
kthread+0x10a/0x110
kernel_thread_starter+0x6/0xc
kernel_thread_starter+0x0/0xc
INFO: lockdep is turned off.
Link: http://lkml.kernel.org/r/20160620184158.GO3262@mtj.duckdns.org
Link: http://lkml.kernel.org/g/5767CFE5.7080904@de.ibm.com
Fixes:
74485cf2bc85 ("mm: migrate: consolidate mem_cgroup_migrate() calls")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org> [4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Fri, 24 Jun 2016 21:49:51 +0000 (14:49 -0700)]
hugetlb: fix nr_pmds accounting with shared page tables
We account HugeTLB's shared page table to all processes who share it.
The accounting happens during huge_pmd_share().
If somebody populates pud entry under us, we should decrease pagetable's
refcount and decrease nr_pmds of the process.
By mistake, I increase nr_pmds again in this case. :-/ It will lead to
"BUG: non-zero nr_pmds on freeing mm: 2" on process' exit.
Let's fix this by increasing nr_pmds only when we're sure that the page
table will be used.
Link: http://lkml.kernel.org/r/20160617122506.GC6534@node.shutemov.name
Fixes:
dc6c9a35b66b ("mm: account pmd page tables to the process")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: zhongjiang <zhongjiang@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Fri, 24 Jun 2016 21:49:48 +0000 (14:49 -0700)]
Revert "mm: disable fault around on emulated access bit architecture"
This reverts commit
d0834a6c2c5b0c76cfb806bd7dba6556d8b4edbb.
After revert of
5c0a85fad949 ("mm: make faultaround produce old ptes")
faultaround doesn't have dependencies on hardware accessed bit, so let's
revert this one too.
Link: http://lkml.kernel.org/r/1465893750-44080-3-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Fri, 24 Jun 2016 21:49:45 +0000 (14:49 -0700)]
Revert "mm: make faultaround produce old ptes"
This reverts commit
5c0a85fad949212b3e059692deecdeed74ae7ec7.
The commit causes ~6% regression in unixbench.
Let's revert it for now and consider other solution for reclaim problem
later.
Link: http://lkml.kernel.org/r/1465893750-44080-2-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Antoine Tenart [Fri, 24 Jun 2016 21:49:42 +0000 (14:49 -0700)]
mailmap: add Boris Brezillon's email
There are different versions of Boris' name and email in the log, and
one typo. Add his emails in mailmap to have all of his contributions
under the same name/email tuple.
Link: http://lkml.kernel.org/r/20160609130323.27706-2-antoine.tenart@free-electrons.com
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Acked-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Antoine Tenart [Fri, 24 Jun 2016 21:49:39 +0000 (14:49 -0700)]
mailmap: add Antoine Tenart's email
I used "Antoine Ténart" at first but then moved to a name without accent
as this cause some issues from time to time... Add my email in the
mailmap file to have a consistent shortlog output.
Link: http://lkml.kernel.org/r/20160609130323.27706-1-antoine.tenart@free-electrons.com
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Cc: Antoine Tenart <antoine.tenart@free-electrons.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Fri, 24 Jun 2016 21:49:37 +0000 (14:49 -0700)]
mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask
Commit
d0164adc89f6 ("mm, page_alloc: distinguish between being unable
to sleep, unwilling to sleep and avoiding waking kswapd") modified
__GFP_WAIT to explicitly identify the difference between atomic callers
and those that were unwilling to sleep. Later the definition was
removed entirely.
The GFP_RECLAIM_MASK is the set of flags that affect watermark checking
and reclaim behaviour but __GFP_ATOMIC was never added. Without it,
atomic users of the slab allocator strip the __GFP_ATOMIC flag and
cannot access the page allocator atomic reserves. This patch addresses
the problem.
The user-visible impact depends on the workload but potentially atomic
allocations unnecessarily fail without this path.
Link: http://lkml.kernel.org/r/20160610093832.GK2527@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Marcin Wojtas <mw@semihalf.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Ryabinin [Fri, 24 Jun 2016 21:49:34 +0000 (14:49 -0700)]
mm: mempool: kasan: don't poot mempool objects in quarantine
Currently we may put reserved by mempool elements into quarantine via
kasan_kfree(). This is totally wrong since quarantine may really free
these objects. So when mempool will try to use such element,
use-after-free will happen. Or mempool may decide that it no longer
need that element and double-free it.
So don't put object into quarantine in kasan_kfree(), just poison it.
Rename kasan_kfree() to kasan_poison_kfree() to respect that.
Also, we shouldn't use kasan_slab_alloc()/kasan_krealloc() in
kasan_unpoison_element() because those functions may update allocation
stacktrace. This would be wrong for the most of the remove_element call
sites.
(The only call site where we may want to update alloc stacktrace is
in mempool_alloc(). Kmemleak solves this by calling
kmemleak_update_trace(), so we could make something like that too.
But this is out of scope of this patch).
Fixes:
55834c59098d ("mm: kasan: initial memory quarantine implementation")
Link: http://lkml.kernel.org/r/575977C3.1010905@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com>
Acked-by: Alexander Potapenko <glider@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jon Mason [Fri, 24 Jun 2016 21:49:31 +0000 (14:49 -0700)]
MAINTAINERS: update Calgary IOMMU
Update the contact info for Muli, clean-up my name, and update the
mailing list to the IOMMU mailing list.
Link: http://lkml.kernel.org/r/1465493059-11840-2-git-send-email-jdmason@kudzu.us
Signed-off-by: Jon Mason <jdmason@kudzu.us>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 24 Jun 2016 21:49:28 +0000 (14:49 -0700)]
jbd2: get rid of superfluous __GFP_REPEAT
jbd2_alloc is explicit about its allocation preferences wrt. the
allocation size. Sub page allocations go to the slab allocator and
larger are using either the page allocator or vmalloc. This is all good
but the logic is unnecessarily complex.
1) as per Ted, the vmalloc fallback is a left-over:
: jbd2_alloc is only passed in the bh->b_size, which can't be PAGE_SIZE, so
: the code path that calls vmalloc() should never get called. When we
: conveted jbd2_alloc() to suppor sub-page size allocations in commit
:
d2eecb039368, there was an assumption that it could be called with a size
: greater than PAGE_SIZE, but that's certaily not true today.
Moreover vmalloc allocation might even lead to a deadlock because the
callers expect GFP_NOFS context while vmalloc is GFP_KERNEL.
2) __GFP_REPEAT for requests <= PAGE_ALLOC_COSTLY_ORDER is ignored
since the flag was introduced.
Let's simplify the code flow and use the slab allocator for sub-page
requests and the page allocator for others. Even though order > 0 is
not currently used as per above leave that option open.
Link: http://lkml.kernel.org/r/1464599699-30131-18-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 24 Jun 2016 21:49:25 +0000 (14:49 -0700)]
unicore32: get rid of superfluous __GFP_REPEAT
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.
PGALLOC_GFP uses __GFP_REPEAT but it is only used in pte_alloc_one,
pte_alloc_one_kernel which does order-0 request. This means that this
flag has never been actually useful here because it has always been used
only for PAGE_ALLOC_COSTLY requests.
Link: http://lkml.kernel.org/r/1464599699-30131-17-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 24 Jun 2016 21:49:22 +0000 (14:49 -0700)]
tile: get rid of superfluous __GFP_REPEAT
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.
pgtable_alloc_one uses __GFP_REPEAT flag for L2_USER_PGTABLE_ORDER but
the order is either 0 or 3 if L2_KERNEL_PGTABLE_SHIFT for HPAGE_SHIFT.
This means that this flag has never been actually useful here because it
has always been used only for PAGE_ALLOC_COSTLY requests.
Link: http://lkml.kernel.org/r/1464599699-30131-16-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 24 Jun 2016 21:49:20 +0000 (14:49 -0700)]
sh: get rid of superfluous __GFP_REPEAT
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.
PGALLOC_GFP uses __GFP_REPEAT but {pgd,pmd}_alloc allocate from
{pgd,pmd}_cache but both caches are allocating up to PAGE_SIZE objects.
This means that this flag has never been actually useful here because it
has always been used only for PAGE_ALLOC_COSTLY requests.
Link: http://lkml.kernel.org/r/1464599699-30131-15-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 24 Jun 2016 21:49:17 +0000 (14:49 -0700)]
s390: get rid of superfluous __GFP_REPEAT
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.
page_table_alloc then uses the flag for a single page allocation. This
means that this flag has never been actually useful here because it has
always been used only for PAGE_ALLOC_COSTLY requests.
Link: http://lkml.kernel.org/r/1464599699-30131-14-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 24 Jun 2016 21:49:14 +0000 (14:49 -0700)]
sparc: get rid of superfluous __GFP_REPEAT
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.
{pud,pmd}_alloc_one is using __GFP_REPEAT but it always allocates from
pgtable_cache which is initialzed to PAGE_SIZE objects. This means that
this flag has never been actually useful here because it has always been
used only for PAGE_ALLOC_COSTLY requests.
Link: http://lkml.kernel.org/r/1464599699-30131-13-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 24 Jun 2016 21:49:12 +0000 (14:49 -0700)]
powerpc: get rid of superfluous __GFP_REPEAT
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.
{pud,pmd}_alloc_one are allocating from {PGT,PUD}_CACHE initialized in
pgtable_cache_init which doesn't have larger than sizeof(void *) << 12
size and that fits into !costly allocation request size.
PGALLOC_GFP is used only in radix__pgd_alloc which uses either order-0
or order-4 requests. The first one doesn't need the flag while the
second does. Drop __GFP_REPEAT from PGALLOC_GFP and add it for the
order-4 one.
This means that this flag has never been actually useful here because it
has always been used only for PAGE_ALLOC_COSTLY requests.
Link: http://lkml.kernel.org/r/1464599699-30131-12-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 24 Jun 2016 21:49:09 +0000 (14:49 -0700)]
score: get rid of superfluous __GFP_REPEAT
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.
pte_alloc_one{_kernel} allocate PTE_ORDER which is 0. This means that
this flag has never been actually useful here because it has always been
used only for PAGE_ALLOC_COSTLY requests.
Link: http://lkml.kernel.org/r/1464599699-30131-11-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 24 Jun 2016 21:49:06 +0000 (14:49 -0700)]
parisc: get rid of superfluous __GFP_REPEAT
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.
pmd_alloc_one allocate PMD_ORDER which is 1. This means that this flag
has never been actually useful here because it has always been used only
for PAGE_ALLOC_COSTLY requests.
Link: http://lkml.kernel.org/r/1464599699-30131-10-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 24 Jun 2016 21:49:04 +0000 (14:49 -0700)]
nios2: get rid of superfluous __GFP_REPEAT
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.
pte_alloc_one{_kernel} allocate PTE_ORDER which is 0. This means that
this flag has never been actually useful here because it has always been
used only for PAGE_ALLOC_COSTLY requests.
Link: http://lkml.kernel.org/r/1464599699-30131-9-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Ley Foon Tan <lftan@altera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 24 Jun 2016 21:49:01 +0000 (14:49 -0700)]
mips: get rid of superfluous __GFP_REPEAT
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.
pte_alloc_one{_kernel}, pmd_alloc_one allocate PTE_ORDER resp.
PMD_ORDER but both are not larger than 1. This means that this flag has
never been actually useful here because it has always been used only for
PAGE_ALLOC_COSTLY requests.
Link: http://lkml.kernel.org/r/1464599699-30131-8-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: John Crispin <blogic@openwrt.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 24 Jun 2016 21:48:58 +0000 (14:48 -0700)]
arc: get rid of superfluous __GFP_REPEAT
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.
pte_alloc_one_kernel uses __get_order_pte but this is obviously always
zero because BITS_FOR_PTE is not larger than 9 yet the page size is
always larger than 4K. This means that this flag has never been
actually useful here because it has always been used only for
PAGE_ALLOC_COSTLY requests.
Link: http://lkml.kernel.org/r/1464599699-30131-7-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 24 Jun 2016 21:48:56 +0000 (14:48 -0700)]
arm64: get rid of superfluous __GFP_REPEAT
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.
{pte,pmd,pud}_alloc_one{_kernel}, late_pgtable_alloc use PGALLOC_GFP for
__get_free_page (aka order-0).
pgd_alloc is slightly more complex because it allocates from pgd_cache
if PGD_SIZE != PAGE_SIZE and PGD_SIZE depends on the configuration
(CONFIG_ARM64_VA_BITS, PAGE_SHIFT and CONFIG_PGTABLE_LEVELS).
As per
config PGTABLE_LEVELS
int
default 2 if ARM64_16K_PAGES && ARM64_VA_BITS_36
default 2 if ARM64_64K_PAGES && ARM64_VA_BITS_42
default 3 if ARM64_64K_PAGES && ARM64_VA_BITS_48
default 3 if ARM64_4K_PAGES && ARM64_VA_BITS_39
default 3 if ARM64_16K_PAGES && ARM64_VA_BITS_47
default 4 if !ARM64_64K_PAGES && ARM64_VA_BITS_48
we should have the following options
CONFIG_ARM64_VA_BITS:48 CONFIG_PGTABLE_LEVELS:4 PAGE_SIZE:4k size:4096 pages:1
CONFIG_ARM64_VA_BITS:48 CONFIG_PGTABLE_LEVELS:4 PAGE_SIZE:16k size:16 pages:1
CONFIG_ARM64_VA_BITS:48 CONFIG_PGTABLE_LEVELS:3 PAGE_SIZE:64k size:512 pages:1
CONFIG_ARM64_VA_BITS:47 CONFIG_PGTABLE_LEVELS:3 PAGE_SIZE:16k size:16384 pages:1
CONFIG_ARM64_VA_BITS:42 CONFIG_PGTABLE_LEVELS:2 PAGE_SIZE:64k size:65536 pages:1
CONFIG_ARM64_VA_BITS:39 CONFIG_PGTABLE_LEVELS:3 PAGE_SIZE:4k size:4096 pages:1
CONFIG_ARM64_VA_BITS:36 CONFIG_PGTABLE_LEVELS:2 PAGE_SIZE:16k size:16384 pages:1
All of them fit into a single page (aka order-0). This means that this
flag has never been actually useful here because it has always been used
only for PAGE_ALLOC_COSTLY requests.
Link: http://lkml.kernel.org/r/1464599699-30131-6-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 24 Jun 2016 21:48:53 +0000 (14:48 -0700)]
x86/efi: get rid of superfluous __GFP_REPEAT
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.
efi_alloc_page_tables uses __GFP_REPEAT but it allocates an order-0
page. This means that this flag has never been actually useful here
because it has always been used only for PAGE_ALLOC_COSTLY requests.
Link: http://lkml.kernel.org/r/1464599699-30131-4-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 24 Jun 2016 21:48:50 +0000 (14:48 -0700)]
x86: get rid of superfluous __GFP_REPEAT
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.
PGALLOC_GFP uses __GFP_REPEAT but none of the allocation which uses this
flag is for more than order-0. This means that this flag has never been
actually useful here because it has always been used only for
PAGE_ALLOC_COSTLY requests.
Link: http://lkml.kernel.org/r/1464599699-30131-3-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 24 Jun 2016 21:48:47 +0000 (14:48 -0700)]
tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
This is the third version of the patchset previously sent [1]. I have
basically only rebased it on top of 4.7-rc1 tree and dropped "dm: get
rid of superfluous gfp flags" which went through dm tree. I am sending
it now because it is tree wide and chances for conflicts are reduced
considerably when we want to target rc2. I plan to send the next step
and rename the flag and move to a better semantic later during this
release cycle so we will have a new semantic ready for 4.8 merge window
hopefully.
Motivation:
While working on something unrelated I've checked the current usage of
__GFP_REPEAT in the tree. It seems that a majority of the usage is and
always has been bogus because __GFP_REPEAT has always been about costly
high order allocations while we are using it for order-0 or very small
orders very often. It seems that a big pile of them is just a
copy&paste when a code has been adopted from one arch to another.
I think it makes some sense to get rid of them because they are just
making the semantic more unclear. Please note that GFP_REPEAT is
documented as
* __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
* _might_ fail. This depends upon the particular VM implementation.
while !costly requests have basically nofail semantic. So one could
reasonably expect that order-0 request with __GFP_REPEAT will not loop
for ever. This is not implemented right now though.
I would like to move on with __GFP_REPEAT and define a better semantic
for it.
$ git grep __GFP_REPEAT origin/master | wc -l
111
$ git grep __GFP_REPEAT | wc -l
36
So we are down to the third after this patch series. The remaining
places really seem to be relying on __GFP_REPEAT due to large allocation
requests. This still needs some double checking which I will do later
after all the simple ones are sorted out.
I am touching a lot of arch specific code here and I hope I got it right
but as a matter of fact I even didn't compile test for some archs as I
do not have cross compiler for them. Patches should be quite trivial to
review for stupid compile mistakes though. The tricky parts are usually
hidden by macro definitions and thats where I would appreciate help from
arch maintainers.
[1] http://lkml.kernel.org/r/
1461849846-27209-1-git-send-email-mhocko@kernel.org
This patch (of 19):
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations. Yet we
have the full kernel tree with its usage for apparently order-0
allocations. This is really confusing because __GFP_REPEAT is
explicitly documented to allow allocation failures which is a weaker
semantic than the current order-0 has (basically nofail).
Let's simply drop __GFP_REPEAT from those places. This would allow to
identify place which really need allocator to retry harder and formulate
a more specific semantic for what the flag is supposed to do actually.
Link: http://lkml.kernel.org/r/1464599699-30131-2-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: John Crispin <blogic@openwrt.org>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Anthony Romano [Fri, 24 Jun 2016 21:48:43 +0000 (14:48 -0700)]
tmpfs: don't undo fallocate past its last page
When fallocate is interrupted it will undo a range that extends one byte
past its range of allocated pages. This can corrupt an in-use page by
zeroing out its first byte. Instead, undo using the inclusive byte
range.
Fixes:
1635f6a74152f1d ("tmpfs: undo fallocation on failure")
Link: http://lkml.kernel.org/r/1462713387-16724-1-git-send-email-anthony.romano@coreos.com
Signed-off-by: Anthony Romano <anthony.romano@coreos.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Brandon Philips <brandon@ifup.co>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Fri, 24 Jun 2016 21:48:40 +0000 (14:48 -0700)]
selftests/vm/compaction_test: fix write to restore nr_hugepages
The write at the end of the test to restore nr_hugepages to its previous
value is failing. This is because it is trying to write the number of
bytes in the char array as opposed to the number of bytes in the string.
Link: http://lkml.kernel.org/r/1465331205-3284-1-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Sri Jayaramappa <sjayaram@akamai.com>
Cc: Eric B Munson <emunson@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Fri, 24 Jun 2016 21:48:38 +0000 (14:48 -0700)]
oom_reaper: avoid pointless atomic_inc_not_zero usage.
Since commit
36324a990cf5 ("oom: clear TIF_MEMDIE after oom_reaper
managed to unmap the address space") changed to use find_lock_task_mm()
for finding a mm_struct to reap, it is guaranteed that mm->mm_users > 0
because find_lock_task_mm() returns a task_struct with ->mm != NULL.
Therefore, we can safely use atomic_inc().
Link: http://lkml.kernel.org/r/1465024759-8074-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>