Damien Le Moal [Thu, 6 Jul 2017 11:21:15 +0000 (20:21 +0900)]
block: Fix __blkdev_issue_zeroout loop
The BIO issuing loop in __blkdev_issue_zeroout() is allocating BIOs
with a maximum number of bvec (pages) equal to
min(nr_sects, (sector_t)BIO_MAX_PAGES)
This works since the requested number of bvecs will always be limited
to the absolute maximum number supported (BIO_MAX_PAGES), but this is
ineficient as too many bvec entries may be requested due to the
different units being used in the min() operation (number of sectors vs
number of pages).
To fix this, introduce the helper __blkdev_sectors_to_bio_pages() to
correctly calculate the number of bvecs for zeroout BIOs as the issuing
loop progresses. The calculation is done using consistent units and
makes sure that the number of pages return is at least 1 (for cases
where the number of sectors is less that the number of sectors in
a page).
Also remove a trailing space after the bit shift in the internal loop
min() call.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 5 Jul 2017 04:14:27 +0000 (12:14 +0800)]
mtip32xx: avoid to read HOST_CAP from HW in .queue_rq()
It is observed reading the register from HW takes a bit long,
for example in my box, the following difference of 'perf report
--no-children fio ...' can be seen when running I/O:
1) V4.12 without patch
+ 9.28% fio [mtip32xx] [k] mtip_irq_handler
+ 8.48% fio [mtip32xx] [k] mtip_init_cmd_header
2) V4.12 with the following patch
+ 9.14% fio [mtip32xx] [k] mtip_irq_handler
......
+ 1.14% fio [mtip32xx] [k] mtip_init_cmd_header
IOPS can be increased by ~5% with this patch too.
Fixes:
a4e84aae8139(mtip32xx: use runtime tag to initialize command header)
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
kbuild test robot [Tue, 4 Jul 2017 18:14:10 +0000 (02:14 +0800)]
bio-integrity: fix boolreturn.cocci warnings
block/bio-integrity.c:318:10-11: WARNING: return of 0/1 in function 'bio_integrity_prep' with return type bool
Return statements in functions returning bool should use
true/false instead of 1/0.
Generated by: scripts/coccinelle/misc/boolreturn.cocci
Fixes:
e23947bd76f0 ("bio-integrity: fold bio_integrity_enabled to bio_integrity_prep")
CC: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 3 Jul 2017 22:58:43 +0000 (16:58 -0600)]
bio-integrity: stop abusing bi_end_io
And instead call directly into the integrity code from bio_end_io.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dmitry Monakhov [Thu, 29 Jun 2017 18:31:15 +0000 (11:31 -0700)]
bio-integrity: Restore original iterator on verify stage
Currently ->verify_fn not woks at all because at the moment it is called
bio->bi_iter.bi_size == 0, so we do not iterate integrity bvecs at all.
In order to perform verification we need to know original data vector,
with new bvec rewind API this is trivial.
testcase: https://github.com/dmonakhov/xfstests/commit/
3c6509eaa83b9c17cd0bc95d73fcdd76e1c54a85
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
[hch: adopted for new status values]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dmitry Monakhov [Thu, 29 Jun 2017 18:31:14 +0000 (11:31 -0700)]
bio: add bvec_iter rewind API
Some ->bi_end_io handlers (for example: pi_verify or decrypt handlers)
need to know original data vector, but after bio traverse io-stack it may
be advanced, splited and relocated many times so it is hard to guess
original iterator. Let's add 'bi_done' conter which accounts number
of bytes iterator was advanced during it's evolution. Later end_io handler
may easily restore original iterator by rewinding iterator to
iter->bi_done.
Note: this change makes sizeof (struct bvec_iter) multiple to 8
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
[hch: switched to true/false return]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dmitry Monakhov [Thu, 29 Jun 2017 18:31:13 +0000 (11:31 -0700)]
block: guard bvec iteration logic
Currently if some one try to advance bvec beyond it's size we simply
dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
This simply means that we endup dereferencing/corrupting random memory
region.
Sane reaction would be to propagate error back to calling context
But bvec_iter_advance's calling context is not always good for error
handling. For safity reason let truncate iterator size to zero which
will break external iteration loop which prevent us from unpredictable
memory range corruption. And even it caller ignores an error, it will
corrupt it's own bvecs, not others.
This patch does:
- Return error back to caller with hope that it will react on this
- Truncate iterator size
Code was added long time ago here
4550dd6c, luckily no one hit it
in real life :)
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
[hch: switch to true/false returns instead of errno values]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dmitry Monakhov [Thu, 29 Jun 2017 18:31:12 +0000 (11:31 -0700)]
t10-pi: Move opencoded contants to common header
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dmitry Monakhov [Thu, 29 Jun 2017 18:31:11 +0000 (11:31 -0700)]
bio-integrity: fold bio_integrity_enabled to bio_integrity_prep
Currently all integrity prep hooks are open-coded, and if prepare fails
we ignore it's code and fail bio with EIO. Let's return real error to
upper layer, so later caller may react accordingly.
In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled,
so it is reasonable to fold it in to one function.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
[hch: merged with the latest block tree,
return bool from bio_integrity_prep]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dmitry Monakhov [Thu, 29 Jun 2017 18:31:10 +0000 (11:31 -0700)]
bio-integrity: fix interface for bio_integrity_trim
bio_integrity_trim inherent it's interface from bio_trim and accept
offset and size, but this API is error prone because data offset
must always be insync with bio's data offset. That is why we have
integrity update hook in bio_advance()
So only meaningful values are: offset == 0, sectors == bio_sectors(bio)
Let's just remove them completely.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dmitry Monakhov [Thu, 29 Jun 2017 18:31:09 +0000 (11:31 -0700)]
bio-integrity: bio_integrity_advance must update integrity seed
SCSI drivers do care about bip_seed so we must update it accordingly.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dmitry Monakhov [Thu, 29 Jun 2017 18:31:08 +0000 (11:31 -0700)]
bio-integrity: bio_trim should truncate integrity vector accordingly
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 3 Jul 2017 12:37:14 +0000 (20:37 +0800)]
blk-mq-sched: fix performance regression of mq-deadline
When mq-deadline is taken, IOPS of sequential read and
seqential write is observed more than 20% drop on sata(scsi-mq)
devices, compared with using 'none' scheduler.
The reason is that the default nr_requests for scheduler is
too big for small queuedepth devices, and latency is increased
much.
Since the principle of taking 256 requests for mq scheduler
is based on 128 queue depth, this patch changes into
double size of min(hw queue_depth, 128).
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Paolo Valente [Mon, 3 Jul 2017 08:00:10 +0000 (10:00 +0200)]
block, bfq: don't change ioprio class for a bfq_queue on a service tree
On each deactivation or re-scheduling (after being served) of a
bfq_queue, BFQ invokes the function __bfq_entity_update_weight_prio(),
to perform pending updates of ioprio, weight and ioprio class for the
bfq_queue. BFQ also invokes this function on I/O-request dispatches,
to raise or lower weights more quickly when needed, thereby improving
latency. However, the entity representing the bfq_queue may be on the
active (sub)tree of a service tree when this happens, and, although
with a very low probability, the bfq_queue may happen to also have a
pending change of its ioprio class. If both conditions hold when
__bfq_entity_update_weight_prio() is invoked, then the entity moves to
a sort of hybrid state: the new service tree for the entity, as
returned by bfq_entity_service_tree(), differs from service tree on
which the entity still is. The functions that handle activations and
deactivations of entities do not cope with such a hybrid state (and
would need to become more complex to cope).
This commit addresses this issue by just making
__bfq_entity_update_weight_prio() not perform also a possible pending
change of ioprio class, when invoked on an I/O-request dispatch for a
bfq_queue. Such a change is thus postponed to when
__bfq_entity_update_weight_prio() is invoked on deactivation or
re-scheduling of the bfq_queue.
Reported-by: Marco Piazza <mpiazza@gmail.com>
Reported-by: Laurentiu Nicola <lnicola@dend.ro>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Marco Piazza <mpiazza@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Mon, 3 Jul 2017 21:45:09 +0000 (14:45 -0700)]
Merge branch 'x86-mm-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
"The main changes in this cycle were:
- Continued work to add support for 5-level paging provided by future
Intel CPUs. In particular we switch the x86 GUP code to the generic
implementation. (Kirill A. Shutemov)
- Continued work to add PCID CPU support to native kernels as well.
In this round most of the focus is on reworking/refreshing the TLB
flush infrastructure for the upcoming PCID changes. (Andy
Lutomirski)"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
x86/mm: Delete a big outdated comment about TLB flushing
x86/mm: Don't reenter flush_tlb_func_common()
x86/KASLR: Fix detection 32/64 bit bootloaders for 5-level paging
x86/ftrace: Exclude functions in head64.c from function-tracing
x86/mmap, ASLR: Do not treat unlimited-stack tasks as legacy mmap
x86/mm: Remove reset_lazy_tlbstate()
x86/ldt: Simplify the LDT switching logic
x86/boot/64: Put __startup_64() into .head.text
x86/mm: Add support for 5-level paging for KASLR
x86/mm: Make kernel_physical_mapping_init() support 5-level paging
x86/mm: Add sync_global_pgds() for configuration with 5-level paging
x86/boot/64: Add support of additional page table level during early boot
x86/boot/64: Rename init_level4_pgt and early_level4_pgt
x86/boot/64: Rewrite startup_64() in C
x86/boot/compressed: Enable 5-level paging during decompression stage
x86/boot/efi: Define __KERNEL32_CS GDT on 64-bit configurations
x86/boot/efi: Fix __KERNEL_CS definition of GDT entry on 64-bit configurations
x86/boot/efi: Cleanup initialization of GDT entries
x86/asm: Fix comment in return_from_SYSCALL_64()
x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation
...
Linus Torvalds [Mon, 3 Jul 2017 21:35:18 +0000 (14:35 -0700)]
Merge branch 'x86-microcode-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 microcode updates from Ingo Molnar:
"The main changes are a fix early microcode application for
resume-from-RAM, plus a 32-bit initrd placement fix - by Borislav
Petkov"
* 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode: Make a couple of symbols static
x86/microcode/intel: Save pointer to ucode patch for early AP loading
x86/microcode: Look for the initrd at the correct address on 32-bit
Linus Torvalds [Mon, 3 Jul 2017 21:09:32 +0000 (14:09 -0700)]
Merge branch 'x86-hyperv-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 hyperv updates from Ingo Molnar:
"Avoid boot time TSC calibration on Hyper-V hosts, to improve
calibration robustness. (Vitaly Kuznetsov)"
* 'x86-hyperv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/hyperv: Read TSC frequency from a synthetic MSR
x86/hyperv: Check frequency MSRs presence according to the specification
Linus Torvalds [Mon, 3 Jul 2017 21:07:45 +0000 (14:07 -0700)]
Merge branch 'x86-debug-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 debug update from Ingo Molnar:
"A single fix for an off-by one bug in test_nmi_ipi() that probably
doesn't matter in practice"
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Fix timeout test in test_nmi_ipi()
Linus Torvalds [Mon, 3 Jul 2017 20:42:17 +0000 (13:42 -0700)]
Merge branch 'x86-cleanups-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
"Two small cleanups"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/paravirt: Remove unnecessary return from void function
x86/boot: Add missing strchr() declaration
Linus Torvalds [Mon, 3 Jul 2017 20:40:38 +0000 (13:40 -0700)]
Merge branch 'x86-boot-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:
"The main changes in this cycle were KASLR improvements for rare
environments with special boot options, by Baoquan He. Also misc
smaller changes/cleanups"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/debug: Extend the lower bound of crash kernel low reservations
x86/boot: Remove unused copy_*_gs() functions
x86/KASLR: Use the right memcpy() implementation
Documentation/kernel-parameters.txt: Update 'memmap=' boot option description
x86/KASLR: Handle the memory limit specified by the 'memmap=' and 'mem=' boot options
x86/KASLR: Parse all 'memmap=' boot option entries
Linus Torvalds [Mon, 3 Jul 2017 20:38:28 +0000 (13:38 -0700)]
Merge branch 'x86-asm-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 asm updates from Ingo Molnar:
"A single commit micro-optimizing short user copies on certain Intel
CPUs"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings
Linus Torvalds [Mon, 3 Jul 2017 20:36:23 +0000 (13:36 -0700)]
Merge branch 'x86-apic-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 apic updates from Ingo Molnar:
"Janitorial changes: removal of an unused function plus __init
annotations"
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic: Make arch_init_msi/htirq_domain __init
x86/apic: Make init_legacy_irqs() __init
x86/ioapic: Remove unused IO_APIC_irq_trigger() function
Linus Torvalds [Mon, 3 Jul 2017 20:33:57 +0000 (13:33 -0700)]
Merge branch 'timers-nohz-for-linus' of git://git./linux/kernel/git/tip/tip
Pull nohz updates from Ingo Molnar:
"The main changes in this cycle relate to fixing another bad (but
sporadic and hard to detect) interaction between the dynticks
scheduler tick and hrtimers, plus related improvements to better
detection and handling of similar problems - by Frédéric Weisbecker"
* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
nohz: Fix spurious warning when hrtimer and clockevent get out of sync
nohz: Fix buggy tick delay on IRQ storms
nohz: Reset next_tick cache even when the timer has no regs
nohz: Fix collision between tick and other hrtimers, again
nohz: Add hrtimer sanity check
Linus Torvalds [Mon, 3 Jul 2017 20:08:04 +0000 (13:08 -0700)]
Merge branch 'sched-core-for-linus' of git://git./linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- Add the SYSTEM_SCHEDULING bootup state to move various scheduler
debug checks earlier into the bootup. This turns silent and
sporadically deadly bugs into nice, deterministic splats. Fix some
of the splats that triggered. (Thomas Gleixner)
- A round of restructuring and refactoring of the load-balancing and
topology code (Peter Zijlstra)
- Another round of consolidating ~20 of incremental scheduler code
history: this time in terms of wait-queue nomenclature. (I didn't
get much feedback on these renaming patches, and we can still
easily change any names I might have misplaced, so if anyone hates
a new name, please holler and I'll fix it.) (Ingo Molnar)
- sched/numa improvements, fixes and updates (Rik van Riel)
- Another round of x86/tsc scheduler clock code improvements, in hope
of making it more robust (Peter Zijlstra)
- Improve NOHZ behavior (Frederic Weisbecker)
- Deadline scheduler improvements and fixes (Luca Abeni, Daniel
Bristot de Oliveira)
- Simplify and optimize the topology setup code (Lauro Ramos
Venancio)
- Debloat and decouple scheduler code some more (Nicolas Pitre)
- Simplify code by making better use of llist primitives (Byungchul
Park)
- ... plus other fixes and improvements"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
sched/cputime: Refactor the cputime_adjust() code
sched/debug: Expose the number of RT/DL tasks that can migrate
sched/numa: Hide numa_wake_affine() from UP build
sched/fair: Remove effective_load()
sched/numa: Implement NUMA node level wake_affine()
sched/fair: Simplify wake_affine() for the single socket case
sched/numa: Override part of migrate_degrades_locality() when idle balancing
sched/rt: Move RT related code from sched/core.c to sched/rt.c
sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
sched/fair: Spare idle load balancing on nohz_full CPUs
nohz: Move idle balancer registration to the idle path
sched/loadavg: Generalize "_idle" naming to "_nohz"
sched/core: Drop the unused try_get_task_struct() helper function
sched/fair: WARN() and refuse to set buddy when !se->on_rq
sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h>
sched/wait: Re-adjust macro line continuation backslashes in <linux/wait.h>
...
Linus Torvalds [Mon, 3 Jul 2017 19:40:46 +0000 (12:40 -0700)]
Merge branch 'perf-core-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
"Most of the changes are for tooling, the main changes in this cycle were:
- Improve Intel-PT hardware tracing support, both on the kernel and
on the tooling side: PTWRITE instruction support, power events for
C-state tracing, etc. (Adrian Hunter)
- Add support to measure SMI cost to the x86 architecture, with
tooling support in 'perf stat' (Kan Liang)
- Support function filtering in 'perf ftrace', plus related
improvements (Namhyung Kim)
- Allow adding and removing fields to the default 'perf script'
columns, using + or - as field prefixes to do so (Andi Kleen)
- Allow resolving the DSO name with 'perf script -F brstack{sym,off},dso'
(Mark Santaniello)
- Add perf tooling unwind support for PowerPC (Paolo Bonzini)
- ... and various other improvements as well"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (84 commits)
perf auxtrace: Add CPU filter support
perf intel-pt: Do not use TSC packets for calculating CPU cycles to TSC
perf intel-pt: Update documentation to include new ptwrite and power events
perf intel-pt: Add example script for power events and PTWRITE
perf intel-pt: Synthesize new power and "ptwrite" events
perf intel-pt: Move code in intel_pt_synth_events() to simplify attr setting
perf intel-pt: Factor out intel_pt_set_event_name()
perf intel-pt: Tidy messages into called function intel_pt_synth_event()
perf intel-pt: Tidy Intel PT evsel lookup into separate function
perf intel-pt: Join needlessly wrapped lines
perf intel-pt: Remove unused instructions_sample_period
perf intel-pt: Factor out common code synthesizing event samples
perf script: Add synthesized Intel PT power and ptwrite events
perf/x86/intel: Constify the 'lbr_desc[]' array and make a function static
perf script: Add 'synth' field for synthesized event payloads
perf auxtrace: Add itrace option to output power events
perf auxtrace: Add itrace option to output ptwrite events
tools include: Add byte-swapping macros to kernel.h
perf script: Add 'synth' event type for synthesized events
x86/insn: perf tools: Add new ptwrite instruction
...
Linus Torvalds [Mon, 3 Jul 2017 19:14:18 +0000 (12:14 -0700)]
Merge branch 'locking-core-for-linus' of git://git./linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- Add CONFIG_REFCOUNT_FULL=y to allow the disabling of the 'full'
(robustness checked) refcount_t implementation with slightly lower
runtime overhead. (Kees Cook)
The lighter weight variant is the default. The two variants use the
same API. Having this variant was a precondition by some
maintainers to merge refcount_t cleanups.
- Add lockdep support for rtmutexes (Peter Zijlstra)
- liblockdep fixes and improvements (Sasha Levin, Ben Hutchings)
- ... misc fixes and improvements"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
locking/refcount: Remove the half-implemented refcount_sub() API
locking/refcount: Create unchecked atomic_t implementation
locking/rtmutex: Don't initialize lockdep when not required
locking/selftest: Add RT-mutex support
locking/selftest: Remove the bad unlock ordering test
rt_mutex: Add lockdep annotations
MAINTAINERS: Claim atomic*_t maintainership
locking/x86: Remove the unused atomic_inc_short() methd
tools/lib/lockdep: Remove private kernel headers
tools/lib/lockdep: Hide liblockdep output from test results
tools/lib/lockdep: Add dummy current_gfp_context()
tools/include: Add IS_ERR_OR_NULL to err.h
tools/lib/lockdep: Add empty __is_[module,kernel]_percpu_address
tools/lib/lockdep: Include err.h
tools/include: Add (mostly) empty include/linux/sched/mm.h
tools/lib/lockdep: Use LDFLAGS
tools/lib/lockdep: Remove double-quotes from soname
tools/lib/lockdep: Fix object file paths used in an out-of-tree build
tools/lib/lockdep: Fix compilation for 4.11
tools/lib/lockdep: Don't mix fd-based and stream IO
...
Linus Torvalds [Mon, 3 Jul 2017 19:12:05 +0000 (12:12 -0700)]
Merge branch 'efi-core-for-linus' of git://git./linux/kernel/git/tip/tip
Pull EFI updates from Ingo Molnar:
"The main changes in this cycle were:
- Rework the EFI capsule loader to allow for workarounds for
non-compliant firmware (Ard Biesheuvel)
- Implement a capsule loader quirk for Quark X102x (Jan Kiszka)
- Enable SMBIOS/DMI support for the ARM architecture (Ard Biesheuvel)
- Add CONFIG_EFI_PGT_DUMP=y support for x86-32 and kexec (Sai
Praneeth)
- Fixes for EFI support for Xen dom0 guests running under x86-64
hosts (Daniel Kiper)"
* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/xen/efi: Initialize only the EFI struct members used by Xen
efi: Process the MEMATTR table only if EFI_MEMMAP is enabled
efi/arm: Enable DMI/SMBIOS
x86/efi: Extend CONFIG_EFI_PGT_DUMP support to x86_32 and kexec as well
efi/efi_test: Use memdup_user() helper
efi/capsule: Add support for Quark security header
efi/capsule-loader: Use page addresses rather than struct page pointers
efi/capsule-loader: Redirect calls to efi_capsule_setup_info() via weak alias
efi/capsule: Remove NULL test on kmap()
efi/capsule-loader: Use a cached copy of the capsule header
efi/capsule: Adjust return type of efi_capsule_setup_info()
efi/capsule: Clean up pr_err/_info() messages
efi/capsule: Remove pr_debug() on ENOMEM or EFAULT
efi/capsule: Fix return code on failing kmap/vmap
Linus Torvalds [Mon, 3 Jul 2017 18:34:53 +0000 (11:34 -0700)]
Merge branch 'core-rcu-for-linus' of git://git./linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
"The sole purpose of these changes is to shrink and simplify the RCU
code base, which has suffered from creeping bloat over the past couple
of years. The end result is a net removal of ~2700 lines of code:
79 files changed, 1496 insertions(+), 4211 deletions(-)
Plus there's a marked reduction in the Kconfig space complexity as
well, here's the number of matches on 'grep RCU' in the .config:
before after
x86-defconfig 17 15
x86-allmodconfig 33 20"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (86 commits)
rcu: Remove RCU CPU stall warnings from Tiny RCU
rcu: Remove event tracing from Tiny RCU
rcu: Move RCU debug Kconfig options to kernel/rcu
rcu: Move RCU non-debug Kconfig options to kernel/rcu
rcu: Eliminate NOCBs CPU-state Kconfig options
rcu: Remove debugfs tracing
srcu: Remove Classic SRCU
srcu: Fix rcutorture-statistics typo
rcu: Remove SPARSE_RCU_POINTER Kconfig option
rcu: Remove the now-obsolete PROVE_RCU_REPEATEDLY Kconfig option
rcu: Remove typecheck() from RCU locking wrapper functions
rcu: Remove #ifdef moving rcu_end_inkernel_boot from rcupdate.h
rcu: Remove nohz_full full-system-idle state machine
rcu: Remove the RCU_KTHREAD_PRIO Kconfig option
rcu: Remove *_SLOW_* Kconfig options
srcu: Use rnp->lock wrappers to replace explicit memory barriers
rcu: Move rnp->lock wrappers for SRCU use
rcu: Convert rnp->lock wrappers to macros for SRCU use
rcu: Refactor #includes from include/linux/rcupdate.h
bcm47xx: Fix build regression
...
Linus Torvalds [Mon, 3 Jul 2017 18:12:04 +0000 (11:12 -0700)]
Merge branch 'core-objtool-for-linus' of git://git./linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:
"This is an extensive rewrite of the objdump tool to track all stack
pointer modifications through the machine instructions of disassembled
functions found in kernel .o files.
This re-design removes the prior dependency on CONFIG_FRAME_POINTERS,
with the goal to prepare the tool to generate kernel debuginfo data in
the future. There's also an increase in checking/tracking robustness
as a side effect as well.
No (intended) changes to existing functionality"
* 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Silence warnings for functions which use IRET
objtool: Implement stack validation 2.0
objtool, x86: Add several functions and files to the objtool whitelist
objtool: Move checking code to check.c
Linus Torvalds [Mon, 3 Jul 2017 18:09:38 +0000 (11:09 -0700)]
Merge tag 'edac_for_4.13' of git://git./linux/kernel/git/bp/bp
Pull EDAC updates from Borislav Petkov:
"Nothing earth-shattering - just the normal development flow of
cleanups, improvements, fixes and such.
Summary:
- i31200_edac: Add Kabylake support (Jason Baron)
- sb_edac: resolve memory controller detection issues on asymmetric
setups with not all DIMM slots being populated (Tony Luck and Qiuxu
Zhuo)
- misc cleanups and fixlets all over"
* tag 'edac_for_4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: (22 commits)
EDAC, pnd2: Fix Apollo Lake DIMM detection
EDAC, i5000, i5400: Fix definition of NRECMEMB register
EDAC, pnd2: Make function sbi_send() static
EDAC, pnd2: Return proper error value from apl_rd_reg()
EDAC, altera: Simplify calculation of total memory
EDAC, sb_edac: Avoid creating SOCK memory controller
EDAC, mce_amd: Fix typo in SMCA error description
EDAC, mv64x60: Sanity check edac_op_state before registering
EDAC, thunderx: Fix a warning during l2c debugfs node creation
EDAC, mv64x60: Check driver registration success
EDAC, ie31200: Add Intel Kaby Lake CPU support
EDAC, mv64x60: Replace in_le32()/out_le32() with readl()/writel()
EDAC, mv64x60: Fix pdata->name
EDAC, sb_edac: Bump driver version and do some cleanups
EDAC, sb_edac: Check if ECC enabled when at least one DIMM is present
EDAC, sb_edac: Drop NUM_CHANNELS from 8 back to 4
EDAC, sb_edac: Carve out dimm-populating loop
EDAC, sb_edac: Fix mod_name
EDAC, sb_edac: Assign EDAC memory controller per h/w controller
EDAC, sb_edac: Don't use "Socket#" in the memory controller name
...
Linus Torvalds [Mon, 3 Jul 2017 17:34:51 +0000 (10:34 -0700)]
Merge branch 'for-4.13/block' of git://git.kernel.dk/linux-block
Pull core block/IO updates from Jens Axboe:
"This is the main pull request for the block layer for 4.13. Not a huge
round in terms of features, but there's a lot of churn related to some
core cleanups.
Note this depends on the UUID tree pull request, that Christoph
already sent out.
This pull request contains:
- A series from Christoph, unifying the error/stats codes in the
block layer. We now use blk_status_t everywhere, instead of using
different schemes for different places.
- Also from Christoph, some cleanups around request allocation and IO
scheduler interactions in blk-mq.
- And yet another series from Christoph, cleaning up how we handle
and do bounce buffering in the block layer.
- A blk-mq debugfs series from Bart, further improving on the support
we have for exporting internal information to aid debugging IO
hangs or stalls.
- Also from Bart, a series that cleans up the request initialization
differences across types of devices.
- A series from Goldwyn Rodrigues, allowing the block layer to return
failure if we will block and the user asked for non-blocking.
- Patch from Hannes for supporting setting loop devices block size to
that of the underlying device.
- Two series of patches from Javier, fixing various issues with
lightnvm, particular around pblk.
- A series from me, adding support for write hints. This comes with
NVMe support as well, so applications can help guide data placement
on flash to improve performance, latencies, and write
amplification.
- A series from Ming, improving and hardening blk-mq support for
stopping/starting and quiescing hardware queues.
- Two pull requests for NVMe updates. Nothing major on the feature
side, but lots of cleanups and bug fixes. From the usual crew.
- A series from Neil Brown, greatly improving the bio rescue set
support. Most notably, this kills the bio rescue work queues, if we
don't really need them.
- Lots of other little bug fixes that are all over the place"
* 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
lightnvm: pblk: set line bitmap check under debug
lightnvm: pblk: verify that cache read is still valid
lightnvm: pblk: add initialization check
lightnvm: pblk: remove target using async. I/Os
lightnvm: pblk: use vmalloc for GC data buffer
lightnvm: pblk: use right metadata buffer for recovery
lightnvm: pblk: schedule if data is not ready
lightnvm: pblk: remove unused return variable
lightnvm: pblk: fix double-free on pblk init
lightnvm: pblk: fix bad le64 assignations
nvme: Makefile: remove dead build rule
blk-mq: map all HWQ also in hyperthreaded system
nvmet-rdma: register ib_client to not deadlock in device removal
nvme_fc: fix error recovery on link down.
nvmet_fc: fix crashes on bad opcodes
nvme_fc: Fix crash when nvme controller connection fails.
nvme_fc: replace ioabort msleep loop with completion
nvme_fc: fix double calls to nvme_cleanup_cmd()
nvme-fabrics: verify that a controller returns the correct NQN
nvme: simplify nvme_dev_attrs_are_visible
...
Linus Torvalds [Mon, 3 Jul 2017 16:55:26 +0000 (09:55 -0700)]
Merge tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid
Pull uuid subsystem from Christoph Hellwig:
"This is the new uuid subsystem, in which Amir, Andy and I have started
consolidating our uuid/guid helpers and improving the types used for
them. Note that various other subsystems have pulled in this tree, so
I'd like it to go in early.
UUID/GUID summary:
- introduce the new uuid_t/guid_t types that are going to replace the
somewhat confusing uuid_be/uuid_le types and make the terminology
fit the various specs, as well as the userspace libuuid library.
(me, based on a previous version from Amir)
- consolidated generic uuid/guid helper functions lifted from XFS and
libnvdimm (Amir and me)
- conversions to the new types and helpers (Amir, Andy and me)"
* tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid: (34 commits)
ACPI: hns_dsaf_acpi_dsm_guid can be static
mmc: sdhci-pci: make guid intel_dsm_guid static
uuid: Take const on input of uuid_is_null() and guid_is_null()
thermal: int340x_thermal: fix compile after the UUID API switch
thermal: int340x_thermal: Switch to use new generic UUID API
acpi: always include uuid.h
ACPI: Switch to use generic guid_t in acpi_evaluate_dsm()
ACPI / extlog: Switch to use new generic UUID API
ACPI / bus: Switch to use new generic UUID API
ACPI / APEI: Switch to use new generic UUID API
acpi, nfit: Switch to use new generic UUID API
MAINTAINERS: add uuid entry
tmpfs: generate random sb->s_uuid
scsi_debug: switch to uuid_t
nvme: switch to uuid_t
sysctl: switch to use uuid_t
partitions/ldm: switch to use uuid_t
overlayfs: use uuid_t instead of uuid_be
fs: switch ->s_uuid to uuid_t
ima/policy: switch to use uuid_t
...
Linus Torvalds [Sun, 2 Jul 2017 23:07:02 +0000 (16:07 -0700)]
Linux 4.12
Sylvain 'ythier' Hitier [Sun, 2 Jul 2017 13:21:56 +0000 (15:21 +0200)]
moduleparam: fix doc: hwparam_irq configures an IRQ
Signed-off-by: Sylvain 'ythier' Hitier <sylvain.hitier@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 2 Jul 2017 18:53:44 +0000 (11:53 -0700)]
Merge branch 'upstream' of git://git.linux-mips.org/ralf/upstream-linus
Pull MIPS fixes from Ralf Baechle:
"Here's a final round of fixes for 4.12:
- Fix misordered instructions in assembly code making kenel startup
via UHB unreliable.
- Fix special case of MADDF and MADDF emulation.
- Fix alignment issue in address calculation in pm-cps on 64 bit.
- Fix IRQ tracing & lockdep when rescheduling
- Systems with MAARs require post-DMA cache flushes.
The reordering fix and the MADDF/MSUBF fix have sat in linux-next for
a number of days. The others haven't propagated from my pull tree to
linux-next yet but all have survived manual testing and Imagination's
automated test system and there are no pending bug reports"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: Avoid accidental raw backtrace
MIPS: Perform post-DMA cache flushes on systems with MAARs
MIPS: Fix IRQ tracing & lockdep when rescheduling
MIPS: pm-cps: Drop manual cache-line alignment of ready_count
MIPS: math-emu: Handle zero accumulator case in MADDF and MSUBF separately
MIPS: head: Reorder instructions missing a delay slot
Linus Torvalds [Sun, 2 Jul 2017 17:09:40 +0000 (10:09 -0700)]
Merge branch 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM fix from Russell King:
"One final fix for 4.12 - Doug found a boot failure case triggered by
requesting a non-even MB vmalloc size"
* 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 8685/1: ensure memblock-limit is pmd-aligned
Kees Cook [Sat, 1 Jul 2017 18:01:29 +0000 (11:01 -0700)]
locking/refcount: Remove the half-implemented refcount_sub() API
CONFIG_REFCOUNT_FULL=y (correctly) does not provide a refcount_sub(),
which should not be part of proper refcount design patterns.
Remove the erroneous extern and the later !CONFIG_REFCOUNT_FULL
accidental implementation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes:
29dee3c03abc ("locking/refcounts: Out-of-line everything")
Link: http://lkml.kernel.org/r/20170701180129.GA17405@beast
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linus Torvalds [Sat, 1 Jul 2017 16:10:17 +0000 (09:10 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Fixlets for x86:
- Prevent kexec crash when KASLR is enabled, which was caused by an
address calculation bug
- Restore the freeing of PUDs on memory hot remove
- Correct a negated pointer check in the intel uncore performance
monitoring driver
- Plug a memory leak in an error exit path in the RDT code"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/intel_rdt: Fix memory leak on mount failure
x86/boot/KASLR: Fix kexec crash due to 'virt_addr' calculation bug
x86/boot/KASLR: Add checking for the offset of kernel virtual address randomization
perf/x86/intel/uncore: Fix wrong box pointer check
x86/mm/hotplug: Fix BUG_ON() after hot-remove by not freeing PUD
Linus Torvalds [Sat, 1 Jul 2017 15:46:52 +0000 (08:46 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf fix from Thomas Gleixner:
"The last fix for perf for this cycles:
- Prevent a segfault when kernel.kptr_restrict=2 is set by avoiding a
null pointer dereference"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf machine: Fix segfault for kernel.kptr_restrict=2
Linus Torvalds [Sat, 1 Jul 2017 15:39:13 +0000 (08:39 -0700)]
Merge tag 'pinctrl-v4.12-4' of git://git./linux/kernel/git/linusw/linux-pinctrl
Pull pinctrl fix from Linus Walleij:
"Brian noticed that this regression has not got a proper fix for the
entire merge window and consequently we need to revert the offending
commit.
It's part of the RT-mainstream work, the dance goes like this, two
steps forward, one step back.
Summary:
- A last fix for v4.12, an IRQ problem reported early in the merge
window appears not to have been properly fixed, so the offending
commit will be reverted and we will find the proper fix for v4.13.
Hopefully"
* tag 'pinctrl-v4.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
Revert "pinctrl: rockchip: avoid hardirq-unsafe functions in irq_chip"
Linus Torvalds [Sat, 1 Jul 2017 15:24:54 +0000 (08:24 -0700)]
Merge tag 'gpio-v4.12-4' of git://git./linux/kernel/git/linusw/linux-gpio
Pull last minute fixes for GPIO from Linus Walleij:
- Fix another ACPI problem with broken BIOSes.
- Filter out the right GPIO events, making a very user-visible bug go
away.
* tag 'gpio-v4.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
gpio: acpi: Skip _AEI entries without a handler rather then aborting the scan
gpiolib: fix filtering out unwanted events
Ingo Molnar [Sat, 1 Jul 2017 08:39:25 +0000 (10:39 +0200)]
Merge tag 'perf-core-for-mingo-4.13-
20170630' of git://git./linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
Intel PT enhancements:
- Support "ptwrite" instruction, a way to stuff 32 or 64 bit values into
the Intel PT trace (Adrian Hunter)
- Support power events in Intel PT to report changes to C-state (Adrian
Hunter)
- Synthesize Intel PT events as PERF_RECORD_SAMPLE records with a
perf_event_attr.type (PERF_TYPE_SYNTH) just after the range used by the
kernel, i.e. right after what is allocated for PMUs, at INT_MAX + 1U,
attr.config will have the identification for the synthesized event and
the PERF_SAMPLE_RAW payload will have its fields (Adrian Hunter)
Infrastructure changes:
- Remove warning() and error(), using instead pr_warning() and
pr_error(), consolidating error reporting (Arnaldo Carvalho de Melo)
- Add platform dependency to 'perf test 15' (Thomas Richter)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linus Torvalds [Sat, 1 Jul 2017 00:18:57 +0000 (17:18 -0700)]
Merge tag 'trace-v4.12-rc5' of git://git./linux/kernel/git/rostedt/linux-trace
Pull last-minute tracing fixes from Steven Rostedt:
"Two fixes:
One is for a crash when using the :mod: trace probe command into
stack_trace_filter. This bug was introduced during the last merge
window.
The other was there forever. It's a small bug that makes it impossible
to name a module function for kprobes when the module starts with a
digit"
* tag 'trace-v4.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/kprobes: Allow to create probe with a module name starting with a digit
ftrace: Fix regression with module command in stack_trace_filter
Zack Weinberg [Wed, 14 Jun 2017 15:14:28 +0000 (08:14 -0700)]
uapi/linux/a.out.h: don't use deprecated system-specific predefines.
uapi/linux/a.out.h uses a number of predefined macros that are
deprecated because they're in the application namespace
(e.g. '#ifdef linux' instead of '#ifdef __linux__').
This patch either corrects or just removes them if they are not
applicable to Linux.
The primary reason this is worth bothering to fix, considering how
obsolete a.out binary support is, is that the GCC build process
considers this such a severe error that it will copy the header into a
private directory and change the macro names, which causes future
updates to the header to be masked. This header probably doesn't get
updated very often anymore, but it is the _only_ uapi header that gets
this treatment, so IMHO it is worth patching just to drive that number
all the way to zero.
Signed-off-by: Zack Weinberg <zackw@panix.com>
[hch: removed dead conditionals]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jakub Kicinski [Thu, 29 Jun 2017 04:25:48 +0000 (21:25 -0700)]
hashtable: remove repeated phrase from a comment
"in a rcu enabled hashtable" is repeated twice in a comment.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vikas Shivappa [Mon, 26 Jun 2017 18:55:49 +0000 (11:55 -0700)]
x86/intel_rdt: Fix memory leak on mount failure
If mount fails, the kn_info directory is not freed causing memory leak.
Add the missing error handling path.
Fixes:
4e978d06dedb ("x86/intel_rdt: Add "info" files to resctrl file system")
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: vikas.shivappa@intel.com
Cc: andi.kleen@intel.com
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1498503368-20173-3-git-send-email-vikas.shivappa@linux.intel.com
Linus Torvalds [Fri, 30 Jun 2017 17:55:34 +0000 (10:55 -0700)]
Merge tag 'powerpc-4.12-8' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Hopefully the last two powerpc fixes for 4.12.
The CXL one is larger than I'd usually send at rc7, but it fixes new
code this cycle, so better to have it working for the release. It was
actually sent a few weeks back but got blocked in testing behind
another fix that was causing issues.
We are still tracking one crash in v4.12-rc7, but only one person has
reproduced it and the commit identified by bisect doesn't touch any of
the relevant code, so I think it's 50/50 whether that commit is
actually the problem or it's some code layout / toolchain issue.
Two fixes for code we merged this cycle:
- cxl: Fixes for Coherent Accelerator Interface Architecture 2.0
- Avoid miscompilation w/GCC 4.6.3 on 32-bit - don't inline
copy_to/from_user()
Thanks to Al Viro, Larry Finger, Christophe Lombard"
* tag 'powerpc-4.12-8' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/32: Avoid miscompilation w/GCC 4.6.3 - don't inline copy_to/from_user()
cxl: Fixes for Coherent Accelerator Interface Architecture 2.0
Josh Poimboeuf [Fri, 30 Jun 2017 14:09:34 +0000 (09:09 -0500)]
objtool: Silence warnings for functions which use IRET
Previously, objtool ignored functions which have the IRET instruction
in them. That's because it assumed that such functions know what
they're doing with respect to frame pointers.
With the new "objtool 2.0" changes, it stopped ignoring such functions,
and started complaining about them:
arch/x86/kernel/alternative.o: warning: objtool: do_sync_core()+0x1b: unsupported instruction in callable function
arch/x86/kernel/alternative.o: warning: objtool: text_poke()+0x1a8: unsupported instruction in callable function
arch/x86/kernel/ftrace.o: warning: objtool: do_sync_core()+0x16: unsupported instruction in callable function
arch/x86/kernel/cpu/mcheck/mce.o: warning: objtool: machine_check_poll()+0x166: unsupported instruction in callable function
arch/x86/kernel/cpu/mcheck/mce.o: warning: objtool: do_machine_check()+0x147: unsupported instruction in callable function
Silence those warnings for now. They can be re-enabled later, once we
have unwind hints which will allow the code to annotate the IRET usages.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Fixes:
baa41469a7b9 ("objtool: Implement stack validation 2.0")
Link: http://lkml.kernel.org/r/20170630140934.mmwtpockvpupahro@treble
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linus Torvalds [Fri, 30 Jun 2017 17:37:48 +0000 (10:37 -0700)]
Merge tag 'iommu-fixes-v4.12-rc7' of git://git./linux/kernel/git/joro/iommu
Pull IOMMU fixes from Joerg Roedel:
"Two fixes:
- A fix for AMD IOMMU interrupt remapping code when IRQs are
forwarded directly to KVM guests
- Fixed check in the recently merged code to allow tboot with
Intel VT-d disabled"
* tag 'iommu-fixes-v4.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/amd: Fix interrupt remapping when disable guest_mode
iommu/vt-d: Correctly disable Intel IOMMU force on
Linus Torvalds [Fri, 30 Jun 2017 17:30:26 +0000 (10:30 -0700)]
Merge tag 'sound-4.12' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Two last-minute HD-audio fixes"
* tag 'sound-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda - Fix endless loop of codec configure
ALSA: hda - set input_path bitmap to zero after moving it to new place
Linus Torvalds [Fri, 30 Jun 2017 17:22:59 +0000 (10:22 -0700)]
Merge branch 'overlayfs-linus' of git://git./linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"Fix two bugs in copy-up code. One introduced in 4.11 and one in
4.12-rc"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: don't set origin on broken lower hardlink
ovl: copy-up: don't unlock between lookup and link
Javier González [Fri, 30 Jun 2017 15:56:43 +0000 (17:56 +0200)]
lightnvm: pblk: set line bitmap check under debug
Do bitmap checks only when debug mode is enable. The line bitmap used
for mapping to physical addresses is fairly large (~512KB) and it is
expensive to do this checks on the fast path.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Javier González [Fri, 30 Jun 2017 15:56:42 +0000 (17:56 +0200)]
lightnvm: pblk: verify that cache read is still valid
When a read is directed to the cache, we risk that the lba has been
updated during the time we made the L2P table lookup and the time we are
actually reading form the cache. We intentionally not hold the L2P lock
not to block other threads.
While strict ordering is not a guarantee at this level (unless REQ_FLUSH
has been previously issued), we have experience that some databases that
have recently implemented direct I/O support, issue metadata reads very
close to the writes, without issuing a fsync in the middle. An easy way
to support them while they is to make an extra effort and check the L2P
map right before reading the cache.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Javier González [Fri, 30 Jun 2017 15:56:41 +0000 (17:56 +0200)]
lightnvm: pblk: add initialization check
Add a sanity check to the pblk initialization sequence in order to
ensure that enough LUNs have been allocated to store the line metadata.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Javier González [Fri, 30 Jun 2017 15:56:40 +0000 (17:56 +0200)]
lightnvm: pblk: remove target using async. I/Os
When removing a pblk instance, pad the current line using asynchronous
I/O. This reduces the removal time from ~1 minute in the worst case to a
couple of seconds.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Javier González [Fri, 30 Jun 2017 15:56:39 +0000 (17:56 +0200)]
lightnvm: pblk: use vmalloc for GC data buffer
For now, we allocate a per I/O buffer for GC data. Since the potential
size of the buffer is 256KB and GC is not in the fast path, do this
allocation with vmalloc. This puts lets pressure on the memory
allocator at no performance cost.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Javier González [Fri, 30 Jun 2017 15:56:38 +0000 (17:56 +0200)]
lightnvm: pblk: use right metadata buffer for recovery
Fix bad metadata buffer assignations introduced when refactoring the
medatada write path.
Fixes:
dd2a43437337 lightnvm: pblk: sched. metadata on write thread
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Javier González [Fri, 30 Jun 2017 15:56:37 +0000 (17:56 +0200)]
lightnvm: pblk: schedule if data is not ready
When user threads place data into the write buffer, they reserve space
and do the memory copy out of the lock. As a consequence, when the write
thread starts persisting data, there is a chance that it is not copied
yet. In this case, avoid polling, and schedule before retrying.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Javier González [Fri, 30 Jun 2017 15:56:36 +0000 (17:56 +0200)]
lightnvm: pblk: remove unused return variable
Remove unused variable.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Javier González [Fri, 30 Jun 2017 15:56:35 +0000 (17:56 +0200)]
lightnvm: pblk: fix double-free on pblk init
Prevent pblk->lines being double freed in case of an error during pblk
initialization.
Fixes:
dd2a43437337: "lightnvm: pblk: sched. metadata on write thread"
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Javier González [Fri, 30 Jun 2017 15:56:34 +0000 (17:56 +0200)]
lightnvm: pblk: fix bad le64 assignations
Use the right types and conversions on le64 variables. Reported by
sparse.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Adrian Hunter [Fri, 26 May 2017 08:17:38 +0000 (11:17 +0300)]
perf auxtrace: Add CPU filter support
Decoding auxtrace data can take a long time. To avoid decoding
unnecessarily, filter auxtrace data that is collected per-cpu before it is
decoded.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1495786658-18063-38-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adrian Hunter [Fri, 26 May 2017 08:17:37 +0000 (11:17 +0300)]
perf intel-pt: Do not use TSC packets for calculating CPU cycles to TSC
CBR (core-to-bus ratio) packets provide an indication of CPU frequency. A
more accurate measure can be made by counting the cycles (given by CYC
packets) in between other timing packets (either MTC or TSC). Using TSC
packets has at least 2 issues: 1) timing might have stopped (e.g. mwait) or
2) TSC packets within PSB+ might slip past CYC packets. For now, simply do
not use TSC packets for calculating CPU cycles to TSC. That leaves the case
where 2 MTC packets are used, otherwise falling back to the CBR value.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1495786658-18063-37-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adrian Hunter [Fri, 26 May 2017 08:17:36 +0000 (11:17 +0300)]
perf intel-pt: Update documentation to include new ptwrite and power events
Update documentation to include new ptwrite and power events.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1495786658-18063-36-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adrian Hunter [Fri, 26 May 2017 08:17:35 +0000 (11:17 +0300)]
perf intel-pt: Add example script for power events and PTWRITE
Add script intel-pt-events.py that provides an example of how to unpack the
raw data for power events and PTWRITE.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1495786658-18063-35-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adrian Hunter [Fri, 30 Jun 2017 08:36:45 +0000 (11:36 +0300)]
perf intel-pt: Synthesize new power and "ptwrite" events
Synthesize new power and ptwrite events.
Power events report changes to C-state but I have also added support
for the existing CBR (core-to-bus ratio) packet and included that
when outputting power events.
The PTWRITE packet is associated with the new "ptwrite" instruction,
which is essentially just a way to stuff a 32 or 64 bit value into the
PT trace.
More details can be found in the patches that add documentation and in
the Intel SDM.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1498811805-2335-1-git-send-email-adrian.hunter@intel.com
[ Copy the description of such packet from the patchkit cover message ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adrian Hunter [Fri, 26 May 2017 08:17:33 +0000 (11:17 +0300)]
perf intel-pt: Move code in intel_pt_synth_events() to simplify attr setting
intel_pt_synth_events() uses the same attr structure to create each event.
Move the code around a bit to simplify that.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1495786658-18063-33-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adrian Hunter [Fri, 26 May 2017 08:17:32 +0000 (11:17 +0300)]
perf intel-pt: Factor out intel_pt_set_event_name()
Factor out intel_pt_set_event_name() so it can be reused.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1495786658-18063-32-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adrian Hunter [Fri, 26 May 2017 08:17:31 +0000 (11:17 +0300)]
perf intel-pt: Tidy messages into called function intel_pt_synth_event()
Tidy print messages into called function intel_pt_synth_event().
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1495786658-18063-31-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adrian Hunter [Fri, 26 May 2017 08:17:30 +0000 (11:17 +0300)]
perf intel-pt: Tidy Intel PT evsel lookup into separate function
Tidy the lookup of the Intel PT selected event (perf_evsel) into a separate
function.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1495786658-18063-30-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adrian Hunter [Fri, 26 May 2017 08:17:29 +0000 (11:17 +0300)]
perf intel-pt: Join needlessly wrapped lines
Join needlessly wrapped lines.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1495786658-18063-29-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adrian Hunter [Fri, 26 May 2017 08:17:28 +0000 (11:17 +0300)]
perf intel-pt: Remove unused instructions_sample_period
Remove unused struct intel_pt member instructions_sample_period.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1495786658-18063-28-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adrian Hunter [Fri, 26 May 2017 08:17:27 +0000 (11:17 +0300)]
perf intel-pt: Factor out common code synthesizing event samples
Factor out common code in functions synthesizing event samples i.e.
intel_pt_synth_branch_sample(), intel_pt_synth_instruction_sample() and
intel_pt_synth_transaction_sample().
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1495786658-18063-27-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adrian Hunter [Fri, 30 Jun 2017 08:36:42 +0000 (11:36 +0300)]
perf script: Add synthesized Intel PT power and ptwrite events
Add definitions for synthesized Intel PT events for power and ptwrite.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1498811802-2301-1-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Josh Poimboeuf [Wed, 28 Jun 2017 15:11:07 +0000 (10:11 -0500)]
objtool: Implement stack validation 2.0
This is a major rewrite of objtool. Instead of only tracking frame
pointer changes, it now tracks all stack-related operations, including
all register saves/restores.
In addition to making stack validation more robust, this also paves the
way for undwarf generation.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/678bd94c0566c6129bcc376cddb259c4c5633004.1498659915.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Josh Poimboeuf [Wed, 28 Jun 2017 15:11:06 +0000 (10:11 -0500)]
objtool, x86: Add several functions and files to the objtool whitelist
In preparation for an objtool rewrite which will have broader checks,
whitelist functions and files which cause problems because they do
unusual things with the stack.
These whitelists serve as a TODO list for which functions and files
don't yet have undwarf unwinder coverage. Eventually most of the
whitelists can be removed in favor of manual CFI hint annotations or
objtool improvements.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/7f934a5d707a574bda33ea282e9478e627fb1829.1498659915.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Josh Poimboeuf [Wed, 28 Jun 2017 15:11:05 +0000 (10:11 -0500)]
objtool: Move checking code to check.c
In preparation for the new 'objtool undwarf generate' command, which
will rely on 'objtool check', move the checking code from
builtin-check.c to check.c where it can be used by other commands.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Jiri Slaby <jslaby@suse.cz>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/294c5c695fd73c1a5000bbe5960a7c9bec4ee6b4.1498659915.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andy Lutomirski [Thu, 29 Jun 2017 15:53:14 +0000 (08:53 -0700)]
x86/mm: Delete a big outdated comment about TLB flushing
The comment describes the old explicit IPI-based flush logic, which
is long gone.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/55e44997e56086528140c5180f8337dc53fb7ffc.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andy Lutomirski [Thu, 29 Jun 2017 15:53:13 +0000 (08:53 -0700)]
x86/mm: Don't reenter flush_tlb_func_common()
It was historically possible to have two concurrent TLB flushes
targetting the same CPU: one initiated locally and one initiated
remotely. This can now cause an OOPS in leave_mm() at
arch/x86/mm/tlb.c:47:
if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
BUG();
with this call trace:
flush_tlb_func_local arch/x86/mm/tlb.c:239 [inline]
flush_tlb_mm_range+0x26d/0x370 arch/x86/mm/tlb.c:317
Without reentrancy, this OOPS is impossible: leave_mm() is only
called if we're not in TLBSTATE_OK, but then we're unexpectedly
in TLBSTATE_OK in leave_mm().
This can be caused by flush_tlb_func_remote() happening between
the two checks and calling leave_mm(), resulting in two consecutive
leave_mm() calls on the same CPU with no intervening switch_mm()
calls.
We never saw this OOPS before because the old leave_mm()
implementation didn't put us back in TLBSTATE_OK, so the assertion
didn't fire.
Nadav noticed the reentrancy issue in a different context, but
neither of us realized that it caused a problem yet.
Reported-by: Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Fixes:
3d28ebceaffa ("x86/mm: Rework lazy TLB to track the actual loaded mm")
Link: http://lkml.kernel.org/r/855acf733268d521c9f2e191faee2dcc23a29729.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Paolo Abeni [Thu, 29 Jun 2017 13:55:58 +0000 (15:55 +0200)]
x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings
According to the Intel datasheet, the REP MOVSB instruction
exposes a pretty heavy setup cost (50 ticks), which hurts
short string copy operations.
This change tries to avoid this cost by calling the explicit
loop available in the unrolled code for strings shorter
than 64 bytes.
The 64 bytes cutoff value is arbitrary from the code logic
point of view - it has been selected based on measurements,
as the largest value that still ensures a measurable gain.
Micro benchmarks of the __copy_from_user() function with
lengths in the [0-63] range show this performance gain
(shorter the string, larger the gain):
- in the [55%-4%] range on Intel Xeon(R) CPU E5-2690 v4
- in the [72%-9%] range on Intel Core i7-4810MQ
Other tested CPUs - namely Intel Atom S1260 and AMD Opteron
8216 - show no difference, because they do not expose the
ERMS feature bit.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/4533a1d101fd460f80e21329a34928fad521c1d4.1498744345.git.pabeni@redhat.com
[ Clarified the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Gustavo A. R. Silva [Thu, 29 Jun 2017 18:41:28 +0000 (13:41 -0500)]
sched/cputime: Refactor the cputime_adjust() code
Address a Coverity false positive, which is caused by overly
convoluted code:
Value assigned to variable 'utime' at line 619:utime = rtime;
is overwritten at line 642:utime = rtime - stime; before it
can be used. This makes such variable assignment useless.
Remove this variable assignment and refactor the code related.
Addresses-Coverity-ID:
1371643
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Cc: Frans Klaver <fransklaver@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/20170629184128.GA5271@embeddedgus
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Daniel Bristot de Oliveira [Mon, 26 Jun 2017 15:07:14 +0000 (17:07 +0200)]
sched/debug: Expose the number of RT/DL tasks that can migrate
Add the value of the rt_rq.rt_nr_migratory and dl_rq.dl_nr_migratory
to the sched_debug output, for instance:
rt_rq[0]:
.rt_nr_running : 2
.rt_nr_migratory : 1 <--- Like this
.rt_throttled : 0
.rt_time : 828.645877
.rt_runtime : 1000.000000
This is useful to debug problems related to the RT/DL schedulers.
This also fixes the format of some variables, that were unsigned, rather
than signed.
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
Link: http://lkml.kernel.org/r/7896f71cada54ee7dd8507bb666063a2e051c3d4.1498482127.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Colin Ian King [Thu, 29 Jun 2017 09:14:06 +0000 (10:14 +0100)]
perf/x86/intel: Constify the 'lbr_desc[]' array and make a function static
A few minor clean-ups: constify the lbr_desc[] array and make
local function lbr_from_signext_quirk_rd() static to fix a sparse warning:
"symbol 'lbr_from_signext_quirk_rd' was not declared. Should it be static?"
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-janitors@vger.kernel.org
Link: http://lkml.kernel.org/r/20170629091406.9870-1-colin.king@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Kirill A. Shutemov [Wed, 28 Jun 2017 12:17:30 +0000 (15:17 +0300)]
x86/KASLR: Fix detection 32/64 bit bootloaders for 5-level paging
KASLR uses hack to detect whether we booted via startup_32() or
startup_64(): it checks what is loaded into cr3 and compares it to
_pgtables. _pgtables is the array of page tables where early code
allocates page table from.
KASLR expects cr3 to point to _pgtables if we booted via startup_32(), but
that's not true if we booted with 5-level paging enabled. In this case top
level page table is allocated separately and only the first p4d page table
is allocated from the array.
Let's modify the check to cover both 4- and 5-level paging cases.
The patch also renames 'level4p' to 'top_level_pgt' as it now can hold
page table for 4th or 5th level, depending on configuration.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170628121730.43079-1-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Baoquan He [Tue, 27 Jun 2017 12:39:06 +0000 (20:39 +0800)]
x86/boot/KASLR: Fix kexec crash due to 'virt_addr' calculation bug
Kernel text KASLR is separated into physical address and virtual
address randomization. And for virtual address randomization, we
only randomiza to get an offset between 16M and KERNEL_IMAGE_SIZE.
So the initial value of 'virt_addr' should be LOAD_PHYSICAL_ADDR,
but not the original kernel loading address 'output'.
The bug will cause kernel boot failure if kernel is loaded at a different
position than the address, 16M, which is decided at compiled time.
Kexec/kdump is such practical case.
To fix it, just assign LOAD_PHYSICAL_ADDR to virt_addr as initial
value.
Tested-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes:
8391c73 ("x86/KASLR: Randomize virtual address separately")
Link: http://lkml.kernel.org/r/1498567146-11990-3-git-send-email-bhe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Baoquan He [Tue, 27 Jun 2017 12:39:05 +0000 (20:39 +0800)]
x86/boot/KASLR: Add checking for the offset of kernel virtual address randomization
For kernel text KASLR, the virtual address is confined to area of 1G,
[0xffffffff80000000, 0xffffffffc0000000). For the implemenataion of
virtual address randomization, we only randomize to get an offset
between 16M and 1G, then add this offset to the starting address,
0xffffffff80000000. Here 16M is the offset which is decided at linking
stage. So the amount of the local variable 'virt_addr' which respresents
the offset plus the kernel output size can not exceed KERNEL_IMAGE_SIZE.
Add a debug check for the offset. If out of bounds, print error
message and hang there.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1498567146-11990-2-git-send-email-bhe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Sabrina Dubroca [Thu, 22 Jun 2017 09:24:42 +0000 (11:24 +0200)]
tracing/kprobes: Allow to create probe with a module name starting with a digit
Always try to parse an address, since kstrtoul() will safely fail when
given a symbol as input. If that fails (which will be the case for a
symbol), try to parse a symbol instead.
This allows creating a probe such as:
p:probe/vlan_gro_receive 8021q:vlan_gro_receive+0
Which is necessary for this command to work:
perf probe -m 8021q -a vlan_gro_receive
Link: http://lkml.kernel.org/r/fd72d666f45b114e2c5b9cf7e27b91de1ec966f1.1498122881.git.sd@queasysnail.net
Cc: stable@vger.kernel.org
Fixes:
413d37d1e ("tracing: Add kprobe-based event tracer")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
James Hogan [Thu, 29 Jun 2017 14:05:04 +0000 (15:05 +0100)]
MIPS: Avoid accidental raw backtrace
Since commit
81a76d7119f6 ("MIPS: Avoid using unwind_stack() with
usermode") show_backtrace() invokes the raw backtracer when
cp0_status & ST0_KSU indicates user mode to fix issues on EVA kernels
where user and kernel address spaces overlap.
However this is used by show_stack() which creates its own pt_regs on
the stack and leaves cp0_status uninitialised in most of the code paths.
This results in the non deterministic use of the raw back tracer
depending on the previous stack content.
show_stack() deals exclusively with kernel mode stacks anyway, so
explicitly initialise regs.cp0_status to KSU_KERNEL (i.e. 0) to ensure
we get a useful backtrace.
Fixes:
81a76d7119f6 ("MIPS: Avoid using unwind_stack() with usermode")
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: <stable@vger.kernel.org> # 3.15+
Patchwork: https://patchwork.linux-mips.org/patch/16656/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Paul Burton [Tue, 13 Jun 2017 17:01:08 +0000 (10:01 -0700)]
MIPS: Perform post-DMA cache flushes on systems with MAARs
Recent CPUs from Imagination Technologies such as the I6400 or P6600 are
able to speculatively fetch data from memory into caches. This means
that if used in a system with non-coherent DMA they require that caches
be invalidated after a device performs DMA, and before the CPU reads the
DMA'd data, in order to ensure that stale values weren't speculatively
prefetched.
Such CPUs also introduced Memory Accessibility Attribute Registers
(MAARs) in order to control the regions in which they are allowed to
speculate. Thus we can use the presence of MAARs as a good indication
that the CPU requires the above cache maintenance. Use the presence of
MAARs to determine the result of cpu_needs_post_dma_flush() in the
default case, in order to handle these recent CPUs correctly.
Note that the return type of cpu_needs_post_dma_flush() is changed to
bool, such that it's clearer what's happening when cpu_has_maar is cast
to bool for the return value. If this patch were backported to a
pre-v4.7 kernel then MIPS_CPU_MAAR was 1ull<<34, so when cast to an int
we would incorrectly return 0. It so happens that MIPS_CPU_MAAR is
currently 1ull<<30, so when truncated to an int gives a non-zero value
anyway, but even so the implicit conversion from long long int to bool
makes it clearer to understand what will happen than the implicit
conversion from long long int to int would. The bool return type also
fits this usage better semantically, so seems like an all-round win.
Thanks to Ed for spotting the issue for pre-v4.7 kernels & suggesting
the return type change.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Reviewed-by: Bryan O'Donoghue <pure.logic@nexus-software.ie>
Tested-by: Bryan O'Donoghue <pure.logic@nexus-software.ie>
Cc: Ed Blake <ed.blake@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/16363/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Paul Burton [Fri, 3 Mar 2017 23:26:05 +0000 (15:26 -0800)]
MIPS: Fix IRQ tracing & lockdep when rescheduling
When the scheduler sets TIF_NEED_RESCHED & we call into the scheduler
from arch/mips/kernel/entry.S we disable interrupts. This is true
regardless of whether we reach work_resched from syscall_exit_work,
resume_userspace or by looping after calling schedule(). Although we
disable interrupts in these paths we don't call trace_hardirqs_off()
before calling into C code which may acquire locks, and we therefore
leave lockdep with an inconsistent view of whether interrupts are
disabled or not when CONFIG_PROVE_LOCKING & CONFIG_DEBUG_LOCKDEP are
both enabled.
Without tracing this interrupt state lockdep will print warnings such
as the following once a task returns from a syscall via
syscall_exit_partial with TIF_NEED_RESCHED set:
[ 49.927678] ------------[ cut here ]------------
[ 49.934445] WARNING: CPU: 0 PID: 1 at kernel/locking/lockdep.c:3687 check_flags.part.41+0x1dc/0x1e8
[ 49.946031] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
[ 49.946355] CPU: 0 PID: 1 Comm: init Not tainted
4.10.0-00439-gc9fd5d362289-dirty #197
[ 49.963505] Stack :
0000000000000000 ffffffff81bb5d6a 0000000000000006 ffffffff801ce9c4
[ 49.974431]
0000000000000000 0000000000000000 0000000000000000 000000000000004a
[ 49.985300]
ffffffff80b7e487 ffffffff80a24498 a8000000ff160000 ffffffff80ede8b8
[ 49.996194]
0000000000000001 0000000000000000 0000000000000000 0000000077c8030c
[ 50.007063]
000000007fd8a510 ffffffff801cd45c 0000000000000000 a8000000ff127c88
[ 50.017945]
0000000000000000 ffffffff801cf928 0000000000000001 ffffffff80a24498
[ 50.028827]
0000000000000000 0000000000000001 0000000000000000 0000000000000000
[ 50.039688]
0000000000000000 a8000000ff127bd0 0000000000000000 ffffffff805509bc
[ 50.050575]
00000000140084e0 0000000000000000 0000000000000000 0000000000040a00
[ 50.061448]
0000000000000000 ffffffff8010e1b0 0000000000000000 ffffffff805509bc
[ 50.072327] ...
[ 50.076087] Call Trace:
[ 50.079869] [<
ffffffff8010e1b0>] show_stack+0x80/0xa8
[ 50.086577] [<
ffffffff805509bc>] dump_stack+0x10c/0x190
[ 50.093498] [<
ffffffff8015dde0>] __warn+0xf0/0x108
[ 50.099889] [<
ffffffff8015de34>] warn_slowpath_fmt+0x3c/0x48
[ 50.107241] [<
ffffffff801c15b4>] check_flags.part.41+0x1dc/0x1e8
[ 50.114961] [<
ffffffff801c239c>] lock_is_held_type+0x8c/0xb0
[ 50.122291] [<
ffffffff809461b8>] __schedule+0x8c0/0x10f8
[ 50.129221] [<
ffffffff80946a60>] schedule+0x30/0x98
[ 50.135659] [<
ffffffff80106278>] work_resched+0x8/0x34
[ 50.142397] ---[ end trace
0cb4f6ef5b99fe21 ]---
[ 50.148405] possible reason: unannotated irqs-off.
[ 50.154600] irq event stamp: 400463
[ 50.159566] hardirqs last enabled at (400463): [<
ffffffff8094edc8>] _raw_spin_unlock_irqrestore+0x40/0xa8
[ 50.171981] hardirqs last disabled at (400462): [<
ffffffff8094eb98>] _raw_spin_lock_irqsave+0x30/0xb0
[ 50.183897] softirqs last enabled at (400450): [<
ffffffff8016580c>] __do_softirq+0x4ac/0x6a8
[ 50.195015] softirqs last disabled at (400425): [<
ffffffff80165e78>] irq_exit+0x110/0x128
Fix this by using the TRACE_IRQS_OFF macro to call trace_hardirqs_off()
when CONFIG_TRACE_IRQFLAGS is enabled. This is done before invoking
schedule() following the work_resched label because:
1) Interrupts are disabled regardless of the path we take to reach
work_resched() & schedule().
2) Performing the tracing here avoids the need to do it in paths which
disable interrupts but don't call out to C code before hitting a
path which uses the RESTORE_SOME macro that will call
trace_hardirqs_on() or trace_hardirqs_off() as appropriate.
We call trace_hardirqs_on() using the TRACE_IRQS_ON macro before calling
syscall_trace_leave() for similar reasons, ensuring that lockdep has a
consistent view of state after we re-enable interrupts.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Fixes:
1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: linux-mips@linux-mips.org
Cc: stable <stable@vger.kernel.org>
Patchwork: https://patchwork.linux-mips.org/patch/15385/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Paul Burton [Thu, 2 Mar 2017 22:02:40 +0000 (14:02 -0800)]
MIPS: pm-cps: Drop manual cache-line alignment of ready_count
We allocate memory for a ready_count variable per-CPU, which is accessed
via a cached non-coherent TLB mapping to perform synchronisation between
threads within the core using LL/SC instructions. In order to ensure
that the variable is contained within its own data cache line we
allocate 2 lines worth of memory & align the resulting pointer to a line
boundary. This is however unnecessary, since kmalloc is guaranteed to
return memory which is at least cache-line aligned (see
ARCH_DMA_MINALIGN). Stop the redundant manual alignment.
Besides cleaning up the code & avoiding needless work, this has the side
effect of avoiding an arithmetic error found by Bryan on 64 bit systems
due to the 32 bit size of the former dlinesz. This led the ready_count
variable to have its upper 32b cleared erroneously for MIPS64 kernels,
causing problems when ready_count was later used on MIPS64 via cpuidle.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Fixes:
3179d37ee1ed ("MIPS: pm-cps: add PM state entry code for CPS systems")
Reported-by: Bryan O'Donoghue <bryan.odonoghue@imgtec.com>
Reviewed-by: Bryan O'Donoghue <bryan.odonoghue@imgtec.com>
Tested-by: Bryan O'Donoghue <bryan.odonoghue@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: stable <stable@vger.kernel.org> # v3.16+
Patchwork: https://patchwork.linux-mips.org/patch/15383/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Doug Berger [Thu, 29 Jun 2017 17:41:36 +0000 (18:41 +0100)]
ARM: 8685/1: ensure memblock-limit is pmd-aligned
The pmd containing memblock_limit is cleared by prepare_page_table()
which creates the opportunity for early_alloc() to allocate unmapped
memory if memblock_limit is not pmd aligned causing a boot-time hang.
Commit
965278dcb8ab ("ARM: 8356/1: mm: handle non-pmd-aligned end of RAM")
attempted to resolve this problem, but there is a path through the
adjust_lowmem_bounds() routine where if all memory regions start and
end on pmd-aligned addresses the memblock_limit will be set to
arm_lowmem_limit.
Since arm_lowmem_limit can be affected by the vmalloc early parameter,
the value of arm_lowmem_limit may not be pmd-aligned. This commit
corrects this oversight such that memblock_limit is always rounded
down to pmd-alignment.
Fixes:
965278dcb8ab ("ARM: 8356/1: mm: handle non-pmd-aligned end of RAM")
Signed-off-by: Doug Berger <opendmb@gmail.com>
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Linus Torvalds [Thu, 29 Jun 2017 21:30:07 +0000 (14:30 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Need to access netdev->num_rx_queues behind an accessor in netvsc
driver otherwise the build breaks with some configs, from Arnd
Bergmann.
2) Add dummy xfrm_dev_event() so that build doesn't fail when
CONFIG_XFRM_OFFLOAD is not set. From Hangbin Liu.
3) Don't OOPS when pfkey_msg2xfrm_state() signals an erros, from Dan
Carpenter.
4) Fix MCDI command size for filter operations in sfc driver, from
Martin Habets.
5) Fix UFO segmenting so that we don't calculate incorrect checksums,
from Michal Kubecek.
6) When ipv6 datagram connects fail, reset destination address and
port. From Wei Wang.
7) TCP disconnect must reset the cached receive DST, from WANG Cong.
8) Fix sign extension bug on 32-bit in dev_get_stats(), from Eric
Dumazet.
9) fman driver has to depend on HAS_DMA, from Madalin Bucur.
10) Fix bpf pointer leak with xadd in verifier, from Daniel Borkmann.
11) Fix negative page counts with GFO, from Michal Kubecek.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
sfc: fix attempt to translate invalid filter ID
net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish()
bpf: prevent leaking pointer via xadd on unpriviledged
arcnet: com20020-pci: add missing pdev setup in netdev structure
arcnet: com20020-pci: fix dev_id calculation
arcnet: com20020: remove needless base_addr assignment
Trivial fix to spelling mistake in arc_printk message
arcnet: change irq handler to lock irqsave
rocker: move dereference before free
mlxsw: spectrum_router: Fix NULL pointer dereference
net: sched: Fix one possible panic when no destroy callback
virtio-net: serialize tx routine during reset
net: usb: asix88179_178a: Add support for the Belkin B2B128
fsl/fman: add dependency on HAS_DMA
net: prevent sign extension in dev_get_stats()
tcp: reset sk_rx_dst in tcp_disconnect()
net: ipv6: reset daddr and dport in sk if connect() fails
bnx2x: Don't log mc removal needlessly
bnxt_en: Fix netpoll handling.
bnxt_en: Add missing logic to handle TPA end error conditions.
...
Linus Torvalds [Thu, 29 Jun 2017 21:23:02 +0000 (14:23 -0700)]
Merge tag 'for-4.12/dm-fixes-5' of git://git./linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- dm thinp fix for crash that will occur when metadata device failure
races with discard passdown to the underlying data device.
- dm raid fix to not access the superblock's >= 1.9.0 'sectors' member
unconditionally.
* tag 'for-4.12/dm-fixes-5' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm thin: do not queue freed thin mapping for next stage processing
dm raid: fix oops on upgrading to extended superblock format
Linus Torvalds [Thu, 29 Jun 2017 21:10:37 +0000 (14:10 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Two fixes that should go into this release.
One is an nvme regression fix from Keith, fixing a missing queue
freeze if the controller is being reset. This causes the reset to
hang.
The other is a fix for a leak of the bio protection info, if smaller
sized O_DIRECT is used. This fix should be more involved as we have
other problematic paths in the kernel, but given as this isn't a
regression in this series, we'll tackle those for 4.13"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: provide bio_uninit() free freeing integrity/task associations
nvme/pci: Fix stuck nvme reset
Kirill A. Shutemov [Tue, 27 Jun 2017 11:59:48 +0000 (14:59 +0300)]
x86/ftrace: Exclude functions in head64.c from function-tracing
A recent commit moved most logic of early boot up from startup_64() written
in assembly to __startup_64() written in C.
Fengguang reported breakage due to the change. It was tracked down to
CONFIG_FUNCTION_TRACER being enabled.
Tracing this function is not possible because it's invoked from the
earliest boot stage before the relocation fixups have been done. It is the
function doing the relocation.
Exclude it from being built with tracer stubs.
Fixes:
c88d71508e36 ("x86/boot/64: Rewrite startup_64() in C")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: lkp@01.org
Link: http://lkml.kernel.org/r/20170627115948.17938-1-kirill.shutemov@linux.intel.com
Edward Cree [Thu, 29 Jun 2017 15:50:06 +0000 (16:50 +0100)]
sfc: fix attempt to translate invalid filter ID
When filter insertion fails with no rollback, we were trying to convert
EFX_EF10_FILTER_ID_INVALID to an id to store in 'ids' (which is either
vlan->uc or vlan->mc). This would WARN_ON_ONCE and then record a bogus
filter ID of 0x1fff, neither of which is a good thing.
Fixes:
0ccb998bf46d ("sfc: fix filter_id misinterpretation in edge case")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubeček [Thu, 29 Jun 2017 09:13:36 +0000 (11:13 +0200)]
net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish()
Recently I started seeing warnings about pages with refcount -1. The
problem was traced to packets being reused after their head was merged into
a GRO packet by skb_gro_receive(). While bisecting the issue pointed to
commit
c21b48cc1bbf ("net: adjust skb->truesize in ___pskb_trim()") and
I have never seen it on a kernel with it reverted, I believe the real
problem appeared earlier when the option to merge head frag in GRO was
implemented.
Handling NAPI_GRO_FREE_STOLEN_HEAD state was only added to GRO_MERGED_FREE
branch of napi_skb_finish() so that if the driver uses napi_gro_frags()
and head is merged (which in my case happens after the skb_condense()
call added by the commit mentioned above), the skb is reused including the
head that has been merged. As a result, we release the page reference
twice and eventually end up with negative page refcount.
To fix the problem, handle NAPI_GRO_FREE_STOLEN_HEAD in napi_frags_finish()
the same way it's done in napi_skb_finish().
Fixes:
d7e8883cfcf4 ("net: make GRO aware of skb->head_frag")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Thu, 29 Jun 2017 01:04:59 +0000 (03:04 +0200)]
bpf: prevent leaking pointer via xadd on unpriviledged
Leaking kernel addresses on unpriviledged is generally disallowed,
for example, verifier rejects the following:
0: (b7) r0 = 0
1: (18) r2 = 0xffff897e82304400
3: (7b) *(u64 *)(r1 +48) = r2
R2 leaks addr into ctx
Doing pointer arithmetic on them is also forbidden, so that they
don't turn into unknown value and then get leaked out. However,
there's xadd as a special case, where we don't check the src reg
for being a pointer register, e.g. the following will pass:
0: (b7) r0 = 0
1: (7b) *(u64 *)(r1 +48) = r0
2: (18) r2 = 0xffff897e82304400 ; map
4: (db) lock *(u64 *)(r1 +48) += r2
5: (95) exit
We could store the pointer into skb->cb, loose the type context,
and then read it out from there again to leak it eventually out
of a map value. Or more easily in a different variant, too:
0: (bf) r6 = r1
1: (7a) *(u64 *)(r10 -8) = 0
2: (bf) r2 = r10
3: (07) r2 += -8
4: (18) r1 = 0x0
6: (85) call bpf_map_lookup_elem#1
7: (15) if r0 == 0x0 goto pc+3
R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R6=ctx R10=fp
8: (b7) r3 = 0
9: (7b) *(u64 *)(r0 +0) = r3
10: (db) lock *(u64 *)(r0 +0) += r6
11: (b7) r0 = 0
12: (95) exit
from 7 to 11: R0=inv,min_value=0,max_value=0 R6=ctx R10=fp
11: (b7) r0 = 0
12: (95) exit
Prevent this by checking xadd src reg for pointer types. Also
add a couple of test cases related to this.
Fixes:
1be7f75d1668 ("bpf: enable non-root eBPF programs")
Fixes:
17a5267067f3 ("bpf: verifier (add verifier core)")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kan Liang [Thu, 29 Jun 2017 19:09:26 +0000 (12:09 -0700)]
perf/x86/intel/uncore: Fix wrong box pointer check
Should not init a NULL box. It will cause system crash.
The issue looks like caused by a typo.
This was not noticed because there is no NULL box. Also, for most
boxes, they are enabled by default. The init code is not critical.
Fixes:
fff4b87e594a ("perf/x86/intel/uncore: Make package handling more robust")
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170629190926.2456-1-kan.liang@intel.com