Nick Desaulniers [Sat, 26 Sep 2020 04:19:18 +0000 (21:19 -0700)]
lib/string.c: implement stpcpy
commit
1e1b6d63d6340764e00356873e5794225a2a03ea upstream.
LLVM implemented a recent "libcall optimization" that lowers calls to
`sprintf(dest, "%s", str)` where the return value is used to
`stpcpy(dest, str) - dest`.
This generally avoids the machinery involved in parsing format strings.
`stpcpy` is just like `strcpy` except it returns the pointer to the new
tail of `dest`. This optimization was introduced into clang-12.
Implement this so that we don't observe linkage failures due to missing
symbol definitions for `stpcpy`.
Similar to last year's fire drill with: commit
5f074f3e192f
("lib/string.c: implement a basic bcmp")
The kernel is somewhere between a "freestanding" environment (no full
libc) and "hosted" environment (many symbols from libc exist with the
same type, function signature, and semantics).
As Peter Anvin notes, there's not really a great way to inform the
compiler that you're targeting a freestanding environment but would like
to opt-in to some libcall optimizations (see pr/47280 below), rather
than opt-out.
Arvind notes, -fno-builtin-* behaves slightly differently between GCC
and Clang, and Clang is missing many __builtin_* definitions, which I
consider a bug in Clang and am working on fixing.
Masahiro summarizes the subtle distinction between compilers justly:
To prevent transformation from foo() into bar(), there are two ways in
Clang to do that; -fno-builtin-foo, and -fno-builtin-bar. There is
only one in GCC; -fno-buitin-foo.
(Any difference in that behavior in Clang is likely a bug from a missing
__builtin_* definition.)
Masahiro also notes:
We want to disable optimization from foo() to bar(),
but we may still benefit from the optimization from
foo() into something else. If GCC implements the same transform, we
would run into a problem because it is not -fno-builtin-bar, but
-fno-builtin-foo that disables that optimization.
In this regard, -fno-builtin-foo would be more future-proof than
-fno-built-bar, but -fno-builtin-foo is still potentially overkill. We
may want to prevent calls from foo() being optimized into calls to
bar(), but we still may want other optimization on calls to foo().
It seems that compilers today don't quite provide the fine grain control
over which libcall optimizations pseudo-freestanding environments would
prefer.
Finally, Kees notes that this interface is unsafe, so we should not
encourage its use. As such, I've removed the declaration from any
header, but it still needs to be exported to avoid linkage errors in
modules.
Reported-by: Sami Tolvanen <samitolvanen@google.com>
Suggested-by: Andy Lavr <andy.lavr@gmail.com>
Suggested-by: Arvind Sankar <nivedita@alum.mit.edu>
Suggested-by: Joe Perches <joe@perches.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Suggested-by: Masahiro Yamada <masahiroy@kernel.org>
Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200914161643.938408-1-ndesaulniers@google.com
Link: https://bugs.llvm.org/show_bug.cgi?id=47162
Link: https://bugs.llvm.org/show_bug.cgi?id=47280
Link: https://github.com/ClangBuiltLinux/linux/issues/1126
Link: https://man7.org/linux/man-pages/man3/stpcpy.3.html
Link: https://pubs.opengroup.org/onlinepubs/9699919799/functions/stpcpy.html
Link: https://reviews.llvm.org/D85963
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Change-Id: I01f7436f21e95dced193036fef38f95f684797d1
Stricted [Thu, 14 Apr 2022 15:57:15 +0000 (11:57 -0400)]
defconfig: exynos9610: Re-add dropped Wi-Fi AP options lost
in R rebase
* Not sure why Motorola removed these but there's 0 hope of
Wi-Fi AP working without them.
* Note: Motorola stock doesn't have these, oddly.
Change-Id: I545153e75b55bd94fe44f9f318746bed1a2cbb38
Stricted [Fri, 21 Jan 2022 17:35:35 +0000 (17:35 +0000)]
drivers: regulator: fix leading whitespace warning
Change-Id: I2d914aee0464a4edd531164908e5c0e611437658
Tim Zimmermann [Thu, 14 Jan 2021 19:44:19 +0000 (20:44 +0100)]
drivers: net: wireless: scsc: don't delete or deactivate AP
interface in slsi_del_station
* This function is meant for cleaning up going from STA to AP mode
* On latest 18.1 builds deleting/deactivating AP interface causes
hotspot not to work, hostapd starts up and system says hotspot
is on, but due to a kernel error (vif type isn't set to AP anymore)
SSID isn't actually broadcasted and clients can't find nor connect
to the hotspot
* So let's remove this as removing it doesn't seem to have any
negative consequences but fixes hotspot, also disable an
ugly warning caused by this which isn't actually a problem
Change-Id: I5a656fd38697b9be09ecdc5f344bc343458aeaaf
Al Viro [Wed, 2 Sep 2020 15:30:48 +0000 (11:30 -0400)]
fix regression in "epoll: Keep a reference on files added to the check list"
[ Upstream commit
77f4689de17c0887775bb77896f4cc11a39bf848 ]
epoll_loop_check_proc() can run into a file already committed to destruction;
we can't grab a reference on those and don't need to add them to the set for
reverse path check anyway.
Mot-CRs-fixed: (CR)
CVE-Fixed: CVE-2020-0466
Bug:
147802478
Change-Id: I24d3b8c878f5ef49f9ff5922d2364b59844fd8b8
Tested-by: Marc Zyngier <maz@kernel.org>
Fixes:
a9ed4a6560b8 ("epoll: Keep a reference on files added to the check list")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Jignesh Patel <jignesh@motorola.com>
Reviewed-on: https://gerrit.mot.com/
1796974
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
Stricted [Sat, 22 May 2021 18:31:15 +0000 (20:31 +0200)]
ARM64: configs: kane/troika: disable {ex,v}fat
* we have sdfat instead
Change-Id: I9e265238f32dace3cd3658c642769c12d5617250
Paul Keith [Wed, 28 Mar 2018 17:52:29 +0000 (19:52 +0200)]
fs: sdfat: Add MODULE_ALIAS_FS for supported filesystems
* This is the proper thing to do for filesystem drivers
Change-Id: I109b201d85e324cc0a72c3fcd09df4a3e1703042
Signed-off-by: Paul Keith <javelinanddart@gmail.com>
Paul Keith [Fri, 2 Mar 2018 04:10:27 +0000 (05:10 +0100)]
fs: sdfat: Add config option to register sdFAT for VFAT
Change-Id: I72ba7a14b56175535884390e8601960b5d8ed1cf
Signed-off-by: Paul Keith <javelinanddart@gmail.com>
Paul Keith [Fri, 2 Mar 2018 03:51:53 +0000 (04:51 +0100)]
fs: sdfat: Add config option to register sdFAT for exFAT
Change-Id: Id57abf0a4bd0b433fecc622eecb383cd4ea29d17
Signed-off-by: Paul Keith <javelinanddart@gmail.com>
Bruno Martins [Sun, 16 May 2021 14:15:40 +0000 (15:15 +0100)]
fs: Import sdFAT version 2.4.5
From Samsung package version T725XXU2DUD1.
Change-Id: I4e8c3317e03edc44d6ac4248231067da8c4f3873
Rob Herring [Tue, 27 Feb 2018 23:49:57 +0000 (17:49 -0600)]
kbuild: add dtc as dependency on .dtb files
If dtc is rebuilt, we should rebuild .dtb files with the new dtc.
Change-Id: I313ee55506628c0922161541adf293735cb4edae
Acked-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Rob Herring <robh@kernel.org>
Bruno Martins [Mon, 30 Nov 2020 12:30:21 +0000 (12:30 +0000)]
scripts: Silence all dtc warnings seen with Samsung DTS
It is insane to fix them all, so just ignore.
If needed, can be re-enabled by building with 'W=1'.
Change-Id: Ic56bc08b255342fc19f5872463552be13375cceb
Jesse Chan [Sat, 21 Apr 2018 07:08:51 +0000 (00:08 -0700)]
power: supply: s2mu00x: export {CURRENT/VOLTAGE}_MAX to sysfs
Change-Id: I54c775bb80c2151bdc69ea9fb53a48a34327bbef
Brian Norris [Thu, 15 Nov 2018 02:11:18 +0000 (18:11 -0800)]
scripts/setlocalversion: Improve -dirty check with git-status --no-optional-locks
[ Upstream commit
ff64dd4857303dd5550faed9fd598ac90f0f2238 ]
git-diff-index does not refresh the index for you, so using it for a
"-dirty" check can give misleading results. Commit
6147b1cf19651
("scripts/setlocalversion: git: Make -dirty check more robust") tried to
fix this by switching to git-status, but it overlooked the fact that
git-status also writes to the .git directory of the source tree, which
is definitely not kosher for an out-of-tree (O=) build. That is getting
reverted.
Fortunately, git-status now supports avoiding writing to the index via
the --no-optional-locks flag, as of git 2.14. It still calculates an
up-to-date index, but it avoids writing it out to the .git directory.
So, let's retry the solution from commit
6147b1cf19651 using this new
flag first, and if it fails, we assume this is an older version of git
and just use the old git-diff-index method.
It's hairy to get the 'grep -vq' (inverted matching) correct by stashing
the output of git-status (you have to be careful about the difference
betwen "empty stdin" and "blank line on stdin"), so just pipe the output
directly to grep and use a regex that's good enough for both the
git-status and git-diff-index version.
Change-Id: Ieb6e2ff2db99c081b17332136db860260d165385
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Suggested-by: Alexander Kapshuk <alexander.kapshuk@gmail.com>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Tested-by: Genki Sky <sky@genki.is>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Stricted [Fri, 21 Jan 2022 17:24:54 +0000 (17:24 +0000)]
kane/troika: disable CONFIG_SCSC_LOG_COLLECTION
* causes random kernel crashes
Change-Id: I5348c4dfba02b1fabb65e124b32f850b8138c197
Stricted [Fri, 14 Jan 2022 07:57:47 +0000 (07:57 +0000)]
wireless: scsc: fix CONFIG_SCSC_LOG_COLLECTION ifdefs
Change-Id: I280d122ffa74a201027bcab61208bf6f5a568d89
Stricted [Mon, 21 Sep 2020 21:09:47 +0000 (21:09 +0000)]
media: radio: s610: fix indentation warning
Khusika Dhamar Gusti [Fri, 13 Sep 2019 16:05:32 +0000 (23:05 +0700)]
input: touchscreen: nt36xxx: fix snprintf warning
Fixes the following clang warning:
../drivers/input/touchscreen/nt36xxx/nt36xxx_mp_ctrlram.c:1293:3: error: 'snprintf' size argument is too large; destination buffer has size 32, $
snprintf(mpcriteria, PAGE_SIZE, "novatek-mp-criteria-%04X", ts->nvt_pid);
^
1 error generated.
Change-Id: Ieb3331a226cb6e183aaac2f22e7e8dccf27af7fc
Signed-off-by: Khusika Dhamar Gusti <mail@khusika.com>
Stricted [Wed, 9 Sep 2020 10:14:47 +0000 (10:14 +0000)]
exynos9609: do not mount oem partition
Change-Id: I98ea0ff45d6c3bd300439d93687c98c6deeab2e0
Daniel Micay [Tue, 16 May 2017 21:51:48 +0000 (17:51 -0400)]
add toggle for disabling newly added USB devices
Based on the public grsecurity patches.
Change-Id: I2cbea91b351cda7d098f4e1aa73dff1acbd23cce
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Stricted [Fri, 21 Jan 2022 17:23:12 +0000 (17:23 +0000)]
kane/troika: disable CONFIG_SCSC_WLAN_KEY_MGMT_OFFLOAD
* our wpa_supplicant has no support for this
David Ng [Sun, 21 Dec 2014 20:53:22 +0000 (12:53 -0800)]
arm64: Add 32-bit sigcontext definition to uapi signcontext.h
The arm64 uapi sigcontext.h can be included by 32-bit userspace
modules. Since arm and arm64 sigcontext definition are not
compatible, add arm sigcontext definition to arm64 sigcontext.h.
Change-Id: I94109b094f6c8376fdaeb2822d7b26d18ddfb2bc
Signed-off-by: David Ng <dave@codeaurora.org>
Stricted [Wed, 28 Aug 2019 15:26:26 +0000 (15:26 +0000)]
gud: fix mobicore initialization
* backported from s9
Change-Id: I48476e899495490ded64a9e173e3daa3c4cdafa0
LuK1337 [Tue, 11 Dec 2018 08:50:01 +0000 (09:50 +0100)]
Android: Add empty Android.mk file
* This prevents inclusion of drivers/staging/greybus/tools/Android.mk
which will conflict in case we have more than 1 kernel tree in AOSP
source dir.
Change-Id: I335bca7b6d6463b1ffc673ab5367603347516e13
Sami Tolvanen [Mon, 1 Oct 2018 23:21:22 +0000 (16:21 -0700)]
ANDROID: arm64: kbuild: only specify code model with LTO for
modules
This fixes CONFIG_LTO_CLANG for LLVM >= r334571 where LLVMgold actually
started respecting the code model flag.
Bug:
116819139
Bug:
119418330
Test: builds with LLVM r334571, boots on a device, all modules load
Change-Id: I1af9d24ed1b789a40ac1e85bd00e176ff8a68aa5
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Miguel de Dios <migueldedios@google.com>
Sami Tolvanen [Thu, 4 Oct 2018 15:56:16 +0000 (08:56 -0700)]
ANDROID: modpost: add an exception for CFI stubs
When CONFIG_CFI_CLANG is enabled, LLVM renames all address taken
functions by appending a .cfi postfix to their names, and creates
function stubs with the original names. The compiler always injects
these stubs to the text section, even if the function itself is
placed into init or exit sections, which creates modpost warnings.
This commit adds a modpost exception for CFI stubs to prevent the
warnings.
Bug:
117237524
Change-Id: Ieb8bf20d0c3ad7b7295c535f598370220598cdb0
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Michael Benedict [Sat, 29 Feb 2020 16:41:42 +0000 (03:41 +1100)]
drivers: soc: cal-if: create new macro to handle NULL return
some of the func have NULL's which will return the size of the pointer instead which is ((void*) 0), because of that we need to make a new macro to set it to 0
Signed-off-by: Michael Benedict <michaelbt@live.com>
Change-Id: I3ab61542330c7b13cd165dfb48d453cff005884d
Stricted [Sun, 3 May 2020 04:33:51 +0000 (04:33 +0000)]
fimc-is2: fix sizeof pointer use
Change-Id: Ie53e42ab4527647e4435449b738a6b869508160f
Stricted [Sat, 25 Apr 2020 23:53:50 +0000 (23:53 +0000)]
sound: soc: codecs: cs35l41: fix variable declaration
Change-Id: Ief5c96b73b59670c71025b879776a50b9b435d38
Stricted [Sat, 25 Apr 2020 23:52:51 +0000 (23:52 +0000)]
fimc-is2: fix variable declarations
Change-Id: I99a5c03664346d74fc4fe93278121c52cbe3289c
Stricted [Fri, 24 Apr 2020 17:50:57 +0000 (17:50 +0000)]
remove libdss from Makefile
* what is this even doing?
Stricted [Fri, 21 Jan 2022 17:21:42 +0000 (17:21 +0000)]
kane/troika: enable last_kmsg
Stricted [Thu, 23 Apr 2020 16:55:05 +0000 (16:55 +0000)]
add kane defconfig
Stricted [Thu, 23 Apr 2020 16:50:09 +0000 (16:50 +0000)]
add troika defconfig
Stricted [Mon, 20 Apr 2020 13:09:23 +0000 (13:09 +0000)]
implement samsung last_kmsg
Peter Zijlstra [Wed, 4 Mar 2020 10:28:31 +0000 (11:28 +0100)]
[RAMEN9610-22013]futex: Fix inode life-time issue
commit
8019ad13ef7f64be44d4f892af9c840179009254 upstream.
As reported by Jann, ihold() does not in fact guarantee inode
persistence. And instead of making it so, replace the usage of inode
pointers with a per boot, machine wide, unique inode identifier.
This sequence number is global, but shared (file backed) futexes are
rare enough that this should not become a performance issue.
Change-Id: I71ce183a546dc536fddd1c1f982fa96a577e5a2d
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Will McVicker [Sat, 5 Dec 2020 00:48:48 +0000 (00:48 +0000)]
[RAMEN9610-22013]HID: make arrays usage and value to be the same
commit
ed9be64eefe26d7d8b0b5b9fa3ffdf425d87a01f upstream.
The HID subsystem allows an "HID report field" to have a different
number of "values" and "usages" when it is allocated. When a field
struct is created, the size of the usage array is guaranteed to be at
least as large as the values array, but it may be larger. This leads to
a potential out-of-bounds write in
__hidinput_change_resolution_multipliers() and an out-of-bounds read in
hidinput_count_leds().
To fix this, let's make sure that both the usage and value arrays are
the same size.
Change-Id: I96857aa2e17b88d900a50aa16a016f4aca6b1fa5
Cc: stable@vger.kernel.org
Signed-off-by: Will McVicker <willmcvicker@google.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jann Horn [Thu, 3 Dec 2020 01:25:04 +0000 (02:25 +0100)]
[RAMEN9610-22009]tty: Fix ->pgrp locking in tiocspgrp()
commit
54ffccbf053b5b6ca4f6e45094b942fab92a25fc upstream.
tiocspgrp() takes two tty_struct pointers: One to the tty that userspace
passed to ioctl() (`tty`) and one to the TTY being changed (`real_tty`).
These pointers are different when ioctl() is called with a master fd.
To properly lock real_tty->pgrp, we must take real_tty->ctrl_lock.
This bug makes it possible for racing ioctl(TIOCSPGRP, ...) calls on
both sides of a PTY pair to corrupt the refcount of `struct pid`,
leading to use-after-free errors.
Change-Id: I000d97f52d398cdf171fb8fb53d9b91456e9dc4c
Fixes:
47f86834bbd4 ("redo locking of tty->pgrp")
CC: stable@kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric Dumazet [Thu, 15 Oct 2020 18:42:00 +0000 (11:42 -0700)]
[RAMEN9610-21992]icmp: randomize the global rate limiter
[ Upstream commit
b38e7819cae946e2edf869e604af1e65a5d241c5 ]
Keyu Man reported that the ICMP rate limiter could be used
by attackers to get useful signal. Details will be provided
in an upcoming academic publication.
Our solution is to add some noise, so that the attackers
no longer can get help from the predictable token bucket limiter.
Fixes:
4cdf507d5452 ("icmp: add a global rate limitation")
Change-Id: Ib1a0ee70e55123a17d88b79a83664a3b2ac70034
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Keyu Man <kman001@ucr.edu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Andrea Arcangeli [Wed, 27 May 2020 23:06:24 +0000 (19:06 -0400)]
[RAMEN9610-21992]mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked()
commit
c444eb564fb16645c172d550359cb3d75fe8a040 upstream.
Write protect anon page faults require an accurate mapcount to decide
if to break the COW or not. This is implemented in the THP path with
reuse_swap_page() ->
page_trans_huge_map_swapcount()/page_trans_huge_mapcount().
If the COW triggers while the other processes sharing the page are
under a huge pmd split, to do an accurate reading, we must ensure the
mapcount isn't computed while it's being transferred from the head
page to the tail pages.
reuse_swap_cache() already runs serialized by the page lock, so it's
enough to add the page lock around __split_huge_pmd_locked too, in
order to add the missing serialization.
Note: the commit in "Fixes" is just to facilitate the backporting,
because the code before such commit didn't try to do an accurate THP
mapcount calculation and it instead used the page_count() to decide if
to COW or not. Both the page_count and the pin_count are THP-wide
refcounts, so they're inaccurate if used in
reuse_swap_page(). Reverting such commit (besides the unrelated fix to
the local anon_vma assignment) would have also opened the window for
memory corruption side effects to certain workloads as documented in
such commit header.
Change-Id: I702663cf38fd55f76fa2772e4b1cc8b1d0a45a0a
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Jann Horn <jannh@google.com>
Reported-by: Jann Horn <jannh@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes:
6d0a07edd17c ("mm: thp: calculate the mapcount correctly for THP pages during WP faults")
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jason Yan [Tue, 16 Jun 2020 12:16:55 +0000 (20:16 +0800)]
[RAMEN9610-21992]block: Fix use-after-free in blkdev_get()
[ Upstream commit
2d3a8e2deddea6c89961c422ec0c5b851e648c14 ]
In blkdev_get() we call __blkdev_get() to do some internal jobs and if
there is some errors in __blkdev_get(), the bdput() is called which
means we have released the refcount of the bdev (actually the refcount of
the bdev inode). This means we cannot access bdev after that point. But
acctually bdev is still accessed in blkdev_get() after calling
__blkdev_get(). This results in use-after-free if the refcount is the
last one we released in __blkdev_get(). Let's take a look at the
following scenerio:
CPU0 CPU1 CPU2
blkdev_open blkdev_open Remove disk
bd_acquire
blkdev_get
__blkdev_get del_gendisk
bdev_unhash_inode
bd_acquire bdev_get_gendisk
bd_forget failed because of unhashed
bdput
bdput (the last one)
bdev_evict_inode
access bdev => use after free
[ 459.350216] BUG: KASAN: use-after-free in __lock_acquire+0x24c1/0x31b0
[ 459.351190] Read of size 8 at addr
ffff88806c815a80 by task syz-executor.0/20132
[ 459.352347]
[ 459.352594] CPU: 0 PID: 20132 Comm: syz-executor.0 Not tainted 4.19.90 #2
[ 459.353628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 459.354947] Call Trace:
[ 459.355337] dump_stack+0x111/0x19e
[ 459.355879] ? __lock_acquire+0x24c1/0x31b0
[ 459.356523] print_address_description+0x60/0x223
[ 459.357248] ? __lock_acquire+0x24c1/0x31b0
[ 459.357887] kasan_report.cold+0xae/0x2d8
[ 459.358503] __lock_acquire+0x24c1/0x31b0
[ 459.359120] ? _raw_spin_unlock_irq+0x24/0x40
[ 459.359784] ? lockdep_hardirqs_on+0x37b/0x580
[ 459.360465] ? _raw_spin_unlock_irq+0x24/0x40
[ 459.361123] ? finish_task_switch+0x125/0x600
[ 459.361812] ? finish_task_switch+0xee/0x600
[ 459.362471] ? mark_held_locks+0xf0/0xf0
[ 459.363108] ? __schedule+0x96f/0x21d0
[ 459.363716] lock_acquire+0x111/0x320
[ 459.364285] ? blkdev_get+0xce/0xbe0
[ 459.364846] ? blkdev_get+0xce/0xbe0
[ 459.365390] __mutex_lock+0xf9/0x12a0
[ 459.365948] ? blkdev_get+0xce/0xbe0
[ 459.366493] ? bdev_evict_inode+0x1f0/0x1f0
[ 459.367130] ? blkdev_get+0xce/0xbe0
[ 459.367678] ? destroy_inode+0xbc/0x110
[ 459.368261] ? mutex_trylock+0x1a0/0x1a0
[ 459.368867] ? __blkdev_get+0x3e6/0x1280
[ 459.369463] ? bdev_disk_changed+0x1d0/0x1d0
[ 459.370114] ? blkdev_get+0xce/0xbe0
[ 459.370656] blkdev_get+0xce/0xbe0
[ 459.371178] ? find_held_lock+0x2c/0x110
[ 459.371774] ? __blkdev_get+0x1280/0x1280
[ 459.372383] ? lock_downgrade+0x680/0x680
[ 459.373002] ? lock_acquire+0x111/0x320
[ 459.373587] ? bd_acquire+0x21/0x2c0
[ 459.374134] ? do_raw_spin_unlock+0x4f/0x250
[ 459.374780] blkdev_open+0x202/0x290
[ 459.375325] do_dentry_open+0x49e/0x1050
[ 459.375924] ? blkdev_get_by_dev+0x70/0x70
[ 459.376543] ? __x64_sys_fchdir+0x1f0/0x1f0
[ 459.377192] ? inode_permission+0xbe/0x3a0
[ 459.377818] path_openat+0x148c/0x3f50
[ 459.378392] ? kmem_cache_alloc+0xd5/0x280
[ 459.379016] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 459.379802] ? path_lookupat.isra.0+0x900/0x900
[ 459.380489] ? __lock_is_held+0xad/0x140
[ 459.381093] do_filp_open+0x1a1/0x280
[ 459.381654] ? may_open_dev+0xf0/0xf0
[ 459.382214] ? find_held_lock+0x2c/0x110
[ 459.382816] ? lock_downgrade+0x680/0x680
[ 459.383425] ? __lock_is_held+0xad/0x140
[ 459.384024] ? do_raw_spin_unlock+0x4f/0x250
[ 459.384668] ? _raw_spin_unlock+0x1f/0x30
[ 459.385280] ? __alloc_fd+0x448/0x560
[ 459.385841] do_sys_open+0x3c3/0x500
[ 459.386386] ? filp_open+0x70/0x70
[ 459.386911] ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 459.387610] ? trace_hardirqs_off_caller+0x55/0x1c0
[ 459.388342] ? do_syscall_64+0x1a/0x520
[ 459.388930] do_syscall_64+0xc3/0x520
[ 459.389490] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 459.390248] RIP: 0033:0x416211
[ 459.390720] Code: 75 14 b8 02 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83
04 19 00 00 c3 48 83 ec 08 e8 0a fa ff ff 48 89 04 24 b8 02 00 00 00 0f
05 <48> 8b 3c 24 48 89 c2 e8 53 fa ff ff 48 89 d0 48 83 c4 08 48 3d
01
[ 459.393483] RSP: 002b:
00007fe45dfe9a60 EFLAGS:
00000293 ORIG_RAX:
0000000000000002
[ 459.394610] RAX:
ffffffffffffffda RBX:
00007fe45dfea6d4 RCX:
0000000000416211
[ 459.395678] RDX:
00007fe45dfe9b0a RSI:
0000000000000002 RDI:
00007fe45dfe9b00
[ 459.396758] RBP:
000000000076bf20 R08:
0000000000000000 R09:
000000000000000a
[ 459.397930] R10:
0000000000000075 R11:
0000000000000293 R12:
00000000ffffffff
[ 459.399022] R13:
0000000000000bd9 R14:
00000000004cdb80 R15:
000000000076bf2c
[ 459.400168]
[ 459.400430] Allocated by task 20132:
[ 459.401038] kasan_kmalloc+0xbf/0xe0
[ 459.401652] kmem_cache_alloc+0xd5/0x280
[ 459.402330] bdev_alloc_inode+0x18/0x40
[ 459.402970] alloc_inode+0x5f/0x180
[ 459.403510] iget5_locked+0x57/0xd0
[ 459.404095] bdget+0x94/0x4e0
[ 459.404607] bd_acquire+0xfa/0x2c0
[ 459.405113] blkdev_open+0x110/0x290
[ 459.405702] do_dentry_open+0x49e/0x1050
[ 459.406340] path_openat+0x148c/0x3f50
[ 459.406926] do_filp_open+0x1a1/0x280
[ 459.407471] do_sys_open+0x3c3/0x500
[ 459.408010] do_syscall_64+0xc3/0x520
[ 459.408572] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 459.409415]
[ 459.409679] Freed by task 1262:
[ 459.410212] __kasan_slab_free+0x129/0x170
[ 459.410919] kmem_cache_free+0xb2/0x2a0
[ 459.411564] rcu_process_callbacks+0xbb2/0x2320
[ 459.412318] __do_softirq+0x225/0x8ac
Fix this by delaying bdput() to the end of blkdev_get() which means we
have finished accessing bdev.
Fixes:
77ea887e433a ("implement in-kernel gendisk events handling")
Change-Id: I9ec1b9ea3520f3e1374b95cee48ac8cc38476117
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Kalesh Singh [Mon, 11 Jan 2021 06:26:18 +0000 (01:26 -0500)]
[RAMEN9610-21968]ANDROID: xt_qtaguid: Remove tag_entry from process list on untag
A sock_tag_entry can only be part of one process's
pqd_entry->sock_tag_list. RetagGing the socket only updates
sock_tag_entry->tag, and does not add the tag entry to the current
process's pqd_entry list, nor update sock_tag_entry->pid.
So the sock_tag_entry is only ever present in the
pqd_entry list of the process that initially tagged the socket.
A sock_tag_entry can also get created and not be added to any process's
pqd_entry list. This happens if the process that initially tags the
socket has not opened /dev/xt_qtaguid.
ctrl_cmd_untag() supports untagGing from a context other than the
process that initially tagged the socket. Currently, the sock_tag_entry is
only removed from its containing pqd_entry->sock_tag_list if the
process that does the untagGing has opened /dev/xt_qtaguid. However, the
tag entry should always be deleted from its pqd entry list (if present).
Bug:
176919394
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I5b6f0c36c0ebefd98cc6873a4057104c7d885ccc
(cherry picked from commit
c2ab93b45b5cdc426868fb8793ada2cac20568ef)
Nick Desaulniers [Fri, 3 Mar 2017 23:40:12 +0000 (15:40 -0800)]
tracing: do not leak kernel addresses
This likely breaks tracing tools like trace-cmd. It logs in the same
format but now addresses are all 0x0.
Bug:
34277115
MOT-CRs-fixed: (CR)
Change-Id: Ifb0d4d2a184bf0d95726de05b1acee0287a375d9
Reviewed-on: https://gerrit.mot.com/
1887836
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
sunyue5 [Wed, 30 Dec 2020 09:54:19 +0000 (17:54 +0800)]
arm/config: support for hid-nintendo driver
CTS will test this device so we should enable it in the
kernel config
CTS cases:
run cts -m CtsHardwareTestCases -t
android.hardware.input.cts.tests.NintendoSwitchProTest#testAllKeys
android.hardware.input.cts.tests.NintendoSwitchProTest#testAllMotions
Change-Id: I4c0f9dcb468fbfa629ae82b984ff55dc170a1acc
Signed-off-by: sunyue5 <sunyue5@motorola.com>
Reviewed-on: https://gerrit.mot.com/
1839437
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
sunyue5 [Fri, 11 Dec 2020 01:25:53 +0000 (09:25 +0800)]
arm64/defconfig: Disable CONFIG_SCSC_WLAN_WIFI_SHARING
SCSC_WLAN_WIFI_SHARING is to enable APSTA function which means
the device can start wifi hotspot with wifi station enabled.
In this case, wlan1 will be for the AP mode while wlan0 will be
for the STA mode.
Disable this function since we have disabled it in the frameworks
Change-Id: Icd0d10badec991b3235735c0e53de3214368ba87
Signed-off-by: sunyue5 <sunyue5@motorola.com>
Reviewed-on: https://gerrit.mot.com/
1825115
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
lulei1 [Fri, 27 Nov 2020 06:01:27 +0000 (14:01 +0800)]
enable blkio for exynos9610
IO cgroup should be enabled to restrict background IO R/W.
Change-Id: I4736eb38d208a23afb183c27a5d90296e79e5f39
Reviewed-on: https://gerrit.mot.com/
1812665
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Guolin Wang <wanggl3@mt.com>
Submit-Approved: Jira Key
zhaoxp3 [Fri, 13 Nov 2020 07:25:04 +0000 (15:25 +0800)]
arm64/defconfig: sync Q defconfig
1. sync kane defconfig change
2. pick wifi change
RAMEN9610-21775]wlbt: Enable Wifi Configurations in Defconfig.
Enable Wifi Configurations in Defconfi
Change-Id: Iea3d6c430f6b3a039c7124b5fa2a331c15bad27c
Signed-off-by: zhaoxp3 <zhaoxp3@motorola.com>
Jann Horn [Sat, 30 Mar 2019 02:12:32 +0000 (03:12 +0100)]
UPSTREAM: signal: don't silently convert SI_USER signals to non-current pidfd
The current sys_pidfd_send_signal() silently turns signals with explicit
SI_USER context that are sent to non-current tasks into signals with
kernel-generated siGinfo.
This is unlike do_rt_sigqueueinfo(), which returns -EPERM in this case.
If a user actually wants to send a signal with kernel-provided siGinfo,
they can do that with pidfd_send_signal(pidfd, sig, NULL, 0); so allowing
this case is unnecessary.
Instead of silently replacing the siGinfo, just bail out with an error;
this is consistent with other interfaces and avoids special-casing behavior
based on security checks.
Fixes:
3eb39f47934f ("signal: add pidfd_send_signal() syscall")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
(cherry picked from commit
556a888a14afe27164191955618990fb3ccc9aad)
Mot-CRs-fixed: (CR)
Bug:
135608568
Test: test program using syscall(__NR_pidfd_send_signal,..) to send SIGKILL
Change-Id: I493af671b82c43bff1425ee24550d2fb9aa6d961
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/
1505848
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/
1796156
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Joel Fernandes (Google) [Tue, 30 Apr 2019 16:21:53 +0000 (12:21 -0400)]
UPSTREAM: pidfd: add polling support
This patch adds polling support to pidfd.
Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
Using the polling support, LMK will be able to get notified when a process
exists in race-free and fast way, and allows the LMK to do other things
(such as by polling on other fds) while awaiting the process being killed
to die.
For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when the
tasks waiting on a poll of pidfd are also awakened in this patch.
We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
it in task_struct would not be option in this case because the task can
be reaped and destroyed before the poll returns.
2. By including the struct pid for the waitqueue means that during
de_thread(), the new thread group leader automatically gets the new
waitqueue/pid even though its task_struct is different.
Appropriate test cases are added in the second patch to provide coverage of
all the cases the patch is handling.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@android.com
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Co-developed-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
(cherry picked from commit
b53b0b9d9a613c418057f6cb921c2f40a6f78c24)
Mot-CRs-fixed: (CR)
Bug:
135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I02f259d2875bec46b198d580edfbb067f077084e
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/
1505855
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/
1796163
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Christian Brauner [Wed, 17 Apr 2019 20:50:25 +0000 (22:50 +0200)]
BACKPORT: signal: support CLONE_PIDFD with pidfd_send_signal
Let pidfd_send_signal() use pidfds retrieved via CLONE_PIDFD. With this
patch pidfd_send_signal() becomes independent of procfs. This fullfils
the request made when we merged the pidfd_send_signal() patchset. The
pidfd_send_signal() syscall is now always available allowing for it to
be used by users without procfs mounted or even users without procfs
support compiled into the kernel.
Signed-off-by: Christian Brauner <christian@brauner.io>
Co-developed-by: Jann Horn <jannh@google.com>
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit
2151ad1b067275730de1b38c7257478cae47d29e)
Conflicts:
kernel/sys_ni.c
(1. Replaced COND_SYSCALL with cond_syscall.)
Mot-CRs-fixed: (CR)
Bug:
135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I621fe6547397e0e68c560d7da60ef7715deb290c
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/
1505852
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/
1796160
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Christian Brauner [Fri, 10 May 2019 09:53:46 +0000 (11:53 +0200)]
UPSTREAM: fork: do not release lock that wasn't taken
Avoid calling cgroup_threadgroup_change_end() without having called
cgroup_threadgroup_change_beGin() first.
During process creation we need to check whether the cgroup we are in
allows us to fork. To perform this check the cgroup needs to guard itself
against threadgroup changes and takes a lock.
Prior to CLONE_PIDFD the cleanup target "bad_fork_free_pid" would also need
to call cgroup_threadgroup_change_end() because said lock had already been
taken.
However, this is not the case anymore with the addition of CLONE_PIDFD. We
are now allocating a pidfd before we check whether the cgroup we're in can
fork and thus prior to taking the lock. So when copy_process() fails at the
right step it would release a lock we haven't taken.
This bug is not even very subtle to be honest. It's just not very clear
from the naming of cgroup_threadgroup_change_{beGin,end}() that a lock is
taken.
Here's the relevant splat:
entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7fec849
Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90
90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90
90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 002b:
00000000ffed5a8c EFLAGS:
00000246 ORIG_RAX:
0000000000000078
RAX:
ffffffffffffffda RBX:
0000000000003ffc RCX:
0000000000000000
RDX:
00000000200005c0 RSI:
0000000000000000 RDI:
0000000000000000
RBP:
0000000000000012 R08:
0000000000000000 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000000 R12:
0000000000000000
R13:
0000000000000000 R14:
0000000000000000 R15:
0000000000000000
------------[ cut here ]------------
DEBUG_LOCKS_WARN_ON(depth <= 0)
WARNING: CPU: 1 PID: 7744 at kernel/locking/lockdep.c:4052 __lock_release
kernel/locking/lockdep.c:4052 [inline]
WARNING: CPU: 1 PID: 7744 at kernel/locking/lockdep.c:4052
lock_release+0x667/0xa00 kernel/locking/lockdep.c:4321
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 7744 Comm: syz-executor007 Not tainted 5.1.0+ #4
Hardware name: Google Google Compute EnGine/Google Compute EnGine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
panic+0x2cb/0x65c kernel/panic.c:214
__warn.cold+0x20/0x45 kernel/panic.c:566
report_bug+0x263/0x2b0 lib/bug.c:186
fixup_bug arch/x86/kernel/traps.c:179 [inline]
fixup_bug arch/x86/kernel/traps.c:174 [inline]
do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:972
RIP: 0010:__lock_release kernel/locking/lockdep.c:4052 [inline]
RIP: 0010:lock_release+0x667/0xa00 kernel/locking/lockdep.c:4321
Code: 0f 85 a0 03 00 00 8b 35 77 66 08 08 85 f6 75 23 48 c7 c6 a0 55 6b 87
48 c7 c7 40 25 6b 87 4c 89 85 70 ff ff ff e8 b7 a9 eb ff <0f> 0b 4c 8b 85
70 ff ff ff 4c 89 ea 4c 89 e6 4c 89 c7 e8 52 63 ff
RSP: 0018:
ffff888094117b48 EFLAGS:
00010086
RAX:
0000000000000000 RBX:
1ffff11012822f6f RCX:
0000000000000000
RDX:
0000000000000000 RSI:
ffffffff815af236 RDI:
ffffed1012822f5b
RBP:
ffff888094117c00 R08:
ffff888092bfc400 R09:
fffffbfff113301d
R10:
fffffbfff113301c R11:
ffffffff889980e3 R12:
ffffffff8a451df8
R13:
ffffffff8142e71f R14:
ffffffff8a44cc80 R15:
ffff888094117bd8
percpu_up_read.constprop.0+0xcb/0x110 include/linux/percpu-rwsem.h:92
cgroup_threadgroup_change_end include/linux/cgroup-defs.h:712 [inline]
copy_process.part.0+0x47ff/0x6710 kernel/fork.c:2222
copy_process kernel/fork.c:1772 [inline]
_do_fork+0x25d/0xfd0 kernel/fork.c:2338
__do_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:240 [inline]
__se_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:236 [inline]
__ia32_compat_sys_x86_clone+0xbc/0x140 arch/x86/ia32/sys_ia32.c:236
do_syscall_32_irqs_on arch/x86/entry/common.c:334 [inline]
do_fast_syscall_32+0x281/0xd54 arch/x86/entry/common.c:405
entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7fec849
Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90
90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90
90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 002b:
00000000ffed5a8c EFLAGS:
00000246 ORIG_RAX:
0000000000000078
RAX:
ffffffffffffffda RBX:
0000000000003ffc RCX:
0000000000000000
RDX:
00000000200005c0 RSI:
0000000000000000 RDI:
0000000000000000
RBP:
0000000000000012 R08:
0000000000000000 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000000 R12:
0000000000000000
R13:
0000000000000000 R14:
0000000000000000 R15:
0000000000000000
Kernel Offset: disabled
Rebooting in 86400 seconds..
Reported-and-tested-by: syzbot+3286e58549edc479faae@syzkaller.appspotmail.com
Fixes:
b3e583825266 ("clone: add CLONE_PIDFD")
Signed-off-by: Christian Brauner <christian@brauner.io>
(cherry picked from commit
c3b7112df86b769927a60a6d7175988ca3d60f09)
Mot-CRs-fixed: (CR)
Bug:
135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: Ib9ecb1e5c0c6e2d062b89c25109ec571570eb497
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/
1505853
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/
1796161
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Christian Brauner [Wed, 27 Mar 2019 12:04:15 +0000 (13:04 +0100)]
BACKPORT: clone: add CLONE_PIDFD
This patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call. Linus oriGinally suggested to implement this as a
new flag to clone() instead of making it a separate system call. As
spotted by Linus, there is exactly one bit for clone() left.
CLONE_PIDFD creates file descriptors based on the anonymous inode
implementation in the kernel that will also be used to implement the new
mount api. They serve as a simple opaque handle on pids. Logically,
this makes it possible to interpret a pidfd differently, narrowing or
widening the scope of various operations (e.g. signal sending). Thus, a
pidfd cannot just refer to a tgid, but also a tid, or in theory - given
appropriate flag arguments in relevant syscalls - a process group or
session. A pidfd does not represent a privilege. This does not imply it
cannot ever be that way but for now this is not the case.
A pidfd comes with additional information in fdinfo if the kernel supports
procfs. The fdinfo file contains the pid of the process in the callers
pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".
As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
parent_tidptr argument of clone. This has the advantage that we can
give back the associated pid and the pidfd at the same time.
To remove worries about missing metadata access this patchset comes with
a sample program that illustrates how a combination of CLONE_PIDFD, and
pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/<pid>. The sample program can easily be
translated into a helper that would be suitable for inclusion in libc so
that users don't have to worry about writing it themselves.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Co-developed-by: Jann Horn <jannh@google.com>
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit
b3e5838252665ee4cfa76b82bdf1198dca81e5be)
Conflicts:
kernel/fork.c
(1. Replaced proc_pid_ns() with its direct implementation.)
Mot-CRs-fixed: (CR)
Bug:
135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I3c804a92faea686e5bf7f99df893fe3a5d87ddf7
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/
1505851
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/
1796159
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
David Howells [Mon, 5 Nov 2018 17:40:31 +0000 (17:40 +0000)]
UPSTREAM: Make anon_inodes unconditional
Make the anon_inodes facility unconditional so that it can be used by core
VFS code.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit
dadd2299ab61fc2b55b95b7b3a8f674cdd3b69c9)
Mot-CRs-fixed: (CR)
Bug:
135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I2f97bda4f360d8d05bbb603de839717b3d8067ae
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/
1505850
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/
1796158
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Christian Brauner [Fri, 24 May 2019 10:43:51 +0000 (12:43 +0200)]
BACKPORT: pid: add pidfd_open()
This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
process that is created via traditional fork()/clone() calls that is only
referenced by a PID:
int pidfd = pidfd_open(1234, 0);
ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
With the introduction of pidfds through CLONE_PIDFD it is possible to
created pidfds at process creation time.
However, a lot of processes get created with traditional PID-based calls
such as fork() or clone() (without CLONE_PIDFD). For these processes a
caller can currently not create a pollable pidfd. This is a problem for
Android's low memory killer (LMK) and service managers such as systemd.
Both are examples of tools that want to make use of pidfds to get reliable
notification of process exit for non-parents (pidfd polling) and race-free
signal sending (pidfd_send_signal()). They intend to switch to this API for
process supervision/management as soon as possible. Having no way to get
pollable pidfds from PID-only processes is one of the biggest blockers for
them in adopting this api. With pidfd_open() making it possible to retrieve
pidfds for PID-based processes we enable them to adopt this api.
In line with Arnd's recent changes to consolidate syscall numbers across
architectures, I have added the pidfd_open() syscall to all architectures
at the same time.
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
(cherry picked from commit
32fcb426ec001cb6d5a4a195091a8486ea77e2df)
Conflicts:
kernel/pid.c
(1. Replaced PIDTYPE_TGID with PIDTYPE_PID and thread_group_leader() check in pidfd_open() call)
Mot-CRs-fixed: (CR)
Bug:
135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I52a93a73722d7f7754dae05f63b94b4ca4a71a75
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/
1505856
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/
1796165
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Suren Baghdasaryan [Wed, 17 Jul 2019 17:21:00 +0000 (13:21 -0400)]
UPSTREAM: pidfd: fix a poll race when setting exit_state
There is a race between reading task->exit_state in pidfd_poll and
writing it after do_notify_parent calls do_notify_pidfd. Expected
sequence of events is:
CPU 0 CPU 1
------------------------------------------------
exit_notify
do_notify_parent
do_notify_pidfd
tsk->exit_state = EXIT_DEAD
pidfd_poll
if (tsk->exit_state)
However nothing prevents the following sequence:
CPU 0 CPU 1
------------------------------------------------
exit_notify
do_notify_parent
do_notify_pidfd
pidfd_poll
if (tsk->exit_state)
tsk->exit_state = EXIT_DEAD
This causes a polling task to wait forever, since poll blocks because
exit_state is 0 and the waiting task is not notified again. A stress
test continuously doing pidfd poll and process exits uncovered this bug.
To fix it, we make sure that the task's exit_state is always set before
calling do_notify_pidfd.
Fixes:
b53b0b9d9a6 ("pidfd: add polling support")
Cc: kernel-team@android.com
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20190717172100.261204-1-joel@joelfernandes.org
[christian@brauner.io: adapt commit message and drop unneeded changes from wait_task_zombie]
Signed-off-by: Christian Brauner <christian@brauner.io>
(cherry picked from commit
b191d6491be67cef2b3fa83015561caca1394ab9)
Mot-CRs-fixed: (CR)
Bug:
135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I043e54c9b69f25de88f6f19ae167920af8532de2
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/
1505858
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/
1796167
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
huangzq2 [Fri, 6 Nov 2020 11:10:37 +0000 (19:10 +0800)]
Enable PSI
Change-Id: I02a9d4b95606250a6727479ed1a1be01dd208e50
Signed-off-by: huangzq2 <huangzq2@motorola.com>
Reviewed-on: https://gerrit.mot.com/
1796168
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
Christian Brauner [Tue, 4 Jun 2019 13:18:43 +0000 (15:18 +0200)]
UPSTREAM: signal: improve comments
Improve the comments for pidfd_send_signal().
First, the comment still referred to a file descriptor for a process as a
"task file descriptor" which stems from way back at the beGinning of the
discussion. Replace this with "pidfd" for consistency.
Second, the wording for the explanation of the arguments to the syscall
was a bit inconsistent, e.g. some used the past tense some used present
tense. Make the wording more consistent.
Signed-off-by: Christian Brauner <christian@brauner.io>
(cherry picked from commit
c732327f04a3818f35fa97d07b1d64d31b691d78)
Mot-CRs-fixed: (CR)
Bug:
135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I06c6bdd1dddaeb8ac75a78dd21f9cdd0dc139a4c
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/
1505854
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/
1796162
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Christian Brauner [Fri, 24 May 2019 10:44:59 +0000 (12:44 +0200)]
BACKPORT: arch: wire-up pidfd_open()
This wires up the pidfd_open() syscall into all arches at once.
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: linux-alpha@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-m68k@lists.linux-m68k.org
Cc: linux-mips@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-xtensa@linux-xtensa.org
Cc: linux-arch@vger.kernel.org
Cc: x86@kernel.org
(cherry picked from commit
7615d9e1780e26e0178c93c55b73309a5dc093d7)
Conflicts:
arch/alpha/kernel/syscalls/syscall.tbl
arch/arm/tools/syscall.tbl
arch/ia64/kernel/syscalls/syscall.tbl
arch/m68k/kernel/syscalls/syscall.tbl
arch/microblaze/kernel/syscalls/syscall.tbl
arch/mips/kernel/syscalls/syscall_n32.tbl
arch/mips/kernel/syscalls/syscall_n64.tbl
arch/mips/kernel/syscalls/syscall_o32.tbl
arch/parisc/kernel/syscalls/syscall.tbl
arch/powerpc/kernel/syscalls/syscall.tbl
arch/s390/kernel/syscalls/syscall.tbl
arch/sh/kernel/syscalls/syscall.tbl
arch/sparc/kernel/syscalls/syscall.tbl
arch/xtensa/kernel/syscalls/syscall.tbl
arch/x86/entry/syscalls/syscall_32.tbl
arch/x86/entry/syscalls/syscall_64.tbl
(1. Skipped syscall.tbl modifications for missing architectures.
2. Removed __ia32_sys_pidfd_open in arch/x86/entry/syscalls/syscall_32.tbl.
3. Replaced __x64_sys_pidfd_open with sys_pidfd_open in arch/x86/entry/syscalls/syscall_64.tbl.)
Mot-CRs-fixed: (CR)
Bug:
135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I294aa33dea5ed2662e077340281d7aa0452f7471
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/
1505857
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/
1796166
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Christian Brauner [Thu, 18 Apr 2019 10:18:39 +0000 (12:18 +0200)]
UPSTREAM: signal: use fdget() since we don't allow O_PATH
As stated in the oriGinal commit for pidfd_send_signal() we don't allow
to signal processes through O_PATH file descriptors since it is
semantically equivalent to a write on the pidfd.
We already correctly error out right now and return EBADF if an O_PATH
fd is passed. This is because we use file->f_op to detect whether a
pidfd is passed and O_PATH fds have their file->f_op set to empty_fops
in do_dentry_open() and thus fail the test.
Thus, there is no regression. It's just semantically correct to use
fdget() and return an error right from there instead of taking a
reference and returning an error later.
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jann@thejh.net>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit
738a7832d21e3d911fcddab98ce260b79010b461)
Mot-CRs-fixed: (CR)
Bug:
135608568
Test: test program using syscall(__NR_pidfd_send_signal,..) to send SIGKILL
Change-Id: IMontanaeaadf9da371fb2d9caae4df49627760de7229
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/
1505849
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/
1796157
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.c
Christian Brauner [Sun, 18 Nov 2018 23:51:56 +0000 (00:51 +0100)]
BACKPORT: signal: add pidfd_send_signal() syscall
The kill() syscall operates on process identifiers (pid). After a process
has exited its pid can be reused by another process. If a caller sends a
signal to a reused pid it will end up signaling the wrong process. This
issue has often Surfaced and there has been a push to address this problem [1].
This patch uses file descriptors (fd) from proc/<pid> as stable handles on
struct pid. Even if a pid is recycled the handle will not change. The fd
can be used to send signals to the process it refers to.
Thus, the new syscall pidfd_send_signal() is introduced to solve this
problem. Instead of pids it operates on process fds (pidfd).
/* prototype and argument /*
long pidfd_send_signal(int pidfd, int sig, siGinfo_t *info, unsigned int flags);
/* syscall number 424 */
The syscall number was chosen to be 424 to align with Arnd's rework in his
y2038 to minimize merge conflicts (cf. [25]).
In addition to the pidfd and signal argument it takes an additional
siGinfo_t and flags argument. If the siGinfo_t argument is NULL then
pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it
is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
The flags argument is added to allow for future extensions of this syscall.
It currently needs to be passed as 0. Failing to do so will cause EINVAL.
/* pidfd_send_signal() replaces multiple pid-based syscalls */
The pidfd_send_signal() syscall currently takes on the job of
rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
positive pid is passed to kill(2). It will however be possible to also
replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.
/* sending signals to threads (tid) and process groups (pgid) */
Specifically, the pidfd_send_signal() syscall does currently not operate on
process groups or threads. This is left for future extensions.
In order to extend the syscall to allow sending signal to threads and
process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
PIDFD_TYPE_TID) should be added. This implies that the flags argument will
determine what is signaled and not the file descriptor itself. Put in other
words, grouping in this api is a property of the flags argument not a
property of the file descriptor (cf. [13]). Clarification for this has been
requested by Eric (cf. [19]).
When appropriate extensions through the flags argument are added then
pidfd_send_signal() can additionally replace the part of kill(2) which
operates on process groups as well as the tgkill(2) and
rt_tgsigqueueinfo(2) syscalls.
How such an extension could be implemented has been very roughly sketched
in [14], [15], and [16]. However, this should not be taken as a commitment
to a particular implementation. There might be better ways to do it.
Right now this is intentionally left out to keep this patchset as simple as
possible (cf. [4]).
/* naming */
The syscall had various names throughout iterations of this patchset:
- procfd_signal()
- procfd_send_signal()
- taskfd_send_signal()
In the last round of reviews it was pointed out that given that if the
flags argument decides the scope of the signal instead of different types
of fds it might make sense to either settle for "procfd_" or "pidfd_" as
prefix. The community was willing to accept either (cf. [17] and [18]).
Given that one developer expressed strong preference for the "pidfd_"
prefix (cf. [13]) and with other developers less opinionated about the name
we should settle for "pidfd_" to avoid further bikeshedding.
The "_send_signal" suffix was chosen to reflect the fact that the syscall
takes on the job of multiple syscalls. It is therefore intentional that the
name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
fomer because it might imply that pidfd_send_signal() is a replacement for
kill(2), and not the latter because it is a hassle to remember the correct
spelling - especially for non-native speakers - and because it is not
descriptive enough of what the syscall actually does. The name
"pidfd_send_signal" makes it very clear that its job is to send signals.
/* zombies */
Zombies can be signaled just as any other process. No special error will be
reported since a zombie state is an unreliable state (cf. [3]). However,
this can be added as an extension through the @flags argument if the need
ever arises.
/* cross-namespace signals */
The patch currently enforces that the signaler and signalee either are in
the same pid namespace or that the signaler's pid namespace is an ancestor
of the signalee's pid namespace. This is done for the sake of simplicity
and because it is unclear to what values certain members of struct
siGinfo_t would need to be set to (cf. [5], [6]).
/* compat syscalls */
It became clear that we would like to avoid adding compat syscalls
(cf. [7]). The compat syscall handling is now done in kernel/signal.c
itself by adding __copy_siGinfo_from_user_generic() which lets us avoid
compat syscalls (cf. [8]). It should be noted that the addition of
__copy_siGinfo_from_user_any() is caused by a bug in the oriGinal
implementation of rt_sigqueueinfo(2) (cf. 12).
With upcoming rework for syscall handling things might improve
significantly (cf. [11]) and __copy_siGinfo_from_user_any() will not gain
any additional callers.
/* testing */
This patch was tested on x64 and x86.
/* userspace usage */
An asciinema recording for the basic functionality can be found under [9].
With this patch a process can be killed via:
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
static inline int do_pidfd_send_signal(int pidfd, int sig, siGinfo_t *info,
unsigned int flags)
{
#ifdef __NR_pidfd_send_signal
return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
#else
return -ENOSYS;
#endif
}
int main(int argc, char *argv[])
{
int fd, ret, saved_errno, sig;
if (argc < 3)
exit(EXIT_FAILURE);
fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
if (fd < 0) {
printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
exit(EXIT_FAILURE);
}
sig = atoi(argv[2]);
printf("Sending signal %d to process %s\n", sig, argv[1]);
ret = do_pidfd_send_signal(fd, sig, NULL, 0);
saved_errno = errno;
close(fd);
errno = saved_errno;
if (ret < 0) {
printf("%s - Failed to send signal %d to process %s\n",
strerror(errno), sig, argv[1]);
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
}
/* Q&A
* Given that it seems the same questions get asked again by people who are
* late to the party it makes sense to add a Q&A section to the commit
* message so it's hopefully easier to avoid duplicate threads.
*
* For the sake of progress please consider these arguments settled unless
* there is a new point that desperately needs to be addressed. Please make
* sure to check the links to the threads in this commit message whether
* this has not already been covered.
*/
Q-01: (Florian Weimer [20], Andrew Morton [21])
What happens when the target process has exited?
A-01: Sending the signal will fail with ESRCH (cf. [22]).
Q-02: (Andrew Morton [21])
Is the task_struct pinned by the fd?
A-02: No. A reference to struct pid is kept. struct pid - as far as I
understand - was created exactly for the reason to not require to
pin struct task_struct (cf. [22]).
Q-03: (Andrew Morton [21])
Does the entire procfs directory remain visible? Just one entry
within it?
A-03: The same thing that happens right now when you hold a file descriptor
to /proc/<pid> open (cf. [22]).
Q-04: (Andrew Morton [21])
Does the pid remain reserved?
A-04: No. This patchset guarantees a stable handle not that pids are not
recycled (cf. [22]).
Q-05: (Andrew Morton [21])
Do attempts to signal that fd return errors?
A-05: See {Q,A}-01.
Q-06: (Andrew Morton [22])
Is there a cleaner way of obtaining the fd? Another syscall perhaps.
A-06: Userspace can already trivially retrieve file descriptors from procfs
so this is something that we will need to support anyway. Hence,
there's no immediate need to add another syscalls just to make
pidfd_send_signal() not dependent on the presence of procfs. However,
adding a syscalls to get such file descriptors is planned for a
future patchset (cf. [22]).
Q-07: (Andrew Morton [21] and others)
This fd-for-a-process sounds like a handy thing and people may well
think up other uses for it in the future, probably unrelated to
signals. Are the code and the interface designed to permit such
future applications?
A-07: Yes (cf. [22]).
Q-08: (Andrew Morton [21] and others)
Now I think about it, why a new syscall? This thing is looking
rather like an ioctl?
A-08: This has been extensively discussed. It was agreed that a syscall is
preferred for a variety or reasons. Here are just a few taken from
prior threads. Syscalls are safer than ioctl()s especially when
signaling to fds. Processes are a core kernel concept so a syscall
seems more appropriate. The layout of the syscall with its four
arguments would require the addition of a custom struct for the
ioctl() thereby causing at least the same amount or even more
complexity for userspace than a simple syscall. The new syscall will
replace multiple other pid-based syscalls (see description above).
The file-descriptors-for-processes concept introduced with this
syscall will be extended with other syscalls in the future. See also
[22], [23] and various other threads already linked in here.
Q-09: (Florian Weimer [24])
What happens if you use the new interface with an O_PATH descriptor?
A-09:
pidfds opened as O_PATH fds cannot be used to send signals to a
process (cf. [2]). Signaling processes through pidfds is the
equivalent of writing to a file. Thus, this is not an operation that
operates "purely at the file descriptor level" as required by the
open(2) manpage. See also [4].
/* References */
[1]: https://lore.kernel.org/lkml/
20181029221037.87724-1-dancol@google.com/
[2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
[3]: https://lore.kernel.org/lkml/
20181204132604.aspfupwjgjx6fhva@brauner.io/
[4]: https://lore.kernel.org/lkml/
20181203180224.fkvw4kajtbvru2ku@brauner.io/
[5]: https://lore.kernel.org/lkml/
20181121213946.GA10795@mail.hallyn.com/
[6]: https://lore.kernel.org/lkml/
20181120103111.etlqp7zop34v6nv4@brauner.io/
[7]: https://lore.kernel.org/lkml/
36323361-90BD-41AF-AB5B-
EE0D7BA02C21@amacapital.net/
[8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
[9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
[11]: https://lore.kernel.org/lkml/
F53D6D38-3521-4C20-9034-
5AF447DF62FF@amacapital.net/
[12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
[13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
[14]: https://lore.kernel.org/lkml/
20181206231742.xxi4ghn24z4h2qki@brauner.io/
[15]: https://lore.kernel.org/lkml/
20181207003124.GA11160@mail.hallyn.com/
[16]: https://lore.kernel.org/lkml/
20181207015423.4miorx43l3qhppfz@brauner.io/
[17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
[18]: https://lore.kernel.org/lkml/
20181206222746.GB9224@mail.hallyn.com/
[19]: https://lore.kernel.org/lkml/
20181208054059.19813-1-christian@brauner.io/
[20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[21]: https://lore.kernel.org/lkml/
20181228152012.
dbf0508c2508138efc5f2bbe@linux-foundation.org/
[22]: https://lore.kernel.org/lkml/
20181228233725.722tdfgijxcssg76@brauner.io/
[23]: https://lwn.net/Articles/773459/
[24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Tycho Andersen <tycho@tycho.ws>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Aleksa Sarai <cyphar@cyphar.com>
(cherry picked from commit
3eb39f47934f9d5a3027fe00d906a45fe3a15fad)
Conflicts:
arch/x86/entry/syscalls/syscall_32.tbl - trivial manual merge
arch/x86/entry/syscalls/syscall_64.tbl - trivial manual merge
include/linux/proc_fs.h - trivial manual merge
include/linux/syscalls.h - trivial manual merge
include/uapi/asm-generic/unistd.h - trivial manual merge
kernel/signal.c - struct kernel_siGinfo does not exist in 4.14
kernel/sys_ni.c - cond_syscall is used instead of COND_SYSCALL
arch/x86/entry/syscalls/syscall_32.tbl
arch/x86/entry/syscalls/syscall_64.tbl
(1. manual merges because of 4.14 differences
2. change prepare_kill_siGinfo() to use struct siGinfo instead of
kernel_siGinfo
3. use copy_from_user() instead of copy_siGinfo_from_user() in copy_siGinfo_from_user_any()
4. replaced COND_SYSCALL with cond_syscall
5. Removed __ia32_sys_pidfd_send_signal in arch/x86/entry/syscalls/syscall_32.tbl.
6. Replaced __x64_sys_pidfd_send_signal with sys_pidfd_send_signal in arch/x86/entry/syscalls/syscall_64.tbl.)
Mot-CRs-fixed: (CR)
Bug:
135608568
Test: test program using syscall(__NR_pidfd_send_signal,..) to send SIGKILL
Change-Id: I34da11c63ac8cafb0353d9af24c820cef519ec27
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/
1505847
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/
1796155
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Eric Dumazet [Wed, 27 Mar 2019 19:40:33 +0000 (12:40 -0700)]
inet: switch IP ID generator to siphash
[ Upstream commit
df453700e8d81b1bdafdf684365ee2b9431fb702 ]
According to Amit Klein and Benny Pinkas, IP ID generation is too weak
and might be used by attackers.
Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix())
having 64bit key and Jenkins hash is risky.
It is time to switch to siphash and its 128bit keys.
Mot-CRs-fixed: (CR)
CVE-Fixed: CVE-2019-18282
Bug:
148588557
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reported-by: Benny Pinkas <benny@pinkas.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jignesh Patel <jignesh@motorola.com>
Change-Id: I9593781a735940aaedf8e6b38fef02b48169bd12
Reviewed-on: https://gerrit.mot.com/
1572721
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
Todd Kjos [Thu, 23 Jan 2020 05:14:53 +0000 (10:44 +0530)]
ANDROID: fix binder change in merge of 4.9.188
The 4.9.188 merge was missing the change to the
binder driver associated with the linux-4.9.y
commit
16903f1a5ba7 ("coredump: fix race condition
between mmget_not_zero()/get_task_mm() and core dumping").
It was left out because the android-4.9 binder
driver has been significantly refactored compared
to linux-4.9.y.
This patch applies the missing change from that
patch to the binder driver.
Mot-CRs-fixed: (CR)
CVE-Fixed: CVE-2019-11599
BUG:
131964235
Change-Id: I1402cf3c28f1336da9d942abeb322f71a9b8138b
Signed-off-by: Pachipulusu Bhanu Prakash <bhprakas@motorola.com>
Reviewed-on: https://gerrit.mot.com/
1473937
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
Dingwei Luo [Tue, 17 Dec 2019 02:55:38 +0000 (20:55 -0600)]
Revert "(CR) arm64: dts: Keep VCCQ power when S2R mode for Sandisk UFS"
The changed will be effect the APM current, the current will increase about 6mA.
This reverts commit
7eb50b69c6197393012ccc02b97c8ad54c8f6ba3.
Change-Id: I27586d466caa39a9632136b87734fc9447090092
Reviewed-on: https://gerrit.mot.com/
1473008
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
luodw1 [Thu, 12 Dec 2019 05:51:06 +0000 (13:51 +0800)]
arm64: dts: Keep VCCQ power when S2R mode for Sandisk UFS
Change-Id: Ib55c4b86a0608ec3e436dd2b8ae36cb1fd44287e
Reviewed-on: https://gerrit.mot.com/
1470812
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
a17671 [Mon, 18 Nov 2019 07:56:16 +0000 (15:56 +0800)]
usb:Restore linked_func list in none-secure mode
In none-secure mode, there is still
A low chance of gadget NULL case,
Restore the binded functions list back to
Linked_func list, so next time unbind functions
Could be handled correctly without memory corruption
This is a Samsung platform only issue
Change-Id: Ie46fc52d3eaa6ef60c1a4f6bb83a56229Montana854d
Signed-off-by: a17671 <a17671@motorola.com>
Reviewed-on: https://gerrit.mot.com/
1456923
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
wangwang [Thu, 14 Nov 2019 09:05:25 +0000 (17:05 +0800)]
Revert "(CR) psi:kernel:enable PSI configuration"
This reverts commit
9ea0893bec0beb7328429c222097427d33981714.
Conflicts:
arch/arm64/configs/ext_config/moto-erd9610.config
Change-Id: I8e2c88a7b2c932fa877416564d5cbf294afe0d5a
Reviewed-on: https://gerrit.mot.com/
1455315
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
zhaoxp3 [Wed, 13 Nov 2019 09:49:56 +0000 (17:49 +0800)]
arm64/defconfig: define CONFIG_LOCALVERSION to user
add user defconfig
Change-Id: Ib8113f551270eee70178e9638dabeb8083e5b675
Signed-off-by: zhaoxp3 <zhaoxp3@motorola.com>
Reviewed-on: https://gerrit.mot.com/
1453937
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Submit-Approved: Jira Key
huangzq2 [Tue, 12 Nov 2019 07:28:15 +0000 (15:28 +0800)]
Enable process reclaim
Change-Id: Icda8271812c13fa2e4677e42ec38f8a52dd50721
Signed-off-by: huangzq2 <huangzq2@motorola.com>
Reviewed-on: https://gerrit.mot.com/
1453732
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
Martin Liu [Mon, 6 May 2019 16:57:20 +0000 (00:57 +0800)]
mm: mm_event: remove get/put_online_cpus call
remove get/put_online_cpus call since it could cause
deadlock in cpu hotplug path. This might cause race
but should be rare and we should be able to correct
that with the next dump.
=======================================================
Task name: Binder:897_2 pid: 3255 cpu: 0 start: 0xffffffd39e2f5700
state: 0x2 exit_state: 0x0 stack base: 0xffffff80241f0000 Prio: 116
Stack:
[<
ffffff9048f3163c>] __switch_to.cfi+0x138
[<
ffffff904a947808>] __schedule+0xb7c
[<
ffffff904a94dbdc>] rwsem_down_read_failed.cfi+0x270
[<
ffffff904900a0a4>] __percpu_down_read.cfi+0x164
[<
ffffff90491bc4e8>] record_stat+0x6c0
[<
ffffff90491bbdfc>] mm_event_end.cfi+0x14c
[<
ffffff904916c280>] try_to_free_pages.cfi+0xaf4
[<
ffffff904914a598>] __alloc_pages_nodemask.cfi+0x9c8
[<
ffffff9049aa6350>] zcomp_cpu_up_prepare.cfi+0x88
[<
ffffff9048f66da8>] cpuhp_invoke_callback+0x378
[<
ffffff9048f664e0>] _cpu_up+0x1bc
[<
ffffff9048f6aadc>] enable_nonboot_cpus.cfi+0x208
[<
ffffff90490135f4>] suspend_devices_and_enter.cfi+0xc20
[<
ffffff9049012844>] pm_suspend.cfi+0xb30
[<
ffffff9049010568>] state_store.cfi+0x94
[<
ffffff904a9339dc>] kobj_attr_store.cfi+0x34
[<
ffffff90492ed9ec>] sysfs_kf_write.cfi+0x64
[<
ffffff90492eb51c>] kernfs_fop_write.cfi+0x1a4
[<
ffffff90491fbf34>] __vfs_write.cfi+0x50
[<
ffffff90491fbd58>] vfs_write.cfi+0xcc
[<
ffffff90491fefd4>] SyS_write.cfi+0xa4
[<
ffffff9048e84080>] el0_svc_naked+0x34
Test: manual suspend/resume test
Mot-CRs-fixed: (CR)
Bug:
132011965
Change-Id: I112ca0d25e825bb4e0e8979d9b4f1d8e6090147f
Signed-off-by: Martin Liu <liumartin@google.com>
(cherry picked from commit
5f419093ab253702847a6b3a8417e47c2acfb652)
Reviewed-on: https://gerrit.mot.com/
1453731
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
Minchan Kim [Mon, 28 Jan 2019 11:00:33 +0000 (20:00 +0900)]
mm: mm_event: fix compact_scan
It fixes double counting of COMPACTFREE_SCANNED.
Mot-CRs-fixed: (CR)
Bug:
80168800
Change-Id: I38ef432ecf44ba94988f5a4ec9c69bcb5d20fdce
Signed-off-by: Minchan Kim <minchan@google.com>
Reviewed-on: https://gerrit.mot.com/
1453730
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
Minchan Kim [Thu, 17 Jan 2019 02:14:14 +0000 (11:14 +0900)]
mm: synchronize period update interval
Wei pointed out period update is racy so it could make partial
update, which could lose a ton of trace potentially.
To close period_ms race between updating and reading, use rwlock
to reduce contention.
To close vmstat_period_ms between updating and reading,
use vmstat_lock.
This patch has small refactoring, too.
Mot-CRs-fixed: (CR)
Bug:
80168800
Change-Id: I7f84cff758b533b7881f47889c7662b743bc3c12
Signed-off-by: Minchan Kim <minchan@google.com>
Reviewed-on: https://gerrit.mot.com/
1453729
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
Minchan Kim [Tue, 15 Jan 2019 04:54:07 +0000 (13:54 +0900)]
mm: mm_event supports vmstat
Vmstat is significantly important to investigate MM problem.
We have solved many problmes with it via asking users to get
vmstat data periodically from the device, which manual way is
painful once we release the device or on hard reproducible
scenario.
This patch adds periodic vmstat dump into mm_event. It works
only if there are some events in compaction or reclaim. Thus,
unless there is memory pressure, it doesn't gather any vmstat
data. Default interval between each dump is 1000ms.
Admin can tweak it via
echo 2000 > /sys/kernel/debug/mm_event/vmstat_period_ms
Mot-CRs-fixed: (CR)
Bug:
80168800
Change-Id: I4c0e7237d7764c4ea79da00952e5de34ccbe4187
Signed-off-by: Minchan Kim <minchan@google.com>
Reviewed-on: https://gerrit.mot.com/
1453728
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
Minchan Kim [Fri, 9 Jan 2015 13:06:55 +0000 (18:36 +0530)]
mm: per-process reclaim
These day, there are many platforms available in the embedded market
and they are smarter than kernel which has very limited information
about working set so they want to involve memory management more heavily
like android's lowmemory killer and ashmem or recent many lowmemory
notifier.
One of the simple imaGine scenario about userspace's intelligence is that
platform can manage tasks as forground and background so it would be
better to reclaim background's task pages for end-user's *responsibility*
although it has frequent referenced pages.
This patch adds new knob "reclaim under proc/<pid>/" so task manager
can reclaim any target process anytime, anywhere. It could give another
method to platform for using memory efficiently.
It can avoid process killing for getting free memory, which was really
terrible experience because I lost my best score of game I had ever
after I switch the phone call while I enjoyed the game.
Reclaim file-backed pages only.
echo file > /proc/PID/reclaim
Reclaim anonymous pages only.
echo anon > /proc/PID/reclaim
Reclaim all pages
echo all > /proc/PID/reclaim
Mot-CRs-fixed: (CR)
Bug:
122047783
Change-Id: I2f629f7a43289af114df27044b1d2af4a6e785bc
Signed-off-by: Tim Murray <timmurray@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-on: https://gerrit.mot.com/
1453727
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
Joel Fernandes [Sat, 5 May 2018 21:58:08 +0000 (14:58 -0700)]
mm: emit tracepoint when rss watermark is hit
Useful to track how rss is chanGing per tgid. Required for the
memory visibility work being done for Android.
OriGinal patch by Tim Murray:
https://partner-android-review.googlesource.com/c/kernel/private/msm-google/+/
1081280
Changes from oriGinal patch:
- don't bloat mm_struct
- add some noise reduction to rss tracking
Mot-CRs-fixed: (CR)
Change-Id: Ief904334235ff4380244e5803d7853579e70d202
Signed-off-by: Joel Fernandes <joelaf@google.com>
Reviewed-on: https://gerrit.mot.com/
1453726
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
SLTApproved: Slta Waiver
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
Minchan Kim [Sun, 12 Aug 2018 23:15:32 +0000 (08:15 +0900)]
mm: mm_event: add read io stat
Read IO's latency as well as filemap fault could affect system
performance so this patch keeps track it on.
Mot-CRs-fixed: (CR)
Bug:
80168800
Change-Id: I761b7110339cf1e5ef24530ad32aedd784d00d07
Signed-off-by: Minchan Kim <minchan@google.com>
Signed-off-by: Cho KyongHo <pullip.cho@samsung.com>
Reviewed-on: https://gerrit.mot.com/
1453725
Tested-by: Jira Key
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
Minchan Kim [Mon, 6 Aug 2018 06:12:44 +0000 (15:12 +0900)]
mm: mm_event: add special kernel allocation stat
Record the count of special page allocation on the process context.
This patch aims for accounting of special page allocation which
consumed a lot by android system.
At this moment, ION system heap is good candidate(it could cover
other kernel allocation in future).
With that, we could keep tracking burst kernel allocation owner
so that it would be useful to find places caused by lmk, reclaim,
compaction latency.
Mot-CRs-fixed: (CR)
Bug:
80168800
Change-Id: I5942fd940d98baa2eb814f66b076cb37ecd3b4aa
Signed-off-by: Minchan Kim <minchan@google.com>
Reviewed-on: https://gerrit.mot.com/
1453724
Tested-by: Jira Key
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
Minchan Kim [Mon, 6 Aug 2018 06:07:57 +0000 (15:07 +0900)]
mm: mm_event: add swapin stat
Many embedded devices use zram as swap. Compared to storage swap
(e.g. UFS), swapin from zram(ie., decompression) is extremly fast
so it might be not major fault but minor. So this patch provides
swapin latency tracking to distinguish them from storage major
fault.
Mot-CRs-fixed: (CR)
Bug:
80168800
Change-Id: I1c32430e32a051916ede5219bd5f40a9002652bc
Signed-off-by: Minchan Kim <minchan@google.com>
Reviewed-on: https://gerrit.mot.com/
1453723
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
Minchan Kim [Wed, 27 Jun 2018 13:04:37 +0000 (22:04 +0900)]
mm: mm_event: add compaction stat
This patch adds compaction mm_event stat so that we could keep track
latency of compaction as well as count of the event.
Under heavy memory fragmentation, high-order page allocation(e.g.
fork, ION memory allocation) triggers compaction, which is
another major part of latency. Let's track it down, too.
Mot-CRs-fixed: (CR)
Bug:
80168800
Change-Id: Ia3da9324f123ba2542863eafaf72024b5351785b
Signed-off-by: Minchan Kim <minchan@google.com>
Reviewed-on: https://gerrit.mot.com/
1453722
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
Minchan Kim [Wed, 27 Jun 2018 13:04:07 +0000 (22:04 +0900)]
mm: mm_event: add reclaim stat
This patch adds page reclaim mm_event stat so that we could
keep tracking [avg|max]_latency for the handling the event
as well as count of the event.
Direct reclaim latency is usually a most popular latency source
caused by memory pressure so we need to track it down to hunt
down application's jank problem.
Mot-CRs-fixed: (CR)
Bug:
80168800
Change-Id: I215c3972f76389404da7c4806a776bf753daac01
Signed-off-by: Minchan Kim <minchan@google.com>
Reviewed-on: https://gerrit.mot.com/
1453721
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
Minchan Kim [Wed, 27 Jun 2018 12:54:47 +0000 (21:54 +0900)]
mm: mm_event: add page fault stat
This patch add major and minor fault mm_event stat so that we could
keep tracking [avg|max]_latency for the handling the event
as well as count of the event.
With major fault, we could see how long the IO is delayed. It's very
tightly coupled with application's latency.
With major+minor fault, we could see how many of pages are allocated
for the process in the period. It would help to see memory spike.
Mot-CRs-fixed: (CR)
Bug:
80168800
Change-Id: I8a4434493e3ec291227961939a24c3d57a18fd5b
Signed-off-by: Minchan Kim <minchan@google.com>
Signed-off-by: Cho KyongHo <pullip.cho@samsung.com>
Reviewed-on: https://gerrit.mot.com/
1453720
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
Minchan Kim [Mon, 6 Aug 2018 06:02:06 +0000 (15:02 +0900)]
mm: mm_event: make capture period configurable
This patch makes per-process mm event capture inteval configurable.
Default is 500ms but admin can change it by below knob.
/sys/kernel/debug/mm_event/period_ms
The unit is millisecond.
Mot-CRs-fixed: (CR)
Bug:
80168800
Change-Id: I3b2de3dd5c4a519a2e5e20f1ef0d5f9a4c7afc8a
Signed-off-by: Minchan Kim <minchan@google.com>
Reviewed-on: https://gerrit.mot.com/
1453719
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
Minchan Kim [Mon, 6 Aug 2018 06:00:19 +0000 (15:00 +0900)]
mm: introduce per-process mm event tracking feature
Linux supports /proc/meminfo and /proc/vmstat stats as memory health metric.
Android uses them too. If user see something goes wrong(e.g., sluggish, jank)
on their system, they can capture and report system state to developers
for debugGing.
It shows memory stat at the moment the bug is captured. However, itâs
not enough to investigate application's jank problem caused by memory
shortage. Because
1. It just shows event count which doesnât quantify the latency of the
application well. Jank could happen by various reasons and one of simple
scenario is frame drop for a second. App should draw the frame every 16ms
interval. Just number of stats(e.g., allocstall or pgmajfault) couldn't
represnt how many of time the app spends for handling the event.
2. At bugreport, dump with vmstat and meminfo is never helpful because it's
too late to capture the moment when the problem happens.
When the user catch up the problem and try to capture the system state,
the problem has already gone.
3. Although we could capture MM stat at the moment bug happens, it couldn't
be helpful because MM stats are usually very flucuate so we need historical
data rather than one-time snapshot to see MM trend.
To solve above problems, this patch introduces per-process, light-weight,
mm event stat. Basically, it tracks minor/major faults, reclaim and compaction
latency of each process as well as event count and record the data into global
buffer.
To compromise memory overhead, it doesn't record every MM event of the process
to the buffer but just drain accumuated stats every 0.5sec interval to buffer.
If there isn't any event, it just skips the recording.
For latency data, it keeps average/max latency of each event in that period
With that, we could keep useful information with small buffer so that
we couldn't miss precious information any longer although the capture time
is rather late. This patch introduces basic facility of MM event stat.
After all patches in this patchset are applied, outout format is as follows,
dumpstate can use it for VM debugGing in future.
<...>-1665 [001] d... 217.575173: mm_event_record: min_flt count=203 avg_lat=3 max_lat=58
<...>-1665 [001] d... 217.575183: mm_event_record: maj_flt count=1 avg_lat=1994 max_lat=1994
<...>-1665 [001] d... 217.575184: mm_event_record: kern_alloc count=227 avg_lat=0 max_lat=0
<...>-626 [000] d... 217.578096: mm_event_record: kern_alloc count=4 avg_lat=0 max_lat=0
<...>-6547 [000] .... 217.581913: mm_event_record: min_flt count=7 avg_lat=7 max_lat=20
<...>-6547 [000] .... 217.581955: mm_event_record: kern_alloc count=4 avg_lat=0 max_lat=0
This feature uses event trace for output buffer so that we could use all of
general benefit of event trace(e.g., buffer size management, filtering and
so on). To prevent overflow of the ring buffer by other random event race,
highly suggest that create separate instance of tracing
on /sys/kernel/debug/tracing/instances/
I had a concern of adding overhead. Actually, major|compaction/reclaim
are already heavy cost so it should be not a concern. Rather than,
minor fault and kern alloc would be severe so I tested a micro benchmark
to measure minor page fault overhead.
Test scenario is create 40 threads and each of them does minor
page fault for 25M range(ranges are not overwrapped).
I didn't see any noticible regression.
Base:
fault/wsec avg: 758489.8288
minor faults=
13123118, major faults=0 ctx switch=139234
User System Wall fault/wsec
39.55s 41.73s 17.49s 749995.768
minor faults=
13123135, major faults=0 ctx switch=139627
User System Wall fault/wsec
34.59s 41.61s 16.95s 773906.976
minor faults=
13123061, major faults=0 ctx switch=139254
User System Wall fault/wsec
39.03s 41.55s 16.97s 772966.334
minor faults=
13123131, major faults=0 ctx switch=139970
User System Wall fault/wsec
36.71s 42.12s 17.04s 769941.019
minor faults=
13123027, major faults=0 ctx switch=138524
User System Wall fault/wsec
42.08s 42.24s 18.08s 725639.047
Base + MM event + event trace enable:
fault/wsec avg: 759626.1488
minor faults=
13123488, major faults=0 ctx switch=140303
User System Wall fault/wsec
37.66s 42.21s 17.48s 750414.257
minor faults=
13123066, major faults=0 ctx switch=138119
User System Wall fault/wsec
36.77s 42.14s 17.49s 750010.107
minor faults=
13123505, major faults=0 ctx switch=140021
User System Wall fault/wsec
38.51s 42.50s 17.54s 748022.219
minor faults=
13123431, major faults=0 ctx switch=138517
User System Wall fault/wsec
36.74s 41.49s 17.03s 770255.610
minor faults=
13122955, major faults=0 ctx switch=137174
User System Wall fault/wsec
40.68s 40.97s 16.83s 779428.551
Mot-CRs-fixed: (CR)
Bug:
80168800
Change-Id: I4e69c994f47402766481c58ab5ec2071180964b8
Signed-off-by: Minchan Kim <minchan@google.com>
Signed-off-by: Cho KyongHo <pullip.cho@samsung.com>
Reviewed-on: https://gerrit.mot.com/
1453718
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
wangwang [Wed, 13 Nov 2019 06:26:15 +0000 (14:26 +0800)]
psi:kernel:enable PSI configuration
support Google PSI memory management in lmkd
Change-Id: I437daa54c55c4caa9d8a67ba5bf7ac529d61da87
Reviewed-on: https://gerrit.mot.com/
1453674
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
wangwang [Wed, 13 Nov 2019 06:04:33 +0000 (14:04 +0800)]
psi:kernel:oom reaper porting into samsung platform
reaper can help to reclaim the memory in time, the knob will be set to true
when init parses the init.rc conf file.
Change-Id: I59f1173c0e46202904da6eeacb2fecc32c53232c
Johannes Weiner [Wed, 30 Jan 2019 23:41:54 +0000 (10:41 +1100)]
BACKPORT: kernel: cgroup: add poll file operation
Cgroup has a standardized poll/notification mechanism for waking all
pollers on all fds when a filesystem node changes. To allow polling for
custom events, add a .poll callback that can override the default.
This is in preparation for pollable cgroup pressure files which have
per-fd trigger configurations.
Link: http://lkml.kernel.org/r/20190124211518.244221-3-surenb@google.com
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
(cherry picked from commit:
dc50537bdd1a0804fa2cbc990565ee9a944e66fa)
Conflicts:
include/linux/cgroup-defs.h
kernel/cgroup.c
(1. replaced __poll_t with unsigned int)
Bug:
127712811
Test: lmkd in PSI mode
Change-Id: I21aff1d9d31e3d4b45e257aa4d299405a2ce6de3
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suren Baghdasaryan [Tue, 4 Dec 2018 01:36:42 +0000 (17:36 -0800)]
FROMLIST: psi: introduce psi monitor
Psi monitor aims to provide a low-latency short-term pressure
detection mechanism configurable by users. It allows users to
monitor psi metrics growth and trigger events whenever a metric
raises above user-defined threshold within user-defined time window.
Time window and threshold are both expressed in usecs. Multiple psi
resources with different thresholds and window sizes can be monitored
concurrently.
Psi monitors activate when system enters stall state for the monitored
psi metric and deactivate upon exit from the stall state. While system
is in the stall state psi signal growth is monitored at a rate of 10 times
per tracking window. Min window size is 500ms, therefore the min monitoring
interval is 50ms. Max window size is 10s with monitoring interval of 1s.
When activated psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when psi
signal is bouncing.
Notifications to the users are rate-limited to one per tracking window.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/
1052418/)
Conflicts:
include/linux/psi.h
kernel/cgroup/cgroup.c
kernel/sched/psi.c
(1. replaced __poll_t with unsigned int
2. replaced EPOLLERR/EPOLLPRI with POLLERR/POLLPRI (values are the same)
3. include <linux/cgroup-defs.h> in include/linux/psi.h
4. include <uapi/linux/sched/types.h> in kernel/sched/psi.c)
Bug:
127712811
Bug:
129157727
Test: lmkd in PSI mode
Change-Id: I1688f047e98e1f109627dad72a33d2f70e575268
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suren Baghdasaryan [Sun, 17 Feb 2019 23:07:38 +0000 (15:07 -0800)]
FROMLIST: refactor header includes to allow kthread.h inclusion in psi_types.h
kthread.h can't be included in psi_types.h because it creates a circular
inclusion with kthread.h eventually including psi_types.h and complaining
on kthread structures not being defined because they are defined further
in the kthread.h. Resolve this by removing psi_types.h inclusion from the
headers included from kthread.h.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/
1052417/)
Conflicts:
include/linux/kthread.h
kernel/kthread.c
(1. <linux/cgroup.h> include is already missing in kthread.h
2. <linux/cgroup.h> is already included in kthread.c)
Bug:
127712811
Bug:
129157727
Test: lmkd in PSI mode
Change-Id: I07c1f4fddf0c43b3095f505e062d9d179d041544
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suren Baghdasaryan [Wed, 6 Mar 2019 18:25:50 +0000 (10:25 -0800)]
FROMLIST: psi: track changed states
Introduce changed_states parameter into collect_percpu_times to track
the states changed since the last update.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/
1052420/)
Bug:
127712811
Bug:
129157727
Test: lmkd in PSI mode
Change-Id: Idb2f7d73013bff16bb101b62a2609917a5353bf9
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suren Baghdasaryan [Wed, 6 Mar 2019 17:52:23 +0000 (09:52 -0800)]
FROMLIST: psi: split update_stats into parts
Split update_stats into collect_percpu_times and update_averages for
collect_percpu_times to be reused later inside psi monitor.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/
1052419/)
Bug:
127712811
Bug:
129157727
Test: lmkd in PSI mode
Change-Id: Ic5dca1924a3f8997b49b5d16289f53bcc43b88fa
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suren Baghdasaryan [Wed, 6 Mar 2019 17:21:03 +0000 (09:21 -0800)]
FROMLIST: psi: rename psi fields in preparation for psi trigger addition
Renaming psi_group structure member fields used for calculating psi totals
and averages for clear distinction between them and trigger-related fields
that will be added next.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/
1052416/)
Bug:
127712811
Bug:
129157727
Test: lmkd in PSI mode
Change-Id: I7aaadfc558950b54b02a051d63e508e8fe233b49
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suren Baghdasaryan [Wed, 30 Jan 2019 23:41:54 +0000 (10:41 +1100)]
FROMLIST: psi: make psi_enable static
psi_enable is not used outside of psi.c, make it static.
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/
1052415/)
Bug:
127712811
Bug:
129157727
Test: lmkd in PSI mode
Change-Id: I3c422d6c0c4299095c6ba05cfe942a2b00705f29
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suren Baghdasaryan [Wed, 30 Jan 2019 23:41:54 +0000 (10:41 +1100)]
FROMLIST: psi: introduce state_mask to represent stalled psi states
The psi monitoring patches will need to determine the same states as
record_times(). To avoid calculating them twice, maintain a state mask
that can be consulted cheaply. Do this in a separate patch to keep the
churn in the main feature patch at a minimum.
This adds 4-byte state_mask member into psi_group_cpu struct which results
in its first cacheline-aligned part becoming 52 bytes long. Add explicit
values to enumeration element counters that affect psi_group_cpu struct
size.
Link: http://lkml.kernel.org/r/20190124211518.244221-4-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/
1052414/)
Bug:
127712811
Bug:
129157727
Test: lmkd in PSI mode
Change-Id: I7807b687e2a5d78aed44c5e33be1621aa11451cb
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Johannes Weiner [Wed, 30 Jan 2019 23:41:54 +0000 (10:41 +1100)]
BACKPORT: fs: kernfs: add poll file operation
Patch series "psi: pressure stall monitors", v3.
Android is adopting psi to detect and remedy memory pressure that results
in stuttering and decreased responsiveness on mobile devices.
Psi gives us the stall information, but because we're dealing with
latencies in the millisecond range, periodically reading the pressure
files to detect stalls in a timely fashion is not feasible. Psi also
doesn't aggregate its averages at a high enough frequency right now.
This patch series extends the psi interface such that users can configure
sensitive latency thresholds and use poll() and friends to be notified
when these are breached.
As high-frequency aggregation is costly, it implements an aggregation
method that is optimized for fast, short-interval averaGing, and makes the
aggregation frequency adaptive, such that high-frequency updates only
happen while monitored stall events are actively occurring.
With these patches applied, Android can monitor for, and ward off,
mounting memory shortages before they cause problems for the user. For
example, using memory stall monitors in userspace low memory killer daemon
(lmkd) we can detect mounting pressure and kill less important processes
before device becomes visibly sluggish. In our memory stress testing psi
memory monitors produce roughly 10x less false positives compared to
vmpressure signals. Having ability to specify multiple triggers for the
same psi metric allows other parts of Android framework to monitor memory
state of the device and act accordingly.
The new interface is straightforward. The user opens one of the pressure
files for writing and writes a trigger description into the file
descriptor that defines the stall state - some or full, and the maximum
stall time over a given window of time. E.g.:
/* Signal when stall time exceeds 100ms of a 1s window */
char trigger[] = "full 100000
1000000";
fd = open("/proc/pressure/memory");
write(fd, trigger, sizeof(trigger));
while (poll() >= 0) {
...
}
close(fd);
When the monitored stall state is entered, psi adapts its aggregation
frequency according to what the configured time window requires in order
to emit event signals in a timely fashion. Once the stalling subsides,
aggregation reverts back to normal.
The trigger is associated with the open file descriptor. To stop
monitoring, the user only needs to close the file descriptor and the
trigger is discarded.
Patches 1-4 prepare the psi code for polling support. Patch 5 implements
the adaptive polling logic, the pressure growth detection optimized for
short intervals, and hooks up write() and poll() on the pressure files.
The patches were developed in collaboration with Johannes Weiner.
This patch (of 5):
Kernfs has a standardized poll/notification mechanism for waking all
pollers on all fds when a filesystem node changes. To allow polling for
custom events, add a .poll callback that can override the default.
This is in preparation for pollable cgroup pressure files which have
per-fd trigger configurations.
Link: http://lkml.kernel.org/r/20190124211518.244221-2-surenb@google.com
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
(cherry picked from commit:
147e1a97c4a0bdd43f55a582a9416bb9092563a9)
Conflicts:
fs/kernfs/file.c
include/linux/kernfs.h
1. replaced __poll_t with unsigned int.
2. replaced EPOLLERR/EPOLLPRI with POLLERR/POLLPRI (values are the same)
Bug:
127712811
Test: lmkd in PSI mode
Change-Id: Ic2bed334d05aec62f4e695f263893c3057921c55
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Johannes Weiner [Thu, 21 Feb 2019 06:19:59 +0000 (22:19 -0800)]
UPSTREAM: psi: avoid divide-by-zero crash inside virtual machines
We've been seeing hard-to-trigger psi crashes when running inside VM
instances:
divide error: 0000 [#1] SMP PTI
Modules linked in: [...]
CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 4.16.18-119_fbk9_3817_gfe944c98d695 #119
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Workqueue: events psi_clock
RIP: 0010:psi_update_stats+0x270/0x490
RSP: 0018:
ffffc90001117e10 EFLAGS:
00010246
RAX:
0000000000000000 RBX:
0000000000000000 RCX:
ffff8800a35a13f8
RDX:
0000000000000000 RSI:
ffff8800a35a1340 RDI:
0000000000000000
RBP:
0000000000000658 R08:
ffff8800a35a1470 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000000 R12:
0000000000000000
R13:
0000000000000000 R14:
0000000000000000 R15:
00000000000f8502
FS:
0000000000000000(0000) GS:
ffff88023fc00000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00007fbe370fa000 CR3:
00000000b1e3a000 CR4:
00000000000006f0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Call Trace:
psi_clock+0x12/0x50
process_one_work+0x1e0/0x390
worker_thread+0x2b/0x3c0
? rescuer_thread+0x330/0x330
kthread+0x113/0x130
? kthread_create_worker_on_cpu+0x40/0x40
? SyS_exit_group+0x10/0x10
ret_from_fork+0x35/0x40
Code: 48 0f 47 c7 48 01 c2 45 85 e4 48 89 16 0f 85 e6 00 00 00 4c 8b 49 10 4c 8b 51 08 49 69 d9 f2 07 00 00 48 6b c0 64 4c 8b 29 31 d2 <48> f7 f7 49 69 d5 8d 06 00 00 48 89 c5 4c 69 f0 00 98 0b 00 48
The Code-line points to `period` being 0 inside update_stats(), and we
divide by that when calculating that period's pressure percentage.
The elapsed period should never be 0. The reason this can happen is due
to an off-by-one in the idle time / missing period calculation combined
with a coarse sched_clock() in the virtual machine.
The target time for aggregation is advanced into the future on a fixed
grid to prevent clock drift. So when an aggregation runs after some idle
period, we can not just set it to "now + psi_period", but have to
calculate the downtime and advance the target time relative to itself.
However, if the aggregator was disabled exactly one psi_period (ns), we
drop one idle period in the calculation due to a > when we should do >=.
In that case, next_update will be advanced from 'now - psi_period' to
'now' when it should be moved to 'now + psi_period'. The run finishes
with last_update == next_update == sched_clock().
With hardware clocks, this exact nanosecond match isn't likely in the
first place; but if it does happen, the clock will still have moved on and
the period non-zero by the time the worker runs. A pointlessly short
period, but besides the extra work, no harm no foul. However, a slow
sched_clock() like we have on VMs might not have advanced either by the
time the worker runs again. And when we calculate the elapsed period, the
result, our pressure divisor, will be 0. Ouch.
Fix this by correctly handling the situation when the elapsed time between
aggregation runs is precisely two periods, and advance the expiration
timestamp correctly to period into the future.
Link: http://lkml.kernel.org/r/20190214193157.15788-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Ćukasz Siudut <lsiudut@fb.com
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit
4e37504d1c49eec6434d0cc97278d2b51c9e8763)
Bug:
127712811
Test: lmkd in PSI mode
Change-Id: I40917c84354f9f32259c6703f00b6b1d21f45f02
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Johannes Weiner [Fri, 1 Feb 2019 22:21:15 +0000 (14:21 -0800)]
UPSTREAM: psi: clarify the Kconfig text for the default-disable option
The current help text caused some confusion in online forums about
whether or not to default-enable or default-disable psi in vendor
kernels. This is because it doesn't communicate the reason for why we
made this setting configurable in the first place: that the overhead is
non-zero in an artificial scheduler stress test.
Since this isn't representative of real workloads, and the effect was
not measurable in scheduler-heavy real world applications such as the
webservers and memcache installations at Facebook, it's fair to point
out that this is a pretty cautious option to select.
Link: http://lkml.kernel.org/r/20190129233617.16767-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit
7b2489d37e1e355228f7c55724f77580e1dec22a)
Bug:
127712811
Test: lmkd in PSI mode
Change-Id: I5d0cb901562fd74c82d9d211544745b802776d8a
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Johannes Weiner [Fri, 1 Feb 2019 22:20:42 +0000 (14:20 -0800)]
UPSTREAM: psi: fix aggregation idle shut-off
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating. However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.
DebugGing this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again. This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)
Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep. To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.
What if the worker is also executing other items before or after?
Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself. If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.
If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity. But that
should not be a problem. The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation. If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.
Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes:
eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit
1b69ac6b40ebd85eed73e4dbccde2a36961ab990)
Bug:
127712811
Test: lmkd in PSI mode
Change-Id: I2877fec3d381b1006b8bd1261895fdfd68bd21db
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Baruch Siach [Fri, 14 Dec 2018 22:17:03 +0000 (14:17 -0800)]
UPSTREAM: psi: fix reference to kernel commandline enable
The kernel commandline parameter named in CONFIG_PSI_DEFAULT_DISABLED
help text contradicts the documentation in kernel-parameters.txt, and
the code. Fix that.
Link: http://lkml.kernel.org/r/20181203213416.GA12627@cmpxchg.org
Fixes:
e0c274472d ("psi: make disabling/enabling easier for vendor kernels")
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit
428a1cb4baeb9e5c7feda93af7372ba6d2491558)
Bug:
127712811
Test: lmkd in PSI mode
Change-Id: I592b66d6542f4fa7c2b6eb9f60a5dd43bcfbabf3
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Johannes Weiner [Fri, 30 Nov 2018 22:09:58 +0000 (14:09 -0800)]
UPSTREAM: psi: make disabling/enabling easier for vendor kernels
Mel Gorman reports a hackbench regression with psi that would prohibit
shipping the suse kernel with it default-enabled, but he'd still like
users to be able to opt in at little to no cost to others.
With the current combination of CONFIG_PSI and the psi_disabled bool set
from the commandline, this is a challenge. Do the following things to
make it easier:
1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros
to enable CONFIG_PSI in their kernel but leave the feature disabled
unless a user requests it at boot-time.
To avoid double negatives, rename psi_disabled= to psi=.
2. Make psi_disabled a static branch to eliminate any branch costs
when the feature is disabled.
In terms of numbers before and after this patch, Mel says:
: The following is a comparision using CONFIG_PSI=n as a baseline against
: your patch and a vanilla kernel
:
: 4.20.0-rc4 4.20.0-rc4 4.20.0-rc4
: kconfigdisable-v1r1 vanilla psidisable-v1r1
: Amean 1 1.3100 ( 0.00%) 1.3923 ( -6.28%) 1.3427 ( -2.49%)
: Amean 3 3.8860 ( 0.00%) 4.1230 * -6.10%* 3.8860 ( -0.00%)
: Amean 5 6.8847 ( 0.00%) 8.0390 * -16.77%* 6.7727 ( 1.63%)
: Amean 7 9.9310 ( 0.00%) 10.8367 * -9.12%* 9.9910 ( -0.60%)
: Amean 12 16.6577 ( 0.00%) 18.2363 * -9.48%* 17.1083 ( -2.71%)
: Amean 18 26.5133 ( 0.00%) 27.8833 * -5.17%* 25.7663 ( 2.82%)
: Amean 24 34.3003 ( 0.00%) 34.6830 ( -1.12%) 32.0450 ( 6.58%)
: Amean 30 40.0063 ( 0.00%) 40.5800 ( -1.43%) 41.5087 ( -3.76%)
: Amean 32 40.1407 ( 0.00%) 41.2273 ( -2.71%) 39.9417 ( 0.50%)
:
: It's showing that the vanilla kernel takes a hit (as the bisection
: indicated it would) and that disabling PSI by default is reasonably
: close in terms of performance for this particular workload on this
: particular machine so;
Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit
e0c274472d5d27f277af722e017525e0b33784cd)
Bug:
127712811
Test: lmkd in PSI mode
Change-Id: I6cb666fa351e8901df82e4d6931bfec0c5ce230d
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Olof Johansson [Fri, 16 Nov 2018 23:08:00 +0000 (15:08 -0800)]
UPSTREAM: kernel/sched/psi.c: simplify cgroup_move_task()
The existing code triggered an invalid warning about 'rq' possibly being
used uninitialized. Instead of doing the silly warning suppression by
initializa it to NULL, refactor the code to bail out early instead.
Warning was:
kernel/sched/psi.c: In function `cgroup_move_task':
kernel/sched/psi.c:639:13: warning: `rq' may be used uninitialized in this function [-Wmaybe-uninitialized]
Link: http://lkml.kernel.org/r/20181103183339.8669-1-olof@lixom.net
Fixes:
2ce7135adc9ad ("psi: cgroup support")
Signed-off-by: Olof Johansson <olof@lixom.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit
8fcb2312d1e3300e81aa871aad00d4c038cfc184)
Bug:
127712811
Test: lmkd in PSI mode
Change-Id: Id989da224a726082e0cfa5d5d9460bf63d448a93
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Johannes Weiner [Fri, 26 Oct 2018 22:06:31 +0000 (15:06 -0700)]
BACKPORT: psi: cgroup support
On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system, but
also that of individual jobs, so that we can ensure individual job health,
fairness between jobs, or prioritize some jobs over others.
This patch implements pressure stall tracking for cgroups. In kernels
with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
and io.pressure files that track aggregate pressure stall times for only
the tasks inside the cgroup.
Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit
2ce7135adc9ad081aa3c49744144376ac74fea60)
Conflicts:
Documentation/cgroup-v2.txt
include/linux/psi.h
kernel/cgroup/cgroup.c
(1. manual merge from Documentation/admin-guide/cgroup-v2.rst
2. include <linux/cgroup-defs.h> into include/linux/psi.h
3. manual merge in css_free_work_fn to allow psi support only for cgroup v2
4. manual merge in cgroup_create to allow psi support only for cgroup v2)
Bug:
127712811
Test: lmkd in PSI mode
Change-Id: I163e6657aaa60aa5aab9372616a3bce2a65e90ec
Signed-off-by: Suren Baghdasaryan <surenb@google.com>