GitHub/LineageOS/android_kernel_motorola_exynos9610.git
2 years agodefconfig: exynos9610: Re-add dropped Wi-Fi AP options lost lineage-18.1
Stricted [Thu, 14 Apr 2022 15:57:15 +0000 (11:57 -0400)]
defconfig: exynos9610: Re-add dropped Wi-Fi AP options lost
 in R rebase

* Not sure why Motorola removed these but there's 0 hope of
  Wi-Fi AP working without them.
* Note: Motorola stock doesn't have these, oddly.

Change-Id: I545153e75b55bd94fe44f9f318746bed1a2cbb38

2 years agodrivers: regulator: fix leading whitespace warning
Stricted [Fri, 21 Jan 2022 17:35:35 +0000 (17:35 +0000)]
drivers: regulator: fix leading whitespace warning

Change-Id: I2d914aee0464a4edd531164908e5c0e611437658

2 years agodrivers: net: wireless: scsc: don't delete or deactivate AP
Tim Zimmermann [Thu, 14 Jan 2021 19:44:19 +0000 (20:44 +0100)]
drivers: net: wireless: scsc: don't delete or deactivate AP
 interface in slsi_del_station

* This function is meant for cleaning up going from STA to AP mode
* On latest 18.1 builds deleting/deactivating AP interface causes
  hotspot not to work, hostapd starts up and system says hotspot
  is on, but due to a kernel error (vif type isn't set to AP anymore)
  SSID isn't actually broadcasted and clients can't find nor connect
  to the hotspot
* So let's remove this as removing it doesn't seem to have any
  negative consequences but fixes hotspot, also disable an
  ugly warning caused by this which isn't actually a problem

Change-Id: I5a656fd38697b9be09ecdc5f344bc343458aeaaf

2 years agofix regression in "epoll: Keep a reference on files added to the check list"
Al Viro [Wed, 2 Sep 2020 15:30:48 +0000 (11:30 -0400)]
fix regression in "epoll: Keep a reference on files added to the check list"

[ Upstream commit 77f4689de17c0887775bb77896f4cc11a39bf848 ]

epoll_loop_check_proc() can run into a file already committed to destruction;
we can't grab a reference on those and don't need to add them to the set for
reverse path check anyway.

Mot-CRs-fixed: (CR)
CVE-Fixed: CVE-2020-0466
Bug: 147802478

Change-Id: I24d3b8c878f5ef49f9ff5922d2364b59844fd8b8
Tested-by: Marc Zyngier <maz@kernel.org>
Fixes: a9ed4a6560b8 ("epoll: Keep a reference on files added to the check list")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Jignesh Patel <jignesh@motorola.com>
Reviewed-on: https://gerrit.mot.com/1796974
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

2 years agoARM64: configs: kane/troika: disable {ex,v}fat
Stricted [Sat, 22 May 2021 18:31:15 +0000 (20:31 +0200)]
ARM64: configs: kane/troika: disable {ex,v}fat

 * we have sdfat instead

Change-Id: I9e265238f32dace3cd3658c642769c12d5617250

2 years agofs: sdfat: Add MODULE_ALIAS_FS for supported filesystems
Paul Keith [Wed, 28 Mar 2018 17:52:29 +0000 (19:52 +0200)]
fs: sdfat: Add MODULE_ALIAS_FS for supported filesystems

* This is the proper thing to do for filesystem drivers

Change-Id: I109b201d85e324cc0a72c3fcd09df4a3e1703042
Signed-off-by: Paul Keith <javelinanddart@gmail.com>
2 years agofs: sdfat: Add config option to register sdFAT for VFAT
Paul Keith [Fri, 2 Mar 2018 04:10:27 +0000 (05:10 +0100)]
fs: sdfat: Add config option to register sdFAT for VFAT

Change-Id: I72ba7a14b56175535884390e8601960b5d8ed1cf
Signed-off-by: Paul Keith <javelinanddart@gmail.com>
2 years agofs: sdfat: Add config option to register sdFAT for exFAT
Paul Keith [Fri, 2 Mar 2018 03:51:53 +0000 (04:51 +0100)]
fs: sdfat: Add config option to register sdFAT for exFAT

Change-Id: Id57abf0a4bd0b433fecc622eecb383cd4ea29d17
Signed-off-by: Paul Keith <javelinanddart@gmail.com>
2 years agofs: Import sdFAT version 2.4.5
Bruno Martins [Sun, 16 May 2021 14:15:40 +0000 (15:15 +0100)]
fs: Import sdFAT version 2.4.5

From Samsung package version T725XXU2DUD1.

Change-Id: I4e8c3317e03edc44d6ac4248231067da8c4f3873

2 years agokbuild: add dtc as dependency on .dtb files
Rob Herring [Tue, 27 Feb 2018 23:49:57 +0000 (17:49 -0600)]
kbuild: add dtc as dependency on .dtb files

If dtc is rebuilt, we should rebuild .dtb files with the new dtc.

Change-Id: I313ee55506628c0922161541adf293735cb4edae
Acked-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Rob Herring <robh@kernel.org>
2 years agoscripts: Silence all dtc warnings seen with Samsung DTS
Bruno Martins [Mon, 30 Nov 2020 12:30:21 +0000 (12:30 +0000)]
scripts: Silence all dtc warnings seen with Samsung DTS

It is insane to fix them all, so just ignore.

If needed, can be re-enabled by building with 'W=1'.

Change-Id: Ic56bc08b255342fc19f5872463552be13375cceb

2 years agopower: supply: s2mu00x: export {CURRENT/VOLTAGE}_MAX to sysfs
Jesse Chan [Sat, 21 Apr 2018 07:08:51 +0000 (00:08 -0700)]
power: supply: s2mu00x: export {CURRENT/VOLTAGE}_MAX to sysfs

Change-Id: I54c775bb80c2151bdc69ea9fb53a48a34327bbef

2 years agoscripts/setlocalversion: Improve -dirty check with git-status --no-optional-locks
Brian Norris [Thu, 15 Nov 2018 02:11:18 +0000 (18:11 -0800)]
scripts/setlocalversion: Improve -dirty check with git-status --no-optional-locks

[ Upstream commit ff64dd4857303dd5550faed9fd598ac90f0f2238 ]

git-diff-index does not refresh the index for you, so using it for a
"-dirty" check can give misleading results. Commit 6147b1cf19651
("scripts/setlocalversion: git: Make -dirty check more robust") tried to
fix this by switching to git-status, but it overlooked the fact that
git-status also writes to the .git directory of the source tree, which
is definitely not kosher for an out-of-tree (O=) build. That is getting
reverted.

Fortunately, git-status now supports avoiding writing to the index via
the --no-optional-locks flag, as of git 2.14. It still calculates an
up-to-date index, but it avoids writing it out to the .git directory.

So, let's retry the solution from commit 6147b1cf19651 using this new
flag first, and if it fails, we assume this is an older version of git
and just use the old git-diff-index method.

It's hairy to get the 'grep -vq' (inverted matching) correct by stashing
the output of git-status (you have to be careful about the difference
betwen "empty stdin" and "blank line on stdin"), so just pipe the output
directly to grep and use a regex that's good enough for both the
git-status and git-diff-index version.

Change-Id: Ieb6e2ff2db99c081b17332136db860260d165385
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Suggested-by: Alexander Kapshuk <alexander.kapshuk@gmail.com>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Tested-by: Genki Sky <sky@genki.is>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2 years agokane/troika: disable CONFIG_SCSC_LOG_COLLECTION
Stricted [Fri, 21 Jan 2022 17:24:54 +0000 (17:24 +0000)]
kane/troika: disable CONFIG_SCSC_LOG_COLLECTION

 * causes random kernel crashes

Change-Id: I5348c4dfba02b1fabb65e124b32f850b8138c197

2 years agowireless: scsc: fix CONFIG_SCSC_LOG_COLLECTION ifdefs
Stricted [Fri, 14 Jan 2022 07:57:47 +0000 (07:57 +0000)]
wireless: scsc: fix CONFIG_SCSC_LOG_COLLECTION ifdefs

Change-Id: I280d122ffa74a201027bcab61208bf6f5a568d89

2 years agomedia: radio: s610: fix indentation warning
Stricted [Mon, 21 Sep 2020 21:09:47 +0000 (21:09 +0000)]
media: radio: s610: fix indentation warning

2 years agoinput: touchscreen: nt36xxx: fix snprintf warning
Khusika Dhamar Gusti [Fri, 13 Sep 2019 16:05:32 +0000 (23:05 +0700)]
input: touchscreen: nt36xxx: fix snprintf warning

Fixes the following clang warning:

../drivers/input/touchscreen/nt36xxx/nt36xxx_mp_ctrlram.c:1293:3: error: 'snprintf' size argument is too large; destination buffer has size 32, $
                snprintf(mpcriteria, PAGE_SIZE, "novatek-mp-criteria-%04X", ts->nvt_pid);
                ^
1 error generated.

Change-Id: Ieb3331a226cb6e183aaac2f22e7e8dccf27af7fc
Signed-off-by: Khusika Dhamar Gusti <mail@khusika.com>
2 years agoexynos9609: do not mount oem partition
Stricted [Wed, 9 Sep 2020 10:14:47 +0000 (10:14 +0000)]
exynos9609: do not mount oem partition

Change-Id: I98ea0ff45d6c3bd300439d93687c98c6deeab2e0

2 years agoadd toggle for disabling newly added USB devices
Daniel Micay [Tue, 16 May 2017 21:51:48 +0000 (17:51 -0400)]
add toggle for disabling newly added USB devices

Based on the public grsecurity patches.

Change-Id: I2cbea91b351cda7d098f4e1aa73dff1acbd23cce
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
2 years agokane/troika: disable CONFIG_SCSC_WLAN_KEY_MGMT_OFFLOAD
Stricted [Fri, 21 Jan 2022 17:23:12 +0000 (17:23 +0000)]
kane/troika: disable CONFIG_SCSC_WLAN_KEY_MGMT_OFFLOAD

 * our wpa_supplicant has no support for this

2 years agoarm64: Add 32-bit sigcontext definition to uapi signcontext.h
David Ng [Sun, 21 Dec 2014 20:53:22 +0000 (12:53 -0800)]
arm64: Add 32-bit sigcontext definition to uapi signcontext.h

The arm64 uapi sigcontext.h can be included by 32-bit userspace
modules.  Since arm and arm64 sigcontext definition are not
compatible, add arm sigcontext definition to arm64 sigcontext.h.

Change-Id: I94109b094f6c8376fdaeb2822d7b26d18ddfb2bc
Signed-off-by: David Ng <dave@codeaurora.org>
2 years agogud: fix mobicore initialization
Stricted [Wed, 28 Aug 2019 15:26:26 +0000 (15:26 +0000)]
gud: fix mobicore initialization

* backported from s9

Change-Id: I48476e899495490ded64a9e173e3daa3c4cdafa0

2 years agoAndroid: Add empty Android.mk file
LuK1337 [Tue, 11 Dec 2018 08:50:01 +0000 (09:50 +0100)]
Android: Add empty Android.mk file

* This prevents inclusion of drivers/staging/greybus/tools/Android.mk
  which will conflict in case we have more than 1 kernel tree in AOSP
  source dir.

Change-Id: I335bca7b6d6463b1ffc673ab5367603347516e13

2 years agoANDROID: arm64: kbuild: only specify code model with LTO for
Sami Tolvanen [Mon, 1 Oct 2018 23:21:22 +0000 (16:21 -0700)]
ANDROID: arm64: kbuild: only specify code model with LTO for
 modules

This fixes CONFIG_LTO_CLANG for LLVM >= r334571 where LLVMgold actually
started respecting the code model flag.

Bug: 116819139
Bug: 119418330
Test: builds with LLVM r334571, boots on a device, all modules load
Change-Id: I1af9d24ed1b789a40ac1e85bd00e176ff8a68aa5
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Miguel de Dios <migueldedios@google.com>
2 years agoANDROID: modpost: add an exception for CFI stubs
Sami Tolvanen [Thu, 4 Oct 2018 15:56:16 +0000 (08:56 -0700)]
ANDROID: modpost: add an exception for CFI stubs

When CONFIG_CFI_CLANG is enabled, LLVM renames all address taken
functions by appending a .cfi postfix to their names, and creates
function stubs with the original names. The compiler always injects
these stubs to the text section, even if the function itself is
placed into init or exit sections, which creates modpost warnings.
This commit adds a modpost exception for CFI stubs to prevent the
warnings.

Bug: 117237524
Change-Id: Ieb8bf20d0c3ad7b7295c535f598370220598cdb0
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2 years agodrivers: soc: cal-if: create new macro to handle NULL return
Michael Benedict [Sat, 29 Feb 2020 16:41:42 +0000 (03:41 +1100)]
drivers: soc: cal-if: create new macro to handle NULL return

some of the func have NULL's which will return the size of the pointer instead which is ((void*) 0), because of that we need to make a new macro to set it to 0

Signed-off-by: Michael Benedict <michaelbt@live.com>
Change-Id: I3ab61542330c7b13cd165dfb48d453cff005884d

2 years agofimc-is2: fix sizeof pointer use
Stricted [Sun, 3 May 2020 04:33:51 +0000 (04:33 +0000)]
fimc-is2: fix sizeof pointer use

Change-Id: Ie53e42ab4527647e4435449b738a6b869508160f

2 years agosound: soc: codecs: cs35l41: fix variable declaration
Stricted [Sat, 25 Apr 2020 23:53:50 +0000 (23:53 +0000)]
sound: soc: codecs: cs35l41: fix variable declaration

Change-Id: Ief5c96b73b59670c71025b879776a50b9b435d38

2 years agofimc-is2: fix variable declarations
Stricted [Sat, 25 Apr 2020 23:52:51 +0000 (23:52 +0000)]
fimc-is2: fix variable declarations

Change-Id: I99a5c03664346d74fc4fe93278121c52cbe3289c

2 years agoremove libdss from Makefile
Stricted [Fri, 24 Apr 2020 17:50:57 +0000 (17:50 +0000)]
remove libdss from Makefile

 * what is this even doing?

2 years agokane/troika: enable last_kmsg
Stricted [Fri, 21 Jan 2022 17:21:42 +0000 (17:21 +0000)]
kane/troika: enable last_kmsg

2 years agoadd kane defconfig
Stricted [Thu, 23 Apr 2020 16:55:05 +0000 (16:55 +0000)]
add kane defconfig

2 years agoadd troika defconfig
Stricted [Thu, 23 Apr 2020 16:50:09 +0000 (16:50 +0000)]
add troika defconfig

2 years agoimplement samsung last_kmsg
Stricted [Mon, 20 Apr 2020 13:09:23 +0000 (13:09 +0000)]
implement samsung last_kmsg

2 years ago[RAMEN9610-22013]futex: Fix inode life-time issue
Peter Zijlstra [Wed, 4 Mar 2020 10:28:31 +0000 (11:28 +0100)]
[RAMEN9610-22013]futex: Fix inode life-time issue

commit 8019ad13ef7f64be44d4f892af9c840179009254 upstream.

As reported by Jann, ihold() does not in fact guarantee inode
persistence. And instead of making it so, replace the usage of inode
pointers with a per boot, machine wide, unique inode identifier.

This sequence number is global, but shared (file backed) futexes are
rare enough that this should not become a performance issue.

Change-Id: I71ce183a546dc536fddd1c1f982fa96a577e5a2d
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 years ago[RAMEN9610-22013]HID: make arrays usage and value to be the same
Will McVicker [Sat, 5 Dec 2020 00:48:48 +0000 (00:48 +0000)]
[RAMEN9610-22013]HID: make arrays usage and value to be the same

commit ed9be64eefe26d7d8b0b5b9fa3ffdf425d87a01f upstream.

The HID subsystem allows an "HID report field" to have a different
number of "values" and "usages" when it is allocated. When a field
struct is created, the size of the usage array is guaranteed to be at
least as large as the values array, but it may be larger. This leads to
a potential out-of-bounds write in
__hidinput_change_resolution_multipliers() and an out-of-bounds read in
hidinput_count_leds().

To fix this, let's make sure that both the usage and value arrays are
the same size.

Change-Id: I96857aa2e17b88d900a50aa16a016f4aca6b1fa5
Cc: stable@vger.kernel.org
Signed-off-by: Will McVicker <willmcvicker@google.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 years ago[RAMEN9610-22009]tty: Fix ->pgrp locking in tiocspgrp()
Jann Horn [Thu, 3 Dec 2020 01:25:04 +0000 (02:25 +0100)]
[RAMEN9610-22009]tty: Fix ->pgrp locking in tiocspgrp()

commit 54ffccbf053b5b6ca4f6e45094b942fab92a25fc upstream.

tiocspgrp() takes two tty_struct pointers: One to the tty that userspace
passed to ioctl() (`tty`) and one to the TTY being changed (`real_tty`).
These pointers are different when ioctl() is called with a master fd.

To properly lock real_tty->pgrp, we must take real_tty->ctrl_lock.

This bug makes it possible for racing ioctl(TIOCSPGRP, ...) calls on
both sides of a PTY pair to corrupt the refcount of `struct pid`,
leading to use-after-free errors.

Change-Id: I000d97f52d398cdf171fb8fb53d9b91456e9dc4c
Fixes: 47f86834bbd4 ("redo locking of tty->pgrp")
CC: stable@kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 years ago[RAMEN9610-21992]icmp: randomize the global rate limiter
Eric Dumazet [Thu, 15 Oct 2020 18:42:00 +0000 (11:42 -0700)]
[RAMEN9610-21992]icmp: randomize the global rate limiter

[ Upstream commit b38e7819cae946e2edf869e604af1e65a5d241c5 ]

Keyu Man reported that the ICMP rate limiter could be used
by attackers to get useful signal. Details will be provided
in an upcoming academic publication.

Our solution is to add some noise, so that the attackers
no longer can get help from the predictable token bucket limiter.

Fixes: 4cdf507d5452 ("icmp: add a global rate limitation")
Change-Id: Ib1a0ee70e55123a17d88b79a83664a3b2ac70034
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Keyu Man <kman001@ucr.edu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 years ago[RAMEN9610-21992]mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked()
Andrea Arcangeli [Wed, 27 May 2020 23:06:24 +0000 (19:06 -0400)]
[RAMEN9610-21992]mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked()

commit c444eb564fb16645c172d550359cb3d75fe8a040 upstream.

Write protect anon page faults require an accurate mapcount to decide
if to break the COW or not. This is implemented in the THP path with
reuse_swap_page() ->
page_trans_huge_map_swapcount()/page_trans_huge_mapcount().

If the COW triggers while the other processes sharing the page are
under a huge pmd split, to do an accurate reading, we must ensure the
mapcount isn't computed while it's being transferred from the head
page to the tail pages.

reuse_swap_cache() already runs serialized by the page lock, so it's
enough to add the page lock around __split_huge_pmd_locked too, in
order to add the missing serialization.

Note: the commit in "Fixes" is just to facilitate the backporting,
because the code before such commit didn't try to do an accurate THP
mapcount calculation and it instead used the page_count() to decide if
to COW or not. Both the page_count and the pin_count are THP-wide
refcounts, so they're inaccurate if used in
reuse_swap_page(). Reverting such commit (besides the unrelated fix to
the local anon_vma assignment) would have also opened the window for
memory corruption side effects to certain workloads as documented in
such commit header.

Change-Id: I702663cf38fd55f76fa2772e4b1cc8b1d0a45a0a
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Jann Horn <jannh@google.com>
Reported-by: Jann Horn <jannh@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: 6d0a07edd17c ("mm: thp: calculate the mapcount correctly for THP pages during WP faults")
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 years ago[RAMEN9610-21992]block: Fix use-after-free in blkdev_get()
Jason Yan [Tue, 16 Jun 2020 12:16:55 +0000 (20:16 +0800)]
[RAMEN9610-21992]block: Fix use-after-free in blkdev_get()

[ Upstream commit 2d3a8e2deddea6c89961c422ec0c5b851e648c14 ]

In blkdev_get() we call __blkdev_get() to do some internal jobs and if
there is some errors in __blkdev_get(), the bdput() is called which
means we have released the refcount of the bdev (actually the refcount of
the bdev inode). This means we cannot access bdev after that point. But
acctually bdev is still accessed in blkdev_get() after calling
__blkdev_get(). This results in use-after-free if the refcount is the
last one we released in __blkdev_get(). Let's take a look at the
following scenerio:

  CPU0            CPU1                    CPU2
blkdev_open     blkdev_open           Remove disk
                  bd_acquire
  blkdev_get
    __blkdev_get      del_gendisk
bdev_unhash_inode
  bd_acquire          bdev_get_gendisk
    bd_forget           failed because of unhashed
  bdput
              bdput (the last one)
        bdev_evict_inode

       access bdev => use after free

[  459.350216] BUG: KASAN: use-after-free in __lock_acquire+0x24c1/0x31b0
[  459.351190] Read of size 8 at addr ffff88806c815a80 by task syz-executor.0/20132
[  459.352347]
[  459.352594] CPU: 0 PID: 20132 Comm: syz-executor.0 Not tainted 4.19.90 #2
[  459.353628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[  459.354947] Call Trace:
[  459.355337]  dump_stack+0x111/0x19e
[  459.355879]  ? __lock_acquire+0x24c1/0x31b0
[  459.356523]  print_address_description+0x60/0x223
[  459.357248]  ? __lock_acquire+0x24c1/0x31b0
[  459.357887]  kasan_report.cold+0xae/0x2d8
[  459.358503]  __lock_acquire+0x24c1/0x31b0
[  459.359120]  ? _raw_spin_unlock_irq+0x24/0x40
[  459.359784]  ? lockdep_hardirqs_on+0x37b/0x580
[  459.360465]  ? _raw_spin_unlock_irq+0x24/0x40
[  459.361123]  ? finish_task_switch+0x125/0x600
[  459.361812]  ? finish_task_switch+0xee/0x600
[  459.362471]  ? mark_held_locks+0xf0/0xf0
[  459.363108]  ? __schedule+0x96f/0x21d0
[  459.363716]  lock_acquire+0x111/0x320
[  459.364285]  ? blkdev_get+0xce/0xbe0
[  459.364846]  ? blkdev_get+0xce/0xbe0
[  459.365390]  __mutex_lock+0xf9/0x12a0
[  459.365948]  ? blkdev_get+0xce/0xbe0
[  459.366493]  ? bdev_evict_inode+0x1f0/0x1f0
[  459.367130]  ? blkdev_get+0xce/0xbe0
[  459.367678]  ? destroy_inode+0xbc/0x110
[  459.368261]  ? mutex_trylock+0x1a0/0x1a0
[  459.368867]  ? __blkdev_get+0x3e6/0x1280
[  459.369463]  ? bdev_disk_changed+0x1d0/0x1d0
[  459.370114]  ? blkdev_get+0xce/0xbe0
[  459.370656]  blkdev_get+0xce/0xbe0
[  459.371178]  ? find_held_lock+0x2c/0x110
[  459.371774]  ? __blkdev_get+0x1280/0x1280
[  459.372383]  ? lock_downgrade+0x680/0x680
[  459.373002]  ? lock_acquire+0x111/0x320
[  459.373587]  ? bd_acquire+0x21/0x2c0
[  459.374134]  ? do_raw_spin_unlock+0x4f/0x250
[  459.374780]  blkdev_open+0x202/0x290
[  459.375325]  do_dentry_open+0x49e/0x1050
[  459.375924]  ? blkdev_get_by_dev+0x70/0x70
[  459.376543]  ? __x64_sys_fchdir+0x1f0/0x1f0
[  459.377192]  ? inode_permission+0xbe/0x3a0
[  459.377818]  path_openat+0x148c/0x3f50
[  459.378392]  ? kmem_cache_alloc+0xd5/0x280
[  459.379016]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  459.379802]  ? path_lookupat.isra.0+0x900/0x900
[  459.380489]  ? __lock_is_held+0xad/0x140
[  459.381093]  do_filp_open+0x1a1/0x280
[  459.381654]  ? may_open_dev+0xf0/0xf0
[  459.382214]  ? find_held_lock+0x2c/0x110
[  459.382816]  ? lock_downgrade+0x680/0x680
[  459.383425]  ? __lock_is_held+0xad/0x140
[  459.384024]  ? do_raw_spin_unlock+0x4f/0x250
[  459.384668]  ? _raw_spin_unlock+0x1f/0x30
[  459.385280]  ? __alloc_fd+0x448/0x560
[  459.385841]  do_sys_open+0x3c3/0x500
[  459.386386]  ? filp_open+0x70/0x70
[  459.386911]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[  459.387610]  ? trace_hardirqs_off_caller+0x55/0x1c0
[  459.388342]  ? do_syscall_64+0x1a/0x520
[  459.388930]  do_syscall_64+0xc3/0x520
[  459.389490]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  459.390248] RIP: 0033:0x416211
[  459.390720] Code: 75 14 b8 02 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83
04 19 00 00 c3 48 83 ec 08 e8 0a fa ff ff 48 89 04 24 b8 02 00 00 00 0f
   05 <48> 8b 3c 24 48 89 c2 e8 53 fa ff ff 48 89 d0 48 83 c4 08 48 3d
      01
[  459.393483] RSP: 002b:00007fe45dfe9a60 EFLAGS: 00000293 ORIG_RAX: 0000000000000002
[  459.394610] RAX: ffffffffffffffda RBX: 00007fe45dfea6d4 RCX: 0000000000416211
[  459.395678] RDX: 00007fe45dfe9b0a RSI: 0000000000000002 RDI: 00007fe45dfe9b00
[  459.396758] RBP: 000000000076bf20 R08: 0000000000000000 R09: 000000000000000a
[  459.397930] R10: 0000000000000075 R11: 0000000000000293 R12: 00000000ffffffff
[  459.399022] R13: 0000000000000bd9 R14: 00000000004cdb80 R15: 000000000076bf2c
[  459.400168]
[  459.400430] Allocated by task 20132:
[  459.401038]  kasan_kmalloc+0xbf/0xe0
[  459.401652]  kmem_cache_alloc+0xd5/0x280
[  459.402330]  bdev_alloc_inode+0x18/0x40
[  459.402970]  alloc_inode+0x5f/0x180
[  459.403510]  iget5_locked+0x57/0xd0
[  459.404095]  bdget+0x94/0x4e0
[  459.404607]  bd_acquire+0xfa/0x2c0
[  459.405113]  blkdev_open+0x110/0x290
[  459.405702]  do_dentry_open+0x49e/0x1050
[  459.406340]  path_openat+0x148c/0x3f50
[  459.406926]  do_filp_open+0x1a1/0x280
[  459.407471]  do_sys_open+0x3c3/0x500
[  459.408010]  do_syscall_64+0xc3/0x520
[  459.408572]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  459.409415]
[  459.409679] Freed by task 1262:
[  459.410212]  __kasan_slab_free+0x129/0x170
[  459.410919]  kmem_cache_free+0xb2/0x2a0
[  459.411564]  rcu_process_callbacks+0xbb2/0x2320
[  459.412318]  __do_softirq+0x225/0x8ac

Fix this by delaying bdput() to the end of blkdev_get() which means we
have finished accessing bdev.

Fixes: 77ea887e433a ("implement in-kernel gendisk events handling")
Change-Id: I9ec1b9ea3520f3e1374b95cee48ac8cc38476117
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
3 years ago[RAMEN9610-21968]ANDROID: xt_qtaguid: Remove tag_entry from process list on untag
Kalesh Singh [Mon, 11 Jan 2021 06:26:18 +0000 (01:26 -0500)]
[RAMEN9610-21968]ANDROID: xt_qtaguid: Remove tag_entry from process list on untag

A sock_tag_entry can only be part of one process's
pqd_entry->sock_tag_list. RetagGing the socket only updates
sock_tag_entry->tag, and does not add the tag entry to the current
process's pqd_entry list, nor update sock_tag_entry->pid.
So the sock_tag_entry is only ever present in the
pqd_entry list of the process that initially tagged the socket.

A sock_tag_entry can also get created and not be added to any process's
pqd_entry list. This happens if the process that initially tags the
socket has not opened /dev/xt_qtaguid.

ctrl_cmd_untag() supports untagGing from a context other than the
process that initially tagged the socket. Currently, the sock_tag_entry is
only removed from its containing pqd_entry->sock_tag_list if the
process that does the untagGing has opened /dev/xt_qtaguid. However, the
tag entry should always be deleted from its pqd entry list (if present).

Bug: 176919394
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I5b6f0c36c0ebefd98cc6873a4057104c7d885ccc
(cherry picked from commit c2ab93b45b5cdc426868fb8793ada2cac20568ef)

3 years agotracing: do not leak kernel addresses
Nick Desaulniers [Fri, 3 Mar 2017 23:40:12 +0000 (15:40 -0800)]
tracing: do not leak kernel addresses

This likely breaks tracing tools like trace-cmd.  It logs in the same
format but now addresses are all 0x0.

Bug: 34277115
MOT-CRs-fixed: (CR)

Change-Id: Ifb0d4d2a184bf0d95726de05b1acee0287a375d9
Reviewed-on: https://gerrit.mot.com/1887836
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agoarm/config: support for hid-nintendo driver
sunyue5 [Wed, 30 Dec 2020 09:54:19 +0000 (17:54 +0800)]
arm/config: support for hid-nintendo driver

CTS will test this device so we should enable it in the
kernel config
CTS cases:
run cts -m CtsHardwareTestCases -t
android.hardware.input.cts.tests.NintendoSwitchProTest#testAllKeys
android.hardware.input.cts.tests.NintendoSwitchProTest#testAllMotions

Change-Id: I4c0f9dcb468fbfa629ae82b984ff55dc170a1acc
Signed-off-by: sunyue5 <sunyue5@motorola.com>
Reviewed-on: https://gerrit.mot.com/1839437
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agoarm64/defconfig: Disable CONFIG_SCSC_WLAN_WIFI_SHARING
sunyue5 [Fri, 11 Dec 2020 01:25:53 +0000 (09:25 +0800)]
arm64/defconfig: Disable CONFIG_SCSC_WLAN_WIFI_SHARING

SCSC_WLAN_WIFI_SHARING is to enable APSTA function which means
the device can start wifi hotspot with wifi station enabled.
In this case, wlan1 will be for the AP mode while wlan0 will be
for the STA mode.
Disable this function since we have disabled it in the frameworks

Change-Id: Icd0d10badec991b3235735c0e53de3214368ba87
Signed-off-by: sunyue5 <sunyue5@motorola.com>
Reviewed-on: https://gerrit.mot.com/1825115
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agoenable blkio for exynos9610
lulei1 [Fri, 27 Nov 2020 06:01:27 +0000 (14:01 +0800)]
enable blkio for exynos9610

IO cgroup should be enabled to restrict background IO R/W.

Change-Id: I4736eb38d208a23afb183c27a5d90296e79e5f39
Reviewed-on: https://gerrit.mot.com/1812665
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Guolin Wang <wanggl3@mt.com>
Submit-Approved: Jira Key

3 years agoarm64/defconfig: sync Q defconfig
zhaoxp3 [Fri, 13 Nov 2020 07:25:04 +0000 (15:25 +0800)]
arm64/defconfig: sync Q defconfig

1. sync kane defconfig change
2. pick wifi change
RAMEN9610-21775]wlbt: Enable Wifi Configurations in Defconfig.
Enable Wifi Configurations in Defconfi

Change-Id: Iea3d6c430f6b3a039c7124b5fa2a331c15bad27c
Signed-off-by: zhaoxp3 <zhaoxp3@motorola.com>
3 years agoUPSTREAM: signal: don't silently convert SI_USER signals to non-current pidfd
Jann Horn [Sat, 30 Mar 2019 02:12:32 +0000 (03:12 +0100)]
UPSTREAM: signal: don't silently convert SI_USER signals to non-current pidfd

The current sys_pidfd_send_signal() silently turns signals with explicit
SI_USER context that are sent to non-current tasks into signals with
kernel-generated siGinfo.
This is unlike do_rt_sigqueueinfo(), which returns -EPERM in this case.
If a user actually wants to send a signal with kernel-provided siGinfo,
they can do that with pidfd_send_signal(pidfd, sig, NULL, 0); so allowing
this case is unnecessary.

Instead of silently replacing the siGinfo, just bail out with an error;
this is consistent with other interfaces and avoids special-casing behavior
based on security checks.

Fixes: 3eb39f47934f ("signal: add pidfd_send_signal() syscall")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
(cherry picked from commit 556a888a14afe27164191955618990fb3ccc9aad)

Mot-CRs-fixed: (CR)

Bug: 135608568
Test: test program using syscall(__NR_pidfd_send_signal,..) to send SIGKILL
Change-Id: I493af671b82c43bff1425ee24550d2fb9aa6d961
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/1505848
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/1796156
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
3 years agoUPSTREAM: pidfd: add polling support
Joel Fernandes (Google) [Tue, 30 Apr 2019 16:21:53 +0000 (12:21 -0400)]
UPSTREAM: pidfd: add polling support

This patch adds polling support to pidfd.

Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
Using the polling support, LMK will be able to get notified when a process
exists in race-free and fast way, and allows the LMK to do other things
(such as by polling on other fds) while awaiting the process being killed
to die.

For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when the
tasks waiting on a poll of pidfd are also awakened in this patch.

We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
   it in task_struct would not be option in this case because the task can
   be reaped and destroyed before the poll returns.

2. By including the struct pid for the waitqueue means that during
   de_thread(), the new thread group leader automatically gets the new
   waitqueue/pid even though its task_struct is different.

Appropriate test cases are added in the second patch to provide coverage of
all the cases the patch is handling.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@android.com
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Co-developed-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
(cherry picked from commit b53b0b9d9a613c418057f6cb921c2f40a6f78c24)

Mot-CRs-fixed: (CR)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I02f259d2875bec46b198d580edfbb067f077084e
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/1505855
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/1796163
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
3 years agoBACKPORT: signal: support CLONE_PIDFD with pidfd_send_signal
Christian Brauner [Wed, 17 Apr 2019 20:50:25 +0000 (22:50 +0200)]
BACKPORT: signal: support CLONE_PIDFD with pidfd_send_signal

Let pidfd_send_signal() use pidfds retrieved via CLONE_PIDFD.  With this
patch pidfd_send_signal() becomes independent of procfs.  This fullfils
the request made when we merged the pidfd_send_signal() patchset.  The
pidfd_send_signal() syscall is now always available allowing for it to
be used by users without procfs mounted or even users without procfs
support compiled into the kernel.

Signed-off-by: Christian Brauner <christian@brauner.io>
Co-developed-by: Jann Horn <jannh@google.com>
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 2151ad1b067275730de1b38c7257478cae47d29e)

Conflicts:
        kernel/sys_ni.c

(1. Replaced COND_SYSCALL with cond_syscall.)

Mot-CRs-fixed: (CR)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I621fe6547397e0e68c560d7da60ef7715deb290c
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/1505852
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/1796160
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
3 years agoUPSTREAM: fork: do not release lock that wasn't taken
Christian Brauner [Fri, 10 May 2019 09:53:46 +0000 (11:53 +0200)]
UPSTREAM: fork: do not release lock that wasn't taken

Avoid calling cgroup_threadgroup_change_end() without having called
cgroup_threadgroup_change_beGin() first.

During process creation we need to check whether the cgroup we are in
allows us to fork. To perform this check the cgroup needs to guard itself
against threadgroup changes and takes a lock.
Prior to CLONE_PIDFD the cleanup target "bad_fork_free_pid" would also need
to call cgroup_threadgroup_change_end() because said lock had already been
taken.
However, this is not the case anymore with the addition of CLONE_PIDFD. We
are now allocating a pidfd before we check whether the cgroup we're in can
fork and thus prior to taking the lock. So when copy_process() fails at the
right step it would release a lock we haven't taken.
This bug is not even very subtle to be honest. It's just not very clear
from the naming of cgroup_threadgroup_change_{beGin,end}() that a lock is
taken.

Here's the relevant splat:

entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7fec849
Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90
90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90
90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 002b:00000000ffed5a8c EFLAGS: 00000246 ORIG_RAX: 0000000000000078
RAX: ffffffffffffffda RBX: 0000000000003ffc RCX: 0000000000000000
RDX: 00000000200005c0 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000012 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
------------[ cut here ]------------
DEBUG_LOCKS_WARN_ON(depth <= 0)
WARNING: CPU: 1 PID: 7744 at kernel/locking/lockdep.c:4052 __lock_release
kernel/locking/lockdep.c:4052 [inline]
WARNING: CPU: 1 PID: 7744 at kernel/locking/lockdep.c:4052
lock_release+0x667/0xa00 kernel/locking/lockdep.c:4321
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 7744 Comm: syz-executor007 Not tainted 5.1.0+ #4
Hardware name: Google Google Compute EnGine/Google Compute EnGine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  panic+0x2cb/0x65c kernel/panic.c:214
  __warn.cold+0x20/0x45 kernel/panic.c:566
  report_bug+0x263/0x2b0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:179 [inline]
  fixup_bug arch/x86/kernel/traps.c:174 [inline]
  do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
  do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:972
RIP: 0010:__lock_release kernel/locking/lockdep.c:4052 [inline]
RIP: 0010:lock_release+0x667/0xa00 kernel/locking/lockdep.c:4321
Code: 0f 85 a0 03 00 00 8b 35 77 66 08 08 85 f6 75 23 48 c7 c6 a0 55 6b 87
48 c7 c7 40 25 6b 87 4c 89 85 70 ff ff ff e8 b7 a9 eb ff <0f> 0b 4c 8b 85
70 ff ff ff 4c 89 ea 4c 89 e6 4c 89 c7 e8 52 63 ff
RSP: 0018:ffff888094117b48 EFLAGS: 00010086
RAX: 0000000000000000 RBX: 1ffff11012822f6f RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff815af236 RDI: ffffed1012822f5b
RBP: ffff888094117c00 R08: ffff888092bfc400 R09: fffffbfff113301d
R10: fffffbfff113301c R11: ffffffff889980e3 R12: ffffffff8a451df8
R13: ffffffff8142e71f R14: ffffffff8a44cc80 R15: ffff888094117bd8
  percpu_up_read.constprop.0+0xcb/0x110 include/linux/percpu-rwsem.h:92
  cgroup_threadgroup_change_end include/linux/cgroup-defs.h:712 [inline]
  copy_process.part.0+0x47ff/0x6710 kernel/fork.c:2222
  copy_process kernel/fork.c:1772 [inline]
  _do_fork+0x25d/0xfd0 kernel/fork.c:2338
  __do_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:240 [inline]
  __se_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:236 [inline]
  __ia32_compat_sys_x86_clone+0xbc/0x140 arch/x86/ia32/sys_ia32.c:236
  do_syscall_32_irqs_on arch/x86/entry/common.c:334 [inline]
  do_fast_syscall_32+0x281/0xd54 arch/x86/entry/common.c:405
  entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7fec849
Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90
90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90
90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 002b:00000000ffed5a8c EFLAGS: 00000246 ORIG_RAX: 0000000000000078
RAX: ffffffffffffffda RBX: 0000000000003ffc RCX: 0000000000000000
RDX: 00000000200005c0 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000012 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Kernel Offset: disabled
Rebooting in 86400 seconds..

Reported-and-tested-by: syzbot+3286e58549edc479faae@syzkaller.appspotmail.com
Fixes: b3e583825266 ("clone: add CLONE_PIDFD")
Signed-off-by: Christian Brauner <christian@brauner.io>
(cherry picked from commit c3b7112df86b769927a60a6d7175988ca3d60f09)

Mot-CRs-fixed: (CR)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: Ib9ecb1e5c0c6e2d062b89c25109ec571570eb497
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/1505853
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/1796161
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
3 years agoBACKPORT: clone: add CLONE_PIDFD
Christian Brauner [Wed, 27 Mar 2019 12:04:15 +0000 (13:04 +0100)]
BACKPORT: clone: add CLONE_PIDFD

This patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call.  Linus oriGinally suggested to implement this as a
new flag to clone() instead of making it a separate system call.  As
spotted by Linus, there is exactly one bit for clone() left.

CLONE_PIDFD creates file descriptors based on the anonymous inode
implementation in the kernel that will also be used to implement the new
mount api.  They serve as a simple opaque handle on pids.  Logically,
this makes it possible to interpret a pidfd differently, narrowing or
widening the scope of various operations (e.g. signal sending).  Thus, a
pidfd cannot just refer to a tgid, but also a tid, or in theory - given
appropriate flag arguments in relevant syscalls - a process group or
session. A pidfd does not represent a privilege.  This does not imply it
cannot ever be that way but for now this is not the case.

A pidfd comes with additional information in fdinfo if the kernel supports
procfs.  The fdinfo file contains the pid of the process in the callers
pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".

As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
parent_tidptr argument of clone.  This has the advantage that we can
give back the associated pid and the pidfd at the same time.

To remove worries about missing metadata access this patchset comes with
a sample program that illustrates how a combination of CLONE_PIDFD, and
pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/<pid>.  The sample program can easily be
translated into a helper that would be suitable for inclusion in libc so
that users don't have to worry about writing it themselves.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Co-developed-by: Jann Horn <jannh@google.com>
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit b3e5838252665ee4cfa76b82bdf1198dca81e5be)

Conflicts:
        kernel/fork.c

(1. Replaced proc_pid_ns() with its direct implementation.)

Mot-CRs-fixed: (CR)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I3c804a92faea686e5bf7f99df893fe3a5d87ddf7
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/1505851
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/1796159
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
3 years agoUPSTREAM: Make anon_inodes unconditional
David Howells [Mon, 5 Nov 2018 17:40:31 +0000 (17:40 +0000)]
UPSTREAM: Make anon_inodes unconditional

Make the anon_inodes facility unconditional so that it can be used by core
VFS code.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit dadd2299ab61fc2b55b95b7b3a8f674cdd3b69c9)

Mot-CRs-fixed: (CR)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I2f97bda4f360d8d05bbb603de839717b3d8067ae
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/1505850
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/1796158
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
3 years agoBACKPORT: pid: add pidfd_open()
Christian Brauner [Fri, 24 May 2019 10:43:51 +0000 (12:43 +0200)]
BACKPORT: pid: add pidfd_open()

This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
process that is created via traditional fork()/clone() calls that is only
referenced by a PID:

int pidfd = pidfd_open(1234, 0);
ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);

With the introduction of pidfds through CLONE_PIDFD it is possible to
created pidfds at process creation time.
However, a lot of processes get created with traditional PID-based calls
such as fork() or clone() (without CLONE_PIDFD). For these processes a
caller can currently not create a pollable pidfd. This is a problem for
Android's low memory killer (LMK) and service managers such as systemd.
Both are examples of tools that want to make use of pidfds to get reliable
notification of process exit for non-parents (pidfd polling) and race-free
signal sending (pidfd_send_signal()). They intend to switch to this API for
process supervision/management as soon as possible. Having no way to get
pollable pidfds from PID-only processes is one of the biggest blockers for
them in adopting this api. With pidfd_open() making it possible to retrieve
pidfds for PID-based processes we enable them to adopt this api.

In line with Arnd's recent changes to consolidate syscall numbers across
architectures, I have added the pidfd_open() syscall to all architectures
at the same time.

Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
(cherry picked from commit 32fcb426ec001cb6d5a4a195091a8486ea77e2df)

Conflicts:
        kernel/pid.c

(1. Replaced PIDTYPE_TGID with PIDTYPE_PID and thread_group_leader() check in pidfd_open() call)

Mot-CRs-fixed: (CR)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I52a93a73722d7f7754dae05f63b94b4ca4a71a75
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/1505856
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/1796165
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
3 years agoUPSTREAM: pidfd: fix a poll race when setting exit_state
Suren Baghdasaryan [Wed, 17 Jul 2019 17:21:00 +0000 (13:21 -0400)]
UPSTREAM: pidfd: fix a poll race when setting exit_state

There is a race between reading task->exit_state in pidfd_poll and
writing it after do_notify_parent calls do_notify_pidfd. Expected
sequence of events is:

CPU 0                            CPU 1
------------------------------------------------
exit_notify
  do_notify_parent
    do_notify_pidfd
  tsk->exit_state = EXIT_DEAD
                                  pidfd_poll
                                     if (tsk->exit_state)

However nothing prevents the following sequence:

CPU 0                            CPU 1
------------------------------------------------
exit_notify
  do_notify_parent
    do_notify_pidfd
                                   pidfd_poll
                                      if (tsk->exit_state)
  tsk->exit_state = EXIT_DEAD

This causes a polling task to wait forever, since poll blocks because
exit_state is 0 and the waiting task is not notified again. A stress
test continuously doing pidfd poll and process exits uncovered this bug.

To fix it, we make sure that the task's exit_state is always set before
calling do_notify_pidfd.

Fixes: b53b0b9d9a6 ("pidfd: add polling support")
Cc: kernel-team@android.com
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20190717172100.261204-1-joel@joelfernandes.org
[christian@brauner.io: adapt commit message and drop unneeded changes from wait_task_zombie]
Signed-off-by: Christian Brauner <christian@brauner.io>
(cherry picked from commit b191d6491be67cef2b3fa83015561caca1394ab9)

Mot-CRs-fixed: (CR)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I043e54c9b69f25de88f6f19ae167920af8532de2
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/1505858
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/1796167
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
3 years agoEnable PSI
huangzq2 [Fri, 6 Nov 2020 11:10:37 +0000 (19:10 +0800)]
Enable PSI

Change-Id: I02a9d4b95606250a6727479ed1a1be01dd208e50
Signed-off-by: huangzq2 <huangzq2@motorola.com>
Reviewed-on: https://gerrit.mot.com/1796168
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agoUPSTREAM: signal: improve comments
Christian Brauner [Tue, 4 Jun 2019 13:18:43 +0000 (15:18 +0200)]
UPSTREAM: signal: improve comments

Improve the comments for pidfd_send_signal().
First, the comment still referred to a file descriptor for a process as a
"task file descriptor" which stems from way back at the beGinning of the
discussion. Replace this with "pidfd" for consistency.
Second, the wording for the explanation of the arguments to the syscall
was a bit inconsistent, e.g. some used the past tense some used present
tense. Make the wording more consistent.

Signed-off-by: Christian Brauner <christian@brauner.io>
(cherry picked from commit c732327f04a3818f35fa97d07b1d64d31b691d78)

Mot-CRs-fixed: (CR)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I06c6bdd1dddaeb8ac75a78dd21f9cdd0dc139a4c
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/1505854
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/1796162
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
3 years agoBACKPORT: arch: wire-up pidfd_open()
Christian Brauner [Fri, 24 May 2019 10:44:59 +0000 (12:44 +0200)]
BACKPORT: arch: wire-up pidfd_open()

This wires up the pidfd_open() syscall into all arches at once.

Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: linux-alpha@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-m68k@lists.linux-m68k.org
Cc: linux-mips@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-xtensa@linux-xtensa.org
Cc: linux-arch@vger.kernel.org
Cc: x86@kernel.org
(cherry picked from commit 7615d9e1780e26e0178c93c55b73309a5dc093d7)

Conflicts:

        arch/alpha/kernel/syscalls/syscall.tbl
        arch/arm/tools/syscall.tbl
        arch/ia64/kernel/syscalls/syscall.tbl
        arch/m68k/kernel/syscalls/syscall.tbl
        arch/microblaze/kernel/syscalls/syscall.tbl
        arch/mips/kernel/syscalls/syscall_n32.tbl
        arch/mips/kernel/syscalls/syscall_n64.tbl
        arch/mips/kernel/syscalls/syscall_o32.tbl
        arch/parisc/kernel/syscalls/syscall.tbl
        arch/powerpc/kernel/syscalls/syscall.tbl
        arch/s390/kernel/syscalls/syscall.tbl
        arch/sh/kernel/syscalls/syscall.tbl
        arch/sparc/kernel/syscalls/syscall.tbl
        arch/xtensa/kernel/syscalls/syscall.tbl
        arch/x86/entry/syscalls/syscall_32.tbl
        arch/x86/entry/syscalls/syscall_64.tbl

(1. Skipped syscall.tbl modifications for missing architectures.
 2. Removed __ia32_sys_pidfd_open in arch/x86/entry/syscalls/syscall_32.tbl.
 3. Replaced __x64_sys_pidfd_open with sys_pidfd_open in arch/x86/entry/syscalls/syscall_64.tbl.)

Mot-CRs-fixed: (CR)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I294aa33dea5ed2662e077340281d7aa0452f7471
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/1505857
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/1796166
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
3 years agoUPSTREAM: signal: use fdget() since we don't allow O_PATH
Christian Brauner [Thu, 18 Apr 2019 10:18:39 +0000 (12:18 +0200)]
UPSTREAM: signal: use fdget() since we don't allow O_PATH

As stated in the oriGinal commit for pidfd_send_signal() we don't allow
to signal processes through O_PATH file descriptors since it is
semantically equivalent to a write on the pidfd.

We already correctly error out right now and return EBADF if an O_PATH
fd is passed.  This is because we use file->f_op to detect whether a
pidfd is passed and O_PATH fds have their file->f_op set to empty_fops
in do_dentry_open() and thus fail the test.

Thus, there is no regression.  It's just semantically correct to use
fdget() and return an error right from there instead of taking a
reference and returning an error later.

Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jann@thejh.net>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 738a7832d21e3d911fcddab98ce260b79010b461)

Mot-CRs-fixed: (CR)

Bug: 135608568
Test: test program using syscall(__NR_pidfd_send_signal,..) to send SIGKILL
Change-Id: IMontanaeaadf9da371fb2d9caae4df49627760de7229
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/1505849
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/1796157
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.c
3 years agoBACKPORT: signal: add pidfd_send_signal() syscall
Christian Brauner [Sun, 18 Nov 2018 23:51:56 +0000 (00:51 +0100)]
BACKPORT: signal: add pidfd_send_signal() syscall

The kill() syscall operates on process identifiers (pid). After a process
has exited its pid can be reused by another process. If a caller sends a
signal to a reused pid it will end up signaling the wrong process. This
issue has often Surfaced and there has been a push to address this problem [1].

This patch uses file descriptors (fd) from proc/<pid> as stable handles on
struct pid. Even if a pid is recycled the handle will not change. The fd
can be used to send signals to the process it refers to.
Thus, the new syscall pidfd_send_signal() is introduced to solve this
problem. Instead of pids it operates on process fds (pidfd).

/* prototype and argument /*
long pidfd_send_signal(int pidfd, int sig, siGinfo_t *info, unsigned int flags);

/* syscall number 424 */
The syscall number was chosen to be 424 to align with Arnd's rework in his
y2038 to minimize merge conflicts (cf. [25]).

In addition to the pidfd and signal argument it takes an additional
siGinfo_t and flags argument. If the siGinfo_t argument is NULL then
pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it
is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
The flags argument is added to allow for future extensions of this syscall.
It currently needs to be passed as 0. Failing to do so will cause EINVAL.

/* pidfd_send_signal() replaces multiple pid-based syscalls */
The pidfd_send_signal() syscall currently takes on the job of
rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
positive pid is passed to kill(2). It will however be possible to also
replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.

/* sending signals to threads (tid) and process groups (pgid) */
Specifically, the pidfd_send_signal() syscall does currently not operate on
process groups or threads. This is left for future extensions.
In order to extend the syscall to allow sending signal to threads and
process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
PIDFD_TYPE_TID) should be added. This implies that the flags argument will
determine what is signaled and not the file descriptor itself. Put in other
words, grouping in this api is a property of the flags argument not a
property of the file descriptor (cf. [13]). Clarification for this has been
requested by Eric (cf. [19]).
When appropriate extensions through the flags argument are added then
pidfd_send_signal() can additionally replace the part of kill(2) which
operates on process groups as well as the tgkill(2) and
rt_tgsigqueueinfo(2) syscalls.
How such an extension could be implemented has been very roughly sketched
in [14], [15], and [16]. However, this should not be taken as a commitment
to a particular implementation. There might be better ways to do it.
Right now this is intentionally left out to keep this patchset as simple as
possible (cf. [4]).

/* naming */
The syscall had various names throughout iterations of this patchset:
- procfd_signal()
- procfd_send_signal()
- taskfd_send_signal()
In the last round of reviews it was pointed out that given that if the
flags argument decides the scope of the signal instead of different types
of fds it might make sense to either settle for "procfd_" or "pidfd_" as
prefix. The community was willing to accept either (cf. [17] and [18]).
Given that one developer expressed strong preference for the "pidfd_"
prefix (cf. [13]) and with other developers less opinionated about the name
we should settle for "pidfd_" to avoid further bikeshedding.

The  "_send_signal" suffix was chosen to reflect the fact that the syscall
takes on the job of multiple syscalls. It is therefore intentional that the
name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
fomer because it might imply that pidfd_send_signal() is a replacement for
kill(2), and not the latter because it is a hassle to remember the correct
spelling - especially for non-native speakers - and because it is not
descriptive enough of what the syscall actually does. The name
"pidfd_send_signal" makes it very clear that its job is to send signals.

/* zombies */
Zombies can be signaled just as any other process. No special error will be
reported since a zombie state is an unreliable state (cf. [3]). However,
this can be added as an extension through the @flags argument if the need
ever arises.

/* cross-namespace signals */
The patch currently enforces that the signaler and signalee either are in
the same pid namespace or that the signaler's pid namespace is an ancestor
of the signalee's pid namespace. This is done for the sake of simplicity
and because it is unclear to what values certain members of struct
siGinfo_t would need to be set to (cf. [5], [6]).

/* compat syscalls */
It became clear that we would like to avoid adding compat syscalls
(cf. [7]).  The compat syscall handling is now done in kernel/signal.c
itself by adding __copy_siGinfo_from_user_generic() which lets us avoid
compat syscalls (cf. [8]). It should be noted that the addition of
__copy_siGinfo_from_user_any() is caused by a bug in the oriGinal
implementation of rt_sigqueueinfo(2) (cf. 12).
With upcoming rework for syscall handling things might improve
significantly (cf. [11]) and __copy_siGinfo_from_user_any() will not gain
any additional callers.

/* testing */
This patch was tested on x64 and x86.

/* userspace usage */
An asciinema recording for the basic functionality can be found under [9].
With this patch a process can be killed via:

 #define _GNU_SOURCE
 #include <errno.h>
 #include <fcntl.h>
 #include <signal.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <sys/stat.h>
 #include <sys/syscall.h>
 #include <sys/types.h>
 #include <unistd.h>

 static inline int do_pidfd_send_signal(int pidfd, int sig, siGinfo_t *info,
                                         unsigned int flags)
 {
 #ifdef __NR_pidfd_send_signal
         return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
 #else
         return -ENOSYS;
 #endif
 }

 int main(int argc, char *argv[])
 {
         int fd, ret, saved_errno, sig;

         if (argc < 3)
                 exit(EXIT_FAILURE);

         fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
         if (fd < 0) {
                 printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
                 exit(EXIT_FAILURE);
         }

         sig = atoi(argv[2]);

         printf("Sending signal %d to process %s\n", sig, argv[1]);
         ret = do_pidfd_send_signal(fd, sig, NULL, 0);

         saved_errno = errno;
         close(fd);
         errno = saved_errno;

         if (ret < 0) {
                 printf("%s - Failed to send signal %d to process %s\n",
                        strerror(errno), sig, argv[1]);
                 exit(EXIT_FAILURE);
         }

         exit(EXIT_SUCCESS);
 }

/* Q&A
 * Given that it seems the same questions get asked again by people who are
 * late to the party it makes sense to add a Q&A section to the commit
 * message so it's hopefully easier to avoid duplicate threads.
 *
 * For the sake of progress please consider these arguments settled unless
 * there is a new point that desperately needs to be addressed. Please make
 * sure to check the links to the threads in this commit message whether
 * this has not already been covered.
 */
Q-01: (Florian Weimer [20], Andrew Morton [21])
      What happens when the target process has exited?
A-01: Sending the signal will fail with ESRCH (cf. [22]).

Q-02:  (Andrew Morton [21])
       Is the task_struct pinned by the fd?
A-02:  No. A reference to struct pid is kept. struct pid - as far as I
       understand - was created exactly for the reason to not require to
       pin struct task_struct (cf. [22]).

Q-03: (Andrew Morton [21])
      Does the entire procfs directory remain visible? Just one entry
      within it?
A-03: The same thing that happens right now when you hold a file descriptor
      to /proc/<pid> open (cf. [22]).

Q-04: (Andrew Morton [21])
      Does the pid remain reserved?
A-04: No. This patchset guarantees a stable handle not that pids are not
      recycled (cf. [22]).

Q-05: (Andrew Morton [21])
      Do attempts to signal that fd return errors?
A-05: See {Q,A}-01.

Q-06: (Andrew Morton [22])
      Is there a cleaner way of obtaining the fd? Another syscall perhaps.
A-06: Userspace can already trivially retrieve file descriptors from procfs
      so this is something that we will need to support anyway. Hence,
      there's no immediate need to add another syscalls just to make
      pidfd_send_signal() not dependent on the presence of procfs. However,
      adding a syscalls to get such file descriptors is planned for a
      future patchset (cf. [22]).

Q-07: (Andrew Morton [21] and others)
      This fd-for-a-process sounds like a handy thing and people may well
      think up other uses for it in the future, probably unrelated to
      signals. Are the code and the interface designed to permit such
      future applications?
A-07: Yes (cf. [22]).

Q-08: (Andrew Morton [21] and others)
      Now I think about it, why a new syscall? This thing is looking
      rather like an ioctl?
A-08: This has been extensively discussed. It was agreed that a syscall is
      preferred for a variety or reasons. Here are just a few taken from
      prior threads. Syscalls are safer than ioctl()s especially when
      signaling to fds. Processes are a core kernel concept so a syscall
      seems more appropriate. The layout of the syscall with its four
      arguments would require the addition of a custom struct for the
      ioctl() thereby causing at least the same amount or even more
      complexity for userspace than a simple syscall. The new syscall will
      replace multiple other pid-based syscalls (see description above).
      The file-descriptors-for-processes concept introduced with this
      syscall will be extended with other syscalls in the future. See also
      [22], [23] and various other threads already linked in here.

Q-09: (Florian Weimer [24])
      What happens if you use the new interface with an O_PATH descriptor?
A-09:
      pidfds opened as O_PATH fds cannot be used to send signals to a
      process (cf. [2]). Signaling processes through pidfds is the
      equivalent of writing to a file. Thus, this is not an operation that
      operates "purely at the file descriptor level" as required by the
      open(2) manpage. See also [4].

/* References */
[1]:  https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
[2]:  https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
[3]:  https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
[4]:  https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
[5]:  https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
[6]:  https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
[7]:  https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
[8]:  https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
[9]:  https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
[11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
[12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
[13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
[14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/
[15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/
[16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/
[17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
[18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/
[19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/
[20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/
[22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
[23]: https://lwn.net/Articles/773459/
[24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Tycho Andersen <tycho@tycho.ws>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Aleksa Sarai <cyphar@cyphar.com>
(cherry picked from commit 3eb39f47934f9d5a3027fe00d906a45fe3a15fad)

Conflicts:
        arch/x86/entry/syscalls/syscall_32.tbl - trivial manual merge
        arch/x86/entry/syscalls/syscall_64.tbl - trivial manual merge
        include/linux/proc_fs.h - trivial manual merge
        include/linux/syscalls.h - trivial manual merge
        include/uapi/asm-generic/unistd.h - trivial manual merge
        kernel/signal.c - struct kernel_siGinfo does not exist in 4.14
        kernel/sys_ni.c - cond_syscall is used instead of COND_SYSCALL
        arch/x86/entry/syscalls/syscall_32.tbl
        arch/x86/entry/syscalls/syscall_64.tbl

(1. manual merges because of 4.14 differences
 2. change prepare_kill_siGinfo() to use struct siGinfo instead of
kernel_siGinfo
 3. use copy_from_user() instead of copy_siGinfo_from_user() in copy_siGinfo_from_user_any()
 4. replaced COND_SYSCALL with cond_syscall
 5. Removed __ia32_sys_pidfd_send_signal in arch/x86/entry/syscalls/syscall_32.tbl.
 6. Replaced __x64_sys_pidfd_send_signal with sys_pidfd_send_signal in arch/x86/entry/syscalls/syscall_64.tbl.)

Mot-CRs-fixed: (CR)

Bug: 135608568
Test: test program using syscall(__NR_pidfd_send_signal,..) to send SIGKILL
Change-Id: I34da11c63ac8cafb0353d9af24c820cef519ec27
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-on: https://gerrit.mot.com/1505847
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Wang Wang <wangwang1@mt.com>
Reviewed-by: Yonghui Jia <jiayh2@motorola.com>
Submit-Approved: Jira Key
Reviewed-on: https://gerrit.mot.com/1796155
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
3 years agoinet: switch IP ID generator to siphash
Eric Dumazet [Wed, 27 Mar 2019 19:40:33 +0000 (12:40 -0700)]
inet: switch IP ID generator to siphash

[ Upstream commit df453700e8d81b1bdafdf684365ee2b9431fb702 ]

According to Amit Klein and Benny Pinkas, IP ID generation is too weak
and might be used by attackers.

Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix())
having 64bit key and Jenkins hash is risky.

It is time to switch to siphash and its 128bit keys.

Mot-CRs-fixed: (CR)
CVE-Fixed: CVE-2019-18282
Bug: 148588557

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reported-by: Benny Pinkas <benny@pinkas.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jignesh Patel <jignesh@motorola.com>
Change-Id: I9593781a735940aaedf8e6b38fef02b48169bd12
Reviewed-on: https://gerrit.mot.com/1572721
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agoANDROID: fix binder change in merge of 4.9.188
Todd Kjos [Thu, 23 Jan 2020 05:14:53 +0000 (10:44 +0530)]
ANDROID: fix binder change in merge of 4.9.188

The 4.9.188 merge was missing the change to the
binder driver associated with the linux-4.9.y
commit 16903f1a5ba7 ("coredump: fix race condition
between mmget_not_zero()/get_task_mm() and core dumping").
It was left out because the android-4.9 binder
driver has been significantly refactored compared
to linux-4.9.y.

This patch applies the missing change from that
patch to the binder driver.

Mot-CRs-fixed: (CR)
CVE-Fixed: CVE-2019-11599
BUG: 131964235

Change-Id: I1402cf3c28f1336da9d942abeb322f71a9b8138b
Signed-off-by: Pachipulusu Bhanu Prakash <bhprakas@motorola.com>
Reviewed-on: https://gerrit.mot.com/1473937
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agoRevert "(CR) arm64: dts: Keep VCCQ power when S2R mode for Sandisk UFS"
Dingwei Luo [Tue, 17 Dec 2019 02:55:38 +0000 (20:55 -0600)]
Revert "(CR) arm64: dts: Keep VCCQ power when S2R mode for Sandisk UFS"

The changed will be effect the APM current, the current will increase about 6mA.

This reverts commit 7eb50b69c6197393012ccc02b97c8ad54c8f6ba3.

Change-Id: I27586d466caa39a9632136b87734fc9447090092
Reviewed-on: https://gerrit.mot.com/1473008
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agoarm64: dts: Keep VCCQ power when S2R mode for Sandisk UFS
luodw1 [Thu, 12 Dec 2019 05:51:06 +0000 (13:51 +0800)]
arm64: dts: Keep VCCQ power when S2R mode for Sandisk UFS

Change-Id: Ib55c4b86a0608ec3e436dd2b8ae36cb1fd44287e
Reviewed-on: https://gerrit.mot.com/1470812
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agousb:Restore linked_func list in none-secure mode
a17671 [Mon, 18 Nov 2019 07:56:16 +0000 (15:56 +0800)]
usb:Restore linked_func list in none-secure mode

In none-secure mode, there is still
A low chance of gadget NULL case,
Restore the binded functions list back to
Linked_func list, so next time unbind functions
Could be handled correctly without memory corruption
This is a Samsung platform only issue

Change-Id: Ie46fc52d3eaa6ef60c1a4f6bb83a56229Montana854d
Signed-off-by: a17671 <a17671@motorola.com>
Reviewed-on: https://gerrit.mot.com/1456923
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agoRevert "(CR) psi:kernel:enable PSI configuration"
wangwang [Thu, 14 Nov 2019 09:05:25 +0000 (17:05 +0800)]
Revert "(CR) psi:kernel:enable PSI configuration"

This reverts commit 9ea0893bec0beb7328429c222097427d33981714.

Conflicts:
arch/arm64/configs/ext_config/moto-erd9610.config

Change-Id: I8e2c88a7b2c932fa877416564d5cbf294afe0d5a
Reviewed-on: https://gerrit.mot.com/1455315
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agoarm64/defconfig: define CONFIG_LOCALVERSION to user
zhaoxp3 [Wed, 13 Nov 2019 09:49:56 +0000 (17:49 +0800)]
arm64/defconfig: define CONFIG_LOCALVERSION to user

add user defconfig
Change-Id: Ib8113f551270eee70178e9638dabeb8083e5b675
Signed-off-by: zhaoxp3 <zhaoxp3@motorola.com>
Reviewed-on: https://gerrit.mot.com/1453937
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Submit-Approved: Jira Key

3 years agoEnable process reclaim
huangzq2 [Tue, 12 Nov 2019 07:28:15 +0000 (15:28 +0800)]
Enable process reclaim

Change-Id: Icda8271812c13fa2e4677e42ec38f8a52dd50721
Signed-off-by: huangzq2 <huangzq2@motorola.com>
Reviewed-on: https://gerrit.mot.com/1453732
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agomm: mm_event: remove get/put_online_cpus call
Martin Liu [Mon, 6 May 2019 16:57:20 +0000 (00:57 +0800)]
mm: mm_event: remove get/put_online_cpus call

remove get/put_online_cpus call since it could cause
deadlock in cpu hotplug path. This might cause race
but should be rare and we should be able to correct
that with the next dump.

=======================================================
    Task name: Binder:897_2 pid: 3255 cpu: 0 start: 0xffffffd39e2f5700
    state: 0x2 exit_state: 0x0 stack base: 0xffffff80241f0000 Prio: 116
    Stack:
    [<ffffff9048f3163c>] __switch_to.cfi+0x138
    [<ffffff904a947808>] __schedule+0xb7c
    [<ffffff904a94dbdc>] rwsem_down_read_failed.cfi+0x270
    [<ffffff904900a0a4>] __percpu_down_read.cfi+0x164
    [<ffffff90491bc4e8>] record_stat+0x6c0
    [<ffffff90491bbdfc>] mm_event_end.cfi+0x14c
    [<ffffff904916c280>] try_to_free_pages.cfi+0xaf4
    [<ffffff904914a598>] __alloc_pages_nodemask.cfi+0x9c8
    [<ffffff9049aa6350>] zcomp_cpu_up_prepare.cfi+0x88
    [<ffffff9048f66da8>] cpuhp_invoke_callback+0x378
    [<ffffff9048f664e0>] _cpu_up+0x1bc
    [<ffffff9048f6aadc>] enable_nonboot_cpus.cfi+0x208
    [<ffffff90490135f4>] suspend_devices_and_enter.cfi+0xc20
    [<ffffff9049012844>] pm_suspend.cfi+0xb30
    [<ffffff9049010568>] state_store.cfi+0x94
    [<ffffff904a9339dc>] kobj_attr_store.cfi+0x34
    [<ffffff90492ed9ec>] sysfs_kf_write.cfi+0x64
    [<ffffff90492eb51c>] kernfs_fop_write.cfi+0x1a4
    [<ffffff90491fbf34>] __vfs_write.cfi+0x50
    [<ffffff90491fbd58>] vfs_write.cfi+0xcc
    [<ffffff90491fefd4>] SyS_write.cfi+0xa4
    [<ffffff9048e84080>] el0_svc_naked+0x34

Test: manual suspend/resume test

Mot-CRs-fixed: (CR)

Bug: 132011965
Change-Id: I112ca0d25e825bb4e0e8979d9b4f1d8e6090147f
Signed-off-by: Martin Liu <liumartin@google.com>
(cherry picked from commit 5f419093ab253702847a6b3a8417e47c2acfb652)
Reviewed-on: https://gerrit.mot.com/1453731
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agomm: mm_event: fix compact_scan
Minchan Kim [Mon, 28 Jan 2019 11:00:33 +0000 (20:00 +0900)]
mm: mm_event: fix compact_scan

It fixes double counting of COMPACTFREE_SCANNED.

Mot-CRs-fixed: (CR)

Bug: 80168800
Change-Id: I38ef432ecf44ba94988f5a4ec9c69bcb5d20fdce
Signed-off-by: Minchan Kim <minchan@google.com>
Reviewed-on: https://gerrit.mot.com/1453730
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agomm: synchronize period update interval
Minchan Kim [Thu, 17 Jan 2019 02:14:14 +0000 (11:14 +0900)]
mm: synchronize period update interval

Wei pointed out period update is racy so it could make partial
update, which could lose a ton of trace potentially.

To close period_ms race between updating and reading, use rwlock
to reduce contention.
To close vmstat_period_ms between updating and reading,
use vmstat_lock.

This patch has small refactoring, too.

Mot-CRs-fixed: (CR)

Bug: 80168800
Change-Id: I7f84cff758b533b7881f47889c7662b743bc3c12
Signed-off-by: Minchan Kim <minchan@google.com>
Reviewed-on: https://gerrit.mot.com/1453729
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agomm: mm_event supports vmstat
Minchan Kim [Tue, 15 Jan 2019 04:54:07 +0000 (13:54 +0900)]
mm: mm_event supports vmstat

Vmstat is significantly important to investigate MM problem.
We have solved many problmes with it via asking users to get
vmstat data periodically from the device, which manual way is
painful once we release the device or on hard reproducible
scenario.

This patch adds periodic vmstat dump into mm_event. It works
only if there are some events in compaction or reclaim. Thus,
unless there is memory pressure, it doesn't gather any vmstat
data. Default interval between each dump is 1000ms.
Admin can tweak it via

echo 2000 > /sys/kernel/debug/mm_event/vmstat_period_ms

Mot-CRs-fixed: (CR)

Bug: 80168800
Change-Id: I4c0e7237d7764c4ea79da00952e5de34ccbe4187
Signed-off-by: Minchan Kim <minchan@google.com>
Reviewed-on: https://gerrit.mot.com/1453728
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agomm: per-process reclaim
Minchan Kim [Fri, 9 Jan 2015 13:06:55 +0000 (18:36 +0530)]
mm: per-process reclaim

These day, there are many platforms available in the embedded market
and they are smarter than kernel which has very limited information
about working set so they want to involve memory management more heavily
like android's lowmemory killer and ashmem or recent many lowmemory
notifier.

One of the simple imaGine scenario about userspace's intelligence is that
platform can manage tasks as forground and background so it would be
better to reclaim background's task pages for end-user's *responsibility*
although it has frequent referenced pages.

This patch adds new knob "reclaim under proc/<pid>/" so task manager
can reclaim any target process anytime, anywhere. It could give another
method to platform for using memory efficiently.

It can avoid process killing for getting free memory, which was really
terrible experience because I lost my best score of game I had ever
after I switch the phone call while I enjoyed the game.

Reclaim file-backed pages only.
echo file > /proc/PID/reclaim
Reclaim anonymous pages only.
echo anon > /proc/PID/reclaim
Reclaim all pages
echo all > /proc/PID/reclaim

Mot-CRs-fixed: (CR)

Bug: 122047783
Change-Id: I2f629f7a43289af114df27044b1d2af4a6e785bc
Signed-off-by: Tim Murray <timmurray@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-on: https://gerrit.mot.com/1453727
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agomm: emit tracepoint when rss watermark is hit
Joel Fernandes [Sat, 5 May 2018 21:58:08 +0000 (14:58 -0700)]
mm: emit tracepoint when rss watermark is hit

Useful to track how rss is chanGing per tgid. Required for the
memory visibility work being done for Android.

OriGinal patch by Tim Murray:
https://partner-android-review.googlesource.com/c/kernel/private/msm-google/+/1081280

Changes from oriGinal patch:
- don't bloat mm_struct
- add some noise reduction to rss tracking

Mot-CRs-fixed: (CR)

Change-Id: Ief904334235ff4380244e5803d7853579e70d202
Signed-off-by: Joel Fernandes <joelaf@google.com>
Reviewed-on: https://gerrit.mot.com/1453726
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
SLTApproved: Slta Waiver
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agomm: mm_event: add read io stat
Minchan Kim [Sun, 12 Aug 2018 23:15:32 +0000 (08:15 +0900)]
mm: mm_event: add read io stat

Read IO's latency as well as filemap fault could affect system
performance so this patch keeps track it on.

Mot-CRs-fixed: (CR)

Bug: 80168800
Change-Id: I761b7110339cf1e5ef24530ad32aedd784d00d07
Signed-off-by: Minchan Kim <minchan@google.com>
Signed-off-by: Cho KyongHo <pullip.cho@samsung.com>
Reviewed-on: https://gerrit.mot.com/1453725
Tested-by: Jira Key
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agomm: mm_event: add special kernel allocation stat
Minchan Kim [Mon, 6 Aug 2018 06:12:44 +0000 (15:12 +0900)]
mm: mm_event: add special kernel allocation stat

Record the count of special page allocation on the process context.

This patch aims for accounting of special page allocation which
consumed a lot by android system.
At this moment, ION system heap is good candidate(it could cover
other kernel allocation in future).
With that, we could keep tracking burst kernel allocation owner
so that it would be useful to find places caused by lmk, reclaim,
compaction latency.

Mot-CRs-fixed: (CR)

Bug: 80168800
Change-Id: I5942fd940d98baa2eb814f66b076cb37ecd3b4aa
Signed-off-by: Minchan Kim <minchan@google.com>
Reviewed-on: https://gerrit.mot.com/1453724
Tested-by: Jira Key
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agomm: mm_event: add swapin stat
Minchan Kim [Mon, 6 Aug 2018 06:07:57 +0000 (15:07 +0900)]
mm: mm_event: add swapin stat

Many embedded devices use zram as swap. Compared to storage swap
(e.g. UFS), swapin from zram(ie., decompression) is extremly fast
so it might be not major fault but minor. So this patch provides
swapin latency tracking to distinguish them from storage major
fault.

Mot-CRs-fixed: (CR)

Bug: 80168800
Change-Id: I1c32430e32a051916ede5219bd5f40a9002652bc
Signed-off-by: Minchan Kim <minchan@google.com>
Reviewed-on: https://gerrit.mot.com/1453723
Tested-by: Jira Key
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agomm: mm_event: add compaction stat
Minchan Kim [Wed, 27 Jun 2018 13:04:37 +0000 (22:04 +0900)]
mm: mm_event: add compaction stat

This patch adds compaction mm_event stat so that we could keep track
latency of compaction as well as count of the event.

Under heavy memory fragmentation, high-order page allocation(e.g.
fork, ION memory allocation) triggers compaction, which is
another major part of latency. Let's track it down, too.

Mot-CRs-fixed: (CR)

Bug: 80168800
Change-Id: Ia3da9324f123ba2542863eafaf72024b5351785b
Signed-off-by: Minchan Kim <minchan@google.com>
Reviewed-on: https://gerrit.mot.com/1453722
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agomm: mm_event: add reclaim stat
Minchan Kim [Wed, 27 Jun 2018 13:04:07 +0000 (22:04 +0900)]
mm: mm_event: add reclaim stat

This patch adds page reclaim mm_event stat so that we could
keep tracking [avg|max]_latency for the handling the event
as well as count of the event.

Direct reclaim latency is usually a most popular latency source
caused by memory pressure so we need to track it down to hunt
down application's jank problem.

Mot-CRs-fixed: (CR)

Bug: 80168800
Change-Id: I215c3972f76389404da7c4806a776bf753daac01
Signed-off-by: Minchan Kim <minchan@google.com>
Reviewed-on: https://gerrit.mot.com/1453721
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agomm: mm_event: add page fault stat
Minchan Kim [Wed, 27 Jun 2018 12:54:47 +0000 (21:54 +0900)]
mm: mm_event: add page fault stat

This patch add major and minor fault mm_event stat so that we could
keep tracking [avg|max]_latency for the handling the event
as well as count of the event.

With major fault, we could see how long the IO is delayed. It's very
tightly coupled with application's latency.

With major+minor fault, we could see how many of pages are allocated
for the process in the period. It would help to see memory spike.

Mot-CRs-fixed: (CR)

Bug: 80168800
Change-Id: I8a4434493e3ec291227961939a24c3d57a18fd5b
Signed-off-by: Minchan Kim <minchan@google.com>
Signed-off-by: Cho KyongHo <pullip.cho@samsung.com>
Reviewed-on: https://gerrit.mot.com/1453720
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agomm: mm_event: make capture period configurable
Minchan Kim [Mon, 6 Aug 2018 06:02:06 +0000 (15:02 +0900)]
mm: mm_event: make capture period configurable

This patch makes per-process mm event capture inteval configurable.
Default is 500ms but admin can change it by below knob.

/sys/kernel/debug/mm_event/period_ms

The unit is millisecond.

Mot-CRs-fixed: (CR)

Bug: 80168800
Change-Id: I3b2de3dd5c4a519a2e5e20f1ef0d5f9a4c7afc8a
Signed-off-by: Minchan Kim <minchan@google.com>
Reviewed-on: https://gerrit.mot.com/1453719
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agomm: introduce per-process mm event tracking feature
Minchan Kim [Mon, 6 Aug 2018 06:00:19 +0000 (15:00 +0900)]
mm: introduce per-process mm event tracking feature

Linux supports /proc/meminfo and /proc/vmstat stats as memory health metric.
Android uses them too. If user see something goes wrong(e.g., sluggish, jank)
on their system, they can capture and report system state to developers
for debugGing.

It shows memory stat at the moment the bug is captured. However, it’s
not enough to investigate application's jank problem caused by memory
shortage. Because

1. It just shows event count which doesn’t quantify the latency of the
application well. Jank could happen by various reasons and one of simple
scenario is frame drop for a second. App should draw the frame every 16ms
interval. Just number of stats(e.g., allocstall or pgmajfault) couldn't
represnt how many of time the app spends for handling the event.

2. At bugreport, dump with vmstat and meminfo is never helpful because it's
too late to capture the moment when the problem happens.
When the user catch up the problem and try to capture the system state,
the problem has already gone.

3. Although we could capture MM stat at the moment bug happens, it couldn't
be helpful because MM stats are usually very flucuate so we need historical
data rather than one-time snapshot to see MM trend.

To solve above problems, this patch introduces per-process, light-weight,
mm event stat. Basically, it tracks minor/major faults, reclaim and compaction
latency of each process as well as event count and record the data into global
buffer.
To compromise memory overhead, it doesn't record every MM event of the process
to the buffer but just drain accumuated stats every 0.5sec interval to buffer.
If there isn't any event, it just skips the recording.
For latency data, it keeps average/max latency of each event in that period

With that, we could keep useful information with small buffer so that
we couldn't miss precious information any longer although the capture time
is rather late. This patch introduces basic facility of MM event stat.

After all patches in this patchset are applied, outout format is as follows,
dumpstate can use it for VM debugGing in future.

<...>-1665  [001] d...   217.575173: mm_event_record: min_flt count=203 avg_lat=3 max_lat=58
<...>-1665  [001] d...   217.575183: mm_event_record: maj_flt count=1 avg_lat=1994 max_lat=1994
<...>-1665  [001] d...   217.575184: mm_event_record: kern_alloc count=227 avg_lat=0 max_lat=0
<...>-626   [000] d...   217.578096: mm_event_record: kern_alloc count=4 avg_lat=0 max_lat=0
<...>-6547  [000] ....   217.581913: mm_event_record: min_flt count=7 avg_lat=7 max_lat=20
<...>-6547  [000] ....   217.581955: mm_event_record: kern_alloc count=4 avg_lat=0 max_lat=0

This feature uses event trace for output buffer so that we could use all of
general benefit of event trace(e.g., buffer size management, filtering and
so on). To prevent overflow of the ring buffer by other random event race,
highly suggest that create separate instance of tracing
on /sys/kernel/debug/tracing/instances/

I had a concern of adding overhead. Actually, major|compaction/reclaim
are already heavy cost so it should be not a concern. Rather than,
minor fault and kern alloc would be severe so I tested a micro benchmark
to measure minor page fault overhead.

Test scenario is create 40 threads and each of them does minor
page fault for 25M range(ranges are not overwrapped).
I didn't see any noticible regression.

Base:
fault/wsec avg: 758489.8288

minor faults=13123118, major faults=0 ctx switch=139234
    User   System     Wall        fault/wsec
  39.55s   41.73s   17.49s        749995.768
minor faults=13123135, major faults=0 ctx switch=139627
    User   System     Wall        fault/wsec
  34.59s   41.61s   16.95s        773906.976
minor faults=13123061, major faults=0 ctx switch=139254
    User   System     Wall        fault/wsec
  39.03s   41.55s   16.97s        772966.334
minor faults=13123131, major faults=0 ctx switch=139970
    User   System     Wall        fault/wsec
  36.71s   42.12s   17.04s        769941.019
minor faults=13123027, major faults=0 ctx switch=138524
    User   System     Wall        fault/wsec
  42.08s   42.24s   18.08s        725639.047

Base + MM event + event trace enable:
fault/wsec avg: 759626.1488

minor faults=13123488, major faults=0 ctx switch=140303
    User   System     Wall        fault/wsec
  37.66s   42.21s   17.48s        750414.257
minor faults=13123066, major faults=0 ctx switch=138119
    User   System     Wall        fault/wsec
  36.77s   42.14s   17.49s        750010.107
minor faults=13123505, major faults=0 ctx switch=140021
    User   System     Wall        fault/wsec
  38.51s   42.50s   17.54s        748022.219
minor faults=13123431, major faults=0 ctx switch=138517
    User   System     Wall        fault/wsec
  36.74s   41.49s   17.03s        770255.610
minor faults=13122955, major faults=0 ctx switch=137174
    User   System     Wall        fault/wsec
  40.68s   40.97s   16.83s        779428.551

Mot-CRs-fixed: (CR)

Bug: 80168800
Change-Id: I4e69c994f47402766481c58ab5ec2071180964b8
Signed-off-by: Minchan Kim <minchan@google.com>
Signed-off-by: Cho KyongHo <pullip.cho@samsung.com>
Reviewed-on: https://gerrit.mot.com/1453718
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agopsi:kernel:enable PSI configuration
wangwang [Wed, 13 Nov 2019 06:26:15 +0000 (14:26 +0800)]
psi:kernel:enable PSI configuration

support Google PSI memory management in lmkd

Change-Id: I437daa54c55c4caa9d8a67ba5bf7ac529d61da87
Reviewed-on: https://gerrit.mot.com/1453674
SLTApproved: Slta Waiver
SME-Granted: SME Approvals Granted
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key

3 years agopsi:kernel:oom reaper porting into samsung platform
wangwang [Wed, 13 Nov 2019 06:04:33 +0000 (14:04 +0800)]
psi:kernel:oom reaper porting into samsung platform

reaper can help to reclaim the memory in time, the knob will be set to true
when init parses the init.rc conf file.

Change-Id: I59f1173c0e46202904da6eeacb2fecc32c53232c

3 years agoBACKPORT: kernel: cgroup: add poll file operation
Johannes Weiner [Wed, 30 Jan 2019 23:41:54 +0000 (10:41 +1100)]
BACKPORT: kernel: cgroup: add poll file operation

Cgroup has a standardized poll/notification mechanism for waking all
pollers on all fds when a filesystem node changes.  To allow polling for
custom events, add a .poll callback that can override the default.

This is in preparation for pollable cgroup pressure files which have
per-fd trigger configurations.

Link: http://lkml.kernel.org/r/20190124211518.244221-3-surenb@google.com
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
(cherry picked from commit: dc50537bdd1a0804fa2cbc990565ee9a944e66fa)

Conflicts:
        include/linux/cgroup-defs.h
        kernel/cgroup.c

(1. replaced __poll_t with unsigned int)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I21aff1d9d31e3d4b45e257aa4d299405a2ce6de3
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
3 years agoFROMLIST: psi: introduce psi monitor
Suren Baghdasaryan [Tue, 4 Dec 2018 01:36:42 +0000 (17:36 -0800)]
FROMLIST: psi: introduce psi monitor

Psi monitor aims to provide a low-latency short-term pressure
detection mechanism configurable by users. It allows users to
monitor psi metrics growth and trigger events whenever a metric
raises above user-defined threshold within user-defined time window.

Time window and threshold are both expressed in usecs. Multiple psi
resources with different thresholds and window sizes can be monitored
concurrently.

Psi monitors activate when system enters stall state for the monitored
psi metric and deactivate upon exit from the stall state. While system
is in the stall state psi signal growth is monitored at a rate of 10 times
per tracking window. Min window size is 500ms, therefore the min monitoring
interval is 50ms. Max window size is 10s with monitoring interval of 1s.

When activated psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when psi
signal is bouncing.

Notifications to the users are rate-limited to one per tracking window.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052418/)

Conflicts:
        include/linux/psi.h
        kernel/cgroup/cgroup.c
        kernel/sched/psi.c

(1. replaced __poll_t with unsigned int
2. replaced EPOLLERR/EPOLLPRI with POLLERR/POLLPRI (values are the same)
3. include <linux/cgroup-defs.h> in include/linux/psi.h
4. include <uapi/linux/sched/types.h> in kernel/sched/psi.c)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I1688f047e98e1f109627dad72a33d2f70e575268
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
3 years agoFROMLIST: refactor header includes to allow kthread.h inclusion in psi_types.h
Suren Baghdasaryan [Sun, 17 Feb 2019 23:07:38 +0000 (15:07 -0800)]
FROMLIST: refactor header includes to allow kthread.h inclusion in psi_types.h

kthread.h can't be included in psi_types.h because it creates a circular
inclusion with kthread.h eventually including psi_types.h and complaining
on kthread structures not being defined because they are defined further
in the kthread.h. Resolve this by removing psi_types.h inclusion from the
headers included from kthread.h.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052417/)

Conflicts:
        include/linux/kthread.h
        kernel/kthread.c

(1. <linux/cgroup.h> include is already missing in kthread.h
2. <linux/cgroup.h> is already included in kthread.c)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I07c1f4fddf0c43b3095f505e062d9d179d041544
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
3 years agoFROMLIST: psi: track changed states
Suren Baghdasaryan [Wed, 6 Mar 2019 18:25:50 +0000 (10:25 -0800)]
FROMLIST: psi: track changed states

Introduce changed_states parameter into collect_percpu_times to track
the states changed since the last update.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052420/)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: Idb2f7d73013bff16bb101b62a2609917a5353bf9
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
3 years agoFROMLIST: psi: split update_stats into parts
Suren Baghdasaryan [Wed, 6 Mar 2019 17:52:23 +0000 (09:52 -0800)]
FROMLIST: psi: split update_stats into parts

Split update_stats into collect_percpu_times and update_averages for
collect_percpu_times to be reused later inside psi monitor.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052419/)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: Ic5dca1924a3f8997b49b5d16289f53bcc43b88fa
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
3 years agoFROMLIST: psi: rename psi fields in preparation for psi trigger addition
Suren Baghdasaryan [Wed, 6 Mar 2019 17:21:03 +0000 (09:21 -0800)]
FROMLIST: psi: rename psi fields in preparation for psi trigger addition

Renaming psi_group structure member fields used for calculating psi totals
and averages for clear distinction between them and trigger-related fields
that will be added next.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052416/)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I7aaadfc558950b54b02a051d63e508e8fe233b49
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
3 years agoFROMLIST: psi: make psi_enable static
Suren Baghdasaryan [Wed, 30 Jan 2019 23:41:54 +0000 (10:41 +1100)]
FROMLIST: psi: make psi_enable static

psi_enable is not used outside of psi.c, make it static.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052415/)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I3c422d6c0c4299095c6ba05cfe942a2b00705f29
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
3 years agoFROMLIST: psi: introduce state_mask to represent stalled psi states
Suren Baghdasaryan [Wed, 30 Jan 2019 23:41:54 +0000 (10:41 +1100)]
FROMLIST: psi: introduce state_mask to represent stalled psi states

The psi monitoring patches will need to determine the same states as
record_times().  To avoid calculating them twice, maintain a state mask
that can be consulted cheaply.  Do this in a separate patch to keep the
churn in the main feature patch at a minimum.

This adds 4-byte state_mask member into psi_group_cpu struct which results
in its first cacheline-aligned part becoming 52 bytes long.  Add explicit
values to enumeration element counters that affect psi_group_cpu struct
size.

Link: http://lkml.kernel.org/r/20190124211518.244221-4-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052414/)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I7807b687e2a5d78aed44c5e33be1621aa11451cb
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
3 years agoBACKPORT: fs: kernfs: add poll file operation
Johannes Weiner [Wed, 30 Jan 2019 23:41:54 +0000 (10:41 +1100)]
BACKPORT: fs: kernfs: add poll file operation

Patch series "psi: pressure stall monitors", v3.

Android is adopting psi to detect and remedy memory pressure that results
in stuttering and decreased responsiveness on mobile devices.

Psi gives us the stall information, but because we're dealing with
latencies in the millisecond range, periodically reading the pressure
files to detect stalls in a timely fashion is not feasible.  Psi also
doesn't aggregate its averages at a high enough frequency right now.

This patch series extends the psi interface such that users can configure
sensitive latency thresholds and use poll() and friends to be notified
when these are breached.

As high-frequency aggregation is costly, it implements an aggregation
method that is optimized for fast, short-interval averaGing, and makes the
aggregation frequency adaptive, such that high-frequency updates only
happen while monitored stall events are actively occurring.

With these patches applied, Android can monitor for, and ward off,
mounting memory shortages before they cause problems for the user.  For
example, using memory stall monitors in userspace low memory killer daemon
(lmkd) we can detect mounting pressure and kill less important processes
before device becomes visibly sluggish.  In our memory stress testing psi
memory monitors produce roughly 10x less false positives compared to
vmpressure signals.  Having ability to specify multiple triggers for the
same psi metric allows other parts of Android framework to monitor memory
state of the device and act accordingly.

The new interface is straightforward.  The user opens one of the pressure
files for writing and writes a trigger description into the file
descriptor that defines the stall state - some or full, and the maximum
stall time over a given window of time.  E.g.:

        /* Signal when stall time exceeds 100ms of a 1s window */
        char trigger[] = "full 100000 1000000";
        fd = open("/proc/pressure/memory");
        write(fd, trigger, sizeof(trigger));
        while (poll() >= 0) {
                ...
        }
        close(fd);

When the monitored stall state is entered, psi adapts its aggregation
frequency according to what the configured time window requires in order
to emit event signals in a timely fashion.  Once the stalling subsides,
aggregation reverts back to normal.

The trigger is associated with the open file descriptor.  To stop
monitoring, the user only needs to close the file descriptor and the
trigger is discarded.

Patches 1-4 prepare the psi code for polling support.  Patch 5 implements
the adaptive polling logic, the pressure growth detection optimized for
short intervals, and hooks up write() and poll() on the pressure files.

The patches were developed in collaboration with Johannes Weiner.

This patch (of 5):

Kernfs has a standardized poll/notification mechanism for waking all
pollers on all fds when a filesystem node changes.  To allow polling for
custom events, add a .poll callback that can override the default.

This is in preparation for pollable cgroup pressure files which have
per-fd trigger configurations.

Link: http://lkml.kernel.org/r/20190124211518.244221-2-surenb@google.com
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
(cherry picked from commit: 147e1a97c4a0bdd43f55a582a9416bb9092563a9)

Conflicts:
        fs/kernfs/file.c
        include/linux/kernfs.h

1. replaced __poll_t with unsigned int.
2. replaced EPOLLERR/EPOLLPRI with POLLERR/POLLPRI (values are the same)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Ic2bed334d05aec62f4e695f263893c3057921c55
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
3 years agoUPSTREAM: psi: avoid divide-by-zero crash inside virtual machines
Johannes Weiner [Thu, 21 Feb 2019 06:19:59 +0000 (22:19 -0800)]
UPSTREAM: psi: avoid divide-by-zero crash inside virtual machines

We've been seeing hard-to-trigger psi crashes when running inside VM
instances:

    divide error: 0000 [#1] SMP PTI
    Modules linked in: [...]
    CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 4.16.18-119_fbk9_3817_gfe944c98d695 #119
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
    Workqueue: events psi_clock
    RIP: 0010:psi_update_stats+0x270/0x490
    RSP: 0018:ffffc90001117e10 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800a35a13f8
    RDX: 0000000000000000 RSI: ffff8800a35a1340 RDI: 0000000000000000
    RBP: 0000000000000658 R08: ffff8800a35a1470 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000f8502
    FS:  0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fbe370fa000 CR3: 00000000b1e3a000 CR4: 00000000000006f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     psi_clock+0x12/0x50
     process_one_work+0x1e0/0x390
     worker_thread+0x2b/0x3c0
     ? rescuer_thread+0x330/0x330
     kthread+0x113/0x130
     ? kthread_create_worker_on_cpu+0x40/0x40
     ? SyS_exit_group+0x10/0x10
     ret_from_fork+0x35/0x40
    Code: 48 0f 47 c7 48 01 c2 45 85 e4 48 89 16 0f 85 e6 00 00 00 4c 8b 49 10 4c 8b 51 08 49 69 d9 f2 07 00 00 48 6b c0 64 4c 8b 29 31 d2 <48> f7 f7 49 69 d5 8d 06 00 00 48 89 c5 4c 69 f0 00 98 0b 00 48

The Code-line points to `period` being 0 inside update_stats(), and we
divide by that when calculating that period's pressure percentage.

The elapsed period should never be 0.  The reason this can happen is due
to an off-by-one in the idle time / missing period calculation combined
with a coarse sched_clock() in the virtual machine.

The target time for aggregation is advanced into the future on a fixed
grid to prevent clock drift.  So when an aggregation runs after some idle
period, we can not just set it to "now + psi_period", but have to
calculate the downtime and advance the target time relative to itself.

However, if the aggregator was disabled exactly one psi_period (ns), we
drop one idle period in the calculation due to a > when we should do >=.
In that case, next_update will be advanced from 'now - psi_period' to
'now' when it should be moved to 'now + psi_period'.  The run finishes
with last_update == next_update == sched_clock().

With hardware clocks, this exact nanosecond match isn't likely in the
first place; but if it does happen, the clock will still have moved on and
the period non-zero by the time the worker runs.  A pointlessly short
period, but besides the extra work, no harm no foul.  However, a slow
sched_clock() like we have on VMs might not have advanced either by the
time the worker runs again.  And when we calculate the elapsed period, the
result, our pressure divisor, will be 0.  Ouch.

Fix this by correctly handling the situation when the elapsed time between
aggregation runs is precisely two periods, and advance the expiration
timestamp correctly to period into the future.

Link: http://lkml.kernel.org/r/20190214193157.15788-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Ɓukasz Siudut <lsiudut@fb.com
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 4e37504d1c49eec6434d0cc97278d2b51c9e8763)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I40917c84354f9f32259c6703f00b6b1d21f45f02
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
3 years agoUPSTREAM: psi: clarify the Kconfig text for the default-disable option
Johannes Weiner [Fri, 1 Feb 2019 22:21:15 +0000 (14:21 -0800)]
UPSTREAM: psi: clarify the Kconfig text for the default-disable option

The current help text caused some confusion in online forums about
whether or not to default-enable or default-disable psi in vendor
kernels.  This is because it doesn't communicate the reason for why we
made this setting configurable in the first place: that the overhead is
non-zero in an artificial scheduler stress test.

Since this isn't representative of real workloads, and the effect was
not measurable in scheduler-heavy real world applications such as the
webservers and memcache installations at Facebook, it's fair to point
out that this is a pretty cautious option to select.

Link: http://lkml.kernel.org/r/20190129233617.16767-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 7b2489d37e1e355228f7c55724f77580e1dec22a)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I5d0cb901562fd74c82d9d211544745b802776d8a
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
3 years agoUPSTREAM: psi: fix aggregation idle shut-off
Johannes Weiner [Fri, 1 Feb 2019 22:20:42 +0000 (14:20 -0800)]
UPSTREAM: psi: fix aggregation idle shut-off

psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating.  However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.

DebugGing this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again.  This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)

Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep.  To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.

What if the worker is also executing other items before or after?

Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself.  If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.

If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity.  But that
should not be a problem.  The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation.  If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.

Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1b69ac6b40ebd85eed73e4dbccde2a36961ab990)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I2877fec3d381b1006b8bd1261895fdfd68bd21db
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
3 years agoUPSTREAM: psi: fix reference to kernel commandline enable
Baruch Siach [Fri, 14 Dec 2018 22:17:03 +0000 (14:17 -0800)]
UPSTREAM: psi: fix reference to kernel commandline enable

The kernel commandline parameter named in CONFIG_PSI_DEFAULT_DISABLED
help text contradicts the documentation in kernel-parameters.txt, and
the code.  Fix that.

Link: http://lkml.kernel.org/r/20181203213416.GA12627@cmpxchg.org
Fixes: e0c274472d ("psi: make disabling/enabling easier for vendor kernels")
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 428a1cb4baeb9e5c7feda93af7372ba6d2491558)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I592b66d6542f4fa7c2b6eb9f60a5dd43bcfbabf3
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
3 years agoUPSTREAM: psi: make disabling/enabling easier for vendor kernels
Johannes Weiner [Fri, 30 Nov 2018 22:09:58 +0000 (14:09 -0800)]
UPSTREAM: psi: make disabling/enabling easier for vendor kernels

Mel Gorman reports a hackbench regression with psi that would prohibit
shipping the suse kernel with it default-enabled, but he'd still like
users to be able to opt in at little to no cost to others.

With the current combination of CONFIG_PSI and the psi_disabled bool set
from the commandline, this is a challenge.  Do the following things to
make it easier:

1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros
   to enable CONFIG_PSI in their kernel but leave the feature disabled
   unless a user requests it at boot-time.

   To avoid double negatives, rename psi_disabled= to psi=.

2. Make psi_disabled a static branch to eliminate any branch costs
   when the feature is disabled.

In terms of numbers before and after this patch, Mel says:

: The following is a comparision using CONFIG_PSI=n as a baseline against
: your patch and a vanilla kernel
:
:                          4.20.0-rc4             4.20.0-rc4             4.20.0-rc4
:                 kconfigdisable-v1r1                vanilla        psidisable-v1r1
: Amean     1       1.3100 (   0.00%)      1.3923 (  -6.28%)      1.3427 (  -2.49%)
: Amean     3       3.8860 (   0.00%)      4.1230 *  -6.10%*      3.8860 (  -0.00%)
: Amean     5       6.8847 (   0.00%)      8.0390 * -16.77%*      6.7727 (   1.63%)
: Amean     7       9.9310 (   0.00%)     10.8367 *  -9.12%*      9.9910 (  -0.60%)
: Amean     12     16.6577 (   0.00%)     18.2363 *  -9.48%*     17.1083 (  -2.71%)
: Amean     18     26.5133 (   0.00%)     27.8833 *  -5.17%*     25.7663 (   2.82%)
: Amean     24     34.3003 (   0.00%)     34.6830 (  -1.12%)     32.0450 (   6.58%)
: Amean     30     40.0063 (   0.00%)     40.5800 (  -1.43%)     41.5087 (  -3.76%)
: Amean     32     40.1407 (   0.00%)     41.2273 (  -2.71%)     39.9417 (   0.50%)
:
: It's showing that the vanilla kernel takes a hit (as the bisection
: indicated it would) and that disabling PSI by default is reasonably
: close in terms of performance for this particular workload on this
: particular machine so;

Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e0c274472d5d27f277af722e017525e0b33784cd)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I6cb666fa351e8901df82e4d6931bfec0c5ce230d
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
3 years agoUPSTREAM: kernel/sched/psi.c: simplify cgroup_move_task()
Olof Johansson [Fri, 16 Nov 2018 23:08:00 +0000 (15:08 -0800)]
UPSTREAM: kernel/sched/psi.c: simplify cgroup_move_task()

The existing code triggered an invalid warning about 'rq' possibly being
used uninitialized.  Instead of doing the silly warning suppression by
initializa it to NULL, refactor the code to bail out early instead.

Warning was:

  kernel/sched/psi.c: In function `cgroup_move_task':
  kernel/sched/psi.c:639:13: warning: `rq' may be used uninitialized in this function [-Wmaybe-uninitialized]

Link: http://lkml.kernel.org/r/20181103183339.8669-1-olof@lixom.net
Fixes: 2ce7135adc9ad ("psi: cgroup support")
Signed-off-by: Olof Johansson <olof@lixom.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8fcb2312d1e3300e81aa871aad00d4c038cfc184)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Id989da224a726082e0cfa5d5d9460bf63d448a93
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
3 years agoBACKPORT: psi: cgroup support
Johannes Weiner [Fri, 26 Oct 2018 22:06:31 +0000 (15:06 -0700)]
BACKPORT: psi: cgroup support

On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system, but
also that of individual jobs, so that we can ensure individual job health,
fairness between jobs, or prioritize some jobs over others.

This patch implements pressure stall tracking for cgroups.  In kernels
with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
and io.pressure files that track aggregate pressure stall times for only
the tasks inside the cgroup.

Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 2ce7135adc9ad081aa3c49744144376ac74fea60)

Conflicts:
        Documentation/cgroup-v2.txt
        include/linux/psi.h
        kernel/cgroup/cgroup.c

(1. manual merge from Documentation/admin-guide/cgroup-v2.rst
2. include <linux/cgroup-defs.h> into include/linux/psi.h
3. manual merge in css_free_work_fn to allow psi support only for cgroup v2
4. manual merge in cgroup_create to allow psi support only for cgroup v2)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I163e6657aaa60aa5aab9372616a3bce2a65e90ec
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
3 years agoUPSTREAM: psi: pressure stall information for CPU, memory, and IO
Johannes Weiner [Fri, 26 Oct 2018 22:06:27 +0000 (15:06 -0700)]
UPSTREAM: psi: pressure stall information for CPU, memory, and IO

When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills.  In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively.  Stall states are aggregate versions of the per-task delay
accounting delays:

       cpu: some tasks are runnable but not executing on a CPU
       memory: tasks are reclaiming, or waiting for swapin or thrashing cache
       io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit.  They can also indicate when the system
is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states.  Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime.  A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).

[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit eb414681d5a07d28d2ff90dc05f69ec6b232ebd2)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Id00d23c977169b0c4636d92016fc1fee0274be05
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Conflicts:
kernel/sched/Makefile