From: Minchan Kim <minchan@google.com>
Date: Mon, 6 Aug 2018 06:00:19 +0000 (+0900)
Subject: mm: introduce per-process mm event tracking feature
X-Git-Tag: MMI-QSAS30.62-33-3~49
X-Git-Url: https://git.stricted.de/?a=commitdiff_plain;h=d3c43f5260bb8cbb54ed914d43da19f2d19e30ea;p=GitHub%2FMotorolaMobilityLLC%2Fkernel-slsi.git

mm: introduce per-process mm event tracking feature

Linux supports /proc/meminfo and /proc/vmstat stats as memory health metric.
Android uses them too. If user see something goes wrong(e.g., sluggish, jank)
on their system, they can capture and report system state to developers
for debugGing.

It shows memory stat at the moment the bug is captured. However, it’s
not enough to investigate application's jank problem caused by memory
shortage. Because

1. It just shows event count which doesn’t quantify the latency of the
application well. Jank could happen by various reasons and one of simple
scenario is frame drop for a second. App should draw the frame every 16ms
interval. Just number of stats(e.g., allocstall or pgmajfault) couldn't
represnt how many of time the app spends for handling the event.

2. At bugreport, dump with vmstat and meminfo is never helpful because it's
too late to capture the moment when the problem happens.
When the user catch up the problem and try to capture the system state,
the problem has already gone.

3. Although we could capture MM stat at the moment bug happens, it couldn't
be helpful because MM stats are usually very flucuate so we need historical
data rather than one-time snapshot to see MM trend.

To solve above problems, this patch introduces per-process, light-weight,
mm event stat. Basically, it tracks minor/major faults, reclaim and compaction
latency of each process as well as event count and record the data into global
buffer.
To compromise memory overhead, it doesn't record every MM event of the process
to the buffer but just drain accumuated stats every 0.5sec interval to buffer.
If there isn't any event, it just skips the recording.
For latency data, it keeps average/max latency of each event in that period

With that, we could keep useful information with small buffer so that
we couldn't miss precious information any longer although the capture time
is rather late. This patch introduces basic facility of MM event stat.

After all patches in this patchset are applied, outout format is as follows,
dumpstate can use it for VM debugGing in future.

<...>-1665  [001] d...   217.575173: mm_event_record: min_flt count=203 avg_lat=3 max_lat=58
<...>-1665  [001] d...   217.575183: mm_event_record: maj_flt count=1 avg_lat=1994 max_lat=1994
<...>-1665  [001] d...   217.575184: mm_event_record: kern_alloc count=227 avg_lat=0 max_lat=0
<...>-626   [000] d...   217.578096: mm_event_record: kern_alloc count=4 avg_lat=0 max_lat=0
<...>-6547  [000] ....   217.581913: mm_event_record: min_flt count=7 avg_lat=7 max_lat=20
<...>-6547  [000] ....   217.581955: mm_event_record: kern_alloc count=4 avg_lat=0 max_lat=0

This feature uses event trace for output buffer so that we could use all of
general benefit of event trace(e.g., buffer size management, filtering and
so on). To prevent overflow of the ring buffer by other random event race,
highly suggest that create separate instance of tracing
on /sys/kernel/debug/tracing/instances/

I had a concern of adding overhead. Actually, major|compaction/reclaim
are already heavy cost so it should be not a concern. Rather than,
minor fault and kern alloc would be severe so I tested a micro benchmark
to measure minor page fault overhead.

Test scenario is create 40 threads and each of them does minor
page fault for 25M range(ranges are not overwrapped).
I didn't see any noticible regression.

Base:
fault/wsec avg: 758489.8288

minor faults=13123118, major faults=0 ctx switch=139234
    User   System     Wall        fault/wsec
  39.55s   41.73s   17.49s        749995.768
minor faults=13123135, major faults=0 ctx switch=139627
    User   System     Wall        fault/wsec
  34.59s   41.61s   16.95s        773906.976
minor faults=13123061, major faults=0 ctx switch=139254
    User   System     Wall        fault/wsec
  39.03s   41.55s   16.97s        772966.334
minor faults=13123131, major faults=0 ctx switch=139970
    User   System     Wall        fault/wsec
  36.71s   42.12s   17.04s        769941.019
minor faults=13123027, major faults=0 ctx switch=138524
    User   System     Wall        fault/wsec
  42.08s   42.24s   18.08s        725639.047

Base + MM event + event trace enable:
fault/wsec avg: 759626.1488

minor faults=13123488, major faults=0 ctx switch=140303
    User   System     Wall        fault/wsec
  37.66s   42.21s   17.48s        750414.257
minor faults=13123066, major faults=0 ctx switch=138119
    User   System     Wall        fault/wsec
  36.77s   42.14s   17.49s        750010.107
minor faults=13123505, major faults=0 ctx switch=140021
    User   System     Wall        fault/wsec
  38.51s   42.50s   17.54s        748022.219
minor faults=13123431, major faults=0 ctx switch=138517
    User   System     Wall        fault/wsec
  36.74s   41.49s   17.03s        770255.610
minor faults=13122955, major faults=0 ctx switch=137174
    User   System     Wall        fault/wsec
  40.68s   40.97s   16.83s        779428.551

Mot-CRs-fixed: (CR)

Bug: 80168800
Change-Id: I4e69c994f47402766481c58ab5ec2071180964b8
Signed-off-by: Minchan Kim <minchan@google.com>
Signed-off-by: Cho KyongHo <pullip.cho@samsung.com>
Reviewed-on: https://gerrit.mot.com/1453718
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Submit-Approved: Jira Key
---

diff --git a/include/linux/mm_event.h b/include/linux/mm_event.h
new file mode 100644
index 000000000000..5a9069ea0fa7
--- /dev/null
+++ b/include/linux/mm_event.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MM_EVENT_H
+#define _LINUX_MM_EVENT_H
+
+#include <linux/types.h>
+#include <linux/ktime.h>
+
+enum mm_event_type {
+	MM_MIN_FAULT = 0,
+	MM_MAJ_FAULT,
+	MM_COMPACTION,
+	MM_RECLAIM,
+	MM_TYPE_NUM,
+};
+
+struct mm_event_task {
+	unsigned int count;
+	unsigned int max_lat;
+	u64 accm_lat;
+} __attribute__ ((packed));
+
+struct task_struct;
+
+#ifdef CONFIG_MM_EVENT_STAT
+void mm_event_task_init(struct task_struct *tsk);
+void mm_event_start(ktime_t *time);
+void mm_event_end(enum mm_event_type event, ktime_t start);
+#else
+static inline void mm_event_task_init(struct task_struct *tsk) {}
+static inline void mm_event_start(ktime_t *time) {}
+static inline void mm_event_end(enum mm_event_type event, ktime_t start) {}
+#endif /* _LINUX_MM_EVENT_H */
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e99ac4d74444..afcd95fd05e8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -8,6 +8,7 @@
  */
 
 #include <uapi/linux/sched.h>
+#include <linux/mm_event.h>
 
 #include <asm/current.h>
 
@@ -985,6 +986,10 @@ struct task_struct {
 	struct rt_mutex_waiter		*pi_blocked_on;
 #endif
 
+#ifdef CONFIG_MM_EVENT_STAT
+	struct mm_event_task	mm_event[MM_TYPE_NUM];
+	unsigned long		next_period;
+#endif
 #ifdef CONFIG_DEBUG_MUTEXES
 	/* Mutex deadlock detection: */
 	struct mutex_waiter		*blocked_on;
diff --git a/include/trace/events/mm_event.h b/include/trace/events/mm_event.h
new file mode 100644
index 000000000000..f43dee51f46f
--- /dev/null
+++ b/include/trace/events/mm_event.h
@@ -0,0 +1,49 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mm_event
+
+#if !defined(_TRACE_MM_EVENT_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MM_EVENT_H
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+#include <linux/mm.h>
+
+struct mm_event_task;
+
+#define show_mm_event_type(type)					\
+	__print_symbolic(type,                                          \
+	{ MM_MIN_FAULT, "min_flt" },                                    \
+	{ MM_MAJ_FAULT, "maj_flt" },                                    \
+	{ MM_COMPACTION, "compaction" },                                \
+	{ MM_RECLAIM, "reclaim" })
+
+TRACE_EVENT(mm_event_record,
+
+	TP_PROTO(enum mm_event_type type, struct mm_event_task *record),
+
+	TP_ARGS(type, record),
+
+	TP_STRUCT__entry(
+		__field(enum mm_event_type, type)
+		__field(unsigned int,	count)
+		__field(unsigned int,	avg_lat)
+		__field(unsigned int,	max_lat)
+	),
+
+	TP_fast_assign(
+		__entry->type		= type;
+		__entry->count		= record->count;
+		__entry->avg_lat	= record->accm_lat / record->count;
+		__entry->max_lat	= record->max_lat;
+	),
+
+	TP_printk("%s count=%d avg_lat=%u max_lat=%u",
+					show_mm_event_type(__entry->type),
+					__entry->count, __entry->avg_lat,
+					__entry->max_lat)
+);
+
+#endif /* _TRACE_MM_EVENT_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/fork.c b/kernel/fork.c
index 5a34c13ca34b..5d81e13fe127 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1227,6 +1227,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 
 	tsk->min_flt = tsk->maj_flt = 0;
 	tsk->nvcsw = tsk->nivcsw = 0;
+	mm_event_task_init(tsk);
 #ifdef CONFIG_DETECT_HUNG_TASK
 	tsk->last_switch_count = tsk->nvcsw + tsk->nivcsw;
 #endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 656af28886ea..62034754105d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -621,6 +621,22 @@ config ZSMALLOC_STAT
 	  information to userspace via debugfs.
 	  If unsure, say N.
 
+config MM_EVENT_STAT
+	bool "Track per-process MM event"
+	depends on MMU
+	help
+	  This option enables per-process mm event stat(e.g., fault, reclaim,
+	  compaction and so on ) with some interval(Default is 0.5sec).
+	  Admin can see the stat from trace file via debugfs(e.g.,
+	  /sys/kernel/debug/tracing/trace)
+
+	  It includes max/average memory allocation latency for the interval
+	  as well as event count so that admin can see what happens in VM side
+	  (how many each event happens and how much processes spent time for
+	  the MM event). If it's too large, that would be not good situation.
+
+	  System can dump the trace into bugreport when user allows the dump.
+
 config GENERIC_EARLY_IOREMAP
 	bool
 
diff --git a/mm/Makefile b/mm/Makefile
index 3c72e1ba6348..1430c0b6cf69 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -28,6 +28,9 @@ mmu-$(CONFIG_MMU)	:= gup.o highmem.o memory.o mincore.o \
 			   rmap.o vmalloc.o
 
 
+ifdef CONFIG_MM_EVENT_STAT
+mmu-$(CONFIG_MMU)	+= mm_event.o
+endif
 ifdef CONFIG_CROSS_MEMORY_ATTACH
 mmu-$(CONFIG_MMU)	+= process_vm_access.o
 endif
diff --git a/mm/mm_event.c b/mm/mm_event.c
new file mode 100644
index 000000000000..f2ba7b9004ae
--- /dev/null
+++ b/mm/mm_event.c
@@ -0,0 +1,62 @@
+#include <linux/mm.h>
+#include <linux/mm_event.h>
+#include <linux/sched.h>
+#include <linux/vmalloc.h>
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/mm_event.h>
+
+void mm_event_task_init(struct task_struct *tsk)
+{
+	memset(tsk->mm_event, 0, sizeof(tsk->mm_event));
+	tsk->next_period = 0;
+}
+
+static void record_stat(void)
+{
+	if (time_is_before_eq_jiffies(current->next_period)) {
+		int i;
+
+		for (i = 0; i < MM_TYPE_NUM; i++) {
+			if (current->mm_event[i].count == 0)
+				continue;
+
+			trace_mm_event_record(i, &current->mm_event[i]);
+			memset(&current->mm_event[i], 0,
+					sizeof(struct mm_event_task));
+		}
+		current->next_period = jiffies + (HZ >> 1);
+	}
+}
+
+void mm_event_start(ktime_t *time)
+{
+	*time = ktime_get();
+}
+
+void mm_event_end(enum mm_event_type event, ktime_t start)
+{
+	s64 elapsed = ktime_us_delta(ktime_get(), start);
+
+	current->mm_event[event].count++;
+	current->mm_event[event].accm_lat += elapsed;
+	if (elapsed > current->mm_event[event].max_lat)
+		current->mm_event[event].max_lat = elapsed;
+	record_stat();
+}
+
+static struct dentry *mm_event_root;
+
+static int __init mm_event_init(void)
+{
+	mm_event_root = debugfs_create_dir("mm_event", NULL);
+	if (!mm_event_root) {
+		pr_warn("debugfs dir <mm_event> creation failed\n");
+		return PTR_ERR(mm_event_root);
+	}
+
+	return 0;
+}
+subsys_initcall(mm_event_init);