Johannes Weiner [Wed, 6 Aug 2014 23:05:44 +0000 (16:05 -0700)]
mm: memcontrol: rearrange charging fast path
The charging path currently starts out with OOM condition checks when
OOM is the rarest possible case.
Rearrange this code to run OOM/task dying checks only after trying the
percpu charge and the res_counter charge and bail out before entering
reclaim. Attempting a charge does not hurt an (oom-)killed task as much
as every charge attempt having to check OOM conditions. Also, only
check __GFP_NOFAIL when the charge would actually fail.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 6 Aug 2014 23:05:42 +0000 (16:05 -0700)]
mm: memcontrol: fold mem_cgroup_do_charge()
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code
and reduces charging and uncharging overhead. The most expensive part
of charging and uncharging is the page_cgroup bit spinlock, which is
removed entirely after this series.
Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup
(i.e. executing in the root memcg). Before:
15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
13.31% cat [kernel.kallsyms] [k] memset
11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
2.38% cat [kernel.kallsyms] [k] put_page
2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
After:
15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
13.48% cat [kernel.kallsyms] [k] memset
11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
2.46% cat [kernel.kallsyms] [k] put_page
2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
1.30% cat [kernel.kallsyms] [k] kfree
As you can see, the memcg footprint has shrunk quite a bit.
text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.o
This patch (of 13):
This function was split out because mem_cgroup_try_charge() got too big.
But having essentially one sequence of operations arbitrarily split in
half is not good for reworking the code. Fold it back in.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 6 Aug 2014 23:05:40 +0000 (16:05 -0700)]
mm: page-flags: clean up the page flag test, set, clear macros
- PAGEFLAG_FALSE only defines TEST, make it define SET and CLEAR as
well, analogous to PAGEFLAG.
- Define TESTSETFLAG_FALSE, analogous to TESTSETFLAG.
- Define TESTSCFLAG_FALSE, analogous to TESTSCFLAG
- Make PG_mlocked accessors the same on both MMU and !MMU setups
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Waiman Long [Wed, 6 Aug 2014 23:05:38 +0000 (16:05 -0700)]
mm, thp: replace smp_mb after atomic_add by smp_mb__after_atomic
In some architectures like x86, atomic_add() is a full memory barrier.
In that case, an additional smp_mb() is just a waste of time. This
patch replaces that smp_mb() by smp_mb__after_atomic() which will avoid
the redundant memory barrier in some architectures.
With a 3.16-rc1 based kernel, this patch reduced the execution time of
breaking 1000 transparent huge pages from 38,245us to 30,964us. A
reduction of 19% which is quite sizeable. It also reduces the %cpu time
of the __split_huge_page_refcount function in the perf profile from
2.18% to 1.15%.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Scott J Norton <scott.norton@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Waiman Long [Wed, 6 Aug 2014 23:05:36 +0000 (16:05 -0700)]
mm, thp: move invariant bug check out of loop in __split_huge_page_map
In __split_huge_page_map(), the check for page_mapcount(page) is
invariant within the for loop. Because of the fact that the macro is
implemented using atomic_read(), the redundant check cannot be optimized
away by the compiler leading to unnecessary read to the page structure.
This patch moves the invariant bug check out of the loop so that it will
be done only once. On a 3.16-rc1 based kernel, the execution time of a
microbenchmark that broke up 1000 transparent huge pages using munmap()
had an execution time of 38,245us and 38,548us with and without the
patch respectively. The performance gain is about 1%.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Scott J Norton <scott.norton@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 6 Aug 2014 23:05:34 +0000 (16:05 -0700)]
mm, CMA: clean-up log message
We don't need explicit 'CMA:' prefix, since we already define prefix
'cma:' in pr_fmt. So remove it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 6 Aug 2014 23:05:32 +0000 (16:05 -0700)]
mm, CMA: change cma_declare_contiguous() to obey coding convention
Conventionally, we put output param to the end of param list and put the
'base' ahead of 'size', but cma_declare_contiguous() doesn't look like
that, so change it.
Additionally, move down cma_areas reference code to the position where
it is really needed.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 6 Aug 2014 23:05:30 +0000 (16:05 -0700)]
mm, CMA: clean-up CMA allocation error path
We can remove one call sites for clear_cma_bitmap() if we first call it
before checking error number.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Michal Nazarewicz <mina86@mina86.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 6 Aug 2014 23:05:28 +0000 (16:05 -0700)]
PPC, KVM, CMA: use general CMA reserved area management framework
Now, we have general CMA reserved area management framework, so use it
for future maintainabilty. There is no functional change.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 6 Aug 2014 23:05:25 +0000 (16:05 -0700)]
CMA: generalize CMA reserved area management functionality
Currently, there are two users on CMA functionality, one is the DMA
subsystem and the other is the KVM on powerpc. They have their own code
to manage CMA reserved area even if they looks really similar. From my
guess, it is caused by some needs on bitmap management. KVM side wants
to maintain bitmap not for 1 page, but for more size. Eventually it use
bitmap where one bit represents 64 pages.
When I implement CMA related patches, I should change those two places
to apply my change and it seem to be painful to me. I want to change
this situation and reduce future code management overhead through this
patch.
This change could also help developer who want to use CMA in their new
feature development, since they can use CMA easily without copying &
pasting this reserved area management code.
In previous patches, we have prepared some features to generalize CMA
reserved area management and now it's time to do it. This patch moves
core functions to mm/cma.c and change DMA APIs to use these functions.
There is no functional change in DMA APIs.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 6 Aug 2014 23:05:23 +0000 (16:05 -0700)]
DMA, CMA: support arbitrary bitmap granularity
PPC KVM's CMA area management requires arbitrary bitmap granularity,
since they want to reserve very large memory and manage this region with
bitmap that one bit for several pages to reduce management overheads.
So support arbitrary bitmap granularity for following generalization.
[akpm@linux-foundation.org: s/1/1UL/]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 6 Aug 2014 23:05:21 +0000 (16:05 -0700)]
DMA, CMA: support alignment constraint on CMA region
PPC KVM's CMA area management needs alignment constraint on CMA region.
So support it to prepare generalization of CMA area management
functionality.
Additionally, add some comments which tell us why alignment constraint
is needed on CMA region.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 6 Aug 2014 23:05:19 +0000 (16:05 -0700)]
DMA, CMA: separate core CMA management codes from DMA APIs
To prepare future generalization work on CMA area management code, we
need to separate core CMA management codes from DMA APIs. We will
extend these core functions to cover requirements of PPC KVM's CMA area
management functionality in following patches. This separation helps us
not to touch DMA APIs while extending core functions.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fabian Frederick [Wed, 6 Aug 2014 23:05:17 +0000 (16:05 -0700)]
mm/internal.h: use nth_page
Use nth_page instead of pfn_to_page(page_to_pfn
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Nazarewicz [Wed, 6 Aug 2014 23:05:15 +0000 (16:05 -0700)]
mm: page_alloc: simplify drain_zone_pages by using min()
Instead of open-coding getting minimal value of two, just use min macro.
That is why it is there for. While changing the function also change
type of batch local variable to match type of per_cpu_pages::batch
(which is int).
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tang Chen [Wed, 6 Aug 2014 23:05:13 +0000 (16:05 -0700)]
mem-hotplug: introduce MMOP_OFFLINE to replace the hard coding -1
In store_mem_state(), we have:
...
334 else if (!strncmp(buf, "offline", min_t(int, count, 7)))
335 online_type = -1;
...
355 case -1:
356 ret = device_offline(&mem->dev);
357 break;
...
Here, "offline" is hard coded as -1.
This patch does the following renaming:
ONLINE_KEEP -> MMOP_ONLINE_KEEP
ONLINE_KERNEL -> MMOP_ONLINE_KERNEL
ONLINE_MOVABLE -> MMOP_ONLINE_MOVABLE
and introduces MMOP_OFFLINE = -1 to avoid hard coding.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Hu Tao <hutao@cn.fujitsu.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tang Chen [Wed, 6 Aug 2014 23:05:11 +0000 (16:05 -0700)]
mem-hotplug: avoid illegal state prefixed with legal state when changing state of memory_block
We use the following command to online a memory_block:
echo online|online_kernel|online_movable > /sys/devices/system/memory/memoryXXX/state
But, if we do the following:
echo online_fhsjkghfkd > /sys/devices/system/memory/memoryXXX/state
the block will also be onlined.
This is because the following code in store_mem_state() does not compare
the whole string, but only the prefix of the string.
store_mem_state()
{
......
328 if (!strncmp(buf, "online_kernel", min_t(int, count, 13)))
Here, only compare the first 13 letters of the string. If we give "online_kernelXXXXXX",
it will be recognized as online_kernel, which is incorrect.
329 online_type = ONLINE_KERNEL;
330 else if (!strncmp(buf, "online_movable", min_t(int, count, 14)))
We have the same problem here,
331 online_type = ONLINE_MOVABLE;
332 else if (!strncmp(buf, "online", min_t(int, count, 6)))
here,
(Here is more problematic. If we give online_movalbe, which is a typo
of online_movable, it will be recognized as online without noticing the
author.)
333 online_type = ONLINE_KEEP;
334 else if (!strncmp(buf, "offline", min_t(int, count, 7)))
and here.
335 online_type = -1;
336 else {
337 ret = -EINVAL;
338 goto err;
339 }
......
}
This patch fixes this problem by using sysfs_streq() to compare the
whole string.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reported-by: Hu Tao <hutao@cn.fujitsu.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Wed, 6 Aug 2014 23:05:08 +0000 (16:05 -0700)]
mm/memory.c: use entry = ACCESS_ONCE(*pte) in handle_pte_fault()
Use ACCESS_ONCE() in handle_pte_fault() when getting the entry or
orig_pte upon which all subsequent decisions and pte_same() tests will
be made.
I have no evidence that its lack is responsible for the mm/filemap.c:202
BUG_ON(page_mapped(page)) in __delete_from_page_cache() found by
trinity, and I am not optimistic that it will fix it. But I have found
no other explanation, and ACCESS_ONCE() here will surely not hurt.
If gcc does re-access the pte before passing it down, then that would be
disastrous for correct page fault handling, and certainly could explain
the page_mapped() BUGs seen (concurrent fault causing page to be mapped
in a second time on top of itself: mapcount 2 for a single pte).
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 6 Aug 2014 23:05:06 +0000 (16:05 -0700)]
vmalloc: use rcu list iterator to reduce vmap_area_lock contention
Richard Yao reported a month ago that his system have a trouble with
vmap_area_lock contention during performance analysis by /proc/meminfo.
Andrew asked why his analysis checks /proc/meminfo stressfully, but he
didn't answer it.
https://lkml.org/lkml/2014/4/10/416
Although I'm not sure that this is right usage or not, there is a
solution reducing vmap_area_lock contention with no side-effect. That
is just to use rcu list iterator in get_vmalloc_info().
rcu can be used in this function because all RCU protocol is already
respected by writers, since Nick Piggin commit
db64fe02258f1 ("mm:
rewrite vmap layer") back in linux-2.6.28
Specifically :
insertions use list_add_rcu(),
deletions use list_del_rcu() and kfree_rcu().
Note the rb tree is not used from rcu reader (it would not be safe),
only the vmap_area_list has full RCU protection.
Note that __purge_vmap_area_lazy() already uses this rcu protection.
rcu_read_lock();
list_for_each_entry_rcu(va, &vmap_area_list, list) {
if (va->flags & VM_LAZY_FREE) {
if (va->va_start < *start)
*start = va->va_start;
if (va->va_end > *end)
*end = va->va_end;
nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
list_add_tail(&va->purge_list, &valist);
va->flags |= VM_LAZY_FREEING;
va->flags &= ~VM_LAZY_FREE;
}
}
rcu_read_unlock();
Peter:
: While rcu list traversal over the vmap_area_list is safe, this may
: arrive at different results than the spinlocked version. The rcu list
: traversal version will not be a 'snapshot' of a single, valid instant
: of the entire vmap_area_list, but rather a potential amalgam of
: different list states.
Joonsoo:
: Yes, you are right, but I don't think that we should be strict here.
: Meminfo is already not a 'snapshot' at specific time. While we try to get
: certain stats, the other stats can change. And, although we may arrive at
: different results than the spinlocked version, the difference would not be
: large and would not make serious side-effect.
[edumazet@google.com: add more commit description]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-by: Richard Yao <ryao@gentoo.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Zhang Yanfei <zhangyanfei.yes@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fabian Frederick [Wed, 6 Aug 2014 23:05:03 +0000 (16:05 -0700)]
include/linux/memblock.h: add __init to memblock_set_bottom_up()
memblock_set_bottom_up() is only called by __init
cmdline_parse_movable_node() and __init numa_init().
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Wed, 6 Aug 2014 23:05:01 +0000 (16:05 -0700)]
mm/page_alloc.c: unexport alloc_pages_exact_nid()
It is only called by mm/page_cgroup.c whcih cannot be modular.
Reported-by: David Rientjes <rientjes@google.com>
Cc: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fabian Frederick [Wed, 6 Aug 2014 23:04:59 +0000 (16:04 -0700)]
mm/page_alloc.c: add __meminit to alloc_pages_exact_nid()
alloc_pages_exact_nid() is only called by __meminit alloc_page_cgroup()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fabian Frederick [Wed, 6 Aug 2014 23:04:57 +0000 (16:04 -0700)]
mm/memory_hotplug.c: add __meminit to grow_zone_span/grow_pgdat_span
grow_zone_span and grow_pgdat_span are only called by
__meminit __add_zone
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Toshi Kani <toshi.kani@hp.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fabian Frederick [Wed, 6 Aug 2014 23:04:55 +0000 (16:04 -0700)]
mm/readahead.c: remove unused file_ra_state from count_history_pages
count_history_pages does only call page_cache_prev_hole in rcu_lock
context using address_space mapping. There's no need to have
file_ra_state here.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 6 Aug 2014 23:04:53 +0000 (16:04 -0700)]
slab: convert last use of __FUNCTION__ to __func__
Just about all of these have been converted to __func__, so convert the
last use.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gu Zheng [Wed, 6 Aug 2014 23:04:51 +0000 (16:04 -0700)]
slab: fix the alias count (via sysfs) of slab cache
We mark some slab caches (e.g. kmem_cache_node) as unmergeable by
setting refcount to -1, and their alias should be 0, not refcount-1, so
correct it here.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Carpenter [Wed, 6 Aug 2014 23:04:48 +0000 (16:04 -0700)]
mm, slub: fix some indenting in cmpxchg_double_slab()
The return statement goes with the cmpxchg_double() condition so it needs
to be indented another tab.
Also these days the fashion is to line function parameters up, and it
looks nicer that way because then the "freelist_new" is not at the same
indent level as the "return 1;".
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wang Sheng-Hui [Wed, 6 Aug 2014 23:04:46 +0000 (16:04 -0700)]
mm/slab.c: fix comments
Current struct kmem_cache has no 'lock' field, and slab page is managed by
struct kmem_cache_node, which has 'list_lock' field.
Clean up the related comment.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Ryabinin [Wed, 6 Aug 2014 23:04:44 +0000 (16:04 -0700)]
mm: move slab related stuff from util.c to slab_common.c
Functions krealloc(), __krealloc(), kzfree() belongs to slab API, so
should be placed in slab_common.c
Also move slab allocator's tracepoints defenitions to slab_common.c No
functional changes here.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wei Yang [Wed, 6 Aug 2014 23:04:42 +0000 (16:04 -0700)]
slub: avoid duplicate creation on the first object
When a kmem_cache is created with ctor, each object in the kmem_cache
will be initialized before ready to use. While in slub implementation,
the first object will be initialized twice.
This patch reduces the duplication of initialization of the first
object.
Fix commit
7656c72b ("SLUB: add macros for scanning objects in a slab").
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 6 Aug 2014 23:04:40 +0000 (16:04 -0700)]
slab: change int to size_t for representing allocation size
It is better to represent allocation size in size_t rather than int. So
change it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 6 Aug 2014 23:04:38 +0000 (16:04 -0700)]
slab: remove BAD_ALIEN_MAGIC
BAD_ALIEN_MAGIC value isn't used anymore. So remove it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 6 Aug 2014 23:04:35 +0000 (16:04 -0700)]
slab: remove a useless lockdep annotation
Now, there is no code to hold two lock simultaneously, since we don't
call slab_destroy() with holding any lock. So, lockdep annotation is
useless now. Remove it.
v2: don't remove BAD_ALIEN_MAGIC in this patch. It will be removed
in the following patch.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 6 Aug 2014 23:04:33 +0000 (16:04 -0700)]
slab: destroy a slab without holding any alien cache lock
I haven't heard that this alien cache lock is contended, but to reduce
chance of contention would be better generally. And with this change,
we can simplify complex lockdep annotation in slab code. In the
following patch, it will be implemented.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 6 Aug 2014 23:04:31 +0000 (16:04 -0700)]
slab: use the lock on alien_cache, instead of the lock on array_cache
Now, we have separate alien_cache structure, so it'd be better to hold
the lock on alien_cache while manipulating alien_cache. After that, we
don't need the lock on array_cache, so remove it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 6 Aug 2014 23:04:29 +0000 (16:04 -0700)]
slab: introduce alien_cache
Currently, we use array_cache for alien_cache. Although they are mostly
similar, there is one difference, that is, need for spinlock. We don't
need spinlock for array_cache itself, but to use array_cache for
alien_cache, array_cache structure should have spinlock. This is
needless overhead, so removing it would be better. This patch prepare
it by introducing alien_cache and using it. In the following patch, we
remove spinlock in array_cache.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 6 Aug 2014 23:04:27 +0000 (16:04 -0700)]
slab: factor out initialization of array cache
Factor out initialization of array cache to use it in following patch.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 6 Aug 2014 23:04:25 +0000 (16:04 -0700)]
slab: defer slab_destroy in free_block()
In free_block(), if freeing object makes new free slab and number of
free_objects exceeds free_limit, we start to destroy this new free slab
with holding the kmem_cache node lock. Holding the lock is useless and,
generally, holding a lock as least as possible is good thing. I never
measure performance effect of this, but we'd be better not to hold the
lock as much as possible.
Commented by Christoph:
This is also good because kmem_cache_free is no longer called while
holding the node lock. So we avoid one case of recursion.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 6 Aug 2014 23:04:22 +0000 (16:04 -0700)]
slab: move up code to get kmem_cache_node in free_block()
node isn't changed, so we don't need to retreive this structure
everytime we move the object. Maybe compiler do this optimization, but
making it explicitly is better.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 6 Aug 2014 23:04:20 +0000 (16:04 -0700)]
slab: add unlikely macro to help compiler
This patchset does some cleanup and tries to remove lockdep annotation.
Patches 1~2 are just for really really minor improvement.
Patches 3~9 are for clean-up and removing lockdep annotation.
There are two cases that lockdep annotation is needed in SLAB.
1) holding two node locks
2) holding two array cache(alien cache) locks
I looked at the code and found that we can avoid these cases without any
negative effect.
1) occurs if freeing object makes new free slab and we decide to
destroy it. Although we don't need to hold the lock during destroying
a slab, current code do that. Destroying a slab without holding the
lock would help the reduction of the lock contention. To do it, I
change the implementation that new free slab is destroyed after
releasing the lock.
2) occurs on similar situation. When we free object from non-local
node, we put this object to alien cache with holding the alien cache
lock. If alien cache is full, we try to flush alien cache to proper
node cache, and, in this time, new free slab could be made. Destroying
it would be started and we will free metadata object which comes from
another node. In this case, we need another node's alien cache lock to
free object. This forces us to hold two array cache locks and then we
need lockdep annotation although they are always different locks and
deadlock cannot be possible. To prevent this situation, I use same way
as 1).
In this way, we can avoid 1) and 2) cases, and then, can remove lockdep
annotation. As short stat noted, this makes SLAB code much simpler.
This patch (of 9):
slab_should_failslab() is called on every allocation, so to optimize it
is reasonable. We normally don't allocate from kmem_cache. It is just
used when new kmem_cache is created, so it's very rare case. Therefore,
add unlikely macro to help compiler optimization.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Ryabinin [Wed, 6 Aug 2014 23:04:18 +0000 (16:04 -0700)]
mm: slub: SLUB_DEBUG=n: use the same alloc/free hooks as for SLUB_DEBUG=y
There are two versions of alloc/free hooks now - one for
CONFIG_SLUB_DEBUG=y and another one for CONFIG_SLUB_DEBUG=n.
I see no reason why calls to other debugging subsystems (LOCKDEP,
DEBUG_ATOMIC_SLEEP, KMEMCHECK and FAILSLAB) are hidden under SLUB_DEBUG.
All this features should work regardless of SLUB_DEBUG config, as all of
them already have own Kconfig options.
This also fixes failslab for CONFIG_SLUB_DEBUG=n configuration. It
simply has not worked before because should_failslab() call was in a
hook hidden under "#ifdef CONFIG_SLUB_DEBUG #else".
Note: There is one concealed change in allocation path for SLUB_DEBUG=n
and all other debugging features disabled. The might_sleep_if() call
can generate some code even if DEBUG_ATOMIC_SLEEP=n. For
PREEMPT_VOLUNTARY=y might_sleep() inserts _cond_resched() call, but I
think it should be ok.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 6 Aug 2014 23:04:16 +0000 (16:04 -0700)]
mm, slub: mark resiliency_test as init text
resiliency_test() is only called for bootstrap, so it may be moved to
init.text and freed after boot.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Ryabinin [Wed, 6 Aug 2014 23:04:14 +0000 (16:04 -0700)]
mm: slab.h: wrap the whole file with guarding macro
Guarding section:
#ifndef MM_SLAB_H
#define MM_SLAB_H
...
#endif
currently doesn't cover the whole mm/slab.h. It seems like it was
done unintentionally.
Wrap the whole file by moving closing #endif to the end of it.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Christoph Lameter [Wed, 6 Aug 2014 23:04:11 +0000 (16:04 -0700)]
slab: use get_node() and kmem_cache_node() functions
Use the two functions to simplify the code avoiding numerous explicit
checks coded checking for a certain node to be online.
Get rid of various repeated calculations of kmem_cache_node structures.
[akpm@linux-foundation.org: fix build]
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Christoph Lameter [Wed, 6 Aug 2014 23:04:09 +0000 (16:04 -0700)]
slub: use new node functions
Make use of the new node functions in mm/slab.h to reduce code size and
simplify.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Christoph Lameter [Wed, 6 Aug 2014 23:04:07 +0000 (16:04 -0700)]
slab common: add functions for kmem_cache_node access
The patchset provides two new functions in mm/slab.h and modifies SLAB
and SLUB to use these. The kmem_cache_node structure is shared between
both allocators and the use of common accessors will allow us to move
more code into slab_common.c in the future.
This patch (of 3):
These functions allow to eliminate repeatedly used code in both SLAB and
SLUB and also allow for the insertion of debugging code that may be
needed in the development process.
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fabian Frederick [Wed, 6 Aug 2014 23:04:05 +0000 (16:04 -0700)]
mm/slab.c: add __init to init_lock_keys
init_lock_keys is only called by __init kmem_cache_init_late
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fabian Frederick [Wed, 6 Aug 2014 23:04:03 +0000 (16:04 -0700)]
kernel/watchdog.c: convert printk/pr_warning to pr_foo()
Replace some obsolete functions.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fabian Frederick [Wed, 6 Aug 2014 23:04:01 +0000 (16:04 -0700)]
fs/ocfs2/slot_map.c: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tariq Saeed [Wed, 6 Aug 2014 23:03:58 +0000 (16:03 -0700)]
ocfs2: race between umount and unfinished remastering during recovery
Orabug:
19074140
When umount is issued during recovery on the new master that has not
finished remastering locks, it triggers BUG() in
dlm_send_mig_lockres_msg(). Here is the situation:
1) node A has a lock on resource X mastered by node B.
2) node B dies -> node A sets recovering flag for res X
3) Node C becomes the new master for resources owned by the
dead node and is remastering locks of the dead node but
has not finished the remastering process yet.
4) umount is issued on node C.
5) During processing of umount, ignoring unfished recovery,
node C attempts to migrate resource X to node A.
6) node A finds res X in DLM_LOCK_RES_RECOVERING state, considers
it a logic error and sends back -EFAULT.
7) node C asserts BUG() upon seeing EFAULT resp from node B.
Fix is to delay migrating res X till remastering is finished at which
point recovering flag will be cleared on both A and C.
Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xue jiufei [Wed, 6 Aug 2014 23:03:56 +0000 (16:03 -0700)]
ocfs2: remove conversion of total_backoff in dlm_join_domain()
The unit of total_backoff is msecs not jiffies, so no need to do the
conversion. Otherwise, the join timeout is not 90 sec.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yingtai Xie [Wed, 6 Aug 2014 23:03:54 +0000 (16:03 -0700)]
ocfs2: correctly check the return value of ocfs2_search_extent_list
ocfs2_search_extent_list may return -1, so we should check the return
value in ocfs2_split_and_insert, otherwise it may cause array index out of
bound.
And ocfs2_search_extent_list can only return value less than
el->l_next_free_rec, so check if it is equal or larger than
le16_to_cpu(el->l_next_free_rec) is meaningless.
Signed-off-by: Yingtai Xie <xieyingtai@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fabian Frederick [Wed, 6 Aug 2014 23:03:52 +0000 (16:03 -0700)]
fs/squashfs/super.c: logging cleanup
- Convert printk to pr_foo()
- Add pr_fmt for future logging entries
- Coalesce formats
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fabian Frederick [Wed, 6 Aug 2014 23:03:50 +0000 (16:03 -0700)]
fs/squashfs/file_direct.c: replace count*size kmalloc by kmalloc_array
kmalloc_array() manages count*sizeof overflow.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pranith Kumar [Wed, 6 Aug 2014 23:03:48 +0000 (16:03 -0700)]
sh: fix build error by adding generic ioport_{map/unmap}()
Fix build error as reported by Geert Uytterhoeven here:
http://kisskb.ellerman.id.au/kisskb/buildresult/
11607865/
The error happens when CONFIG_HAS_IOPORT_MAP=n because of which there
are missing definitions of ioport_map/unmap(). Fix this build error by
adding these prototypes.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: <stable@vger.kernel.org> [3.16+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Daniel Palmer [Wed, 6 Aug 2014 23:03:45 +0000 (16:03 -0700)]
sh: SH7724 clock fixes
Fix the device name for the CMT.
Add clocks called usb0 and usb1 so that r8a66597_hcd works again on the
ecovec24 board
Signed-off-by: Daniel Palmer <danieruru@gmail.com>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fabian Frederick [Wed, 6 Aug 2014 23:03:43 +0000 (16:03 -0700)]
arch/sh/kernel/time.c: use PTR_ERR_OR_ZERO
Replace IS_ERR/PTR_ERR.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fabian Frederick [Wed, 6 Aug 2014 23:03:41 +0000 (16:03 -0700)]
arch/sh/mm/asids-debugfs.c: use PTR_ERR_OR_ZERO
Replace IS_ERR/PTR_ERR.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Richard Weinberger [Wed, 6 Aug 2014 23:03:39 +0000 (16:03 -0700)]
sh: remove CPU_SUBTYPE_SH7764
The symbol is an orphan, get rid of it.
Submitted by Richard a few months ago as "[PATCH 21/28] Remove
CPU_SUBTYPE_SH7764".
[pebolle@tiscali.nl: re-added dropped ||]
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chen Gang [Wed, 6 Aug 2014 23:03:37 +0000 (16:03 -0700)]
score, ptrace: remove unused macros
'COUNTER' and other same kind macros are too common to use, and easy to
get conflict with other modules.
At present, they are not used, so it is OK to simply remove them. And
the related warning (allmodconfig with score):
CC [M] drivers/md/raid1.o
In file included from drivers/md/raid1.c:42:0:
drivers/md/bitmap.h:93:0: warning: "COUNTER" redefined
#define COUNTER(x) (((bitmap_counter_t) x) & COUNTER_MAX)
^
In file included from ./arch/score/include/asm/ptrace.h:4:0,
from include/linux/sched.h:31,
from include/linux/blkdev.h:4,
from drivers/md/raid1.c:36:
./arch/score/include/uapi/asm/ptrace.h:13:0: note: this is the location of the previous definition
#define COUNTER 38
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fabian Frederick [Wed, 6 Aug 2014 23:03:35 +0000 (16:03 -0700)]
ntfs: kernel-doc warning fixes
cached_page and lru_pvec were removed from ntfs_attr_extend_initialized
in commit
2ec93b0bf35f ("ntfs: clean up ntfs_attr_extend_initialized")
lru_pvec has been removed from __ntfs_grab_cache_pages in commit
4c99000ac47c ("ntfs: use add_to_page_cache_lru()")
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fabian Frederick [Wed, 6 Aug 2014 23:03:33 +0000 (16:03 -0700)]
fs/logfs/readwrite.c: kernel-doc warning fixes
s/-/:/ and fix variable names.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Wed, 6 Aug 2014 23:03:31 +0000 (16:03 -0700)]
./Makefile: explain stack-protector-strong CONFIG logic
This adds a hopefully helpful comment above the (seemingly weird) compiler
flag selection logic.
Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 6 Aug 2014 23:03:28 +0000 (16:03 -0700)]
fanotify: fix double free of pending permission events
Commit
85816794240b ("fanotify: Fix use after free for permission
events") introduced a double free issue for permission events which are
pending in group's notification queue while group is being destroyed.
These events are freed from fanotify_handle_event() but they are not
removed from groups notification queue and thus they get freed again
from fsnotify_flush_notify().
Fix the problem by removing permission events from notification queue
before freeing them if we skip processing access response. Also expand
comments in fanotify_release() to explain group shutdown in detail.
Fixes:
85816794240b9659e66e4d9b0df7c6e814e5f603
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Douglas Leeder <douglas.leeder@sophos.com>
Tested-by: Douglas Leeder <douglas.leeder@sophos.com>
Reported-by: Heinrich Schuchard <xypron.glpk@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 6 Aug 2014 23:03:26 +0000 (16:03 -0700)]
fsnotify: rename event handling functions
Rename fsnotify_add_notify_event() to fsnotify_add_event() since the
"notify" part is duplicit. Rename fsnotify_remove_notify_event() and
fsnotify_peek_notify_event() to fsnotify_remove_first_event() and
fsnotify_peek_first_event() respectively since "notify" part is duplicit
and they really look at the first event in the queue.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fabian Frederick [Wed, 6 Aug 2014 23:03:24 +0000 (16:03 -0700)]
fs/fscache: make ctl_table static
fscache_sysctls and fscache_sysctls_root are only used in main.c
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: David Howells <dhowells@redhat.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fabian Frederick [Wed, 6 Aug 2014 23:03:22 +0000 (16:03 -0700)]
kernel/auditfilter.c: replace count*size kmalloc by kcalloc
kcalloc manages count*sizeof overflow.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 6 Aug 2014 16:42:33 +0000 (09:42 -0700)]
Merge git://git./linux/kernel/git/davem/ide
Pull IDE cleanup from David Miller:
"Just one minor cleanup"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide:
ide: use module_platform_driver()
Linus Torvalds [Wed, 6 Aug 2014 16:41:23 +0000 (09:41 -0700)]
Merge git://git./linux/kernel/git/davem/sparc-next
Pull sparc updates from David Miller:
1) Add sparc RAM output to /proc/iomem, from Bob Picco.
2) Allow seeks on /dev/mdesc, from Khalid Aziz.
3) Cleanup sparc64 I/O accessors, from Sam Ravnborg.
4) If update_mmu_cache{,_pmd}() is called with an not-valid mapping, do
not insert it into the TLB miss hash tables otherwise we'll
livelock. Based upon work by Christopher Alexander Tobias Schulze.
5) Fix BREAK detection in sunsab driver when no actual characters are
pending, from Christopher Alexander Tobias Schulze.
6) Because we have modules --> openfirmware --> vmalloc ordering of
virtual memory, the lazy VMAP TLB flusher can cons up an invocation
of flush_tlb_kernel_range() that covers the openfirmware address
range. Unfortunately this will flush out the firmware's locked TLB
mapping which causes all kinds of trouble. Just split up the flush
request if this happens, but in the long term the lazy VMAP flusher
should probably be made a little bit smarter.
Based upon work by Christopher Alexander Tobias Schulze.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next:
sparc64: Fix up merge thinko.
sparc: Add "install" target
arch/sparc/math-emu/math_32.c: drop stray break operator
sparc64: ldc_connect() should not return EINVAL when handshake is in progress.
sparc64: Guard against flushing openfirmware mappings.
sunsab: Fix detection of BREAK on sunsab serial console
bbc-i2c: Fix BBC I2C envctrl on SunBlade 2000
sparc64: Do not insert non-valid PTEs into the TSB hash table.
sparc64: avoid code duplication in io_64.h
sparc64: reorder functions in io_64.h
sparc64: drop unused SLOW_DOWN_IO definitions
sparc64: remove macro indirection in io_64.h
sparc64: update IO access functions in PeeCeeI
sparcspkr: use sbus_*() primitives for IO
sparc: Add support for seek and shorter read to /dev/mdesc
sparc: use %s for unaligned panic
drivers/sbus/char: Micro-optimization in display7seg.c
display7seg: Introduce the use of the managed version of kzalloc
sparc64 - add mem to iomem resource
Linus Torvalds [Wed, 6 Aug 2014 16:38:14 +0000 (09:38 -0700)]
Merge git://git./linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
"Highlights:
1) Steady transitioning of the BPF instructure to a generic spot so
all kernel subsystems can make use of it, from Alexei Starovoitov.
2) SFC driver supports busy polling, from Alexandre Rames.
3) Take advantage of hash table in UDP multicast delivery, from David
Held.
4) Lighten locking, in particular by getting rid of the LRU lists, in
inet frag handling. From Florian Westphal.
5) Add support for various RFC6458 control messages in SCTP, from
Geir Ola Vaagland.
6) Allow to filter bridge forwarding database dumps by device, from
Jamal Hadi Salim.
7) virtio-net also now supports busy polling, from Jason Wang.
8) Some low level optimization tweaks in pktgen from Jesper Dangaard
Brouer.
9) Add support for ipv6 address generation modes, so that userland
can have some input into the process. From Jiri Pirko.
10) Consolidate common TCP connection request code in ipv4 and ipv6,
from Octavian Purdila.
11) New ARP packet logger in netfilter, from Pablo Neira Ayuso.
12) Generic resizable RCU hash table, with intial users in netlink and
nftables. From Thomas Graf.
13) Maintain a name assignment type so that userspace can see where a
network device name came from (enumerated by kernel, assigned
explicitly by userspace, etc.) From Tom Gundersen.
14) Automatic flow label generation on transmit in ipv6, from Tom
Herbert.
15) New packet timestamping facilities from Willem de Bruijn, meant to
assist in measuring latencies going into/out-of the packet
scheduler, latency from TCP data transmission to ACK, etc"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1536 commits)
cxgb4 : Disable recursive mailbox commands when enabling vi
net: reduce USB network driver config options.
tg3: Modify tg3_tso_bug() to handle multiple TX rings
amd-xgbe: Perform phy connect/disconnect at dev open/stop
amd-xgbe: Use dma_set_mask_and_coherent to set DMA mask
net: sun4i-emac: fix memory leak on bad packet
sctp: fix possible seqlock seadlock in sctp_packet_transmit()
Revert "net: phy: Set the driver when registering an MDIO bus device"
cxgb4vf: Turn off SGE RX/TX Callback Timers and interrupts in PCI shutdown routine
team: Simplify return path of team_newlink
bridge: Update outdated comment on promiscuous mode
net-timestamp: ACK timestamp for bytestreams
net-timestamp: TCP timestamping
net-timestamp: SCHED timestamp on entering packet scheduler
net-timestamp: add key to disambiguate concurrent datagrams
net-timestamp: move timestamp flags out of sk_flags
net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
cxgb4i : Move stray CPL definitions to cxgb4 driver
tcp: reduce spurious retransmits due to transient SACK reneging
qlcnic: Initialize dcbnl_ops before register_netdev
...
Linus Torvalds [Wed, 6 Aug 2014 15:16:24 +0000 (08:16 -0700)]
Merge tag 'random_for_linus' of git://git./linux/kernel/git/tytso/random
Pull randomness updates from Ted Ts'o:
"Cleanups and bug fixes to /dev/random, add a new getrandom(2) system
call, which is a superset of OpenBSD's getentropy(2) call, for use
with userspace crypto libraries such as LibreSSL.
Also add the ability to have a kernel thread to pull entropy from
hardware rng devices into /dev/random"
* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
hwrng: Pass entropy to add_hwgenerator_randomness() in bits, not bytes
random: limit the contribution of the hw rng to at most half
random: introduce getrandom(2) system call
hw_random: fix sparse warning (NULL vs 0 for pointer)
random: use registers from interrupted code for CPU's w/o a cycle counter
hwrng: add per-device entropy derating
hwrng: create filler thread
random: add_hwgenerator_randomness() for feeding entropy from devices
random: use an improved fast_mix() function
random: clean up interrupt entropy accounting for archs w/o cycle counters
random: only update the last_pulled time if we actually transferred entropy
random: remove unneeded hash of a portion of the entropy pool
random: always update the entropy pool under the spinlock
Linus Torvalds [Wed, 6 Aug 2014 15:06:39 +0000 (08:06 -0700)]
Merge branch 'next' of git://git./linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
"In this release:
- PKCS#7 parser for the key management subsystem from David Howells
- appoint Kees Cook as seccomp maintainer
- bugfixes and general maintenance across the subsystem"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (94 commits)
X.509: Need to export x509_request_asymmetric_key()
netlabel: shorter names for the NetLabel catmap funcs/structs
netlabel: fix the catmap walking functions
netlabel: fix the horribly broken catmap functions
netlabel: fix a problem when setting bits below the previously lowest bit
PKCS#7: X.509 certificate issuer and subject are mandatory fields in the ASN.1
tpm: simplify code by using %*phN specifier
tpm: Provide a generic means to override the chip returned timeouts
tpm: missing tpm_chip_put in tpm_get_random()
tpm: Properly clean sysfs entries in error path
tpm: Add missing tpm_do_selftest to ST33 I2C driver
PKCS#7: Use x509_request_asymmetric_key()
Revert "selinux: fix the default socket labeling in sock_graft()"
X.509: x509_request_asymmetric_keys() doesn't need string length arguments
PKCS#7: fix sparse non static symbol warning
KEYS: revert encrypted key change
ima: add support for measuring and appraising firmware
firmware_class: perform new LSM checks
security: introduce kernel_fw_from_file hook
PKCS#7: Missing inclusion of linux/err.h
...
Christoph Jaeger [Wed, 9 Apr 2014 07:28:01 +0000 (09:28 +0200)]
ide: use module_platform_driver()
Eliminate boilerplate code by using module_platform_driver().
Signed-off-by: Christoph Jaeger <christophjaeger@linux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 6 Aug 2014 02:09:19 +0000 (19:09 -0700)]
sparc64: Fix up merge thinko.
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 6 Aug 2014 01:57:18 +0000 (18:57 -0700)]
Merge git://git./linux/kernel/git/davem/sparc
Conflicts:
arch/sparc/mm/init_64.c
Conflict was simple non-overlapping additions.
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 6 Aug 2014 01:46:26 +0000 (18:46 -0700)]
Merge git://git./linux/kernel/git/davem/net
Conflicts:
drivers/net/Makefile
net/ipv6/sysctl_net_ipv6.c
Two ipv6_table_template[] additions overlap, so the index
of the ipv6_table[x] assignments needed to be adjusted.
In the drivers/net/Makefile case, we've gotten rid of the
garbage whereby we had to list every single USB networking
driver in the top-level Makefile, there is just one
"USB_NETWORKING" that guards everything.
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 6 Aug 2014 00:46:42 +0000 (17:46 -0700)]
Merge branch 'timers-core-for-linus' of git://git./linux/kernel/git/tip/tip
Pull timer and time updates from Thomas Gleixner:
"A rather large update of timers, timekeeping & co
- Core timekeeping code is year-2038 safe now for 32bit machines.
Now we just need to fix all in kernel users and the gazillion of
user space interfaces which rely on timespec/timeval :)
- Better cache layout for the timekeeping internal data structures.
- Proper nanosecond based interfaces for in kernel users.
- Tree wide cleanup of code which wants nanoseconds but does hoops
and loops to convert back and forth from timespecs. Some of it
definitely belongs into the ugly code museum.
- Consolidation of the timekeeping interface zoo.
- A fast NMI safe accessor to clock monotonic for tracing. This is a
long standing request to support correlated user/kernel space
traces. With proper NTP frequency correction it's also suitable
for correlation of traces accross separate machines.
- Checkpoint/restart support for timerfd.
- A few NOHZ[_FULL] improvements in the [hr]timer code.
- Code move from kernel to kernel/time of all time* related code.
- New clocksource/event drivers from the ARM universe. I'm really
impressed that despite an architected timer in the newer chips SoC
manufacturers insist on inventing new and differently broken SoC
specific timers.
[ Ed. "Impressed"? I don't think that word means what you think it means ]
- Another round of code move from arch to drivers. Looks like most
of the legacy mess in ARM regarding timers is sorted out except for
a few obnoxious strongholds.
- The usual updates and fixlets all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
timekeeping: Fixup typo in update_vsyscall_old definition
clocksource: document some basic timekeeping concepts
timekeeping: Use cached ntp_tick_length when accumulating error
timekeeping: Rework frequency adjustments to work better w/ nohz
timekeeping: Minor fixup for timespec64->timespec assignment
ftrace: Provide trace clocks monotonic
timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC
seqcount: Add raw_write_seqcount_latch()
seqcount: Provide raw_read_seqcount()
timekeeping: Use tk_read_base as argument for timekeeping_get_ns()
timekeeping: Create struct tk_read_base and use it in struct timekeeper
timekeeping: Restructure the timekeeper some more
clocksource: Get rid of cycle_last
clocksource: Move cycle_last validation to core code
clocksource: Make delta calculation a function
wireless: ath9k: Get rid of timespec conversions
drm: vmwgfx: Use nsec based interfaces
drm: i915: Use nsec based interfaces
timekeeping: Provide ktime_get_raw()
hangcheck-timer: Use ktime_get_ns()
...
Linus Torvalds [Wed, 6 Aug 2014 00:38:45 +0000 (17:38 -0700)]
Merge branch 'irq-core-for-linus' of git://git./linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Nothing spectacular from the irq department this time:
- overhaul of the crossbar chip driver
- overhaul of the spear shirq chip driver
- support for the atmel-aic chip
- code move from arch to drivers
- the usual tiny fixlets
- two reverts worth to mention which undo the too simple attempt of
supporting wakeup interrupts on shared interrupt lines"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
Revert "irq: Warn when shared interrupts do not match on NO_SUSPEND"
Revert "PM / sleep / irq: Do not suspend wakeup interrupts"
irq: Warn when shared interrupts do not match on NO_SUSPEND
irqchip: atmel-aic: Define irq fixups for atmel SoCs
irqchip: atmel-aic: Implement RTC irq fixup
irqchip: atmel-aic: Add irq fixup infrastructure
irqchip: atmel-aic: Add atmel AIC/AIC5 drivers
irqchip: atmel-aic: Move binding doc to interrupt-controller directory
genirq: generic chip: Export irq_map_generic_chip function
PM / sleep / irq: Do not suspend wakeup interrupts
irqchip: or1k-pic: Migrate from arch/openrisc/
irqchip: crossbar: Allow for quirky hardware with direct hardwiring of GIC
documentation: dt: omap: crossbar: Add description for interrupt consumer
irqchip: crossbar: Introduce centralized check for crossbar write
irqchip: crossbar: Introduce ti, max-crossbar-sources to identify valid crossbar mapping
irqchip: crossbar: Add kerneldoc for crossbar_domain_unmap callback
irqchip: crossbar: Set cb pointer to null in case of error
irqchip: crossbar: Change the goto naming
irqchip: crossbar: Return proper error value
irqchip: crossbar: Fix kerneldoc warning
...
Thomas Gleixner [Tue, 5 Aug 2014 20:57:19 +0000 (22:57 +0200)]
x86: MCE: Add raw_lock conversion again
Commit
ea431643d6c3 ("x86/mce: Fix CMCI preemption bugs") breaks RT by
the completely unrelated conversion of the cmci_discover_lock to a
regular (non raw) spinlock. This lock was annotated in commit
59d958d2c7de ("locking, x86: mce: Annotate cmci_discover_lock as raw")
with a proper explanation why.
The argument for converting the lock back to a regular spinlock was:
- it does percpu ops without disabling preemption. Preemption is not
disabled due to the mistaken use of a raw spinlock.
Which is complete nonsense. The raw_spinlock is disabling preemption in
the same way as a regular spinlock. In mainline spinlock maps to
raw_spinlock, in RT spinlock becomes a "sleeping" lock.
raw_spinlock has on RT exactly the same semantics as in mainline. And
because this lock is taken in non preemptible context it must be raw on
RT.
Undo the locking brainfart.
Reported-by: Clark Williams <williams@redhat.com>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Anish Bhatt [Tue, 5 Aug 2014 23:05:23 +0000 (16:05 -0700)]
cxgb4 : Disable recursive mailbox commands when enabling vi
Enabling a Virtual Interface can result in an interrupt during the processing
of the VI Enable command and, in some paths, result in an attempt to issue
another command in the interrupt context, eventually crashing the system. Thus,
we disable interrupts during the course of the VI Enable command and ensure
enable doesn't sleep.
Signed-off-by: Anish Bhatt <anish@chelsio.com>
Signed-off-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Francois Romieu [Tue, 5 Aug 2014 21:10:52 +0000 (23:10 +0200)]
net: reduce USB network driver config options.
USB network drivers are already handled in drivers/net/usb/Kconfig.
Let's save the maintenance burden of dependencies in drivers/net/Makefile.
The newly introduced USB_NET_DRIVERS umbrella config option defaults
to 'y' so as to minimize the changes of behavior.
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prashant Sreedharan [Tue, 5 Aug 2014 23:02:02 +0000 (16:02 -0700)]
tg3: Modify tg3_tso_bug() to handle multiple TX rings
tg3_tso_bug() was originally designed to handle only HW TX ring 0, Commit
d3f6f3a1d818410c17445bce4f4caab52eb102f1 ("tg3: Prevent page allocation failure
during TSO workaround") changed the driver logic to use tg3_tso_bug() for all
HW TX rings that are enabled. This patch fixes the regression by modifying
tg3_tso_bug() to handle multiple HW TX rings.
Signed-off-by: Prashant Sreedharan <prashant@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 5 Aug 2014 23:47:07 +0000 (16:47 -0700)]
Merge branch 'amd-xgbe'
Tom Lendacky says:
====================
amd-xgbe: AMD XGBE driver update 2014-08-05
The following series of patches includes fixes/updates to the driver.
- Use dma_set_mask_and_coherent to set the DMA mask
- Move the phy connect/disconnect logic to allow for module unloading
Changes in V2:
- Check the return value of the dma_set_mask_and_coherent call
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Tue, 5 Aug 2014 18:30:44 +0000 (13:30 -0500)]
amd-xgbe: Perform phy connect/disconnect at dev open/stop
A change added to the mdiobus/phy api added a module_get/module_put
during phy connect/disconnect processing. Currently, the driver
performs a phy connect during module probe and a phy disconnect during
module remove. With the addition of the module_get during phy connect
the amd-xgbe module use count is incremented and can no longer be
unloaded.
Move the phy connect/disconnect from the driver probe/remove functions
to the net_device_ops ndo_open/ndo_stop functions. This allows the
module use count to be decremented when the device(s) are brought down
and allows the module to be unloaded.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Tue, 5 Aug 2014 18:30:38 +0000 (13:30 -0500)]
amd-xgbe: Use dma_set_mask_and_coherent to set DMA mask
Use the dma_set_mask_and_coherent function to set the DMA mask rather
than setting the DMA mask fields directly. This was originally done
to work around a bug in the arm64 DMA support when RAM started above
the 4GB boundary which has since been fixed.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marc Zyngier [Tue, 5 Aug 2014 15:44:39 +0000 (16:44 +0100)]
net: sun4i-emac: fix memory leak on bad packet
Upon reception of a new frame, the emac driver checks for a number
of error conditions, and flag the packet as "bad" if any of these
are present. It then allocates a skb unconditionally, but only uses
it if the packet is "good". On the error path, the skb is just forgotten,
and the system leaks memory.
The piece of junk I have on my desk seems to encounter such error
frequently enough so that the box goes OOM after a couple of days,
which makes me grumpy.
Fix this by moving the allocation on the "good_packet" path (and
convert it to netdev_alloc_skb while we're at it).
Tested on a random Allwinner A20 board.
Cc: Stefan Roese <sr@denx.de>
Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
Cc: <stable@vger.kernel.org> # 3.11+
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Maxime Ripard <maxime.ripard@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 5 Aug 2014 14:49:52 +0000 (16:49 +0200)]
sctp: fix possible seqlock seadlock in sctp_packet_transmit()
Dave reported following splat, caused by improper use of
IP_INC_STATS_BH() in process context.
BUG: using __this_cpu_add() in preemptible [
00000000] code: trinity-c117/14551
caller is __this_cpu_preempt_check+0x13/0x20
CPU: 3 PID: 14551 Comm: trinity-c117 Not tainted 3.16.0+ #33
ffffffff9ec898f0 0000000047ea7e23 ffff88022d32f7f0 ffffffff9e7ee207
0000000000000003 ffff88022d32f818 ffffffff9e397eaa ffff88023ee70b40
ffff88022d32f970 ffff8801c026d580 ffff88022d32f828 ffffffff9e397ee3
Call Trace:
[<
ffffffff9e7ee207>] dump_stack+0x4e/0x7a
[<
ffffffff9e397eaa>] check_preemption_disabled+0xfa/0x100
[<
ffffffff9e397ee3>] __this_cpu_preempt_check+0x13/0x20
[<
ffffffffc0839872>] sctp_packet_transmit+0x692/0x710 [sctp]
[<
ffffffffc082a7f2>] sctp_outq_flush+0x2a2/0xc30 [sctp]
[<
ffffffff9e0d985c>] ? mark_held_locks+0x7c/0xb0
[<
ffffffff9e7f8c6d>] ? _raw_spin_unlock_irqrestore+0x5d/0x80
[<
ffffffffc082b99a>] sctp_outq_uncork+0x1a/0x20 [sctp]
[<
ffffffffc081e112>] sctp_cmd_interpreter.isra.23+0x1142/0x13f0 [sctp]
[<
ffffffffc081c86b>] sctp_do_sm+0xdb/0x330 [sctp]
[<
ffffffff9e0b8f1b>] ? preempt_count_sub+0xab/0x100
[<
ffffffffc083b350>] ? sctp_cname+0x70/0x70 [sctp]
[<
ffffffffc08389ca>] sctp_primitive_ASSOCIATE+0x3a/0x50 [sctp]
[<
ffffffffc083358f>] sctp_sendmsg+0x88f/0xe30 [sctp]
[<
ffffffff9e0d673a>] ? lock_release_holdtime.part.28+0x9a/0x160
[<
ffffffff9e0d62ce>] ? put_lock_stats.isra.27+0xe/0x30
[<
ffffffff9e73b624>] inet_sendmsg+0x104/0x220
[<
ffffffff9e73b525>] ? inet_sendmsg+0x5/0x220
[<
ffffffff9e68ac4e>] sock_sendmsg+0x9e/0xe0
[<
ffffffff9e1c0c09>] ? might_fault+0xb9/0xc0
[<
ffffffff9e1c0bae>] ? might_fault+0x5e/0xc0
[<
ffffffff9e68b234>] SYSC_sendto+0x124/0x1c0
[<
ffffffff9e0136b0>] ? syscall_trace_enter+0x250/0x330
[<
ffffffff9e68c3ce>] SyS_sendto+0xe/0x10
[<
ffffffff9e7f9be4>] tracesys+0xdd/0xe2
This is a followup of commits
f1d8cba61c3c4b ("inet: fix possible
seqlock deadlocks") and
7f88c6b23afbd315 ("ipv6: fix possible seqlock
deadlock in ip6_finish_output2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reported-by: Dave Jones <davej@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fabio Estevam [Tue, 5 Aug 2014 11:13:42 +0000 (08:13 -0300)]
Revert "net: phy: Set the driver when registering an MDIO bus device"
Commit
a71e3c37960ce5f9 ("net: phy: Set the driver when registering an MDIO bus
device") caused the following regression on the fec driver:
root@imx6qsabresd:~# echo mem > /sys/power/state
PM: Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.003 seconds) done.
Freezing remaining freezable tasks ... (elapsed 0.002 seconds) done.
Unable to handle kernel NULL pointer dereference at virtual address
0000002c
pgd =
bcd14000
[
0000002c] *pgd=
4d9e0831, *pte=
00000000, *ppte=
00000000
Internal error: Oops: 17 [#1] SMP ARM
Modules linked in:
CPU: 0 PID: 617 Comm: sh Not tainted 3.16.0 #17
task:
bc0c4e00 ti:
bceb6000 task.ti:
bceb6000
PC is at fec_suspend+0x10/0x70
LR is at dpm_run_callback.isra.7+0x34/0x6c
pc : [<
803f8a98>] lr : [<
80361f44>] psr:
600f0013
sp :
bceb7d70 ip :
bceb7d88 fp :
bceb7d84
r10:
8091523c r9 :
00000000 r8 :
bd88f478
r7 :
803f8a88 r6 :
81165988 r5 :
00000000 r4 :
00000000
r3 :
00000000 r2 :
00000000 r1 :
bd88f478 r0 :
bd88f478
Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
Control:
10c5387d Table:
4cd1404a DAC:
00000015
Process sh (pid: 617, stack limit = 0xbceb6240)
Stack: (0xbceb7d70 to 0xbceb8000)
....
The problem with the original commit is explained by Russell King:
"It has the effect (as can be seen from the oops) of attaching the MDIO bus
device (itself is a bus-less device) to the platform driver, which means
that if the platform driver supports power management, it will be called
to power manage the MDIO bus device.
Moreover, drivers do not expect to be called for power management
operations for devices which they haven't probed, and certainly not for
devices which aren't part of the same bus that the driver is registered
against."
This reverts commit
a71e3c37960ce5f9c6a519bc1215e3ba9fa83e75.
Cc: <stable@vger.kernel.org> #3.16
Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Tue, 5 Aug 2014 10:11:28 +0000 (15:41 +0530)]
cxgb4vf: Turn off SGE RX/TX Callback Timers and interrupts in PCI shutdown routine
Need to turn off SGE RX/TX Callback Timers & interrupt in cxgb4vf PCI Shutdown
routine in order to prevent crashes during reboot/poweroff when traffic is
running.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 5 Aug 2014 23:39:47 +0000 (16:39 -0700)]
Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
pull request: batman-adv 2014-08-05
this is a pull request intended for net-next/linux-3.17 (yeah..it's really
late).
Patches 1, 2 and 4 are really minor changes:
- kmalloc_array is substituted to kmalloc when possible (as suggested by
checkpatch);
- net_ratelimited() is now used properly and the "suppressed" message is not
printed anymore if not needed;
- the internal version number has been increased to reflect our current version.
Patch 3 instead is introducing a change in the metric computation function
by changing the penalty applied at each mesh hop from 15/255 (~6%) to
30/255 (~11%). This change is introduced by Simon Wunderlich after having
observed a performance improvement in several networks when using the new value.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Toshiaki Makita [Tue, 5 Aug 2014 06:58:50 +0000 (15:58 +0900)]
team: Simplify return path of team_newlink
The variable "err" is not necessary.
Return register_netdevice() directly.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Toshiaki Makita [Tue, 5 Aug 2014 06:57:15 +0000 (15:57 +0900)]
bridge: Update outdated comment on promiscuous mode
Now bridge ports can be non-promiscuous, vlan_vid_add() is no longer an
unnecessary operation.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Tue, 5 Aug 2014 23:36:30 +0000 (16:36 -0700)]
Merge branch 'v4l_for_linus' of git://git./linux/kernel/git/mchehab/linux-media
Pull media updates from Mauro Carvalho Chehab:
- removal of sn9c102. This device driver was replaced a long time ago
by gspca
- solo6x10 and go7007 webcam drivers moved from staging into
mainstream. They were waiting for an API to allow setting the image
detection matrix
- SDR drivers moved from staging into mainstream: sdr-msi3101 (renamed
as msi2500) and rtl2832
- added SDR driver for airspy
- added demux driver: si2165
- rework at several RC subsystem, making the code for RC-5 SZ variant
to be added at the standard RC5 decoder
- added decoder for the XMP IR protocol
- tuner driver moved from staging into mainstream: msi3101 (renamed as
msi001)
- added documentation for some additional SDR pixfmt
- some device tree bindings documented
- added support for exynos3250 at s5p-jpeg
- remove the obsolete, unmaintained and broken mx1_camera driver
- added support for remote controllers at au0828 driver
- added a RC driver: sunxi-cir
- several driver fixes, enhancements and cleanups.
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (455 commits)
[media] cx23885: fix UNSET/TUNER_ABSENT confusion
[media] coda: fix build error by making reset control optional
[media] radio-miropcm20: fix sparse NULL pointer warning
[media] MAINTAINERS: Update go7007 pattern
[media] MAINTAINERS: Update solo6x10 patterns
[media] media: atmel-isi: add primary DT support
[media] media: atmel-isi: convert the pdata from pointer to structure
[media] media: atmel-isi: add v4l2 async probe support
[media] rcar_vin: add devicetree support
[media] media: pxa_camera device-tree support
[media] media: mt9m111: add device-tree suppport
[media] soc_camera: add support for dt binding soc_camera drivers
[media] media: soc_camera: pxa_camera documentation device-tree support
[media] media: mt9m111: add device-tree documentation
[media] s5p-mfc: remove unnecessary calling to function video_devdata()
[media] s5p-jpeg: add chroma subsampling adjustment for Exynos3250
[media] s5p-jpeg: Prevent erroneous downscaling for Exynos3250 SoC
[media] s5p-jpeg: Assure proper crop rectangle initialization
[media] s5p-jpeg: fix g_selection op
[media] s5p-jpeg: Adjust jpeg_bound_align_image to Exynos3250 needs
...
David S. Miller [Tue, 5 Aug 2014 23:36:01 +0000 (16:36 -0700)]
Merge branch 'net-timestamp-next'
Willem de Bruijn says:
====================
net-timestamp: new tx tstamps and tcp
Extend socket tx timestamping:
- allow multiple types of software timestamps aside from send (1)
- add software timestamp on enter packet scheduling (4)
- add software timestamp for TCP (5)
- add software timestamp for TCP on ACK (6)
The sk_flags option space is nearly exhausted. Also move the
many timestamp options to a new sk->sk_tstamps (2).
To disambiguate data when tstamps may arrive out of order,
optionally return a sequential ID assigned at send (3).
Extend Linux tx timestamping to monitoring of latency
incurred within the kernel stack and to protocols embedded in TCP.
Complex kernel setups may have multiple layers of queueing, including
multiple instances of packet scheduling, and many classes per layer.
Many applications embed discrete payloads into TCP bytestreams for
reliability, flow control, etcetera. Detecting application tail
latency in such scenarios relies on identifying the exact queue
responsible if on the host, or the network latency if otherwise.
Changelog:
v4->v5
- define SCM_TSTAMP_SND == 0, for legacy behavior
- add TCP tstamps without changing the generated byte stream
- modify GSO and ACK to find offset: slightly more complex
than previous invariant that it is the last byte
- consistent naming of packet scheduling
- rename SCM_TSTAMP_ENQ to SCM_TSTAMP_SCHED
- add unique key in ee_data
- add id field in ee_info to disambiguate tstamps
- optional, only on new flag SOF_TIMESTAMPING_OPT_ID
- for bytestream, in bytes
v3->v4
- (v3 review comment) removed skb->mark packet identification (*A)
- (v3 review comment) fixed indentation
- tcp: fixed poll() to return POLLERR on non-zero queue
- rebased to work without syststamp
- comments: removed all traces of MSG_TSTAMP_.. (*B)
v2->v3
- extend the SO_TIMESTAMPING API, instead of defining a new one.
- add protocol independent support to correlate tstamps with data,
based on returning skb->mark.
- removed no-payload optimization and documentation (for now):
I have a follow-on patch that reintroduces MSG_TSTAMP along with a
new socket option SOF_TIMESTAMPING_OPT_ONFLAG. This is equivalent
to sequence setsockopt(<enable>); send(..); setsockopt(<disable>),
but avoids the need to define a MSG_TSTAMP_<TYPE> for each type.
I will leave these three patches as follow-on, as this patchset is
large enough as is.
v1->v2
- expand timestamping (existing and new) to SOCK_RAW and ping sockets
- rename sock_errqueue_timestamping to scm_timestamping
- change timestamp data format: do not add fields to scm_timestamping.
Doing so could break legacy applications. Instead, communicate
through an existing, but unused, field in the error message.
- rename SOF_.._OPT_TX_NO_PAYLOAD to shorter SOF_.._OPT_TSONLY
- move msg_tstamp test app out of patchset and to github
git://github.com/wdebruij/kerneltools.git
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Tue, 5 Aug 2014 02:11:50 +0000 (22:11 -0400)]
net-timestamp: ACK timestamp for bytestreams
Add SOF_TIMESTAMPING_TX_ACK, a request for a tstamp when the last byte
in the send() call is acknowledged. It implements the feature for TCP.
The timestamp is generated when the TCP socket cumulative ACK is moved
beyond the tracked seqno for the first time. The feature ignores SACK
and FACK, because those acknowledge the specific byte, but not
necessarily the entire contents of the buffer up to that byte.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Tue, 5 Aug 2014 02:11:49 +0000 (22:11 -0400)]
net-timestamp: TCP timestamping
TCP timestamping extends SO_TIMESTAMPING to bytestreams.
Bytestreams do not have a 1:1 relationship between send() buffers and
network packets. The feature interprets a send call on a bytestream as
a request for a timestamp for the last byte in that send() buffer.
The choice corresponds to a request for a timestamp when all bytes in
the buffer have been sent. That assumption depends on in-order kernel
transmission. This is the common case. That said, it is possible to
construct a traffic shaping tree that would result in reordering.
The guarantee is strong, then, but not ironclad.
This implementation supports send and sendpages (splice). GSO replaces
one large packet with multiple smaller packets. This patch also copies
the option into the correct smaller packet.
This patch does not yet support timestamping on data in an initial TCP
Fast Open SYN, because that takes a very different data path.
If ID generation in ee_data is enabled, bytestream timestamps return a
byte offset, instead of the packet counter for datagrams.
The implementation supports a single timestamp per packet. It silenty
replaces requests for previous timestamps. To avoid missing tstamps,
flush the tcp queue by disabling Nagle, cork and autocork. Missing
tstamps can be detected by offset when the ee_data ID is enabled.
Implementation details:
- On GSO, the timestamping code can be included in the main loop. I
moved it into its own loop to reduce the impact on the common case
to a single branch.
- To avoid leaking the absolute seqno to userspace, the offset
returned in ee_data must always be relative. It is an offset between
an skb and sk field. The first is always set (also for GSO & ACK).
The second must also never be uninitialized. Only allow the ID
option on sockets in the ESTABLISHED state, for which the seqno
is available. Never reset it to zero (instead, move it to the
current seqno when reenabling the option).
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Tue, 5 Aug 2014 02:11:48 +0000 (22:11 -0400)]
net-timestamp: SCHED timestamp on entering packet scheduler
Kernel transmit latency is often incurred in the packet scheduler.
Introduce a new timestamp on transmission just before entering the
scheduler. When data travels through multiple devices (bonding,
tunneling, ...) each device will export an individual timestamp.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Tue, 5 Aug 2014 02:11:47 +0000 (22:11 -0400)]
net-timestamp: add key to disambiguate concurrent datagrams
Datagrams timestamped on transmission can coexist in the kernel stack
and be reordered in packet scheduling. When reading looped datagrams
from the socket error queue it is not always possible to unique
correlate looped data with original send() call (for application
level retransmits). Even if possible, it may be expensive and complex,
requiring packet inspection.
Introduce a data-independent ID mechanism to associate timestamps with
send calls. Pass an ID alongside the timestamp in field ee_data of
sock_extended_err.
The ID is a simple 32 bit unsigned int that is associated with the
socket and incremented on each send() call for which software tx
timestamp generation is enabled.
The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to
avoid changing ee_data for existing applications that expect it 0.
The counter is reset each time the flag is reenabled. Reenabling
does not change the ID of already submitted data. It is possible
to receive out of order IDs if the timestamp stream is not quiesced
first.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Tue, 5 Aug 2014 02:11:46 +0000 (22:11 -0400)]
net-timestamp: move timestamp flags out of sk_flags
sk_flags is reaching its limit. New timestamping options will not fit.
Move all of them into a new field sk->sk_tsflags.
Added benefit is that this removes boilerplate code to convert between
SOF_TIMESTAMPING_.. and SOCK_TIMESTAMPING_.. in getsockopt/setsockopt.
SOCK_TIMESTAMPING_RX_SOFTWARE is also used to toggle the receive
timestamp logic (netstamp_needed). That can be simplified and this
last key removed, but will leave that for a separate patch.
Signed-off-by: Willem de Bruijn <willemb@google.com>
----
The u16 in sock can be moved into a 16-bit hole below sk_gso_max_segs,
though that scatters tstamp fields throughout the struct.
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Tue, 5 Aug 2014 02:11:45 +0000 (22:11 -0400)]
net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
Applications that request kernel tx timestamps with SO_TIMESTAMPING
read timestamps as recvmsg() ancillary data. The response is defined
implicitly as timespec[3].
1) define struct scm_timestamping explicitly and
2) add support for new tstamp types. On tx, scm_timestamping always
accompanies a sock_extended_err. Define previously unused field
ee_info to signal the type of ts[0]. Introduce SCM_TSTAMP_SND to
define the existing behavior.
The reception path is not modified. On rx, no struct similar to
sock_extended_err is passed along with SCM_TIMESTAMPING.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>