Amerigo Wang [Fri, 14 Dec 2012 22:09:50 +0000 (22:09 +0000)]
bridge: update selinux perm table for RTM_NEWMDB and RTM_DELMDB
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neil Horman [Fri, 14 Dec 2012 15:22:01 +0000 (15:22 +0000)]
sctp: Change defaults on cookie hmac selection
Recently I posted commit
3c68198e75 which made selection of the cookie hmac
algorithm selectable. This is all well and good, but Linus noted that it
changes the default config:
http://marc.info/?l=linux-netdev&m=
135536629004808&w=2
I've modified the sctp Kconfig file to reflect the recommended way of making
this choice, using the thermal driver example specified, and brought the
defaults back into line with the way they were prior to my origional patch
Also, on Linus' suggestion, re-adding ability to select default 'none' hmac
algorithm, so we don't needlessly bloat the kernel by forcing a non-none
default. This also led me to note that we won't honor the default none
condition properly because of how sctp_net_init is encoded. Fix that up as
well.
Tested by myself (allbeit fairly quickly). All configuration combinations seems
to work soundly.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: David Miller <davem@davemloft.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Vlad Yasevich <vyasevich@gmail.com>
CC: linux-sctp@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Williams [Fri, 14 Dec 2012 13:10:50 +0000 (13:10 +0000)]
i2400m: add Intel 6150 device IDs
Add device IDs for WiMAX function of Intel 6150 cards.
Signed-off-by: Dan Williams <dcbw@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Marc Kleine-Budde [Fri, 14 Dec 2012 12:25:12 +0000 (12:25 +0000)]
can: sja1000: fix compilation on x86
Since commit:
04df251 can: sja1000: Make sja1000_of_platform selectable and compilable on SPARC
the driver can be activated on non powerpc platform like x86 or sparc. Without
this patch the driver fails to compile on platform that don't define NO_IRQ,
like x86.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Andreas Larsson <andreas@gaisler.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tony Lindgren [Thu, 13 Dec 2012 11:36:41 +0000 (11:36 +0000)]
cpts: Fix build error caused by include of plat/clock.h
Commit
87c0e764 (cpts: introduce time stamping code and a PTP hardware clock)
mistakenly included plat/clock.h that should not be included by drivers
even if it exists.
Otherwise we get the following error with at least omap2plus_defconfig:
drivers/net/ethernet/ti/cpts.c:30:24: error: plat/clock.h: No such file or directory
Signed-off-by: Tony Lindgren <tony@atomide.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Christoph Paasch [Fri, 14 Dec 2012 04:07:58 +0000 (04:07 +0000)]
inet: Fix kmemleak in tcp_v4/6_syn_recv_sock and dccp_v4/6_request_recv_sock
If in either of the above functions inet_csk_route_child_sock() or
__inet_inherit_port() fails, the newsk will not be freed:
unreferenced object 0xffff88022e8a92c0 (size 1592):
comm "softirq", pid 0, jiffies
4294946244 (age 726.160s)
hex dump (first 32 bytes):
0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00 ................
02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<
ffffffff8153d190>] kmemleak_alloc+0x21/0x3e
[<
ffffffff810ab3e7>] kmem_cache_alloc+0xb5/0xc5
[<
ffffffff8149b65b>] sk_prot_alloc.isra.53+0x2b/0xcd
[<
ffffffff8149b784>] sk_clone_lock+0x16/0x21e
[<
ffffffff814d711a>] inet_csk_clone_lock+0x10/0x7b
[<
ffffffff814ebbc3>] tcp_create_openreq_child+0x21/0x481
[<
ffffffff814e8fa5>] tcp_v4_syn_recv_sock+0x3a/0x23b
[<
ffffffff814ec5ba>] tcp_check_req+0x29f/0x416
[<
ffffffff814e8e10>] tcp_v4_do_rcv+0x161/0x2bc
[<
ffffffff814eb917>] tcp_v4_rcv+0x6c9/0x701
[<
ffffffff814cea9f>] ip_local_deliver_finish+0x70/0xc4
[<
ffffffff814cec20>] ip_local_deliver+0x4e/0x7f
[<
ffffffff814ce9f8>] ip_rcv_finish+0x1fc/0x233
[<
ffffffff814cee68>] ip_rcv+0x217/0x267
[<
ffffffff814a7bbe>] __netif_receive_skb+0x49e/0x553
[<
ffffffff814a7cc3>] netif_receive_skb+0x50/0x82
This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus
a single sock_put() is not enough to free the memory. Additionally, things
like xfrm, memcg, cookie_values,... may have been initialized.
We have to free them properly.
This is fixed by forcing a call to tcp_done(), ending up in
inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary,
because it ends up doing all the cleanup on xfrm, memcg, cookie_values,
xfrm,...
Before calling tcp_done, we have to set the socket to SOCK_DEAD, to
force it entering inet_csk_destroy_sock. To avoid the warning in
inet_csk_destroy_sock, inet_num has to be set to 0.
As inet_csk_destroy_sock does a dec on orphan_count, we first have to
increase it.
Calling tcp_done() allows us to remove the calls to
tcp_clear_xmit_timer() and tcp_cleanup_congestion_control().
A similar approach is taken for dccp by calling dccp_done().
This is in the kernel since
093d282321 (tproxy: fix hash locking issue
when using port redirection in __inet_inherit_port()), thus since
version >= 2.6.37.
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Duan Jiong [Fri, 14 Dec 2012 02:59:59 +0000 (02:59 +0000)]
ipv6: Change skb->data before using icmpv6_notify() to propagate redirect
In function ndisc_redirect_rcv(), the skb->data points to the transport
header, but function icmpv6_notify() need the skb->data points to the
inner IP packet. So before using icmpv6_notify() to propagate redirect,
change skb->data to point the inner IP packet that triggered the sending
of the Redirect, and introduce struct rd_msg to make it easy.
Signed-off-by: Duan Jiong <djduanjiong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Konstantin Khlebnikov [Fri, 14 Dec 2012 01:03:03 +0000 (01:03 +0000)]
mac802154: fix destructon ordering for ieee802154 devices
mutex_destroy() must be called before wpan_phy_free(), because it puts the last
reference and frees memory. Catched as overwritten poison in kmalloc-2048.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Smirnov <alex.bluesman.smirnov@gmail.com>
Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: linux-zigbee-devel@lists.sourceforge.net
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Konstantin Khlebnikov [Fri, 14 Dec 2012 01:02:55 +0000 (01:02 +0000)]
bonding: do not cancel works in bond_uninit()
Bonding initializes these works in bond_open() and cancels in bond_close(),
thus in bond_uninit() they are already canceled but may be unitialized yet.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Nikolay Aleksandrov <nikolay@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Konstantin Khlebnikov [Fri, 14 Dec 2012 01:02:51 +0000 (01:02 +0000)]
stmmac: fix platform driver unregistering
This patch fixes platform device drivers unregistering and adds proper error
handing on module loading.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Konstantin Khlebnikov [Fri, 14 Dec 2012 01:02:36 +0000 (01:02 +0000)]
mISDN: fix race in timer canceling on module unloading
Using timer_pending() without additional syncronization is racy,
del_timer_sync() must be used here for waiting in-flight handler.
Bug caught with help from "debug-objects" during random insmod/rmmod.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: netdev <netdev@vger.kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Thu, 13 Dec 2012 23:53:30 +0000 (23:53 +0000)]
tuntap: fix ambigious multiqueue API
The current multiqueue API is ambigious which may confuse both user and LSM to
do things correctly:
- Both TUNSETIFF and TUNSETQUEUE could be used to create the queues of a tuntap
device.
- TUNSETQUEUE were used to disable and enable a specific queue of the
device. But since the state of tuntap were completely removed from the queue,
it could be used to attach to another device (there's no such kind of
requirement currently, and it needs new kind of LSM policy.
- TUNSETQUEUE could be used to attach to a persistent device without any
queues. This kind of attching bypass the necessary checking during TUNSETIFF
and may lead unexpected result.
So this patch tries to make a cleaner and simpler API by:
- Only allow TUNSETIFF to create queues.
- TUNSETQUEUE could be only used to disable and enabled the queues of a device,
and the state of the tuntap device were not detachd from the queues when it
was disabled, so TUNSETQUEUE could be only used after TUNSETIFF and with the
same device.
This is done by introducing a list which keeps track of all queues which were
disabled. The queue would be moved between this list and tfiles[] array when it
was enabled/disabled. A pointer of the tun_struct were also introdued to track
the device it belongs to when it was disabled.
After the change, the isolation between management and application could be done
through: TUNSETIFF were only called by management software and TUNSETQUEUE were
only called by application.For LSM/SELinux, the things left is to do proper
check during tun_set_queue() if needed.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ang Way Chuang [Thu, 13 Dec 2012 23:08:39 +0000 (23:08 +0000)]
bridge: remove temporary variable for MLDv2 maximum response code computation
As suggested by Stephen Hemminger, this remove the temporary variable
introduced in commit
eca2a43bb0d2c6ebd528be6acb30a88435abe307
("bridge: fix icmpv6 endian bug and other sparse warnings")
Signed-off-by: Ang Way Chuang <wcang@sfc.wide.ad.jp>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 14 Dec 2012 15:20:43 +0000 (07:20 -0800)]
Revert "sched: Update_cfs_shares at period edge"
This reverts commit
f269ae0469fc882332bdfb5db15d3c1315fe2a10.
It turns out it causes a very noticeable interactivity regression with
CONFIG_SCHED_AUTOGROUP (test-case: "make -j32" of the kernel in a
terminal window, while scrolling in a browser - the autogrouping means
that the two end up in separate cgroups, and the browser should be
smooth as silk despite the high load).
Says Paul Turner:
"It seems that the update-throttling on the wake-side is reducing the
interactive tasks' ability to preempt. While I suspect the right
longer term answer here is force these updates only in the
cross-cgroup case; this is less trivial. For this release I believe
the right answer is either going to be a revert or restore the updates
on the enqueue-side."
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Bisected-by: Mike Galbraith <efault@gmx.de>
Acked-by: Paul Turner <pjt@google.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 14 Dec 2012 03:26:04 +0000 (19:26 -0800)]
Merge tag 'for-v3.8-merged' of git://git.infradead.org/battery-2.6
Pull battery subsystem updates from Anton Vorontsov:
"Highlights:
- Two new drivers from Pali Rohár and N900 hackers: rx51_battery and
bq2415x_charger. The drivers are a part of a solution to replace
the proprietary Nokia BME stack
- Power supply core now registers devices with a thermal cooling
subsystem, so we can now automatically throttle charging. Thanks
to Ramakrishna Pallala!
- Device tree support for ab8500 and max8925_power drivers
- Random fixups and enhancements for a bunch of drivers."
* tag 'for-v3.8-merged' of git://git.infradead.org/battery-2.6: (22 commits)
max8925_power: Add support for device-tree initialization
ab8500: Add devicetree support for chargalg
ab8500: Add devicetree support for charger
ab8500: Add devicetree support for btemp
ab8500: Add devicetree support for fuelgauge
twl4030_charger: Change TWL4030_MODULE_* ids to TWL_MODULE_*
jz4740-battery: Use devm_request_and_ioremap
jz4740-battery: Use devm_kzalloc
bq27x00_battery: Fixup nominal available capacity reporting
bq2415x_charger: Fix style issues
bq2415x_charger: Add Kconfig/Makefile entries
power_supply: Add bq2415x charger driver
power_supply: Add new Nokia RX-51 (N900) power supply battery driver
max17042_battery: Fix missing verify_model_lock() return value check
ds2782_battery: Fix signedness bug in ds278x_read_reg16()
lp8788-charger: Fix ADC channel names
lp8788-charger: Fix wrong ADC conversion
lp8788-charger: Use consumer device name on setting IIO channels
power_supply: Register power supply for thermal cooling device
power_supply: Add support for CHARGE_CONTROL_* attributes
...
Linus Torvalds [Fri, 14 Dec 2012 03:22:22 +0000 (19:22 -0800)]
Merge branch 'v4l_for_linus' of git://git./linux/kernel/git/mchehab/linux-media
Pull media updates from Mauro Carvalho Chehab:
- Missing MAINTAINERS entries were added for several drivers
- Adds V4L2 support for DMABUF handling, allowing zero-copy buffer
sharing between V4L2 devices and GPU
- Got rid of all warnings when compiling with W=1 on x86
- Add a new driver for Exynos hardware (s3c-camif)
- Several bug fixes, cleanups and driver improvements
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (243 commits)
[media] omap3isp: Replace cpu_is_omap3630() with ISP revision check
[media] omap3isp: Prepare/unprepare clocks before/after enable/disable
[media] omap3isp: preview: Add support for 8-bit formats at the sink pad
[media] omap3isp: Replace printk with dev_*
[media] omap3isp: Find source pad from external entity
[media] omap3isp: Configure CSI-2 phy based on platform data
[media] omap3isp: Add PHY routing configuration
[media] omap3isp: Add CSI configuration registers from control block to ISP resources
[media] omap3isp: Remove unneeded module memory address definitions
[media] omap3isp: Use monotonic timestamps for statistics buffers
[media] uvcvideo: Fix control value clamping for unsigned integer controls
[media] uvcvideo: Mark first output terminal as default video node
[media] uvcvideo: Add VIDIOC_[GS]_PRIORITY support
[media] uvcvideo: Return -ENOTTY for unsupported ioctls
[media] uvcvideo: Set device_caps in VIDIOC_QUERYCAP
[media] uvcvideo: Don't fail when an unsupported format is requested
[media] uvcvideo: Return -EACCES when trying to access a read/write-only control
[media] uvcvideo: Set error_idx properly for extended controls API failures
[media] rtl28xxu: add NOXON DAB/DAB+ USB dongle rev 2
[media] fc2580: write some registers conditionally
...
Linus Torvalds [Fri, 14 Dec 2012 03:20:31 +0000 (19:20 -0800)]
Merge tag 'scsi-misc' of git://git./linux/kernel/git/jejb/scsi
Pull first round of SCSI updates from James Bottomley:
"This patch set includes two large new drivers: mpt3sas (for the next
gen fusion SAS hardware) and csiostor a FCoE offload driver for the
Chelsio converged network cards (this includes some net changes which
I've OK'd with DaveM).
The rest of the patch is driver updates (qla2xxx, lpfc, hptiop,
be2iscsi) plus a few assorted updates and bug fixes.
We also have a Power Management rework in the Upper Layer Drivers
preparatory to doing ACPI zero power optical devices, but the actual
enabler is still being worked on.
Signed-off-by: James Bottomley <JBottomley@Parallels.com>"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (72 commits)
[SCSI] mpt3sas: add new driver supporting 12GB SAS
[SCSI] scsi_transport_sas: add 12GB definitions for mpt3sas
[SCSI] miscdevice: Adding support for MPT3SAS_MINOR(222)
[SCSI] csiostor: remove unneeded memset()
[SCSI] csiostor: Fix sparse warnings.
[SCSI] qla2xxx: Display that driver is operating in legacy interrupt mode.
[SCSI] qla2xxx: Dont clear drv active on iospace config failure.
[SCSI] qla2xxx: Fix typo in qla2xxx driver.
[SCSI] qla2xxx: Update ql2xextended_error_logging parameter description with new option.
[SCSI] qla2xxx: Parameterize the link speed of hba rather than fcport.
[SCSI] qla2xxx: Add 16Gb/s case to get port speed capability.
[SCSI] qla2xxx: Move marking fcport online ahead of setting iiDMA speed.
[SCSI] qla2xxx: Add acquiring of risc semaphore before doing ISP reset.
[SCSI] qla2xxx: Ignore driver ack bit if corresponding presence bit is not set.
[SCSI] qla2xxx: Fix typo in qla83xx_fw_dump function.
[SCSI] qla2xxx: Add Gen3 PCIe speed 8GT/s to the log message.
[SCSI] qla2xxx: Use correct Request-Q-Out register during bidirectional request processing
[SCSI] qla2xxx: Move noisy Start scsi failed messages to verbose logging level.
[SCSI] qla2xxx: Fix coccinelle warnings in qla2x00_relogin.
[SCSI] qla2xxx: No fcport FC-4 type assignment in GA_NXT response.
...
Linus Torvalds [Fri, 14 Dec 2012 03:19:09 +0000 (19:19 -0800)]
Merge tag 'rdma-for-linus' of git://git./linux/kernel/git/roland/infiniband
Pull infiniband upate from Roland Dreier:
"First batch of InfiniBand/RDMA changes for the 3.8 merge window:
- A good chunk of Bart Van Assche's SRP fixes
- UAPI disintegration from David Howells
- mlx4 support for "64-byte CQE" hardware feature from Or Gerlitz
- Other miscellaneous fixes"
Fix up trivial conflict in mellanox/mlx4 driver.
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (33 commits)
RDMA/nes: Fix for crash when registering zero length MR for CQ
RDMA/nes: Fix for terminate timer crash
RDMA/nes: Fix for BUG_ON due to adding already-pending timer
IB/srp: Allow SRP disconnect through sysfs
srp_transport: Document sysfs attributes
srp_transport: Simplify attribute initialization code
srp_transport: Fix attribute registration
IB/srp: Document sysfs attributes
IB/srp: send disconnect request without waiting for CM timewait exit
IB/srp: destroy and recreate QP and CQs when reconnecting
IB/srp: Eliminate state SRP_TARGET_DEAD
IB/srp: Introduce the helper function srp_remove_target()
IB/srp: Suppress superfluous error messages
IB/srp: Process all error completions
IB/srp: Introduce srp_handle_qp_err()
IB/srp: Simplify SCSI error handling
IB/srp: Keep processing commands during host removal
IB/srp: Eliminate state SRP_TARGET_CONNECTING
IB/srp: Increase block layer timeout
RDMA/cm: Change return value from find_gid_port()
...
Linus Torvalds [Fri, 14 Dec 2012 03:15:11 +0000 (19:15 -0800)]
Merge tag 'spi-for-linus' of git://git.secretlab.ca/git/linux-2.6
Pull SPI updates from Grant Likely:
"Primarily SPI device driver bug fixes, one removal of an old driver,
and some new tegra support. There is some core code change too, but
all in all pretty small stuff.
The new features to note are:
- Common code for describing GPIO CS lines in the device tree
- Remove the SPI_BUFSIZ limitation on spi_write_the_read()
- core spi ensures bits_per_word is set correctly
- SPARC can now use SPI"
* tag 'spi-for-linus' of git://git.secretlab.ca/git/linux-2.6: (36 commits)
spi/sparc: Allow of_register_spi_devices for sparc
spi: Remove HOTPLUG section attributes
spi: Add support for specifying 3-wire mode via device tree
spi: Fix comparison of different integer types
spi/orion: Add SPI_CHPA and SPI_CPOL support to kirkwood driver.
spi/sh: Add SH Mobile series as dependency to MSIOF controller
spi/sh-msiof: Remove unneeded clock name
spi: Remove SPI_BUFSIZ restriction on spi_write_then_read()
spi/stmp: remove obsolete driver
spi/clps711x: New SPI master driver
spi: omap2-mcspi: remove duplicate inclusion of linux/err.h
spi: omap2-mcspi: Fix the redifine warning
spi/sh-hspi: add CS manual control support
of_spi: add generic binding support to specify cs gpio
spi: omap2-mcspi: remove duplicated include from spi-omap2-mcspi.c
spi/bitbang: (cosmetic) simplify list manipulation
spi/bitbang: avoid needless loop flow manipulations
spi/omap: fix D0/D1 direction confusion
spi: tegra: add spi driver for sflash controller
spi: Dont call master->setup if not populated
...
Linus Torvalds [Fri, 14 Dec 2012 03:13:37 +0000 (19:13 -0800)]
Merge branch 'autofs' (patches from Ian Kent)
Merge emailed autofs cleanup/fix patches from Ian Kent
* autofs:
autofs4 - use simple_empty() for empty directory check
autofs4 - dont clear DCACHE_NEED_AUTOMOUNT on rootless mount
Ian Kent [Fri, 14 Dec 2012 02:23:29 +0000 (10:23 +0800)]
autofs4 - use simple_empty() for empty directory check
For direct (and offset) mounts, if an automounted mount is manually
umounted the trigger mount dentry can appear non-empty causing it to
not trigger mounts. This can also happen if there is a file handle
leak in a user space automounting application.
This happens because, when a ioctl control file handle is opened
on the mount, a cursor dentry is created which causes list_empty()
to see the dentry as non-empty. Since there is a case where listing
the directory of these dentrys is needed, the use of dcache_dir_*()
functions for .open() and .release() is needed.
Consequently simple_empty() must be used instead of list_empty()
when checking for an empty directory.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ian Kent [Fri, 14 Dec 2012 02:23:23 +0000 (10:23 +0800)]
autofs4 - dont clear DCACHE_NEED_AUTOMOUNT on rootless mount
The DCACHE_NEED_AUTOMOUNT flag is cleared on mount and set on expire
for autofs rootless multi-mount dentrys to prevent unnecessary calls
to ->d_automount().
Since DCACHE_MANAGE_TRANSIT is always set on autofs dentrys ->d_managed()
is always called so the check can be done in ->d_manage() without the
need to change the flag. This still avoids unnecessary calls to
->d_automount(), adds negligible overhead and eliminates a seriously
ugly check in the expire code.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 14 Dec 2012 02:03:21 +0000 (18:03 -0800)]
Merge tag 'ktest-v3.8' of git://git./linux/kernel/git/rostedt/linux-ktest
Pull ktest update from Steven Rostedt:
"fixes and updated for new boot loaders"
* tag 'ktest-v3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest:
ktest: Test if target machine is up before install
ktest: Fix breakage from change of oldnoconfig to olddefconfig
ktest: Add native support for syslinux boot loader
ktest: Sync before reboot
ktest: Add support for grub2
Linus Torvalds [Thu, 13 Dec 2012 23:31:08 +0000 (15:31 -0800)]
Merge tag 'kvm-3.8-1' of git://git./virt/kvm/kvm
Pull KVM updates from Marcelo Tosatti:
"Considerable KVM/PPC work, x86 kvmclock vsyscall support,
IA32_TSC_ADJUST MSR emulation, amongst others."
Fix up trivial conflict in kernel/sched/core.c due to cross-cpu
migration notifier added next to rq migration call-back.
* tag 'kvm-3.8-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (156 commits)
KVM: emulator: fix real mode segment checks in address linearization
VMX: remove unneeded enable_unrestricted_guest check
KVM: VMX: fix DPL during entry to protected mode
x86/kexec: crash_vmclear_local_vmcss needs __rcu
kvm: Fix irqfd resampler list walk
KVM: VMX: provide the vmclear function and a bitmap to support VMCLEAR in kdump
x86/kexec: VMCLEAR VMCSs loaded on all cpus if necessary
KVM: MMU: optimize for set_spte
KVM: PPC: booke: Get/set guest EPCR register using ONE_REG interface
KVM: PPC: bookehv: Add EPCR support in mtspr/mfspr emulation
KVM: PPC: bookehv: Add guest computation mode for irq delivery
KVM: PPC: Make EPCR a valid field for booke64 and bookehv
KVM: PPC: booke: Extend MAS2 EPN mask for 64-bit
KVM: PPC: e500: Mask MAS2 EPN high 32-bits in 32/64 tlbwe emulation
KVM: PPC: Mask ea's high 32-bits in 32/64 instr emulation
KVM: PPC: e500: Add emulation helper for getting instruction ea
KVM: PPC: bookehv64: Add support for interrupt handling
KVM: PPC: bookehv: Remove GET_VCPU macro from exception handler
KVM: PPC: booke: Fix get_tb() compile error on 64-bit
KVM: PPC: e500: Silence bogus GCC warning in tlb code
...
Linus Torvalds [Thu, 13 Dec 2012 22:29:16 +0000 (14:29 -0800)]
Merge tag 'stable/for-linus-3.8-rc0-tag' of git://git./linux/kernel/git/konrad/xen
Pull Xen updates from Konrad Rzeszutek Wilk:
- Add necessary infrastructure to make balloon driver work under ARM.
- Add /dev/xen/privcmd interfaces to work with ARM and PVH.
- Improve Xen PCIBack wild-card parsing.
- Add Xen ACPI PAD (Processor Aggregator) support - so can offline/
online sockets depending on the power consumption.
- PVHVM + kexec = use an E820_RESV region for the shared region so we
don't overwrite said region during kexec reboot.
- Cleanups, compile fixes.
Fix up some trivial conflicts due to the balloon driver now working on
ARM, and there were changes next to the previous work-arounds that are
now gone.
* tag 'stable/for-linus-3.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/PVonHVM: fix compile warning in init_hvm_pv_info
xen: arm: implement remap interfaces needed for privcmd mappings.
xen: correctly use xen_pfn_t in remap_domain_mfn_range.
xen: arm: enable balloon driver
xen: balloon: allow PVMMU interfaces to be compiled out
xen: privcmd: support autotranslated physmap guests.
xen: add pages parameter to xen_remap_domain_mfn_range
xen/acpi: Move the xen_running_on_version_or_later function.
xen/xenbus: Remove duplicate inclusion of asm/xen/hypervisor.h
xen/acpi: Fix compile error by missing decleration for xen_domain.
xen/acpi: revert pad config check in xen_check_mwait
xen/acpi: ACPI PAD driver
xen-pciback: reject out of range inputs
xen-pciback: simplify and tighten parsing of device IDs
xen PVonHVM: use E820_Reserved area for shared_info
Linus Torvalds [Thu, 13 Dec 2012 22:20:19 +0000 (14:20 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/s390/linux
Pull s390 update from Martin Schwidefsky:
"Add support to generate code for the latest machine zEC12, MOD and XOR
instruction support for the BPF jit compiler, the dasd safe offline
feature and the big one: the s390 architecture gets PCI support!!
Right before the world ends on the 21st ;-)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (41 commits)
s390/qdio: rename the misleading PCI flag of qdio devices
s390/pci: remove obsolete email addresses
s390/pci: speed up __iowrite64_copy by using pci store block insn
s390/pci: enable NEED_DMA_MAP_STATE
s390/pci: no msleep in potential IRQ context
s390/pci: fix potential NULL pointer dereference in dma_free_seg_table()
s390/pci: use kmem_cache_zalloc instead of kmem_cache_alloc/memset
s390/bpf,jit: add support for XOR instruction
s390/bpf,jit: add support MOD instruction
s390/cio: fix pgid reserved check
vga: compile fix, disable vga for s390
s390/pci: add PCI Kconfig options
s390/pci: s390 specific PCI sysfs attributes
s390/pci: PCI hotplug support via SCLP
s390/pci: CHSC PCI support for error and availability events
s390/pci: DMA support
s390/pci: PCI adapter interrupts for MSI/MSI-X
s390/bitops: find leftmost bit instruction support
s390/pci: CLP interface
s390/pci: base support
...
Linus Torvalds [Thu, 13 Dec 2012 21:23:33 +0000 (13:23 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/geert/linux-m68k
Pull m68k updates from Geert Uytterhoeven.
Fix up trivial conflict (m68k switched to generic version of
uapi/asm/socket.h, net tree updated the old one) as per Geert.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
m68k/sun3: Fix instruction faults
m68k/sun3: Get interrupts working again
m68k: move to a single instance of free_initmem()
m68k: merge MMU and non-MMU versions of mm/init.c
m68k: switch to using the asm-generic termios.h
m68k: switch to using the asm-generic termbits.h
m68k: switch to using the asm-generic sockios.h
m68k: switch to using the asm-generic socket.h
m68k: switch to using the asm-generic shmbuf.h
m68k: switch to using the asm-generic sembuf.h
m68k: switch to using the asm-generic msgbuf.h
m68k: switch to using the asm-generic auxvec.h
m68k: switch to using the asm-generic shmparam.h
m68k: switch to using the asm-generic spinlock.h
m68k: switch to using the asm-generic hw_irq.h
arch/m68k: remove CONFIG_EXPERIMENTAL
Linus Torvalds [Thu, 13 Dec 2012 21:21:19 +0000 (13:21 -0800)]
Merge git://git./linux/kernel/git/davem/sparc-next
Pull tiny sparc update from David Miller:
"Not much going on this release cycle in sparc land, just a Kconfig
tweak."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next:
of_i2c: sparc: Allow OF_I2C for sparc
Linus Torvalds [Thu, 13 Dec 2012 21:20:02 +0000 (13:20 -0800)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
"A pile of fixes in response to yesterday's big merge. The SCTP HMAC
thing hasn't been addressed yet, I'll take care of that myself if Neil
and Vlad don't show signs of life by tomorrow.
1) Use after free of SKB in tuntap code. Fix by Eric Dumazet,
reported by Dave Jones.
2) NFC LLCP code emits annoying kernel log message, triggerable by
the user. From Dave Jones.
3) Fix several endianness bugs noticed by sparse in the bridging
code, from Stephen Hemminger.
4) Ipv6 NDISC code doesn't take padding into account properly, fix
from YOSHIFUJI Hideaki.
5) Add missing docs to ethtool_flow_ext struct, from Yan Burman."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
bridge: fix icmpv6 endian bug and other sparse warnings
net: ethool: Document struct ethtool_flow_ext
ndisc: Fix padding error in link-layer address option.
tuntap: dont use skb after netif_rx_ni(skb)
nfc: remove noisy message from llcp_sock_sendmsg
Linus Torvalds [Thu, 13 Dec 2012 21:11:15 +0000 (13:11 -0800)]
Merge branch 'akpm' (Andrew's patch-bomb)
Merge misc VM changes from Andrew Morton:
"The rest of most-of-MM. The other MM bits await a slab merge.
This patch includes the addition of a huge zero_page. Not a
performance boost but it an save large amounts of physical memory in
some situations.
Also a bunch of Fujitsu engineers are working on memory hotplug.
Which, as it turns out, was badly broken. About half of their patches
are included here; the remainder are 3.8 material."
However, this merge disables CONFIG_MOVABLE_NODE, which was totally
broken. We don't add new features with "default y", nor do we add
Kconfig questions that are incomprehensible to most people without any
help text. Does the feature even make sense without compaction or
memory hotplug?
* akpm: (54 commits)
mm/bootmem.c: remove unused wrapper function reserve_bootmem_generic()
mm/memory.c: remove unused code from do_wp_page()
asm-generic, mm: pgtable: consolidate zero page helpers
mm/hugetlb.c: fix warning on freeing hwpoisoned hugepage
hwpoison, hugetlbfs: fix RSS-counter warning
hwpoison, hugetlbfs: fix "bad pmd" warning in unmapping hwpoisoned hugepage
mm: protect against concurrent vma expansion
memcg: do not check for mm in __mem_cgroup_count_vm_event
tmpfs: support SEEK_DATA and SEEK_HOLE (reprise)
mm: provide more accurate estimation of pages occupied by memmap
fs/buffer.c: remove redundant initialization in alloc_page_buffers()
fs/buffer.c: do not inline exported function
writeback: fix a typo in comment
mm: introduce new field "managed_pages" to struct zone
mm, oom: remove statically defined arch functions of same name
mm, oom: remove redundant sleep in pagefault oom handler
mm, oom: cleanup pagefault oom handler
memory_hotplug: allow online/offline memory to result movable node
numa: add CONFIG_MOVABLE_NODE for movable-dedicated node
mm, memcg: avoid unnecessary function call when memcg is disabled
...
Linus Torvalds [Thu, 13 Dec 2012 20:14:47 +0000 (12:14 -0800)]
Merge tag 'for-3.8' of git://git./linux/kernel/git/helgaas/pci
Pull PCI update from Bjorn Helgaas:
"Host bridge hotplug:
- Untangle _PRT from struct pci_bus (Bjorn Helgaas)
- Request _OSC control before scanning root bus (Taku Izumi)
- Assign resources when adding host bridge (Yinghai Lu)
- Remove root bus when removing host bridge (Yinghai Lu)
- Remove _PRT during hot remove (Yinghai Lu)
SRIOV
- Add sysfs knobs to control numVFs (Don Dutile)
Power management
- Notify devices when power resource turned on (Huang Ying)
Bug fixes
- Work around broken _SEG on HP xw9300 (Bjorn Helgaas)
- Keep runtime PM enabled for unbound PCI devices (Huang Ying)
- Fix Optimus dual-GPU runtime D3 suspend issue (Dave Airlie)
- Fix xen frontend shutdown issue (David Vrabel)
- Work around PLX PCI 9050 BAR alignment erratum (Ian Abbott)
Miscellaneous
- Add GPL license for drivers/pci/ioapic (Andrew Cooks)
- Add standard PCI-X, PCIe ASPM register #defines (Bjorn Helgaas)
- NumaChip remote PCI support (Daniel Blueman)
- Fix PCIe Link Capabilities Supported Link Speed definition (Jingoo
Han)
- Convert dev_printk() to dev_info(), etc (Joe Perches)
- Add support for non PCI BAR ROM data (Matthew Garrett)
- Add x86 support for host bridge translation offset (Mike Yoknis)
- Report success only when every driver supports AER (Vijay
Pandarathil)"
Fix up trivial conflicts.
* tag 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (48 commits)
PCI: Use phys_addr_t for physical ROM address
x86/PCI: Add NumaChip remote PCI support
ath9k: Use standard #defines for PCIe Capability ASPM fields
iwlwifi: Use standard #defines for PCIe Capability ASPM fields
iwlwifi: collapse wrapper for pcie_capability_read_word()
iwlegacy: Use standard #defines for PCIe Capability ASPM fields
iwlegacy: collapse wrapper for pcie_capability_read_word()
cxgb3: Use standard #defines for PCIe Capability ASPM fields
PCI: Add standard PCIe Capability Link ASPM field names
PCI/portdrv: Use PCI Express Capability accessors
PCI: Use standard PCIe Capability Link register field names
x86: Use PCI setup data
PCI: Add support for non-BAR ROMs
PCI: Add pcibios_add_device
EFI: Stash ROMs if they're not in the PCI BAR
PCI: Add and use standard PCI-X Capability register names
PCI/PM: Keep runtime PM enabled for unbound PCI devices
xen-pcifront: Handle backend CLOSED without CLOSING
PCI: SRIOV control and status via sysfs (documentation)
PCI/AER: Report success only when every device has AER-aware driver
...
Linus Torvalds [Thu, 13 Dec 2012 20:04:35 +0000 (12:04 -0800)]
Merge tag 'regulator-3.8' of git://git./linux/kernel/git/broonie/regulator
Pull regulator updates from Mark Brown:
"A fairly quiet release again, a couple of relatively small new
features and a bunch of driver specific work including yet more code
elimination and fixes from Axel Lin.
- Addidion of linear_min_sel for offsetting linear selectors in the
helpers.
- Support for continuous voltage ranges for regulators with extremely
high resolution.
- Drivers for AS3711, DA9055, MAX9873, TPS51632, TPS80031 and ARM
vexpress."
Fix up trivial conflict (due to typo fix) in palmas-regulator.c
* tag 'regulator-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (80 commits)
regulator: core: Fix logic to determinate if regulator can change voltage
regulator: s5m8767: Fix to work even if no DVS gpio present
regulator: s5m8767: Fix to read the first DVS register.
regulator: s5m8767: Fix to work when platform registers less regulators
regulator: gpio-regulator: gpio_set_value should use cansleep
regulator: gpio-regulator: Fix logical error in for() loop
regulator: anatop: Use regulator_[get|set]_voltage_sel_regmap
regulator: anatop: Use linear_min_sel with linear mapping
regulator: max1586: Implement get_voltage_sel callback
regulator: lp8788-buck: Kill _gpio_request function
regulator: tps80031: Convert tps80031_ldo_ops to linear_min_sel and list_voltage_linear
regulator: lp8788-ldo: Remove val array in lp8788_config_ldo_enable_mode
regulator: gpio-regulator: Add ifdef CONFIG_OF guard for regulator_gpio_of_match
regulator: palmas: Convert palmas_ops_smps to regulator_[get|set]_voltage_sel_regmap
regulator: palmas: Return raw register values as the selectors in [get|set]_voltage_sel
regulators: add regulator_can_change_voltage() function
regulator: tps51632: Ensure [base|max]_voltage_uV pdata settings are valid
regulator: wm831x-dcdc: Add MODULE_ALIAS for wm831x-boostp
regulator: wm831x-dcdc: Ensure selected voltage falls within requested range
regulator: tps51632: Use linear_min_sel and regulator_[map|list]_voltage_linear
...
Linus Torvalds [Thu, 13 Dec 2012 20:00:48 +0000 (12:00 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jikos/hid
Pull HID subsystem updates from Jiri Kosina:
1) Support for HID over I2C bus has been added by Benjamin Tissoires.
ACPI device discovery is still in the works.
2) Support for Win8 Multitiouch protocol is being added, most work done
by Benjamin Tissoires as well
3) EIO/ERESTARTSYS is fixed in hiddev/hidraw, fixes by Andrew Duggan
and Jiri Kosina
4) ION iCade driver added by Bastien Nocera
5) Support for a couple new Roccat devices has been added by Stefan
Achatz
6) HID sensor hubs are now auto-detected instead of having to list all
the VID/PID combinations in the blacklist array
7) other random fixes and support for new device IDs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (65 commits)
HID: i2c-hid: add mutex protecting open/close race
Revert "HID: sensors: add to special driver list"
HID: sensors: autodetect USB HID sensor hubs
HID: hidp: fallback to input session properly if hid is blacklisted
HID: i2c-hid: fix ret_count check
HID: i2c-hid: fix i2c_hid_get_raw_report count mismatches
HID: i2c-hid: remove extra .irq field in struct i2c_hid
HID: i2c-hid: reorder allocation/free of buffers
HID: i2c-hid: fix memory corruption due to missing hid declaration
HID: i2c-hid: remove superfluous include
HID: i2c-hid: remove unneeded test in i2c_hid_remove
HID: i2c-hid: i2c_hid_get_report may fail
HID: i2c-hid: also call i2c_hid_free_buffers in i2c_hid_remove
HID: i2c-hid: fix error messages
HID: i2c-hid: fix return paths
HID: i2c-hid: remove unused static declarations
HID: i2c-hid: fix i2c_hid_dbg macro
HID: i2c-hid: fix checkpatch.pl warning
HID: i2c-hid: enhance Kconfig
HID: i2c-hid: change I2C name
...
Linus Torvalds [Thu, 13 Dec 2012 20:00:02 +0000 (12:00 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jikos/trivial
Pull trivial branch from Jiri Kosina:
"Usual stuff -- comment/printk typo fixes, documentation updates, dead
code elimination."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
HOWTO: fix double words typo
x86 mtrr: fix comment typo in mtrr_bp_init
propagate name change to comments in kernel source
doc: Update the name of profiling based on sysfs
treewide: Fix typos in various drivers
treewide: Fix typos in various Kconfig
wireless: mwifiex: Fix typo in wireless/mwifiex driver
messages: i2o: Fix typo in messages/i2o
scripts/kernel-doc: check that non-void fcts describe their return value
Kernel-doc: Convention: Use a "Return" section to describe return values
radeon: Fix typo and copy/paste error in comments
doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
various: Fix spelling of "asynchronous" in comments.
Fix misspellings of "whether" in comments.
eisa: Fix spelling of "asynchronous".
various: Fix spelling of "registered" in comments.
doc: fix quite a few typos within Documentation
target: iscsi: fix comment typos in target/iscsi drivers
treewide: fix typo of "suport" in various comments and Kconfig
treewide: fix typo of "suppport" in various comments
...
Linus Torvalds [Thu, 13 Dec 2012 19:59:27 +0000 (11:59 -0800)]
Merge tag 'firewire-updates' of git://git./linux/kernel/git/ieee1394/linux1394
Pull IEEE 1394 (FireWire) subsystem updates from Stefan Richter:
- IPv4-over-1394: fixes for broadcast and multicast
- SBP-2: allow thin-provisioning related commands
- trivia
* tag 'firewire-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
firewire: net: remove unused variable in fwnet_receive_broadcast()
firewire: net: Fix handling of fragmented multicast/broadcast packets.
firewire: sbp2: allow WRITE SAME and REPORT SUPPORTED OPERATION CODES
tools/firewire: nosy-dump: check for allocation failure
Linus Torvalds [Thu, 13 Dec 2012 19:51:23 +0000 (11:51 -0800)]
Merge tag 'sound-3.8' of git://git./linux/kernel/git/tiwai/sound
Pull sound updates from Takashi Iwai:
"This update contains a fairly wide range of changes all over in sound
subdirectory, mainly because of UAPI header moves by David and __dev*
annotation removals by Bill. Other highlights are:
- Introduced the support for wallclock timestamps in ALSA PCM core
- Add the poll loop implementation for HD-audio jack detection
- Yet more VGA-switcheroo fixes for HD-audio
- New VIA HD-audio codec support
- More fixes on resource management in USB audio and MIDI drivers
- More quirks for USB-audio ASUS Xonar U3, Reloop Play, Focusrite,
Roland VG-99, etc
- Add support for FastTrack C400 usb-audio
- Clean ups in many drivers regarding firmware loading
- Add PSC724 Ultiimate Edge support to ice1712
- A few hdspm driver updates
- New Stanton SCS.1d/1m FireWire driver
- Standardisation of the logging in ASoC codes
- DT and dmaengine support for ASoC Atmel
- Support for Wolfson ADSP cores
- New drivers for Freescale/iVeia P1022 and Maxim MAX98090
- Lots of other ASoC driver fixes and developments"
Fix up trivial conflicts. And go out on a limb and assume the dts file
'status' field of one of the conflicting things was supposed to be
"disabled", not "disable" like in pretty much all other cases.
* tag 'sound-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (341 commits)
ALSA: hda - Move runtime PM check to runtime_idle callback
ALSA: hda - Add stereo-dmic fixup for Acer Aspire One 522
ALSA: hda - Avoid doubly suspend after vga switcheroo
ALSA: usb-audio: Enable S/PDIF on the ASUS Xonar U3
ALSA: hda - Check validity of CORB/RIRB WP reads
ALSA: hda - use usleep_range in link reset and change timeout check
ALSA: HDA: VIA: Add support for codec VT1808.
ALSA: HDA: VIA Add support for codec VT1705CF.
ASoC: codecs: remove __dev* attributes
ASoC: utils: remove __dev* attributes
ASoC: ux500: remove __dev* attributes
ASoC: txx9: remove __dev* attributes
ASoC: tegra: remove __dev* attributes
ASoC: spear: remove __dev* attributes
ASoC: sh: remove __dev* attributes
ASoC: s6000: remove __dev* attributes
ASoC: OMAP: remove __dev* attributes
ASoC: nuc900: remove __dev* attributes
ASoC: mxs: remove __dev* attributes
ASoC: kirkwood: remove __dev* attributes
...
Linus Torvalds [Thu, 13 Dec 2012 19:00:00 +0000 (11:00 -0800)]
Merge tag 'boards2' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC board updates, take 2 from Olof Johansson:
"This branch contains board updates for shmobile that had dependencies
on earlier branches past the first driver branch, and thus are merged
separately.
Most of these are to enable audio and USB on shmobile. They contain a
dependent ASoC branch that has been coordinated with Mark Brown."
* tag 'boards2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: shmobile: mackerel: Add FLCTL IRQ resource
ARM: shmobile: use FSI driver's audio clock on ap4evb
ARM: shmobile: use FSI driver's audio clock on mackerel
ARM: shmobile: use FSI driver's audio clock on armadillo800eva
ARM: shmobile: mackerel: enable DMAEngine on USB Host
ARM: shmobile: marzen: add USB OHCI driver support
ARM: shmobile: marzen: add USB EHCI driver support
ARM: shmobile: marzen: add USB phy support
ASoC: fsi: add master clock control functions
ASoC: fsi: care fsi_hw_start/stop() return value
ASoC: fsi: fsi_set_master_clk() was called from fsi_hw_xxx() only
ASoC: fsi: use devm_request_irq()
ASoC: fsi: fixup channels_min/max
Linus Torvalds [Thu, 13 Dec 2012 18:59:11 +0000 (10:59 -0800)]
Merge tag 'drivers' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC driver specific changes from Olof Johansson:
"A collection of mostly SoC-specific driver updates:
- a handful of pincontrol and setup changes
- new drivers for hwmon and reset controller for vexpress
- timing support updates for OMAP (gpmc and other interfaces)
- plus a collection of smaller cleanups"
* tag 'drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (21 commits)
ARM: ux500: fix pin warning
ARM: OMAP2+: tusb6010: generic timing calculation
ARM: OMAP2+: smc91x: generic timing calculation
ARM: OMAP2+: onenand: generic timing calculation
ARM: OMAP2+: gpmc: generic timing calculation
ARM: OMAP2+: gpmc: handle additional timings
ARM: OMAP2+: nand: remove redundant rounding
gpio: samsung: use pr_* instead of printk
ARM: ux500: fixup magnetometer pins
ARM: ux500: add STM pin configuration
ARM: ux500: 8500: add pinctrl support for uart1 and uart2
ARM: ux500: cosmetic fixups for uart0
gpio: samsung: Fix input mode setting function for GPIO int
ARM: SAMSUNG: Insert bitmap_gpio_int member in samsung_gpio_chip
ARM: ux500: 8500: define SDI sleep states
ARM: vexpress: Reset driver
ARM: ux500: 8500: update SKE keypad pinctrl table
hwmon: Versatile Express hwmon driver
ARM: ux500: delete duplicate macro
ARM: ux500: 8500: add IDLE pin configuration for SPI
...
Linus Torvalds [Thu, 13 Dec 2012 18:58:20 +0000 (10:58 -0800)]
Merge tag 'pm-merge' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC power management and clock changes from Olof Johansson:
"This branch contains a largeish set of updates of power management and
clock setup. The bulk of it is for OMAP/AM33xx platforms, but also a
few around hotplug/suspend/resume on Exynos.
It includes a split-up of some of the OMAP clock data into separate
files which adds to the diffstat, but gross delta is fairly reasonable."
* tag 'pm-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (60 commits)
ARM: OMAP: Move plat-omap/dma-omap.h to include/linux/omap-dma.h
ASoC: OMAP: mcbsp fixes for enabling ARM multiplatform support
watchdog: OMAP: fixup for ARM multiplatform support
ARM: EXYNOS: Add flush_cache_all in suspend finisher
ARM: EXYNOS: Remove scu_enable from cpuidle
ARM: EXYNOS: Fix soft reboot hang after suspend/resume
ARM: EXYNOS: Add support for rtc wakeup
ARM: EXYNOS: fix the hotplug for Cortex-A15
ARM: OMAP2+: omap_device: Correct resource handling for DT boot
ARM: OMAP2+: hwmod: Add possibility to count hwmod resources based on type
ARM: OMAP2+: hwmod: Add support for per hwmod/module context lost count
ARM: OMAP2+: PRM: initialize some PRM functions early
ARM: OMAP2+: voltage: fixup oscillator handling when CONFIG_PM=n
ARM: OMAP4: USB: power down MUSB PHY during boot
ARM: OMAP2+: clock: Cleanup !CONFIG_COMMON_CLK parts
ARM: OMAP2xxx: clock: drop obsolete clock data
ARM: OMAP2: clock: Cleanup !CONFIG_COMMON_CLK parts
ARM: OMAP3+: DPLL: drop !CONFIG_COMMON_CLK sections
ARM: AM33xx: clock: drop obsolete clock data
ARM: OMAP3xxx: clk: drop obsolete clock data
...
Linus Torvalds [Thu, 13 Dec 2012 18:57:16 +0000 (10:57 -0800)]
Merge tag 'multiplatform' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC multiplatform conversion patches from Olof Johansson:
"Here are more patches in the progression towards multiplatform, sparse
irq conversions in particular.
Tegra has a handful of cleanups and general groundwork, but is not
quite there yet on full enablement.
Platforms that are enabled through this branch are VT8500 and Zynq.
Note that i.MX was converted in one of the earlier cleanup branches as
well (before we started a separate topic for multiplatform). And both
new platforms for this merge window, sunxi and bcm, were merged with
multiplatform support enabled."
Fix up conflicts mostly as per Olof.
* tag 'multiplatform' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (29 commits)
ARM: zynq: Remove all unused mach headers
ARM: zynq: add support for ARCH_MULTIPLATFORM
ARM: zynq: make use of debug_ll_io_init()
ARM: zynq: remove TTC early mapping
ARM: tegra: move debug-macro.S to include/debug
ARM: tegra: don't include iomap.h from debug-macro.S
ARM: tegra: decouple uncompress.h and debug-macro.S
ARM: tegra: simplify DEBUG_LL UART selection options
ARM: tegra: select SPARSE_IRQ
ARM: tegra: enhance timer.c to get IO address from device tree
ARM: tegra: enhance timer.c to get IRQ info from device tree
ARM: timer: fix checkpatch warnings
ARM: tegra: add TWD to device tree
ARM: tegra: define DT bindings for and instantiate RTC
ARM: tegra: define DT bindings for and instantiate timer
clocksource/mtu-nomadik: use apb_pclk
clk: ux500: Register mtu apb_pclocks
ARM: plat-nomadik: convert platforms to SPARSE_IRQ
mfd/db8500-prcmu: use the irq_domain_add_simple()
mfd/ab8500-core: use irq_domain_add_simple()
...
Linus Torvalds [Thu, 13 Dec 2012 18:39:26 +0000 (10:39 -0800)]
Merge tag 'dt' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC device tree conversions and enablement from Olof Johansson:
"Continued device tree conversion and enablement across a number of
platforms; Kirkwood, tegra, i.MX, Exynos, zynq and a couple of other
smaller series as well.
ux500 has seen continued conversion for platforms. Several platforms
have seen pinctrl-via-devicetree conversions for simpler
multiplatform. Tegra is adding data for new devices/drivers, and
Exynos has a bunch of new bindings and devices added as well.
So, pretty much the same progression in the right direction as the
last few releases."
Fix up conflicts as per Olof.
* tag 'dt' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (185 commits)
ARM: ux500: Rename dbx500 cpufreq code to be more generic
ARM: dts: add missing ux500 device trees
ARM: ux500: Stop registering the PCM driver from platform code
ARM: ux500: Move board specific GPIO info out to subordinate DTS files
ARM: ux500: Disable the MMCI gpio-regulator by default
ARM: Kirkwood: remove kirkwood_ehci_init() from new boards
ARM: Kirkwood: Add support LED of OpenBlocks A6
ARM: Kirkwood: Convert to EHCI via DT for OpenBlocks A6
ARM: kirkwood: Add NAND partiton map for OpenBlocks A6
ARM: kirkwood: Add support second I2C bus and RTC on OpenBlocks A6
ARM: kirkwood: Add support DT of second I2C bus
ARM: kirkwood: Convert mplcec4 board to pinctrl
ARM: Kirkwood: Convert km_kirkwood to pinctrl
ARM: Kirkwood: support 98DX412x kirkwoods with pinctrl
ARM: Kirkwood: Convert IX2-200 to pinctrl.
ARM: Kirkwood: Convert lsxl boards to pinctrl.
ARM: Kirkwood: Convert ib62x0 to pinctrl.
ARM: Kirkwood: Convert GoFlex Net to pinctrl.
ARM: Kirkwood: Convert dreamplug to pinctrl.
ARM: Kirkwood: Convert dockstar to pinctrl.
...
stephen hemminger [Thu, 13 Dec 2012 06:51:28 +0000 (06:51 +0000)]
bridge: fix icmpv6 endian bug and other sparse warnings
Fix the warnings reported by sparse on recent bridge multicast
changes. Mostly just rcu annotation issues but in this case
sparse found a real bug! The ICMPv6 mld2 query mrc
values is in network byte order.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yan Burman [Thu, 13 Dec 2012 05:20:59 +0000 (05:20 +0000)]
net: ethool: Document struct ethtool_flow_ext
Add documentation for struct ethtool_flow_ext especially in regard
to what flags are needed for which fields.
Signed-off-by: Yan Burman <yanb@mellanox.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
YOSHIFUJI Hideaki / 吉藤英明 [Thu, 13 Dec 2012 04:29:36 +0000 (04:29 +0000)]
ndisc: Fix padding error in link-layer address option.
If a natural number n exists where 2 + data_len <= 8n < 2 + data_len + pad,
post padding is not initialized correctly.
(Un)fortunately, the only type that requires pad is Infiniband,
whose pad is 2 and data_len is 20, and this logical error has not
become obvious, but it is better to fix.
Note that ndisc_opt_addr_space() handles the situation described
above correctly.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 12 Dec 2012 19:22:57 +0000 (19:22 +0000)]
tuntap: dont use skb after netif_rx_ni(skb)
On Wed, 2012-12-12 at 23:16 -0500, Dave Jones wrote:
> Since todays net merge, I see this when I start openvpn..
>
> general protection fault: 0000 [#1] PREEMPT SMP
> Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables xfs iTCO_wdt iTCO_vendor_support snd_emu10k1 snd_util_mem snd_ac97_codec coretemp ac97_bus microcode snd_hwdep snd_seq pcspkr snd_pcm snd_page_alloc snd_timer lpc_ich i2c_i801 snd_rawmidi mfd_core snd_seq_device snd e1000e soundcore emu10k1_gp gameport i82975x_edac edac_core vhost_net tun macvtap macvlan kvm_intel kvm binfmt_misc nfsd auth_rpcgss nfs_acl lockd sunrpc btrfs libcrc32c zlib_deflate firewire_ohci sata_sil firewire_core crc_itu_t radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core floppy
> CPU 0
> Pid: 1381, comm: openvpn Not tainted 3.7.0+ #14 /D975XBX
> RIP: 0010:[<
ffffffff815b54a4>] [<
ffffffff815b54a4>] skb_flow_dissect+0x314/0x3e0
> RSP: 0018:
ffff88007d0d9c48 EFLAGS:
00010206
> RAX:
000000000000055d RBX:
6b6b6b6b6b6b6b4b RCX:
1471030a0180040a
> RDX:
0000000000000005 RSI:
00000000ffffffe0 RDI:
ffff8800ba83fa80
> RBP:
ffff88007d0d9cb8 R08:
0000000000000000 R09:
0000000000000000
> R10:
0000000000000000 R11:
0000000000000101 R12:
ffff8800ba83fa80
> R13:
0000000000000008 R14:
ffff88007d0d9cc8 R15:
ffff8800ba83fa80
> FS:
00007f6637104800(0000) GS:
ffff8800bf600000(0000) knlGS:
0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
> CR2:
00007f563f5b01c4 CR3:
000000007d140000 CR4:
00000000000007f0
> DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
> DR3:
0000000000000000 DR6:
00000000ffff0ff0 DR7:
0000000000000400
> Process openvpn (pid: 1381, threadinfo
ffff88007d0d8000, task
ffff8800a540cd60)
> Stack:
>
ffff8800ba83fa80 0000000000000296 0000000000000000 0000000000000000
>
ffff88007d0d9cc8 ffffffff815bcff4 ffff88007d0d9ce8 ffffffff815b1831
>
ffff88007d0d9ca8 00000000703f6364 ffff8800ba83fa80 0000000000000000
> Call Trace:
> [<
ffffffff815bcff4>] ? netif_rx+0x114/0x4c0
> [<
ffffffff815b1831>] ? skb_copy_datagram_from_iovec+0x61/0x290
> [<
ffffffff815b672a>] __skb_get_rxhash+0x1a/0xd0
> [<
ffffffffa03b9538>] tun_get_user+0x418/0x810 [tun]
> [<
ffffffff8135f468>] ? delay_tsc+0x98/0xf0
> [<
ffffffff8109605c>] ? __rcu_read_unlock+0x5c/0xa0
> [<
ffffffffa03b9a41>] tun_chr_aio_write+0x81/0xb0 [tun]
> [<
ffffffff81145011>] ? __buffer_unlock_commit+0x41/0x50
> [<
ffffffff811db917>] do_sync_write+0xa7/0xe0
> [<
ffffffff811dc01f>] vfs_write+0xaf/0x190
> [<
ffffffff811dc375>] sys_write+0x55/0xa0
> [<
ffffffff81705540>] tracesys+0xdd/0xe2
> Code: 41 8b 44 24 68 41 2b 44 24 6c 01 de 29 f0 83 f8 03 0f 8e a0 00 00 00 48 63 de 49 03 9c 24 e0 00 00 00 48 85 db 0f 84 72 fe ff ff <8b> 03 41 89 46 08 b8 01 00 00 00 e9 43 fd ff ff 0f 1f 40 00 48
> RIP [<
ffffffff815b54a4>] skb_flow_dissect+0x314/0x3e0
> RSP <
ffff88007d0d9c48>
> ---[ end trace
6d42c834c72c002e ]---
>
>
> Faulting instruction is
>
> 0: 8b 03 mov (%rbx),%eax
>
> rbx is slab poison (-20) so this looks like a use-after-free here...
>
> flow->ports = *ports;
> 314: 8b 03 mov (%rbx),%eax
> 316: 41 89 46 08 mov %eax,0x8(%r14)
>
> in the inlined skb_header_pointer in skb_flow_dissect
>
> Dave
>
commit
96442e4242 (tuntap: choose the txq based on rxq) added
a use after free.
Cache rxhash in a temp variable before calling netif_rx_ni()
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Wang <jasowang@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Jones [Wed, 12 Dec 2012 18:11:34 +0000 (18:11 +0000)]
nfc: remove noisy message from llcp_sock_sendmsg
This is easily triggerable when fuzz-testing as an unprivileged user.
We could rate-limit it, but given we don't print similar messages
for other protocols, I just removed it.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 13 Dec 2012 02:07:07 +0000 (18:07 -0800)]
Merge git://git./linux/kernel/git/davem/net-next
Pull networking changes from David Miller:
1) Allow to dump, monitor, and change the bridge multicast database
using netlink. From Cong Wang.
2) RFC 5961 TCP blind data injection attack mitigation, from Eric
Dumazet.
3) Networking user namespace support from Eric W. Biederman.
4) tuntap/virtio-net multiqueue support by Jason Wang.
5) Support for checksum offload of encapsulated packets (basically,
tunneled traffic can still be checksummed by HW). From Joseph
Gasparakis.
6) Allow BPF filter access to VLAN tags, from Eric Dumazet and
Daniel Borkmann.
7) Bridge port parameters over netlink and BPDU blocking support
from Stephen Hemminger.
8) Improve data access patterns during inet socket demux by rearranging
socket layout, from Eric Dumazet.
9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and
Jon Maloy.
10) Update TCP socket hash sizing to be more in line with current day
realities. The existing heurstics were choosen a decade ago.
From Eric Dumazet.
11) Fix races, queue bloat, and excessive wakeups in ATM and
associated drivers, from Krzysztof Mazur and David Woodhouse.
12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions
in VXLAN driver, from David Stevens.
13) Add "oops_only" mode to netconsole, from Amerigo Wang.
14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also
allow DCB netlink to work on namespaces other than the initial
namespace. From John Fastabend.
15) Support PTP in the Tigon3 driver, from Matt Carlson.
16) tun/vhost zero copy fixes and improvements, plus turn it on
by default, from Michael S. Tsirkin.
17) Support per-association statistics in SCTP, from Michele
Baldessari.
And many, many, driver updates, cleanups, and improvements. Too
numerous to mention individually.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
net/mlx4_en: Add support for destination MAC in steering rules
net/mlx4_en: Use generic etherdevice.h functions.
net: ethtool: Add destination MAC address to flow steering API
bridge: add support of adding and deleting mdb entries
bridge: notify mdb changes via netlink
ndisc: Unexport ndisc_{build,send}_skb().
uapi: add missing netconf.h to export list
pkt_sched: avoid requeues if possible
solos-pci: fix double-free of TX skb in DMA mode
bnx2: Fix accidental reversions.
bna: Driver Version Updated to 3.1.2.1
bna: Firmware update
bna: Add RX State
bna: Rx Page Based Allocation
bna: TX Intr Coalescing Fix
bna: Tx and Rx Optimizations
bna: Code Cleanup and Enhancements
ath9k: check pdata variable before dereferencing it
ath5k: RX timestamp is reported at end of frame
ath9k_htc: RX timestamp is reported at end of frame
...
Linus Torvalds [Thu, 13 Dec 2012 01:50:34 +0000 (17:50 -0800)]
Merge tag 'for-linus-
20121212' of git://git./linux/kernel/git/dhowells/linux-mn10300
Pull MN10300 changes from David Howells:
"miscellaneous MN10300 arch patches. I've based it on top of Al Viro's
signal tree - so these patches should be pulled after that."
* tag 'for-linus-
20121212' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-mn10300:
MN10300: Use asm-generic/pci_iomap.h
MN10300: Get rid of unused variable from ASB2305 PCI code
MN10300: ASB2305 PCI code needs linux/irq.h
mn10300/mm/fault.c: Port OOM changes to do_page_fault
MN10300: Handle cacheable PCI regions in pci_iomap()
MN10300: fix debug polling in ttySM driver
MN10300: ttySM: clean up unnecessary casting
MN10300: fix SMP synchronization between txdma and serial driver
MN10300: fix serial port vdma irq setup for SMP
MN10300: cleanup IRQ affinity setting
MN10300: ttySM: Use memory barriers correctly in circular buffer logic
Lin Feng [Wed, 12 Dec 2012 21:52:39 +0000 (13:52 -0800)]
mm/bootmem.c: remove unused wrapper function reserve_bootmem_generic()
reserve_bootmem_generic() has no caller,
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dominik Dingel [Wed, 12 Dec 2012 21:52:37 +0000 (13:52 -0800)]
mm/memory.c: remove unused code from do_wp_page()
page_mkwrite is initalized with zero and only set once, from that point
exists no way to get to the oom or oom_free_new labels.
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:52:36 +0000 (13:52 -0800)]
asm-generic, mm: pgtable: consolidate zero page helpers
We have two different implementation of is_zero_pfn() and my_zero_pfn()
helpers: for architectures with and without zero page coloring.
Let's consolidate them in <asm-generic/pgtable.h>.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Wed, 12 Dec 2012 21:52:33 +0000 (13:52 -0800)]
mm/hugetlb.c: fix warning on freeing hwpoisoned hugepage
Fix the warning from __list_del_entry() which is triggered when a process
tries to do free_huge_page() for a hwpoisoned hugepage.
free_huge_page() can be called for hwpoisoned hugepage from
unpoison_memory(). This function gets refcount once and clears
PageHWPoison, and then puts refcount twice to return the hugepage back to
free pool. The second put_page() finally reaches free_huge_page().
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Wed, 12 Dec 2012 21:52:30 +0000 (13:52 -0800)]
hwpoison, hugetlbfs: fix RSS-counter warning
Memory error handling on hugepages can break a RSS counter, which emits a
message like "Bad rss-counter state mm:
ffff88040abecac0 idx:1 val:-1".
This is because PageAnon returns true for hugepage (this behavior is
necessary for reverse mapping to work on hugetlbfs).
[akpm@linux-foundation.org: clean up code layout]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Wed, 12 Dec 2012 21:52:28 +0000 (13:52 -0800)]
hwpoison, hugetlbfs: fix "bad pmd" warning in unmapping hwpoisoned hugepage
When a process which used a hwpoisoned hugepage tries to exit() or
munmap(), the kernel can print out "bad pmd" message because page table
walker in free_pgtables() encounters 'hwpoisoned entry' on pmd.
This is because currently we fail to clear the hwpoisoned entry in
__unmap_hugepage_range(), so this patch simply does it.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michel Lespinasse [Wed, 12 Dec 2012 21:52:25 +0000 (13:52 -0800)]
mm: protect against concurrent vma expansion
expand_stack() runs with a shared mmap_sem lock. Because of this, there
could be multiple concurrent stack expansions in the same mm, which may
cause problems in the vma gap update code.
I propose to solve this by taking the mm->page_table_lock around such vma
expansions, in order to avoid the concurrency issue. We only have to
worry about concurrent expand_stack() calls here, since we hold a shared
mmap_sem lock and all vma modificaitons other than expand_stack() are done
under an exclusive mmap_sem lock.
I previously tried to achieve the same effect by making sure all growable
vmas in a given mm would share the same anon_vma, which we already lock
here. However this turned out to be difficult - all of the schemes I
tried for refcounting the growable anon_vma and clearing turned out ugly.
So, I'm now proposing only the minimal fix.
The overhead of taking the page table lock during stack expansion is
expected to be small: glibc doesn't use expandable stacks for the threads
it creates, so having multiple growable stacks is actually uncommon and we
don't expect the page table lock to get bounced between threads.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 12 Dec 2012 21:52:23 +0000 (13:52 -0800)]
memcg: do not check for mm in __mem_cgroup_count_vm_event
The mm given to __mem_cgroup_count_vm_event() cannot be NULL because the
function is either called from the page fault path or vma->vm_mm is used.
So the check can be dropped.
The check was introduced by commit
456f998ec817 ("memcg: add the
pagefault count into memcg stats") because the originally proposed patch
used current->mm for shmem but this has been changed to vma->vm_mm later
on without the check being removed (thanks to Hugh for this
recollection).
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ying Han <yinghan@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Wed, 12 Dec 2012 21:52:21 +0000 (13:52 -0800)]
tmpfs: support SEEK_DATA and SEEK_HOLE (reprise)
Revert 3.5's commit
f21f8062201f ("tmpfs: revert SEEK_DATA and
SEEK_HOLE") to reinstate
4fb5ef089b28 ("tmpfs: support SEEK_DATA and
SEEK_HOLE"), with the intervening additional arg to
generic_file_llseek_size().
In 3.8, ext4 is expected to join btrfs, ocfs2 and xfs with proper
SEEK_DATA and SEEK_HOLE support; and a good case has now been made for
it on tmpfs, so let's join the party.
It's quite easy for tmpfs to scan the radix_tree to support llseek's new
SEEK_DATA and SEEK_HOLE options: so add them while the minutiae are
still on my mind (in particular, the !PageUptodate-ness of pages
fallocated but still unwritten).
[akpm@linux-foundation.org: fix warning with CONFIG_TMPFS=n]
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jaegeuk Hanse <jaegeuk.hanse@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Zheng Liu <wenqing.lz@taobao.com>
Cc: Jeff liu <jeff.liu@oracle.com>
Cc: Paul Eggert <eggert@cs.ucla.edu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Josef Bacik <josef@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andreas Dilger <adilger@dilger.ca>
Cc: Marco Stornelli <marco.stornelli@gmail.com>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jiang Liu [Wed, 12 Dec 2012 21:52:19 +0000 (13:52 -0800)]
mm: provide more accurate estimation of pages occupied by memmap
If SPARSEMEM is enabled, it won't build page structures for non-existing
pages (holes) within a zone, so provide a more accurate estimation of
pages occupied by memmap if there are bigger holes within the zone.
And pages for highmem zones' memmap will be allocated from lowmem, so
charge nr_kernel_pages for that.
[akpm@linux-foundation.org: mark calc_memmap_size __paging_init]
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Tested-by: Jianguo Wu <wujianguo@huawei.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yan Hong [Wed, 12 Dec 2012 21:52:16 +0000 (13:52 -0800)]
fs/buffer.c: remove redundant initialization in alloc_page_buffers()
buffer_head comes from kmem_cache_zalloc(), no need to zero its fields.
Signed-off-by: Yan Hong <clouds.yan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yan Hong [Wed, 12 Dec 2012 21:52:15 +0000 (13:52 -0800)]
fs/buffer.c: do not inline exported function
It makes no sense to inline an exported function.
Signed-off-by: Yan Hong <clouds.yan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yan Hong [Wed, 12 Dec 2012 21:52:14 +0000 (13:52 -0800)]
writeback: fix a typo in comment
Signed-off-by: Yan Hong <clouds.yan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jiang Liu [Wed, 12 Dec 2012 21:52:12 +0000 (13:52 -0800)]
mm: introduce new field "managed_pages" to struct zone
Currently a zone's present_pages is calcuated as below, which is
inaccurate and may cause trouble to memory hotplug.
spanned_pages - absent_pages - memmap_pages - dma_reserve.
During fixing bugs caused by inaccurate zone->present_pages, we found
zone->present_pages has been abused. The field zone->present_pages may
have different meanings in different contexts:
1) pages existing in a zone.
2) pages managed by the buddy system.
For more discussions about the issue, please refer to:
http://lkml.org/lkml/2012/11/5/866
https://patchwork.kernel.org/patch/
1346751/
This patchset tries to introduce a new field named "managed_pages" to
struct zone, which counts "pages managed by the buddy system". And revert
zone->present_pages to count "physical pages existing in a zone", which
also keep in consistence with pgdat->node_present_pages.
We will set an initial value for zone->managed_pages in function
free_area_init_core() and will adjust it later if the initial value is
inaccurate.
For DMA/normal zones, the initial value is set to:
(spanned_pages - absent_pages - memmap_pages - dma_reserve)
Later zone->managed_pages will be adjusted to the accurate value when the
bootmem allocator frees all free pages to the buddy system in function
free_all_bootmem_node() and free_all_bootmem().
The bootmem allocator doesn't touch highmem pages, so highmem zones'
managed_pages is set to the accurate value "spanned_pages - absent_pages"
in function free_area_init_core() and won't be updated anymore.
This patch also adds a new field "managed_pages" to /proc/zoneinfo
and sysrq showmem.
[akpm@linux-foundation.org: small comment tweaks]
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Tested-by: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 12 Dec 2012 21:52:10 +0000 (13:52 -0800)]
mm, oom: remove statically defined arch functions of same name
out_of_memory() is a globally defined function to call the oom killer.
x86, sh, and powerpc all use a function of the same name within file scope
in their respective fault.c unnecessarily. Inline the functions into the
pagefault handlers to clean the code up.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 12 Dec 2012 21:52:07 +0000 (13:52 -0800)]
mm, oom: remove redundant sleep in pagefault oom handler
out_of_memory() will already cause current to schedule if it has not been
killed, so doing it again in pagefault_out_of_memory() is redundant.
Remove it.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 12 Dec 2012 21:52:06 +0000 (13:52 -0800)]
mm, oom: cleanup pagefault oom handler
To lock the entire system from parallel oom killing, it's possible to pass
in a zonelist with all zones rather than using for_each_populated_zone()
for the iteration. This obsoletes try_set_system_oom() and
clear_system_oom() so that they can be removed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:52:04 +0000 (13:52 -0800)]
memory_hotplug: allow online/offline memory to result movable node
Now, memory management can handle movable node or nodes which don't have
any normal memory, so we can dynamic configure and add movable node by:
online a ZONE_MOVABLE memory from a previous offline node
offline the last normal memory which result a non-normal-memory-node
movable-node is very important for power-saving, hardware partitioning and
high-available-system(hardware fault management).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:52:00 +0000 (13:52 -0800)]
numa: add CONFIG_MOVABLE_NODE for movable-dedicated node
We need a node which only contains movable memory. This feature is very
important for node hotplug. If a node has normal/highmem, the memory may
be used by the kernel and can't be offlined. If the node only contains
movable memory, we can offline the memory and the node.
All are prepared, we can actually introduce N_MEMORY.
add CONFIG_MOVABLE_NODE make we can use it for movable-dedicated node
[akpm@linux-foundation.org: fix Kconfig text]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 12 Dec 2012 21:51:57 +0000 (13:51 -0800)]
mm, memcg: avoid unnecessary function call when memcg is disabled
While profiling numa/core v16 with cgroup_disable=memory on the command
line, I noticed mem_cgroup_count_vm_event() still showed up as high as
0.60% in perftop.
This occurs because the function is called extremely often even when memcg
is disabled.
To fix this, inline the check for mem_cgroup_disabled() so we avoid the
unnecessary function call if memcg is disabled.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Wed, 12 Dec 2012 21:51:56 +0000 (13:51 -0800)]
mm: add a reminder comment for __GFP_BITS_SHIFT
Cc: Glauber Costa <glommer@parallels.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 12 Dec 2012 21:51:53 +0000 (13:51 -0800)]
mm: WARN_ON_ONCE if f_op->mmap() change vma's start address
During reviewing the source code, I found a comment which mention that
after f_op->mmap(), vma's start address can be changed. I didn't verify
that it is really possible, because there are so many f_op->mmap()
implementation. But if there are some mmap() which change vma's start
address, it is possible error situation, because we already prepare prev
vma, rb_link and rb_parent and these are related to original address.
So add WARN_ON_ONCE for finding that this situtation really happens.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Greg Thelen [Wed, 12 Dec 2012 21:51:52 +0000 (13:51 -0800)]
res_counter: delete res_counter_write()
Since commit
628f42355389 ("memcg: limit change shrink usage") both
res_counter_write() and write_strategy_fn have been unused. This patch
deletes them both.
Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:49 +0000 (13:51 -0800)]
hotplug: update nodemasks management
Update nodemasks management for N_MEMORY.
[lliubbo@gmail.com: fix build]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:46 +0000 (13:51 -0800)]
page_alloc: use N_MEMORY instead N_HIGH_MEMORY change the node_states initialization
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Since we introduced N_MEMORY, we update the initialization of node_states.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:43 +0000 (13:51 -0800)]
vmscan: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:40 +0000 (13:51 -0800)]
init: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:39 +0000 (13:51 -0800)]
kthread: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:37 +0000 (13:51 -0800)]
vmstat: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:36 +0000 (13:51 -0800)]
hugetlb: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:33 +0000 (13:51 -0800)]
mempolicy: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:30 +0000 (13:51 -0800)]
mm,migrate: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:28 +0000 (13:51 -0800)]
oom: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:27 +0000 (13:51 -0800)]
memcontrol: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:25 +0000 (13:51 -0800)]
procfs: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:24 +0000 (13:51 -0800)]
cpuset: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:21 +0000 (13:51 -0800)]
mm: node_states: introduce N_MEMORY
We have N_NORMAL_MEMORY for standing for the nodes that have normal memory
with zone_type <= ZONE_NORMAL.
And we have N_HIGH_MEMORY for standing for the nodes that have normal or
high memory.
But we don't have any word to stand for the nodes that have *any* memory.
And we have N_CPU but without N_MEMORY.
Current code reuse the N_HIGH_MEMORY for this purpose because any node
which has memory must have high memory or normal memory currently.
A) But this reusing is bad for *readability*. Because the name
N_HIGH_MEMORY just stands for high or normal:
A.example 1)
mem_cgroup_nr_lru_pages():
for_each_node_state(nid, N_HIGH_MEMORY)
The user will be confused(why this function just counts for high or
normal memory node? does it counts for ZONE_MOVABLE's lru pages?)
until someone else tell them N_HIGH_MEMORY is reused to stand for
nodes that have any memory.
A.cont) If we introduce N_MEMORY, we can reduce this confusing
AND make the code more clearly:
A.example 2) mm/page_cgroup.c use N_HIGH_MEMORY twice:
One is in page_cgroup_init(void):
for_each_node_state(nid, N_HIGH_MEMORY) {
It means if the node have memory, we will allocate page_cgroup map for
the node. We should use N_MEMORY instead here to gaim more clearly.
The second using is in alloc_page_cgroup():
if (node_state(nid, N_HIGH_MEMORY))
addr = vzalloc_node(size, nid);
It means if the node has high or normal memory that can be allocated
from kernel. We should keep N_HIGH_MEMORY here, and it will be better
if the "any memory" semantic of N_HIGH_MEMORY is removed.
B) This reusing is out-dated if we introduce MOVABLE-dedicated node.
The MOVABLE-dedicated node should not appear in
node_stats[N_HIGH_MEMORY] nor node_stats[N_NORMAL_MEMORY],
because MOVABLE-dedicated node has no high or normal memory.
In x86_64, N_HIGH_MEMORY=N_NORMAL_MEMORY, if a MOVABLE-dedicated node
is in node_stats[N_HIGH_MEMORY], it is also means it is in
node_stats[N_NORMAL_MEMORY], it causes SLUB wrong.
The slub uses
for_each_node_state(nid, N_NORMAL_MEMORY)
and creates kmem_cache_node for MOVABLE-dedicated node and cause problem.
In one word, we need a N_MEMORY. We just intrude it as an alias to
N_HIGH_MEMORY and fix all im-proper usages of N_HIGH_MEMORY in late
patches.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Marek Szyprowski [Wed, 12 Dec 2012 21:51:19 +0000 (13:51 -0800)]
mm: use migrate_prep() instead of migrate_prep_local()
__alloc_contig_migrate_range() should use all possible ways to get all the
pages migrated from the given memory range, so pruning per-cpu lru lists
for all CPUs is required, regadless the cost of such operation. Otherwise
some pages which got stuck at per-cpu lru list might get missed by
migration procedure causing the contiguous allocation to fail.
Reported-by: SeongHwan Yoon <sunghwan.yun@samsung.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thierry Reding [Wed, 12 Dec 2012 21:51:17 +0000 (13:51 -0800)]
mm: compaction: Fix compiler warning
compact_capture_page() is only used if compaction is enabled so it should
be moved into the corresponding #ifdef.
Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:51:14 +0000 (13:51 -0800)]
thp: avoid race on multiple parallel page faults to the same page
pmd value is stable only with mm->page_table_lock taken. After taking
the lock we need to check that nobody modified the pmd before changing it.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Reviewed-by: Bob Liu <lliubbo@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:51:12 +0000 (13:51 -0800)]
thp: introduce sysfs knob to disable huge zero page
By default kernel tries to use huge zero page on read page fault. It's
possible to disable huge zero page by writing 0 or enable it back by
writing 1:
echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:51:09 +0000 (13:51 -0800)]
thp, vmstat: implement HZP_ALLOC and HZP_ALLOC_FAILED events
hzp_alloc is incremented every time a huge zero page is successfully
allocated. It includes allocations which where dropped due
race with other allocation. Note, it doesn't count every map
of the huge zero page, only its allocation.
hzp_alloc_failed is incremented if kernel fails to allocate huge zero
page and falls back to using small pages.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:51:06 +0000 (13:51 -0800)]
thp: implement refcounting for huge zero page
H. Peter Anvin doesn't like huge zero page which sticks in memory forever
after the first allocation. Here's implementation of lockless refcounting
for huge zero page.
We have two basic primitives: {get,put}_huge_zero_page(). They
manipulate reference counter.
If counter is 0, get_huge_zero_page() allocates a new huge page and takes
two references: one for caller and one for shrinker. We free the page
only in shrinker callback if counter is 1 (only shrinker has the
reference).
put_huge_zero_page() only decrements counter. Counter is never zero in
put_huge_zero_page() since shrinker holds on reference.
Freeing huge zero page in shrinker callback helps to avoid frequent
allocate-free.
Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
parallel (40 processes) read page faulting comparing to lazy huge page
allocation. I think it's pretty reasonable for synthetic benchmark.
[lliubbo@gmail.com: fix mismerge]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:51:05 +0000 (13:51 -0800)]
thp: lazy huge zero page allocation
Instead of allocating huge zero page on hugepage_init() we can postpone it
until first huge zero page map. It saves memory if THP is not in use.
cmpxchg() is used to avoid race on huge_zero_pfn initialization.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:51:02 +0000 (13:51 -0800)]
thp: setup huge zero page on non-write page fault
All code paths seems covered. Now we can map huge zero page on read page
fault.
We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.
If we fail to setup huge zero page (ENOMEM) we fallback to
handle_pte_fault() as we normally do in THP.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:51:00 +0000 (13:51 -0800)]
thp: implement splitting pmd for huge zero page
We can't split huge zero page itself (and it's bug if we try), but we
can split the pmd which points to it.
On splitting the pmd we create a table with all ptes set to normal zero
page.
[akpm@linux-foundation.org: fix build error]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:50:59 +0000 (13:50 -0800)]
thp: change split_huge_page_pmd() interface
Pass vma instead of mm and add address parameter.
In most cases we already have vma on the stack. We provides
split_huge_page_pmd_mm() for few cases when we have mm, but not vma.
This change is preparation to huge zero pmd splitting implementation.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:50:57 +0000 (13:50 -0800)]
thp: change_huge_pmd(): make sure we don't try to make a page writable
mprotect core never tries to make page writable using change_huge_pmd().
Let's add an assert that the assumption is true. It's important to be
sure we will not make huge zero page writable.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:50:54 +0000 (13:50 -0800)]
thp: do_huge_pmd_wp_page(): handle huge zero page
On write access to huge zero page we alloc a new huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte around
fault address to newly allocated normal (4k) page. All other ptes in the
pmd set to normal zero page.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:50:51 +0000 (13:50 -0800)]
thp: copy_huge_pmd(): copy huge zero page
It's easy to copy huge zero page. Just set destination pmd to huge zero
page.
It's safe to copy huge zero page since we have none yet :-p
[rientjes@google.com: fix comment]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:50:50 +0000 (13:50 -0800)]
thp: zap_huge_pmd(): zap huge zero pmd
We don't have a mapped page to zap in huge zero page case. Let's just clear
pmd and remove it from tlb.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:50:47 +0000 (13:50 -0800)]
thp: huge zero page: basic preparation
During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.
The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.
A program to demonstrate the issue:
#include <assert.h>
#include <stdlib.h>
#include <unistd.h>
#define MB 1024*1024
int main(int argc, char **argv)
{
char *p;
int i;
posix_memalign((void **)&p, 2 * MB, 200 * MB);
for (i = 0; i < 200 * MB; i+= 4096)
assert(p[i] == 0);
pause();
return 0;
}
With thp-never RSS is about 400k, but with thp-always it's 200M. After
the patcheset thp-always RSS is 400k too.
Design overview.
Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros. The way how we allocate it changes in the patchset:
- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
- [09/10] lazy allocation on first use;
- [10/10] lockless refcounting + shrinker-reclaimable hzp;
We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault. If we fail to setup
hzp (ENOMEM) we fallback to handle_pte_fault() as we normally do in THP.
On wp fault to hzp we allocate real memory for the huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte
around fault address to newly allocated normal (4k) page. All other ptes
in the pmd set to normal zero page.
We cannot split hzp (and it's bug if we try), but we can split the pmd
which points to it. On splitting the pmd we create a table with all ptes
set to normal zero page.
===
By hpa's request I've tried alternative approach for hzp implementation
(see Virtual huge zero page patchset): pmd table with all entries set to
zero page. This way should be more cache friendly, but it increases TLB
pressure.
The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.
Some numbers to compare two implementations (on 4s Westmere-EX):
Mirobenchmark1
==============
test:
posix_memalign((void **)&p, 2 * MB, 8 * GB);
for (i = 0; i < 100; i++) {
assert(memcmp(p, p + 4*GB, 4*GB) == 0);
asm volatile ("": : :"memory");
}
hzp:
Performance counter stats for './test_memcmp' (5 runs):
32356.272845 task-clock # 0.998 CPUs utilized ( +- 0.13% )
40 context-switches # 0.001 K/sec ( +- 0.94% )
0 CPU-migrations # 0.000 K/sec
4,218 page-faults # 0.130 K/sec ( +- 0.00% )
76,712,481,765 cycles # 2.371 GHz ( +- 0.13% ) [83.31%]
36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%]
1,684,049,110 stalled-cycles-backend # 2.20% backend cycles idle ( +- 2.96% ) [66.67%]
134,355,715,816 instructions # 1.75 insns per cycle
# 0.27 stalled cycles per insn ( +- 0.10% ) [83.35%]
13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%]
1,058,230 branch-misses # 0.01% of all branches ( +- 0.91% ) [83.36%]
32.
413866442 seconds time elapsed ( +- 0.13% )
vhzp:
Performance counter stats for './test_memcmp' (5 runs):
30327.183829 task-clock # 0.998 CPUs utilized ( +- 0.13% )
38 context-switches # 0.001 K/sec ( +- 1.53% )
0 CPU-migrations # 0.000 K/sec
4,218 page-faults # 0.139 K/sec ( +- 0.01% )
71,964,773,660 cycles # 2.373 GHz ( +- 0.13% ) [83.35%]
31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%]
773,484,474 stalled-cycles-backend # 1.07% backend cycles idle ( +- 6.61% ) [66.67%]
134,982,215,437 instructions # 1.88 insns per cycle
# 0.23 stalled cycles per insn ( +- 0.11% ) [83.32%]
13,509,150,683 branches # 445.447 M/sec ( +- 0.11% ) [83.34%]
1,017,667 branch-misses # 0.01% of all branches ( +- 1.07% ) [83.32%]
30.
381324695 seconds time elapsed ( +- 0.13% )
Mirobenchmark2
==============
test:
posix_memalign((void **)&p, 2 * MB, 8 * GB);
for (i = 0; i < 1000; i++) {
char *_p = p;
while (_p < p+4*GB) {
assert(*_p == *(_p+4*GB));
_p += 4096;
asm volatile ("": : :"memory");
}
}
hzp:
Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
3505.727639 task-clock # 0.998 CPUs utilized ( +- 0.26% )
9 context-switches # 0.003 K/sec ( +- 4.97% )
4,384 page-faults # 0.001 M/sec ( +- 0.00% )
8,318,482,466 cycles # 2.373 GHz ( +- 0.26% ) [33.31%]
5,134,318,786 stalled-cycles-frontend # 61.72% frontend cycles idle ( +- 0.42% ) [33.32%]
2,193,266,208 stalled-cycles-backend # 26.37% backend cycles idle ( +- 5.51% ) [33.33%]
9,494,670,537 instructions # 1.14 insns per cycle
# 0.54 stalled cycles per insn ( +- 0.13% ) [41.68%]
2,108,522,738 branches # 601.451 M/sec ( +- 0.09% ) [41.68%]
158,746 branch-misses # 0.01% of all branches ( +- 1.60% ) [41.71%]
3,168,102,115 L1-dcache-loads
# 903.693 M/sec ( +- 0.11% ) [41.70%]
1,048,710,998 L1-dcache-misses
# 33.10% of all L1-dcache hits ( +- 0.11% ) [41.72%]
1,047,699,685 LLC-load
# 298.854 M/sec ( +- 0.03% ) [33.38%]
2,287 LLC-misses
# 0.00% of all LL-cache hits ( +- 8.27% ) [33.37%]
3,166,187,367 dTLB-loads
# 903.147 M/sec ( +- 0.02% ) [33.35%]
4,266,538 dTLB-misses
# 0.13% of all dTLB cache hits ( +- 0.03% ) [33.33%]
3.
513339813 seconds time elapsed ( +- 0.26% )
vhzp:
Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
27313.891128 task-clock # 0.998 CPUs utilized ( +- 0.24% )
62 context-switches # 0.002 K/sec ( +- 0.61% )
4,384 page-faults # 0.160 K/sec ( +- 0.01% )
64,747,374,606 cycles # 2.370 GHz ( +- 0.24% ) [33.33%]
61,341,580,278 stalled-cycles-frontend # 94.74% frontend cycles idle ( +- 0.26% ) [33.33%]
56,702,237,511 stalled-cycles-backend # 87.57% backend cycles idle ( +- 0.07% ) [33.33%]
10,033,724,846 instructions # 0.15 insns per cycle
# 6.11 stalled cycles per insn ( +- 0.09% ) [41.65%]
2,190,424,932 branches # 80.195 M/sec ( +- 0.12% ) [41.66%]
1,028,630 branch-misses # 0.05% of all branches ( +- 1.50% ) [41.66%]
3,302,006,540 L1-dcache-loads
# 120.891 M/sec ( +- 0.11% ) [41.68%]
271,374,358 L1-dcache-misses
# 8.22% of all L1-dcache hits ( +- 0.04% ) [41.66%]
20,385,476 LLC-load
# 0.746 M/sec ( +- 1.64% ) [33.34%]
76,754 LLC-misses
# 0.38% of all LL-cache hits ( +- 2.35% ) [33.34%]
3,309,927,290 dTLB-loads
# 121.181 M/sec ( +- 0.03% ) [33.34%]
2,098,967,427 dTLB-misses
# 63.41% of all dTLB cache hits ( +- 0.03% ) [33.34%]
27.
364448741 seconds time elapsed ( +- 0.24% )
===
I personally prefer implementation present in this patchset. It doesn't
touch arch-specific code.
This patch:
Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros.
For now let's allocate the page on hugepage_init(). We'll switch to lazy
allocation later.
We are not going to map the huge zero page until we can handle it properly
on all code paths.
is_huge_zero_{pfn,pmd}() functions will be used by following patches to
check whether the pfn/pmd is huge zero page.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>