Linus Torvalds [Thu, 26 May 2016 21:10:32 +0000 (14:10 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
"This changeset has a few main parts:
- Ilya has finished a huge refactoring effort to sync up the
client-side logic in libceph with the user-space client code, which
has evolved significantly over the last couple years, with lots of
additional behaviors (e.g., how requests are handled when cluster
is full and transitions from full to non-full).
This structure of the code is more closely aligned with userspace
now such that it will be much easier to maintain going forward when
behavior changes take place. There are some locking improvements
bundled in as well.
- Zheng adds multi-filesystem support (multiple namespaces within the
same Ceph cluster)
- Zheng has changed the readdir offsets and directory enumeration so
that dentry offsets are hash-based and therefore stable across
directory fragmentation events on the MDS.
- Zheng has a smorgasbord of bug fixes across fs/ceph"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits)
ceph: fix wake_up_session_cb()
ceph: don't use truncate_pagecache() to invalidate read cache
ceph: SetPageError() for writeback pages if writepages fails
ceph: handle interrupted ceph_writepage()
ceph: make ceph_update_writeable_page() uninterruptible
libceph: make ceph_osdc_wait_request() uninterruptible
ceph: handle -EAGAIN returned by ceph_update_writeable_page()
ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM
ceph: block non-fatal signals for fault/page_mkwrite
ceph: make logical calculation functions return bool
ceph: tolerate bad i_size for symlink inode
ceph: improve fragtree change detection
ceph: keep leaf frag when updating fragtree
ceph: fix dir_auth check in ceph_fill_dirfrag()
ceph: don't assume frag tree splits in mds reply are sorted
ceph: fix inode reference leak
ceph: using hash value to compose dentry offset
ceph: don't forbid marking directory complete after forward seek
ceph: record 'offset' for each entry of readdir result
ceph: define 'end/complete' in readdir reply as bit flags
...
Linus Torvalds [Thu, 26 May 2016 17:33:33 +0000 (10:33 -0700)]
Merge tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"Highlights include:
Features:
- Add support for the NFS v4.2 COPY operation
- Add support for NFS/RDMA over IPv6
Bugfixes and cleanups:
- Avoid race that crashes nfs_init_commit()
- Fix oops in callback path
- Fix LOCK/OPEN race when unlinking an open file
- Choose correct stateids when using delegations in setattr, read and
write
- Don't send empty SETATTR after OPEN_CREATE
- xprtrdma: Prevent server from writing a reply into memory client
has released
- xprtrdma: Support using Read list and Reply chunk in one RPC call"
* tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (61 commits)
pnfs: pnfs_update_layout needs to consider if strict iomode checking is on
nfs/flexfiles: Use the layout segment for reading unless it a IOMODE_RW and reading is disabled
nfs/flexfiles: Helper function to detect FF_FLAGS_NO_READ_IO
nfs: avoid race that crashes nfs_init_commit
NFS: checking for NULL instead of IS_ERR() in nfs_commit_file()
pnfs: make pnfs_layout_process more robust
pnfs: rework LAYOUTGET retry handling
pnfs: lift retry logic from send_layoutget to pnfs_update_layout
pnfs: fix bad error handling in send_layoutget
flexfiles: add kerneldoc header to nfs4_ff_layout_prepare_ds
flexfiles: remove pointless setting of NFS_LAYOUT_RETURN_REQUESTED
pnfs: only tear down lsegs that precede seqid in LAYOUTRETURN args
pnfs: keep track of the return sequence number in pnfs_layout_hdr
pnfs: record sequence in pnfs_layout_segment when it's created
pnfs: don't merge new ff lsegs with ones that have LAYOUTRETURN bit set
pNFS/flexfiles: When initing reads or writes, we might have to retry connecting to DSes
pNFS/flexfiles: When checking for available DSes, conditionally check for MDS io
pNFS/flexfile: Fix erroneous fall back to read/write through the MDS
NFS: Reclaim writes via writepage are opportunistic
NFSv4: Use the right stateid for delegations in setattr, read and write
...
Linus Torvalds [Thu, 26 May 2016 17:13:40 +0000 (10:13 -0700)]
Merge tag 'xfs-for-linus-4.7-rc1' of git://git./linux/kernel/git/dgc/linux-xfs
Pull xfs updates from Dave Chinner:
"A pretty average collection of fixes, cleanups and improvements in
this request.
Summary:
- fixes for mount line parsing, sparse warnings, read-only compat
feature remount behaviour
- allow fast path symlink lookups for inline symlinks.
- attribute listing cleanups
- writeback goes direct to bios rather than indirecting through
bufferheads
- transaction allocation cleanup
- optimised kmem_realloc
- added configurable error handling for metadata write errors,
changed default error handling behaviour from "retry forever" to
"retry until unmount then fail"
- fixed several inode cluster writeback lookup vs reclaim race
conditions
- fixed inode cluster writeback checking wrong inode after lookup
- fixed bugs where struct xfs_inode freeing wasn't actually RCU safe
- cleaned up inode reclaim tagging"
* tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits)
xfs: fix warning in xfs_finish_page_writeback for non-debug builds
xfs: move reclaim tagging functions
xfs: simplify inode reclaim tagging interfaces
xfs: rename variables in xfs_iflush_cluster for clarity
xfs: xfs_iflush_cluster has range issues
xfs: mark reclaimed inodes invalid earlier
xfs: xfs_inode_free() isn't RCU safe
xfs: optimise xfs_iext_destroy
xfs: skip stale inodes in xfs_iflush_cluster
xfs: fix inode validity check in xfs_iflush_cluster
xfs: xfs_iflush_cluster fails to abort on error
xfs: remove xfs_fs_evict_inode()
xfs: add "fail at unmount" error handling configuration
xfs: add configuration handlers for specific errors
xfs: add configuration of error failure speed
xfs: introduce table-based init for error behaviors
xfs: add configurable error support to metadata buffers
xfs: introduce metadata IO error class
xfs: configurable error behavior via sysfs
xfs: buffer ->bi_end_io function requires irq-safe lock
...
Linus Torvalds [Thu, 26 May 2016 16:48:23 +0000 (09:48 -0700)]
Merge branch 'hwmon-for-linus' of git://git./linux/kernel/git/jdelvare/staging
Pull hwmon fixlets from Jean Delvare.
* 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
Documentation/hwmon: Update links in max34440
hwmon: (emc2103) Fix typo in MODULE_PARM_DESC
Linus Torvalds [Thu, 26 May 2016 16:36:10 +0000 (09:36 -0700)]
Merge tag 'mmc-v4.7-rc1' of git://git.linaro.org/people/ulf.hansson/mmc
Pull MMC fixes from Ulf Hansson:
"Here are some mmc fixes intended for v4.7 rc1. They are based on a
commit earlier in the merge window and have been tested in linux-next
for a while.
MMC core:
- Prevent re-tuning while serving requests for RPMB partitions
- Extend timeout for long read time quirk to support more eMMCs
MMC host:
- sdhci-acpi: Ensure connected devices are powered when probing
- sdhci-pci|acpi: Remove unreliable MMC_CAP_BUS_WIDTH_TEST for Intel HWs
- dw_mmc: Correct the assigning of max_blk_size
- dw_mmc-rockchip: Allow RPMB partitions to be created
- dw_mmc-rockchip: Set the drive phase properly"
* tag 'mmc-v4.7-rc1' of git://git.linaro.org/people/ulf.hansson/mmc:
mmc: sdhci-acpi: Remove MMC_CAP_BUS_WIDTH_TEST for Intel controllers
mmc: sdhci-pci: Remove MMC_CAP_BUS_WIDTH_TEST for Intel controllers
mmc: longer timeout for long read time quirk
mmc: dw_mmc: rockchip: Set the drive phase properly
mmc: dw_mmc: fix the wrong max_blk_size
mmc: dw_mmc-rockchip: add MMC_CAP_CMD23 capabilities
mmc: sdhci-acpi: Ensure connected devices are powered when probing
ACPI / PM: Export acpi_device_fix_up_power()
mmc: block: Pause re-tuning while switched to the RPMB partition
mmc: block: Always switch back to main area after RPMB access
mmc: core: Add a facility to "pause" re-tuning
Linus Torvalds [Thu, 26 May 2016 16:23:43 +0000 (09:23 -0700)]
Merge branch 'next' of git://git./linux/kernel/git/rzhang/linux
Pull thermal management updates from Zhang Rui:
- Introduce generic ADC thermal driver, based on OF thermal (Laxman
Dewangan)
- Introduce new thermal driver for Tango chips (Marc Gonzalez)
- Rockchip driver support for RK3399, RK3366, and some fixes (Caesar
Wang, Elaine Zhang and Shawn Lin)
- Add CPU power cooling model to Mediatek thermal driver (Dawei Chien)
- Wider usage of dev_thermal_zone_of_sensor_register (Eduardo Valentin)
- TI thermal driver gained a new maintainer (Keerthy).
- Enabled powerclamp driver by checking CPU feature and package cstate
counter instead of CPU whitelist (Jacob Pan)
- Various fixes on thermal governor, OF thermal, Tegra, and RCAR
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: (50 commits)
thermal: tango: initialize TEMPSI_CFG
thermal: rockchip: use the usleep_range instead of udelay
thermal: rockchip: add the notes for better reading
thermal: rockchip: Support RK3366 SoCs in the thermal driver
thermal: rockchip: handle the power sequence for tsadc controller
thermal: rockchip: update the tsadc table for rk3399
thermal: rockchip: fixes the code_to_temp for tsadc driver
thermal: rockchip: disable thermal->clk in err case
thermal: tegra: add Tegra132 specific SOC_THERM driver
thermal: fix ptr_ret.cocci warnings
thermal: mediatek: Add cpu dynamic power cooling model.
thermal: generic-adc: Add ADC based thermal sensor driver
thermal: generic-adc: Add DT binding for ADC based thermal sensor
thermal: tegra: fix static checker warning
thermal: tegra: mark PM functions __maybe_unused
thermal: add temperature sensor support for tango SoC
thermal: hisilicon: fix IRQ imbalance enabling
thermal: hisilicon: support to use any sensor
thermal: rcar: Remove binding docs for r8a7794
thermal: tegra: add PM support
...
Linus Torvalds [Thu, 26 May 2016 16:15:19 +0000 (09:15 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jmorris/linux-security
Pull Yama locking fix from James Morris:
"Fix for the Yama LSM"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
Yama: fix double-spinlock and user access in atomic context
Tom Haynes [Wed, 25 May 2016 14:31:14 +0000 (07:31 -0700)]
pnfs: pnfs_update_layout needs to consider if strict iomode checking is on
As flexfiles has FF_FLAGS_NO_READ_IO, there is a need to generically
support enforcing that a IOMODE_RW segment will not allow READ I/O.
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Tom Haynes [Wed, 25 May 2016 14:31:13 +0000 (07:31 -0700)]
nfs/flexfiles: Use the layout segment for reading unless it a IOMODE_RW and reading is disabled
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Glenn Dayton [Thu, 26 May 2016 09:06:53 +0000 (11:06 +0200)]
Documentation/hwmon: Update links in max34440
It appears the website for maxim-ic.com changed to
maximintegrated.com.
Signed-off-by: Glenn Dayton <glenn.dayton24@gmail.com>
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Dan Carpenter [Thu, 26 May 2016 09:06:53 +0000 (11:06 +0200)]
hwmon: (emc2103) Fix typo in MODULE_PARM_DESC
"apd" was intended here instead of "init".
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Linus Torvalds [Thu, 26 May 2016 00:37:33 +0000 (17:37 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Misc fixes: EFI, entry code, pkeys and MPX fixes, TASK_SIZE cleanups
and a tsc frequency table fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Switch from TASK_SIZE to TASK_SIZE_MAX in the page fault code
x86/fsgsbase/64: Use TASK_SIZE_MAX for FSBASE/GSBASE upper limits
x86/mm/mpx: Work around MPX erratum SKD046
x86/entry/64: Fix stack return address retrieval in thunk
x86/efi: Fix 7-parameter efi_call()s
x86/cpufeature, x86/mm/pkeys: Fix broken compile-time disabling of pkeys
x86/tsc: Add missing Cherrytrail frequency to the table
Linus Torvalds [Thu, 26 May 2016 00:11:43 +0000 (17:11 -0700)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Two fixes: one for a lost wakeup, the other to fix the compiler
optimizing out preempt operations on ARM64 (and possibly other non-x86
architectures)"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Fix remote wakeups
sched/preempt: Fix preempt_count manipulations
Linus Torvalds [Thu, 26 May 2016 00:05:40 +0000 (17:05 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
"Mostly tooling and PMU driver fixes, but also a number of late updates
such as the reworking of the call-chain size limiting logic to make
call-graph recording more robust, plus tooling side changes for the
new 'backwards ring-buffer' extension to the perf ring-buffer"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
perf record: Read from backward ring buffer
perf record: Rename variable to make code clear
perf record: Prevent reading invalid data in record__mmap_read
perf evlist: Add API to pause/resume
perf trace: Use the ptr->name beautifier as default for "filename" args
perf trace: Use the fd->name beautifier as default for "fd" args
perf report: Add srcline_from/to branch sort keys
perf evsel: Record fd into perf_mmap
perf evsel: Add overwrite attribute and check write_backward
perf tools: Set buildid dir under symfs when --symfs is provided
perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced
perf annotate: Sort list of recognised instructions
perf annotate: Fix identification of ARM blt and bls instructions
perf tools: Fix usage of max_stack sysctl
perf callchain: Stop validating callchains by the max_stack sysctl
perf trace: Fix exit_group() formatting
perf top: Use machine->kptr_restrict_warned
perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1
perf machine: Do not bail out if not managing to read ref reloc symbol
perf/x86/intel/p4: Trival indentation fix, remove space
...
Jann Horn [Sun, 22 May 2016 04:01:34 +0000 (06:01 +0200)]
Yama: fix double-spinlock and user access in atomic context
Commit
8a56038c2aef ("Yama: consolidate error reporting") causes lockups
when someone hits a Yama denial. Call chain:
process_vm_readv -> process_vm_rw -> process_vm_rw_core -> mm_access
-> ptrace_may_access
task_lock(...) is taken
__ptrace_may_access -> security_ptrace_access_check
-> yama_ptrace_access_check -> report_access -> kstrdup_quotable_cmdline
-> get_cmdline -> access_process_vm -> get_task_mm
task_lock(...) is taken again
task_lock(p) just calls spin_lock(&p->alloc_lock), so at this point,
spin_lock() is called on a lock that is already held by the current
process.
Also: Since the alloc_lock is a spinlock, sleeping inside
security_ptrace_access_check hooks is probably not allowed at all? So it's
not even possible to print the cmdline from in there because that might
involve paging in userspace memory.
It would be tempting to rewrite ptrace_may_access() to drop the alloc_lock
before calling the LSM, but even then, ptrace_may_access() itself might be
called from various contexts in which you're not allowed to sleep; for
example, as far as I understand, to be able to hold a reference to another
task, usually an RCU read lock will be taken (see e.g. kcmp() and
get_robust_list()), so that also prohibits sleeping. (And using e.g. FUSE,
a user can cause pagefault handling to take arbitrary amounts of time -
see https://bugs.chromium.org/p/project-zero/issues/detail?id=808.)
Therefore, AFAIK, in order to print the name of a process below
security_ptrace_access_check(), you'd have to either grab a reference to
the mm_struct and defer the access violation reporting or just use the
"comm" value that's stored in kernelspace and accessible without big
complications. (Or you could try to use some kind of atomic remote VM
access that fails if the memory isn't paged in, similar to
copy_from_user_inatomic(), and if necessary fall back to comm, but
that'd be kind of ugly because the comm/cmdline choice would look
pretty random to the user.)
Fix it by deferring reporting of the access violation until current
exits kernelspace the next time.
v2: Don't oops on PTRACE_TRACEME, call report_access under
task_lock(current). Also fix nonsensical comment. And don't use
GPF_ATOMIC for memory allocation with no locks held.
This patch is tested both for ptrace attach and ptrace traceme.
Fixes:
8a56038c2aef ("Yama: consolidate error reporting")
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Linus Torvalds [Wed, 25 May 2016 23:52:19 +0000 (16:52 -0700)]
Merge branch 'core-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull objtool build fix from Ingo Molnar:
"An libtool fix for older libelf versions"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Allow building with older libelf
Yan, Zheng [Thu, 19 May 2016 11:15:19 +0000 (19:15 +0800)]
ceph: fix wake_up_session_cb()
We should reset i_requested_max_size before waking the waiters.
(zero i_requested_max_size make waiter re-request the max size)
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Wed, 18 May 2016 12:58:26 +0000 (20:58 +0800)]
ceph: don't use truncate_pagecache() to invalidate read cache
truncate_pagecache() drops dirty pages, it's dangerous to use it
to invalidate read cache. Besides, we shouldn't start invalidating
read cache while there are buffer writers. Because buffer writers
may add dirty pages later.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Fri, 13 May 2016 09:54:17 +0000 (17:54 +0800)]
ceph: SetPageError() for writeback pages if writepages fails
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Fri, 13 May 2016 09:29:51 +0000 (17:29 +0800)]
ceph: handle interrupted ceph_writepage()
writepage() can be interrupted when it's called by direct memory
reclaimer (the direct memory relaimer is killed). To avoid lossing
data, we redirty the page.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Fri, 13 May 2016 03:30:24 +0000 (11:30 +0800)]
ceph: make ceph_update_writeable_page() uninterruptible
ceph_update_writeable_page() is used by ceph_write_begin(). It beaks
atomicity of write operation if it's interruptible.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Fri, 13 May 2016 03:04:33 +0000 (11:04 +0800)]
libceph: make ceph_osdc_wait_request() uninterruptible
Ceph_osdc_wait_request() is used when cephfs issues sync IO. In most
cases, the sync IO should be uninterruptible. The fix is use killale
wait function in ceph_osdc_wait_request().
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Tue, 10 May 2016 11:09:06 +0000 (19:09 +0800)]
ceph: handle -EAGAIN returned by ceph_update_writeable_page()
when ceph_update_writeable_page() return -EAGAIN, caller should
lock the page and call ceph_update_writeable_page() again.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Tue, 10 May 2016 10:59:13 +0000 (18:59 +0800)]
ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Tue, 10 May 2016 10:40:28 +0000 (18:40 +0800)]
ceph: block non-fatal signals for fault/page_mkwrite
Fault and page_mkwrite are supposed to be uninterruptable. But they
call ceph functions that are interruptible. So they should block
signals before calling functions that are interruptible
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Zhang Zhuoyu [Fri, 25 Mar 2016 09:18:39 +0000 (05:18 -0400)]
ceph: make logical calculation functions return bool
This patch makes serverl logical caculation functions return bool to
improve readability due to these particular functions only using 0/1
as their return value.
No functional change.
Signed-off-by: Zhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com>
Yan, Zheng [Thu, 5 May 2016 08:40:17 +0000 (16:40 +0800)]
ceph: tolerate bad i_size for symlink inode
A mds bug can cause symlink's size to be truncated to zero.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Wed, 4 May 2016 03:40:30 +0000 (11:40 +0800)]
ceph: improve fragtree change detection
check if number of splits in i_fragtree is equal to number of splits
in mds reply
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Wed, 4 May 2016 03:05:10 +0000 (11:05 +0800)]
ceph: keep leaf frag when updating fragtree
Nodes in i_fragtree are sorted according to ceph_compare_frag().
It means frag node in i_fragtree always follow its direct parent
node. To check if a leaf node is valid, we just need to check if
it's child of previous split node.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Tue, 3 May 2016 14:33:20 +0000 (22:33 +0800)]
ceph: fix dir_auth check in ceph_fill_dirfrag()
-1 is CDIR_AUTH_PARENT, it means dir's auth mds is the same as
inode's auth mds
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Tue, 3 May 2016 12:55:50 +0000 (20:55 +0800)]
ceph: don't assume frag tree splits in mds reply are sorted
The algorithm that updates i_fragtree relies on that the frag tree
splits in mds reply are of the same order of i_fragtree. This is not
true because current MDS encodes frag tree splits in ascending order
of (unsigned)frag_t. But nodes in i_fragtree are sorted according to
ceph_frag_compare().
The fix is sort the frag tree splits first, then updates i_fragtree.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Fri, 29 Apr 2016 15:40:23 +0000 (23:40 +0800)]
ceph: fix inode reference leak
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Fri, 29 Apr 2016 03:27:30 +0000 (11:27 +0800)]
ceph: using hash value to compose dentry offset
If MDS sorts dentries in dirfrag in hash order, we use hash value to
compose dentry offset. dentry offset is:
(0xff << 52) | ((24 bits hash) << 28) |
(the nth entry hash hash collision)
This offset is stable across directory fragmentation. This alos means
there is no need to reset readdir offset if directory get fragmented
in the middle of readdir.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Thu, 28 Apr 2016 14:56:44 +0000 (22:56 +0800)]
ceph: don't forbid marking directory complete after forward seek
Forward seek within same frag does not update fi->last_name, it will
not affect contents of later readdir reply. So there is no need to
forbid marking directory complete
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Thu, 28 Apr 2016 07:17:40 +0000 (15:17 +0800)]
ceph: record 'offset' for each entry of readdir result
This is preparation for using hash value as dentry 'offset'
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Wed, 27 Apr 2016 09:48:30 +0000 (17:48 +0800)]
ceph: define 'end/complete' in readdir reply as bit flags
Set a flag in readdir request, which indicates that client interprets
'end/complete' as bit flags. So that mds can reply additional flags in
readdir reply.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Thu, 28 Apr 2016 01:37:39 +0000 (09:37 +0800)]
ceph: define struct for dir entry in readdir reply
This avoids defining multiple arrays for entries in readdir reply
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Wed, 27 Apr 2016 09:32:34 +0000 (17:32 +0800)]
ceph: simplify 'offset in frag'
don't distinguish leftmost frag from other frags. always use 2 as
first entry's offset.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Fri, 29 Apr 2016 07:58:32 +0000 (15:58 +0800)]
ceph: remove unnecessary checks in __dcache_readdir
we never add snapdir and the hidden .ceph dir into readdir cache
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Thu, 28 Apr 2016 09:43:35 +0000 (17:43 +0800)]
ceph: search cache postion for dcache readdir
use binary search to find cache index that corresponds to readdir
postion.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Thu, 21 Apr 2016 04:11:54 +0000 (12:11 +0800)]
ceph: use CEPH_MDS_OP_RMXATTR request to remove xattr
Setxattr with NULL value and XATTR_REPLACE flag should be equivalent
to removexattr. But current MDS does not support deleting vxattrs through
MDS_OP_SETXATTR request. The workaround is sending MDS_OP_RMXATTR request
if setxattr actually removs xattr.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Thu, 21 Apr 2016 03:09:55 +0000 (11:09 +0800)]
ceph: report mount root in session metadata
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Mon, 18 Apr 2016 08:51:37 +0000 (16:51 +0800)]
ceph: don't show symlink target in debugfs/mdsc
symlink target is useless for debug and can be very long. It's annoying
to show it in debugfs/mdsc.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Fri, 15 Apr 2016 05:56:12 +0000 (13:56 +0800)]
ceph: don't call truncate_pagecache in ceph_writepages_start
truncate_pagecache() may decrease inode's reference. This can cause
deadlock if inode's last reference is dropped and iput_final() wants
to evict the inode. (evict() calls inode_wait_for_writeback(), which
waits for ceph_writepages_start() to return).
The fix is use work thead to truncate dirty pages. Also add 'forced
umount' check to ceph_update_writeable_page(), which prevents new
pages getting dirty.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Fri, 8 Apr 2016 07:27:16 +0000 (15:27 +0800)]
ceph: renew caps for read/write if mds session got killed.
When mds session gets killed, read/write operation may hang.
Client waits for Frw caps, but mds does not know what caps client
wants. To recover this, client sends an open request to mds. The
request will tell mds what caps client wants.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Thu, 31 Mar 2016 07:53:01 +0000 (15:53 +0800)]
ceph: CEPH_FEATURE_MDSENC support
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Wed, 30 Mar 2016 09:18:34 +0000 (17:18 +0800)]
ceph: multiple filesystem support
To access non-default filesystem, we just need to subscribe to
mdsmap.<MDS_NAMESPACE_ID> and add a new mount option for mds
namespace id.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
[idryomov@gmail.com: switch to a new libceph API]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Wed, 25 May 2016 22:05:01 +0000 (00:05 +0200)]
libceph: support for subscribing to "mdsmap.<id>" maps
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:28 +0000 (16:07 +0200)]
libceph: replace ceph_monc_request_next_osdmap()
... with a wrapper around maybe_request_map() - no need for two
osdmap-specific functions.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:27 +0000 (16:07 +0200)]
libceph: take osdc->lock in osdmap_show() and dump flags in hex
There is now about a dozen CEPH_OSDMAP_* flags. This is a debugging
interface, so just dump in hex instead of spelling each flag out.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:27 +0000 (16:07 +0200)]
libceph: pool deletion detection
This adds the "map check" infrastructure for sending osdmap version
checks on CALC_TARGET_POOL_DNE and completing in-flight requests with
-ENOENT if the target pool doesn't exist or has just been deleted.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:27 +0000 (16:07 +0200)]
libceph: async MON client generic requests
For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION
messages asynchronously and get a callback on completion. Refactor MON
client to allow firing off generic requests asynchronously and add an
async variant of ceph_monc_get_version(). ceph_monc_do_statfs() is
switched over and remains sync.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:27 +0000 (16:07 +0200)]
libceph: support for checking on status of watch
Implement ceph_osdc_watch_check() to be able to check on status of
watch. Note that the time it takes for a watch/notify event to get
delivered through the notify_wq is taken into account.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:27 +0000 (16:07 +0200)]
libceph: support for sending notifies
Implement ceph_osdc_notify() for sending notifies.
Due to the fact that the current messenger can't do read-in into
pagelists (it can only do write-out from them), I had to go with a page
vector for a NOTIFY_COMPLETE payload, for now.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Wed, 25 May 2016 23:15:02 +0000 (01:15 +0200)]
libceph, rbd: ceph_osd_linger_request, watch/notify v2
This adds support and switches rbd to a new, more reliable version of
watch/notify protocol. As with the OSD client update, this is mostly
about getting the right structures linked into the right places so that
reconnects are properly sent when needed. watch/notify v2 also
requires sending regular pings to the OSDs - send_linger_ping().
A major change from the old watch/notify implementation is the
introduction of ceph_osd_linger_request - linger requests no longer
piggy back on ceph_osd_request. ceph_osd_event has been merged into
ceph_osd_linger_request.
All the details are now hidden within libceph, the interface consists
of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack().
ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep
the lifetime management simple.
ceph_osdc_notify_ack() accepts an optional data payload, which is
relayed back to the notifier.
Portions of this patch are loosely based on work by Douglas Fuller
<dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:26 +0000 (16:07 +0200)]
rbd: rbd_dev_header_unwatch_sync() variant
Introduce __rbd_dev_header_unwatch_sync(), which doesn't flush notify
callbacks. This is for the new rados_watcherrcb_t, which would be
called from a notify callback.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:26 +0000 (16:07 +0200)]
libceph: wait_request_timeout()
The unwatch timeout is currently implemented in rbd. With
watch/unwatch code moving into libceph, we are going to need
a ceph_osdc_wait_request() variant with a timeout.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:26 +0000 (16:07 +0200)]
libceph: request_init() and request_release_checks()
These are going to be used by request_reinit() code.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:26 +0000 (16:07 +0200)]
libceph: a major OSD client update
This is a major sync up, up to ~Jewel. The highlights are:
- per-session request trees (vs a global per-client tree)
- per-session locking (vs a global per-client rwlock)
- homeless OSD session
- no ad-hoc global per-client lists
- support for pool quotas
- foundation for watch/notify v2 support
- foundation for map check (pool deletion detection) support
The switchover is incomplete: lingering requests can be setup and
teared down but aren't ever reestablished. This functionality is
restored with the introduction of the new lingering infrastructure
(ceph_osd_linger_request, linger_work, etc) in a later commit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:26 +0000 (16:07 +0200)]
libceph: protect osdc->osd_lru list with a spinlock
OSD client is getting moved from the big per-client lock to a set of
per-session locks. The big rwlock would only be held for read most of
the time, so a global osdc->osd_lru needs additional protection.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:25 +0000 (16:07 +0200)]
libceph: allocate ceph_osd with GFP_NOFAIL
create_osd() is called way too deep in the stack to be able to error
out in a sane way; a failing create_osd() just messes everything up.
The current req_notarget list solution is broken - the list is never
traversed as it's not entirely clear when to do it, I guess.
If we were to start traversing it at regular intervals and retrying
each request, we wouldn't be far off from what __GFP_NOFAIL is doing,
so allocate OSD sessions with __GFP_NOFAIL, at least until we come up
with a better fix.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:25 +0000 (16:07 +0200)]
libceph: osd_init() and osd_cleanup()
These are going to be used by homeless OSD sessions code.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:25 +0000 (16:07 +0200)]
libceph: handle_one_map()
Separate osdmap handling from decoding and iterating over a bag of maps
in a fresh MOSDMap message. This sets up the scene for the updated OSD
client.
Of particular importance here is the addition of pi->was_full, which
can be used to answer "did this pool go full -> not-full in this map?".
This is the key bit for supporting pool quotas.
We won't be able to downgrade map_sem for much longer, so drop
downgrade_write().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Linus Torvalds [Wed, 25 May 2016 22:59:09 +0000 (15:59 -0700)]
Merge branch 'work.iov_iter' of git://git./linux/kernel/git/viro/vfs
Pull vfs iov_iter regression fix from Al Viro:
"Fix for braino in 'fold checks into iterate_and_advance()'"
* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
do "fold checks into iterate_and_advance()" right
Linus Torvalds [Wed, 25 May 2016 22:54:35 +0000 (15:54 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs
Pull vfs xattr regression fixes from Al Viro.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
make xattr_resolve_handlers() safe to use with NULL ->s_xattr
xattr: Fail with -EINVAL for NULL attribute names
Linus Torvalds [Wed, 25 May 2016 22:38:56 +0000 (15:38 -0700)]
Merge tag 'acpi-4.7-rc1-more' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"Additional ACPI update for v4.7-rc1
Just one fix for incorrect async_synchronize_cookie() usage in the
ACPI battery driver (Chris Wilson)"
* tag 'acpi-4.7-rc1-more' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / battery: Correctly serialise with the pending async probe
Ilya Dryomov [Thu, 28 Apr 2016 14:07:25 +0000 (16:07 +0200)]
libceph: allocate dummy osdmap in ceph_osdc_init()
This leads to a simpler osdmap handling code, particularly when dealing
with pi->was_full, which is introduced in a later commit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:24 +0000 (16:07 +0200)]
libceph: schedule tick from ceph_osdc_init()
Both homeless OSD sessions and watch/notify v2, introduced in later
commits, require periodic ticks which don't depend on ->num_requests.
Schedule the initial tick from ceph_osdc_init() and reschedule from
handle_timeout() unconditionally.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:24 +0000 (16:07 +0200)]
libceph: move schedule_delayed_work() in ceph_osdc_init()
ceph_osdc_stop() isn't called if ceph_osdc_init() fails, so we end up
with handle_osds_timeout() running on invalid memory if any one of the
allocations fails. Call schedule_delayed_work() after everything is
setup, just before returning.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:24 +0000 (16:07 +0200)]
libceph: redo callbacks and factor out MOSDOpReply decoding
If you specify ACK | ONDISK and set ->r_unsafe_callback, both
->r_callback and ->r_unsafe_callback(true) are called on ack. This is
very confusing. Redo this so that only one of them is called:
->r_unsafe_callback(true), on ack
->r_unsafe_callback(false), on commit
or
->r_callback, on ack|commit
Decode everything in decode_MOSDOpReply() to reduce clutter.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:24 +0000 (16:07 +0200)]
libceph: drop msg argument from ceph_osdc_callback_t
finish_read(), its only user, uses it to get to hdr.data_len, which is
what ->r_result is set to on success. This gains us the ability to
safely call callbacks from contexts other than reply, e.g. map check.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Wed, 25 May 2016 22:29:52 +0000 (00:29 +0200)]
libceph: switch to calc_target(), part 2
The crux of this is getting rid of ceph_osdc_build_request(), so that
MOSDOp can be encoded not before but after calc_target() calculates the
actual target. Encoding now happens within ceph_osdc_start_request().
Also nuked is the accompanying bunch of pointers into the encoded
buffer that was used to update fields on each send - instead, the
entire front is re-encoded. If we want to support target->name_len !=
base->name_len in the future, there is no other way, because oid is
surrounded by other fields in the encoded buffer.
Encoding OSD ops and adding data items to the request message were
mixed together in osd_req_encode_op(). While we want to re-encode OSD
ops, we don't want to add duplicate data items to the message when
resending, so all call to ceph_osdc_msg_data_add() are factored out
into a new setup_request_data().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:23 +0000 (16:07 +0200)]
libceph: switch to calc_target(), part 1
Replace __calc_request_pg() and most of __map_request() with
calc_target() and start using req->r_t.
ceph_osdc_build_request() however still encodes base_oid, because it's
called before calc_target() is and target_oid is empty at that point in
time; a printf in osdc_show() also shows base_oid. This is fixed in
"libceph: switch to calc_target(), part 2".
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:23 +0000 (16:07 +0200)]
libceph: introduce ceph_osd_request_target, calc_target()
Introduce ceph_osd_request_target, containing all mapping-related
fields of ceph_osd_request and calc_target() for calculating mappings
and populating it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:23 +0000 (16:07 +0200)]
libceph: pi->min_size, pi->last_force_request_resend
Add and decode pi->min_size and pi->last_force_request_resend. These
are going to be used by calc_target().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:23 +0000 (16:07 +0200)]
libceph: make pgid_cmp() global
calc_target() code is going to need to know how to compare PGs. Take
lhs and rhs pgid by const * while at it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:23 +0000 (16:07 +0200)]
libceph: rename ceph_calc_pg_primary()
Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to
emphasise that it returns acting primary.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:22 +0000 (16:07 +0200)]
libceph: ceph_osds, ceph_pg_to_up_acting_osds()
Knowning just acting set isn't enough, we need to be able to record up
set as well to detect interval changes. This means returning (up[],
up_len, up_primary, acting[], acting_len, acting_primary) and passing
it around. Introduce and switch to ceph_osds to help with that.
Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return
both up and acting sets from it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:22 +0000 (16:07 +0200)]
libceph: rename ceph_oloc_oid_to_pg()
Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg(). Emphasise
that returned is raw PG and return -ENOENT instead of -EIO if the pool
doesn't exist.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:22 +0000 (16:07 +0200)]
libceph: fix ceph_eversion encoding
eversion_t is version+epoch in userspace and is encoded in that order.
ceph_eversion is defined as epoch+version in rados.h, yet we memcpy it
in __send_request(). Reoder ceph_eversion fields.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:22 +0000 (16:07 +0200)]
libceph: DEFINE_RB_FUNCS macro
Given
struct foo {
u64 id;
struct rb_node bar_node;
};
generate insert_bar(), erase_bar() and lookup_bar() functions with
DEFINE_RB_FUNCS(bar, struct foo, id, bar_node)
The key is assumed to be an integer (u64, int, etc), compared with
< and >. nodefld has to be initialized with RB_CLEAR_NODE().
Start using it for MDS, MON and OSD requests and OSD sessions.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:22 +0000 (16:07 +0200)]
libceph: open-code remove_{all,old}_osds()
They are called only once, from ceph_osdc_stop() and
handle_osds_timeout() respectively.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 28 Apr 2016 14:07:21 +0000 (16:07 +0200)]
libceph: nuke unused fields and functions
Either unused or useless:
osdmap->mkfs_epoch
osd->o_marked_for_keepalive
monc->num_generic_requests
osdc->map_waiters
osdc->last_requested_map
osdc->timeout_tid
osd_req_op_cls_response_data()
osdmap_apply_incremental() @msgr arg
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Fri, 29 Apr 2016 18:01:25 +0000 (20:01 +0200)]
rbd: use header_oid instead of header_name
Switch to ceph_object_id and use ceph_oid_aprintf() instead of a bare
const char *. This reduces noise in rbd_dev_header_name().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Fri, 29 Apr 2016 17:54:20 +0000 (19:54 +0200)]
libceph: variable-sized ceph_object_id
Currently ceph_object_id can hold object names of up to 100
(CEPH_MAX_OID_NAME_LEN) characters. This is enough for all use cases,
expect one - long rbd image names:
- a format 1 header is named "<imgname>.rbd"
- an object that points to a format 2 header is named "rbd_id.<imgname>"
We operate on these potentially long-named objects during rbd map, and,
for format 1 images, during header refresh. (A format 2 header name is
a small system-generated string.)
Lift this 100 character limit by making ceph_object_id be able to point
to an externally-allocated string. Apart from being able to work with
almost arbitrarily-long named objects, this allows us to reduce the
size of ceph_object_id from >100 bytes to 64 bytes.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Wed, 27 Apr 2016 16:32:56 +0000 (18:32 +0200)]
libceph: change how osd_op_reply message size is calculated
For a message pool message, preallocate a page, just like we do for
osd_op. For a normal message, take ceph_object_id into account and
don't bother subtracting CEPH_OSD_SLAB_OPS ceph_osd_ops.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Wed, 27 Apr 2016 12:15:51 +0000 (14:15 +0200)]
libceph: move message allocation out of ceph_osdc_alloc_request()
The size of ->r_request and ->r_reply messages depends on the size of
the object name (ceph_object_id), while the size of ceph_osd_request is
fixed. Move message allocation into a separate function that would
have to be called after ceph_object_id and ceph_object_locator (which
is also going to become variable in size with RADOS namespaces) have
been filled in:
req = ceph_osdc_alloc_request(...);
<fill in req->r_base_oid>
<fill in req->r_base_oloc>
ceph_osdc_alloc_messages(req);
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Tue, 26 Apr 2016 13:39:47 +0000 (15:39 +0200)]
libceph: grab snapc in ceph_osdc_alloc_request()
ceph_osdc_build_request() is going away. Grab snapc and initialize
->r_snapid in ceph_osdc_alloc_request().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Tue, 26 Apr 2016 13:05:29 +0000 (15:05 +0200)]
libceph: make ceph_osdc_put_request() accept NULL
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 May 2016 11:18:57 +0000 (13:18 +0200)]
rbd: get/put img_request in rbd_img_request_submit()
By the time we get to checking for_each_obj_request_safe(img_request)
terminating condition, all obj_requests may be complete and img_request
ref, that rbd_img_request_submit() takes away from its caller, may be
put. Moving the next_obj_request cursor is then a use-after-free on
img_request.
It's totally benign, as the value that's read is never used, but
I think it's still worth fixing.
Cc: Alex Elder <elder@linaro.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Linus Torvalds [Wed, 25 May 2016 22:29:21 +0000 (15:29 -0700)]
Merge tag 'pm-4.7-rc1-more' of git://git./linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These are two stable-candidate fixes (PM core, cpuidle) and a bunch of
cpufreq cleanups.
Specifics:
- Stable-candidate cpuidle fix to make it check the right variable
when deciding whether or not to enable interrupts on the local CPU
so as to avoid enabling iterrupts too early in some cases if the
system has both coupled and per-core idle states (Daniel Lezcano).
- Stable-candidate PM core fix to make it handle failures at the
"late suspend" stage of device suspend consistently for all devices
regardless of whether or not async suspend/resume is enabled for
them (Rafael Wysocki).
- Cleanups in the cpufreq core, the schedutil governor and the
intel_pstate driver (Rafael Wysocki, Pankaj Gupta, Viresh Kumar)"
* tag 'pm-4.7-rc1-more' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / sleep: Handle failures in device_suspend_late() consistently
cpufreq: schedutil: Improve prints messages with pr_fmt
cpuidle: Fix cpuidle_state_is_coupled() argument in cpuidle_enter()
cpufreq: simplified goto out in cpufreq_register_driver()
cpufreq: governor: CPUFREQ_GOV_STOP never fails
cpufreq: governor: CPUFREQ_GOV_POLICY_EXIT never fails
intel_pstate: Simplify conditional in intel_pstate_set_policy()
Al Viro [Wed, 25 May 2016 21:36:19 +0000 (17:36 -0400)]
do "fold checks into iterate_and_advance()" right
the only case when we should skip the iterate_and_advance() guts
is when nothing's left in the iterator, _not_ just when requested
amount is 0. Said guts will do nothing in the latter case anyway;
the problem we tried to deal with in the aforementioned commit is
that when there's nothing left *and* the amount requested is 0,
we might end up deferencing one iovec too many; the value we fetch
from there is discarded in that case, but theoretically it might
oops if the iovec array ends exactly at the end of page with the
next page not mapped.
Bailing out on zero size requested had an unexpected side effect -
zero-length segment in the beginning of iovec array ended up
throwing do_loop_readv_writev() into infinite spin; we do not
advance past the empty segment at all. Reproducer is trivial:
echo '#include <sys/uio.h>' >a.c
echo 'main() {char c; struct iovec v[] = {{&c,0},{&c,1}}; readv(0,v,2);}' >>a.c
cc a.c && ./a.out </proc/uptime
which should end up with the process not hanging. Probably ought to
go into LTP or xfstests...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Wed, 25 May 2016 21:34:41 +0000 (17:34 -0400)]
make xattr_resolve_handlers() safe to use with NULL ->s_xattr
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Andreas Gruenbacher [Mon, 9 May 2016 11:28:49 +0000 (13:28 +0200)]
xattr: Fail with -EINVAL for NULL attribute names
Commit
98e9cb57 improved the xattr name checks in xattr_resolve_name but
didn't update the NULL attribute name check appropriately, so NULL
attribute names lead to NULL pointer dereferences. Turn that into
-EINVAL results instead.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
fs/xattr.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rafael J. Wysocki [Wed, 25 May 2016 20:11:28 +0000 (22:11 +0200)]
Merge branch 'acpi-battery'
* acpi-battery:
ACPI / battery: Correctly serialise with the pending async probe
Rafael J. Wysocki [Wed, 25 May 2016 19:54:45 +0000 (21:54 +0200)]
Merge branches 'pm-cpufreq', 'pm-cpuidle' and 'pm-core'
* pm-cpufreq:
cpufreq: schedutil: Improve prints messages with pr_fmt
cpufreq: simplified goto out in cpufreq_register_driver()
cpufreq: governor: CPUFREQ_GOV_STOP never fails
cpufreq: governor: CPUFREQ_GOV_POLICY_EXIT never fails
intel_pstate: Simplify conditional in intel_pstate_set_policy()
* pm-cpuidle:
cpuidle: Fix cpuidle_state_is_coupled() argument in cpuidle_enter()
* pm-core:
PM / sleep: Handle failures in device_suspend_late() consistently
Linus Torvalds [Wed, 25 May 2016 17:40:15 +0000 (10:40 -0700)]
Merge tag 'pwm/for-4.7-rc1' of git://git./linux/kernel/git/thierry.reding/linux-pwm
Pull pwm updates from Thierry Reding:
"This set of changes introduces an atomic API to the PWM subsystem.
This is influenced by the DRM atomic API that was introduced a while
back, though it is obviously a lot simpler. The fundamental idea
remains the same, though: drivers provide a single callback to
implement the atomic configuration of a PWM channel.
As a side-effect the PWM subsystem gains the ability for initial state
retrieval, so that the logical state mirrors that of the hardware.
Many use-cases don't care about this, but for others it is essential.
These new features require changes in all users, which these patches
take care of. The core is transitioned to use the atomic callback if
available and provides a fallback mechanism for other drivers.
Changes to transition users and drivers to the atomic API are
postponed to v4.8"
* tag 'pwm/for-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm: (30 commits)
pwm: Add information about polarity, duty cycle and period to debugfs
pwm: Switch to the atomic API
pwm: Update documentation
pwm: Add core infrastructure to allow atomic updates
pwm: Add hardware readout infrastructure
pwm: Move the enabled/disabled info into pwm_state
pwm: Introduce the pwm_state concept
pwm: Keep PWM state in sync with hardware state
ARM: Explicitly apply PWM config extracted from pwm_args
drm: i915: Explicitly apply PWM config extracted from pwm_args
input: misc: pwm-beeper: Explicitly apply PWM config extracted from pwm_args
input: misc: max8997: Explicitly apply PWM config extracted from pwm_args
backlight: lm3630a: explicitly apply PWM config extracted from pwm_args
backlight: lp855x: Explicitly apply PWM config extracted from pwm_args
backlight: lp8788: Explicitly apply PWM config extracted from pwm_args
backlight: pwm_bl: Use pwm_get_args() where appropriate
fbdev: ssd1307fb: Use pwm_get_args() where appropriate
regulator: pwm: Use pwm_get_args() where appropriate
leds: pwm: Use pwm_get_args() where appropriate
input: misc: max77693: Use pwm_get_args() where appropriate
...
Tom Haynes [Wed, 25 May 2016 14:31:12 +0000 (07:31 -0700)]
nfs/flexfiles: Helper function to detect FF_FLAGS_NO_READ_IO
The mds can inform the client not to use the IOMODE_RW layout
segment for doing READs. I.e., it is basically a
IOMODE_WRITE layout segment.
It would do this to not interfere with the WRITEs.
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Linus Torvalds [Wed, 25 May 2016 17:19:17 +0000 (10:19 -0700)]
Merge git://www.linux-watchdog.org/linux-watchdog
Pull watchdog updates from Wim Van Sebroeck:
- add support for Fintek F81865 Super-IO chip
- add support for watchdogs (RWDT and SWDT) found on RCar Gen3 based
SoCs from Renesas
- octeon: Handle the FROZEN hot plug notifier actions
- f71808e_wdt fixes and cleanups
- some small improvements in code and documentation
* git://www.linux-watchdog.org/linux-watchdog:
MAINTAINERS: Add file patterns for watchdog device tree bindings
Documentation: Add ebc-c384_wdt watchdog-parameters.txt entry
watchdog: shwdt: Use setup_timer()
watchdog: cpwd: Use setup_timer()
arm64: defconfig: enable Renesas Watchdog Timer
watchdog: renesas-wdt: add driver
watchdog: remove error message when unable to allocate watchdog device
watchdog: f71808e_wdt: Fix WDTMOUT_STS register read
watchdog: f71808e_wdt: Fix typo
watchdog: f71808e_wdt: Add F81865 support
watchdog: sp5100_tco: properly check for new register layouts
watchdog: core: Fix circular locking dependency
watchdog: core: fix trivial typo in a comment
watchdog: hpwdt: Adjust documentation to match latest kernel module parameters.
watchdog: imx2_wdt: add external reset support via dt prop
watchdog: octeon: Handle the FROZEN hot plug notifier actions.
watchdog: qcom: Report reboot reason
Weston Andros Adamson [Wed, 25 May 2016 14:07:23 +0000 (10:07 -0400)]
nfs: avoid race that crashes nfs_init_commit
Since the patch "NFS: Allow multiple commit requests in flight per file"
we can run multiple simultaneous commits on the same inode. This
introduced a race over collecting pages to commit that made it possible
to call nfs_init_commit() with an empty list - which causes crashes like
the one below.
The fix is to catch this race and avoid calling nfs_init_commit and
initiate_commit when there is no work to do.
Here is the crash:
[600522.076832] BUG: unable to handle kernel NULL pointer dereference at
0000000000000040
[600522.078475] IP: [<
ffffffffa0479e72>] nfs_init_commit+0x22/0x130 [nfs]
[600522.078745] PGD
4272b1067 PUD
4272cb067 PMD 0
[600522.078972] Oops: 0000 [#1] SMP
[600522.079204] Modules linked in: nfsv3 nfs_layout_flexfiles rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache dcdbas ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw vmw_vsock_vmci_transport vsock bonding ipmi_devintf ipmi_msghandler coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev vmw_balloon parport_pc parport acpi_cpufreq vmw_vmci i2c_piix4 shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c vmwgfx drm_kms_helper ttm drm crc32c_intel serio_raw vmxnet3
[600522.081380] vmw_pvscsi ata_generic pata_acpi
[600522.081809] CPU: 3 PID: 15667 Comm: /usr/bin/python Not tainted 4.1.9-100.pd.88.el7.x86_64 #1
[600522.082281] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
[600522.082814] task:
ffff8800bbbfa780 ti:
ffff88042ae84000 task.ti:
ffff88042ae84000
[600522.083378] RIP: 0010:[<
ffffffffa0479e72>] [<
ffffffffa0479e72>] nfs_init_commit+0x22/0x130 [nfs]
[600522.083973] RSP: 0018:
ffff88042ae87438 EFLAGS:
00010246
[600522.084571] RAX:
0000000000000000 RBX:
ffff880003485e40 RCX:
ffff88042ae87588
[600522.085188] RDX:
0000000000000000 RSI:
ffff88042ae874b0 RDI:
ffff880003485e40
[600522.085756] RBP:
ffff88042ae87448 R08:
ffff880003486010 R09:
ffff88042ae874b0
[600522.086332] R10:
0000000000000000 R11:
0000000000000005 R12:
ffff88042ae872d0
[600522.086905] R13:
ffff88042ae874b0 R14:
ffff880003485e40 R15:
ffff88042704c840
[600522.087484] FS:
00007f4728ff2740(0000) GS:
ffff88043fd80000(0000) knlGS:
0000000000000000
[600522.088070] CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
[600522.088663] CR2:
0000000000000040 CR3:
000000042b6aa000 CR4:
00000000001406e0
[600522.089327] Stack:
[600522.089926]
0000000000000001 ffff88042ae87588 ffff88042ae874f8 ffffffffa04f09fa
[600522.090549]
0000000000017840 0000000000017840 ffff88042ae87588 ffff8803258d9930
[600522.091169]
ffff88042ae87578 ffffffffa0563d80 0000000000000000 ffff88042704c840
[600522.091789] Call Trace:
[600522.092420] [<
ffffffffa04f09fa>] pnfs_generic_commit_pagelist+0x1da/0x320 [nfsv4]
[600522.093052] [<
ffffffffa0563d80>] ? ff_layout_commit_prepare_v3+0x30/0x30 [nfs_layout_flexfiles]
[600522.093696] [<
ffffffffa0562645>] ff_layout_commit_pagelist+0x15/0x20 [nfs_layout_flexfiles]
[600522.094359] [<
ffffffffa047bc78>] nfs_generic_commit_list+0xe8/0x120 [nfs]
[600522.095032] [<
ffffffffa047bd6a>] nfs_commit_inode+0xba/0x110 [nfs]
[600522.095719] [<
ffffffffa046ac54>] nfs_release_page+0x44/0xd0 [nfs]
[600522.096410] [<
ffffffff811a8122>] try_to_release_page+0x32/0x50
[600522.097109] [<
ffffffff811bd4f1>] shrink_page_list+0x961/0xb30
[600522.097812] [<
ffffffff811bdced>] shrink_inactive_list+0x1cd/0x550
[600522.098530] [<
ffffffff811bea65>] shrink_lruvec+0x635/0x840
[600522.099250] [<
ffffffff811bed60>] shrink_zone+0xf0/0x2f0
[600522.099974] [<
ffffffff811bf312>] do_try_to_free_pages+0x192/0x470
[600522.100709] [<
ffffffff811bf6ca>] try_to_free_pages+0xda/0x170
[600522.101464] [<
ffffffff811b2198>] __alloc_pages_nodemask+0x588/0x970
[600522.102235] [<
ffffffff811fbbd5>] alloc_pages_vma+0xb5/0x230
[600522.103000] [<
ffffffff813a1589>] ? cpumask_any_but+0x39/0x50
[600522.103774] [<
ffffffff811d6115>] wp_page_copy.isra.55+0x95/0x490
[600522.104558] [<
ffffffff810e3438>] ? __wake_up+0x48/0x60
[600522.105357] [<
ffffffff811d7d3b>] do_wp_page+0xab/0x4f0
[600522.106137] [<
ffffffff810a1bbb>] ? release_task+0x36b/0x470
[600522.106902] [<
ffffffff8126dbd7>] ? eventfd_ctx_read+0x67/0x1c0
[600522.107659] [<
ffffffff811da2a8>] handle_mm_fault+0xc78/0x1900
[600522.108431] [<
ffffffff81067ef1>] __do_page_fault+0x181/0x420
[600522.109173] [<
ffffffff811446a6>] ? __audit_syscall_exit+0x1e6/0x280
[600522.109893] [<
ffffffff810681c0>] do_page_fault+0x30/0x80
[600522.110594] [<
ffffffff81024f36>] ? syscall_trace_leave+0xc6/0x120
[600522.111288] [<
ffffffff81790a58>] page_fault+0x28/0x30
[600522.111947] Code: 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 4c 8d 87 d0 01 00 00 48 89 e5 53 48 89 fb 48 83 ec 08 4c 8b 0e 49 8b 41 18 4c 39 ce <48> 8b 40 40 4c 8b 50 30 74 24 48 8b 87 d0 01 00 00 48 8b 7e 08
[600522.113343] RIP [<
ffffffffa0479e72>] nfs_init_commit+0x22/0x130 [nfs]
[600522.114003] RSP <
ffff88042ae87438>
[600522.114636] CR2:
0000000000000040
Fixes:
af7cf057 (NFS: Allow multiple commit requests in flight per file)
CC: stable@vger.kernel.org
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>