Alex Elder [Tue, 3 Jul 2012 21:01:18 +0000 (16:01 -0500)]
rbd: use rbd_dev consistently
Most variables that represent a struct rbd_device are named
"rbd_dev", but in some cases "dev" is used instead. Change all the
"dev" references so they use "rbd_dev" consistently, to make it
clear from the name that we're working with an RBD device (as
opposed to, for example, a struct device). Similarly, change the
name of the "dev" field in struct rbd_notify_info to be "rbd_dev".
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Tue, 10 Jul 2012 02:04:24 +0000 (21:04 -0500)]
rbd: dynamically allocate snapshot name
There is no need to impose a small limit the length of the snapshot
name recorded for an rbd image in a struct rbd_dev. Remove the
limitation by allocating space for the snapshot name dynamically.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Tue, 10 Jul 2012 02:04:23 +0000 (21:04 -0500)]
rbd: dynamically allocate image name
There is no need to impose a small limit the length of the rbd image
name recorded in a struct rbd_dev. Remove the limitation by
allocating space for the image name dynamically.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Tue, 10 Jul 2012 02:04:23 +0000 (21:04 -0500)]
rbd: dynamically allocate image header name
There is no need to impose a small limit the length of the header
name recorded for an rbd image in a struct rbd_dev. Remove the
limitation by allocating space for the header name dynamically.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Tue, 10 Jul 2012 02:04:24 +0000 (21:04 -0500)]
rbd: dynamically allocate object prefix
There is no need to impose a small limit the length of the object
prefix recorded for an rbd image in a struct rbd_image_header.
Remove the limitation by allocating space for the object prefix
dynamically.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Thu, 12 Jul 2012 15:46:35 +0000 (10:46 -0500)]
rbd: dynamically allocate pool name
There is no need to impose a small limit the length of the pool name
recorded for an rbd image in a struct rbd_device. Remove the
limitation by allocating space for the pool name ynamically.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Thu, 12 Jul 2012 15:46:35 +0000 (10:46 -0500)]
rbd: create pool_id device attribute
Add an entry under /sys/bus/rbd/devices/<N>/ named "pool_id" that
provides the id for the pool the rbd image is assocatied with. This
is in addition to the pool name already provided.
Rename the "poolid" field in struct rbd_device to be "pool_id".
Update the documentation to reflect the addition of this new entry.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Wed, 11 Jul 2012 01:30:09 +0000 (20:30 -0500)]
rbd: rename rbd_dev->block_name
Each rbd image has a name that forms the basis of all data objects
backing the device. Old (format 1) images refer to this name as the
"block name," while new (format 2) images use the term "object
prefix" for this.
Change the field name in the in-core rbd image header structure to
reflect the more modern usage. We intentionally keep the the name
"block_name" in the on-disk definition for format 1 image headers.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Tue, 10 Jul 2012 02:04:23 +0000 (21:04 -0500)]
rbd: define dup_token()
Define a new function dup_token(), to be used during argument
parsing for making dynamically-allocated copies of tokens being
parsed.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Wed, 11 Jul 2012 13:24:45 +0000 (08:24 -0500)]
libceph: define ceph_extract_encoded_string()
This adds a new utility routine which will return a dynamically-
allocated buffer containing a string that has been decoded from ceph
over-the-wire format. It also returns the length of the string
if the address of a size variable is supplied to receive it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Tue, 3 Jul 2012 21:01:19 +0000 (16:01 -0500)]
rbd: drop a useless local variable
In rbd_req_sync_notify_ack(), a local variable was needlessly being
used to hold a null pointer. Just pass NULL instead.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Tue, 3 Jul 2012 21:01:18 +0000 (16:01 -0500)]
libceph: fix off-by-one bug in ceph_encode_filepath()
There is a BUG_ON() call that doesn't account for the single byte
structure version at the start of an encoded filepath in
ceph_encode_filepath(). Fix that.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Sage Weil [Thu, 7 Jun 2012 20:43:35 +0000 (13:43 -0700)]
ceph: clean up useless d_parent checks
d_parent is never NULL, and IS_ROOT() is the proper way to check for a
(non-self-referential) parent.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Guanjun He [Mon, 9 Jul 2012 02:50:33 +0000 (19:50 -0700)]
libceph: prevent the race of incoming work during teardown
Add an atomic variable 'stopping' as flag in struct ceph_messenger,
set this flag to 1 in function ceph_destroy_client(), and add the condition code
in function ceph_data_ready() to test the flag value, if true(1), just return.
Signed-off-by: Guanjun He <gjhe@suse.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 10 Jul 2012 18:53:34 +0000 (11:53 -0700)]
libceph: fix messenger retry
In ancient times, the messenger could both initiate and accept connections.
An artifact if that was data structures to store/process an incoming
ceph_msg_connect request and send an outgoing ceph_msg_connect_reply.
Sadly, the negotiation code was referencing those structures and ignoring
important information (like the peer's connect_seq) from the correct ones.
Among other things, this fixes tight reconnect loops where the server sends
RETRY_SESSION and we (the client) retries with the same connect_seq as last
time. This bug pretty easily triggered by injecting socket failures on the
MDS and running some fs workload like workunits/direct_io/test_sync_io.
Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 9 Jul 2012 21:31:41 +0000 (14:31 -0700)]
libceph: initialize rb, list nodes in ceph_osd_request
These don't strictly need to be initialized based on how they are used, but
it is good practice to do so.
Reported-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 9 Jul 2012 21:22:34 +0000 (14:22 -0700)]
libceph: initialize msgpool message types
Initialize the type field for messages in a msgpool. The caller was doing
this for osd ops, but not for the reply messages.
Reported-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 27 Jun 2012 19:31:02 +0000 (12:31 -0700)]
libceph: allow sock transition from CONNECTING to CLOSED
It is possible to close a socket that is in the OPENING state. For
example, it can happen if ceph_con_close() is called on the con before
the TCP connection is established. con_work() will come around and shut
down the socket.
Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 27 Jun 2012 19:24:34 +0000 (12:24 -0700)]
libceph: initialize mon_client con only once
Do not re-initialize the con on every connection attempt. When we
ceph_con_close, there may still be work queued on the socket (e.g., to
close it), and re-initializing will clobber the work_struct state.
Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 27 Jun 2012 19:24:08 +0000 (12:24 -0700)]
libceph: set peer name on con_open, not init
The peer name may change on each open attempt, even when the connection is
reused.
Signed-off-by: Sage Weil <sage@inktank.com>
Alex Elder [Thu, 21 Jun 2012 19:49:23 +0000 (12:49 -0700)]
libceph: drop declaration of ceph_con_get()
For some reason the declaration of ceph_con_get() and
ceph_con_put() did not get deleted in this commit:
d59315ca libceph: drop ceph_con_get/put helpers and nref member
Clean that up.
Signed-off-by: Alex Elder <elder@inktank.com>
Alex Elder [Thu, 21 Jun 2012 02:53:53 +0000 (21:53 -0500)]
libceph: add some fine ASCII art
Sage liked the state diagram I put in my commit description so
I'm putting it in with the code.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Mon, 11 Jun 2012 19:57:13 +0000 (14:57 -0500)]
libceph: small changes to messenger.c
This patch gathers a few small changes in "net/ceph/messenger.c":
out_msg_pos_next()
- small logic change that mostly affects indentation
write_partial_msg_pages().
- use a local variable trail_off to represent the offset into
a message of the trail portion of the data (if present)
- once we are in the trail portion we will always be there, so we
don't always need to check against our data position
- avoid computing len twice after we've reached the trail
- get rid of the variable tmpcrc, which is not needed
- trail_off and trail_len never change so mark them const
- update some comments
read_partial_message_bio()
- bio_iovec_idx() will never return an error, so don't bother
checking for it
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Thu, 24 May 2012 16:55:03 +0000 (11:55 -0500)]
libceph: distinguish two phases of connect sequence
Currently a ceph connection enters a "CONNECTING" state when it
begins the process of (re-)connecting with its peer. Once the two
ends have successfully exchanged their banner and addresses, an
additional NEGOTIATING bit is set in the ceph connection's state to
indicate the connection information exhange has begun. The
CONNECTING bit/state continues to be set during this phase.
Rather than have the CONNECTING state continue while the NEGOTIATING
bit is set, interpret these two phases as distinct states. In other
words, when NEGOTIATING is set, clear CONNECTING. That way only
one of them will be active at a time.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Thu, 31 May 2012 16:37:29 +0000 (11:37 -0500)]
libceph: separate banner and connect writes
There are two phases in the process of linking together the two ends
of a ceph connection. The first involves exchanging a banner and
IP addresses, and if that is successful a second phase exchanges
some detail about each side's connection capabilities.
When initiating a connection, the client side now queues to send
its information for both phases of this process at the same time.
This is probably a bit more efficient, but it is slightly messier
from a layering perspective in the code.
So rearrange things so that the client doesn't send the connection
information until it has received and processed the response in the
initial banner phase (in process_banner()).
Move the code (in the (con->sock == NULL) case in try_write()) that
prepares for writing the connection information, delaying doing that
until the banner exchange has completed. Move the code that begins
the transition to this second "NEGOTIATING" phase out of
process_banner() and into its caller, so preparing to write the
connection information and preparing to read the response are
adjacent to each other.
Finally, preparing to write the connection information now requires
the output kvec to be reset in all cases, so move that into the
prepare_write_connect() and delete it from all callers.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 23 May 2012 19:35:23 +0000 (14:35 -0500)]
libceph: define and use an explicit CONNECTED state
There is no state explicitly defined when a ceph connection is fully
operational. So define one.
It's set when the connection sequence completes successfully, and is
cleared when the connection gets closed.
Be a little more careful when examining the old state when a socket
disconnect event is reported.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 23 May 2012 19:35:23 +0000 (14:35 -0500)]
libceph: clear NEGOTIATING when done
A connection state's NEGOTIATING bit gets set while in CONNECTING
state after we have successfully exchanged a ceph banner and IP
addresses with the connection's peer (the server). But that bit
is not cleared again--at least not until another connection attempt
is initiated.
Instead, clear it as soon as the connection is fully established.
Also, clear it when a socket connection gets prematurely closed
in the midst of establishing a ceph connection (in case we had
reached the point where it was set).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Thu, 21 Jun 2012 02:53:53 +0000 (21:53 -0500)]
libceph: clear CONNECTING in ceph_con_close()
A connection that is closed will no longer be connecting. So
clear the CONNECTING state bit in ceph_con_close(). Similarly,
if the socket has been closed we no longer are in connecting
state (a new connect sequence will need to be initiated).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Thu, 21 Jun 2012 02:53:53 +0000 (21:53 -0500)]
libceph: don't touch con state in con_close_socket()
In con_close_socket(), a connection's SOCK_CLOSED flag gets set and
then cleared while its shutdown method is called and its reference
gets dropped.
Previously, that flag got set only if it had not already been set,
so setting it in con_close_socket() might have prevented additional
processing being done on a socket being shut down. We no longer set
SOCK_CLOSED in the socket event routine conditionally, so setting
that bit here no longer provides whatever benefit it might have
provided before.
A race condition could still leave the SOCK_CLOSED bit set even
after we've issued the call to con_close_socket(), so we still clear
that bit after shutting the socket down. Add a comment explaining
the reason for this.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Thu, 21 Jun 2012 02:53:53 +0000 (21:53 -0500)]
libceph: just set SOCK_CLOSED when state changes
When a TCP_CLOSE or TCP_CLOSE_WAIT event occurs, the SOCK_CLOSED
connection flag bit is set, and if it had not been previously set
queue_con() is called to ensure con_work() will get a chance to
handle the changed state.
con_work() atomically checks--and if set, clears--the SOCK_CLOSED
bit if it was set. This means that even if the bit were set
repeatedly, the related processing in con_work() only gets called
once per transition of the bit from 0 to 1.
What's important then is that we ensure con_work() gets called *at
least* once when a socket close event occurs, not that it gets
called *exactly* once.
The work queue mechanism already takes care of queueing work
only if it is not already queued, so there's no need for us
to call queue_con() conditionally.
So this patch just makes it so the SOCK_CLOSED flag gets set
unconditionally in ceph_sock_state_change().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Thu, 21 Jun 2012 02:53:53 +0000 (21:53 -0500)]
libceph: don't change socket state on sock event
Currently the socket state change event handler records an error
message on a connection to distinguish a close while connecting from
a close while a connection was already established.
Changing connection information during handling of a socket event is
not very clean, so instead move this assignment inside con_work(),
where it can be done during normal connection-level processing (and
under protection of the connection mutex as well).
Move the handling of a socket closed event up to the top of the
processing loop in con_work(); there's no point in handling backoff
etc. if we have a newly-closed socket to take care of.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Thu, 21 Jun 2012 02:53:53 +0000 (21:53 -0500)]
libceph: SOCK_CLOSED is a flag, not a state
The following commit changed it so SOCK_CLOSED bit was stored in
a connection's new "flags" field rather than its "state" field.
libceph: start separating connection flags from state
commit
928443cd
That bit is used in con_close_socket() to protect against setting an
error message more than once in the socket event handler function.
Unfortunately, the field being operated on in that function was not
updated to be "flags" as it should have been. This fixes that
error.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Mon, 11 Jun 2012 19:57:13 +0000 (14:57 -0500)]
libceph: don't use bio_iter as a flag
Recently a bug was fixed in which the bio_iter field in a ceph
message was not being properly re-initialized when a message got
re-transmitted:
commit
43643528cce60ca184fe8197efa8e8da7c89a037
Author: Yan, Zheng <zheng.z.yan@intel.com>
rbd: Clear ceph_msg->bio_iter for retransmitted message
We are now only initializing the bio_iter field when we are about to
start to write message data (in prepare_write_message_data()),
rather than every time we are attempting to write any portion of the
message data (in write_partial_msg_pages()). This means we no
longer need to use the msg->bio_iter field as a flag.
So just don't do that any more. Trust prepare_write_message_data()
to ensure msg->bio_iter is properly initialized, every time we are
about to begin writing (or re-writing) a message's bio data.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Mon, 11 Jun 2012 19:57:13 +0000 (14:57 -0500)]
libceph: move init of bio_iter
If a message has a non-null bio pointer, its bio_iter field is
initialized in write_partial_msg_pages() if this has not been done
already. This is really a one-time setup operation for sending a
message's (bio) data, so move that initialization code into
prepare_write_message_data() which serves that purpose.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Mon, 11 Jun 2012 19:57:13 +0000 (14:57 -0500)]
libceph: move init_bio_*() functions up
Move init_bio_iter() and iter_bio_next() up in their source file so
the'll be defined before they're needed.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Mon, 11 Jun 2012 19:57:13 +0000 (14:57 -0500)]
libceph: don't mark footer complete before it is
This is a nit, but prepare_write_message() sets the FOOTER_COMPLETE
flag before the CRC for the data portion (recorded in the footer)
has been completely computed. Hold off setting the complete flag
until we've decided it's ready to send.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Mon, 11 Jun 2012 19:57:13 +0000 (14:57 -0500)]
libceph: encapsulate advancing msg page
In write_partial_msg_pages(), once all the data from a page has been
sent we advance to the next one. Put the code that takes care of
this into its own function.
While modifying write_partial_msg_pages(), make its local variable
"in_trail" be Boolean, and use the local variable "msg" (which is
just the connection's current out_msg pointer) consistently.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Mon, 11 Jun 2012 19:57:13 +0000 (14:57 -0500)]
libceph: encapsulate out message data setup
Move the code that prepares to write the data portion of a message
into its own function.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Thu, 21 Jun 2012 19:49:23 +0000 (12:49 -0700)]
libceph: drop ceph_con_get/put helpers and nref member
These are no longer used. Every ceph_connection instance is embedded in
another structure, and refcounts manipulated via the get/put ops.
Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Thu, 21 Jun 2012 19:47:08 +0000 (12:47 -0700)]
libceph: use con get/put methods
The ceph_con_get/put() helpers manipulate the embedded con ref
count, which isn't used now that ceph_connections are embedded in
other structures.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Dan Carpenter [Tue, 19 Jun 2012 13:52:33 +0000 (08:52 -0500)]
libceph: fix NULL dereference in reset_connection()
We dereference "con->in_msg" on the line after it was set to NULL.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Sage Weil [Fri, 15 Jun 2012 19:32:04 +0000 (12:32 -0700)]
Merge tag 'v3.5-rc1'
Linux 3.5-rc1
Conflicts:
net/ceph/messenger.c
Sage Weil [Mon, 11 Jun 2012 03:43:56 +0000 (20:43 -0700)]
libceph: flush msgr queue during mon_client shutdown
We need to flush the msgr workqueue during mon_client shutdown to
ensure that any work affecting our embedded ceph_connection is
finished so that we can be safely destroyed.
Previously, we were flushing the work queue after osd_client
shutdown and before mon_client shutdown to ensure that any osd
connection refs to authorizers are flushed. Remove the redundant
flush, and document in the comment that the mon_client flush is
needed to cover that case as well.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Sage Weil [Sat, 9 Jun 2012 21:19:21 +0000 (14:19 -0700)]
libceph: transition socket state prior to actual connect
Once we call ->connect(), we are racing against the actual
connection, and a subsequent transition from CONNECTING ->
CONNECTED. Set the state to CONNECTING before that, under the
protection of the mutex, to avoid the race.
This was introduced in
928443cd9644e7cfd46f687dbeffda2d1a357ff9,
with the original socket state code.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Xi Wang [Thu, 7 Jun 2012 00:35:55 +0000 (19:35 -0500)]
libceph: fix overflow in osdmap_apply_incremental()
On 32-bit systems, a large `pglen' would overflow `pglen*sizeof(u32)'
and bypass the check ceph_decode_need(p, end, pglen*sizeof(u32), bad).
It would also overflow the subsequent kmalloc() size, leading to
out-of-bounds write.
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Xi Wang [Thu, 7 Jun 2012 00:35:55 +0000 (19:35 -0500)]
libceph: fix overflow in osdmap_decode()
On 32-bit systems, a large `n' would overflow `n * sizeof(u32)' and bypass
the check ceph_decode_need(p, end, n * sizeof(u32), bad). It would also
overflow the subsequent kmalloc() size, leading to out-of-bounds write.
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Xi Wang [Thu, 7 Jun 2012 00:35:55 +0000 (19:35 -0500)]
libceph: fix overflow in __decode_pool_names()
`len' is read from network and thus needs validation. Otherwise a
large `len' would cause out-of-bounds access via the memcpy() call.
In addition, len = 0xffffffff would overflow the kmalloc() size,
leading to out-of-bounds write.
This patch adds a check of `len' via ceph_decode_need(). Also use
kstrndup rather than kmalloc/memcpy.
[elder@inktank.com: added -ENOMEM return for null kstrndup() result]
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Yan, Zheng [Thu, 7 Jun 2012 00:35:55 +0000 (19:35 -0500)]
rbd: Clear ceph_msg->bio_iter for retransmitted message
The bug can cause NULL pointer dereference in write_partial_msg_pages
Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Alex Elder [Fri, 1 Jun 2012 19:56:43 +0000 (14:56 -0500)]
libceph: make ceph_con_revoke_message() a msg op
ceph_con_revoke_message() is passed both a message and a ceph
connection. A ceph_msg allocated for incoming messages on a
connection always has a pointer to that connection, so there's no
need to provide the connection when revoking such a message.
Note that the existing logic does not preclude the message supplied
being a null/bogus message pointer. The only user of this interface
is the OSD client, and the only value an osd client passes is a
request's r_reply field. That is always non-null (except briefly in
an error path in ceph_osdc_alloc_request(), and that drops the
only reference so the request won't ever have a reply to revoke).
So we can safely assume the passed-in message is non-null, but add a
BUG_ON() to make it very obvious we are imposing this restriction.
Rename the function ceph_msg_revoke_incoming() to reflect that it is
really an operation on an incoming message.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Fri, 1 Jun 2012 19:56:43 +0000 (14:56 -0500)]
libceph: make ceph_con_revoke() a msg operation
ceph_con_revoke() is passed both a message and a ceph connection.
Now that any message associated with a connection holds a pointer
to that connection, there's no need to provide the connection when
revoking a message.
This has the added benefit of precluding the possibility of the
providing the wrong connection pointer. If the message's connection
pointer is null, it is not being tracked by any connection, so
revoking it is a no-op. This is supported as a convenience for
upper layers, so they can revoke a message that is not actually
"in flight."
Rename the function ceph_msg_revoke() to reflect that it is really
an operation on a message, not a connection.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Mon, 4 Jun 2012 19:43:33 +0000 (14:43 -0500)]
libceph: have messages take a connection reference
There are essentially two types of ceph messages: incoming and
outgoing. Outgoing messages are always allocated via ceph_msg_new(),
and at the time of their allocation they are not associated with any
particular connection. Incoming messages are always allocated via
ceph_con_in_msg_alloc(), and they are initially associated with the
connection from which incoming data will be placed into the message.
When an outgoing message gets sent, it becomes associated with a
connection and remains that way until the message is successfully
sent. The association of an incoming message goes away at the point
it is sent to an upper layer via a con->ops->dispatch method.
This patch implements reference counting for all ceph messages, such
that every message holds a reference (and a pointer) to a connection
if and only if it is associated with that connection (as described
above).
For background, here is an explanation of the ceph message
lifecycle, emphasizing when an association exists between a message
and a connection.
Outgoing Messages
An outgoing message is "owned" by its allocator, from the time it is
allocated in ceph_msg_new() up to the point it gets queued for
sending in ceph_con_send(). Prior to that point the message's
msg->con pointer is null; at the point it is queued for sending its
message pointer is assigned to refer to the connection. At that
time the message is inserted into a connection's out_queue list.
When a message on the out_queue list has been sent to the socket
layer to be put on the wire, it is transferred out of that list and
into the connection's out_sent list. At that point it is still owned
by the connection, and will remain so until an acknowledgement is
received from the recipient that indicates the message was
successfully transferred. When such an acknowledgement is received
(in process_ack()), the message is removed from its list (in
ceph_msg_remove()), at which point it is no longer associated with
the connection.
So basically, any time a message is on one of a connection's lists,
it is associated with that connection. Reference counting outgoing
messages can thus be done at the points a message is added to the
out_queue (in ceph_con_send()) and the point it is removed from
either its two lists (in ceph_msg_remove())--at which point its
connection pointer becomes null.
Incoming Messages
When an incoming message on a connection is getting read (in
read_partial_message()) and there is no message in con->in_msg,
a new one is allocated using ceph_con_in_msg_alloc(). At that
point the message is associated with the connection. Once that
message has been completely and successfully read, it is passed to
upper layer code using the connection's con->ops->dispatch method.
At that point the association between the message and the connection
no longer exists.
Reference counting of connections for incoming messages can be done
by taking a reference to the connection when the message gets
allocated, and releasing that reference when it gets handed off
using the dispatch method.
We should never fail to get a connection reference for a
message--the since the caller should already hold one.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Fri, 1 Jun 2012 19:56:43 +0000 (14:56 -0500)]
libceph: have messages point to their connection
When a ceph message is queued for sending it is placed on a list of
pending messages (ceph_connection->out_queue). When they are
actually sent over the wire, they are moved from that list to
another (ceph_connection->out_sent). When acknowledgement for the
message is received, it is removed from the sent messages list.
During that entire time the message is "in the possession" of a
single ceph connection. Keep track of that connection in the
message. This will be used in the next patch (and is a helpful
bit of information for debugging anyway).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Mon, 4 Jun 2012 19:43:32 +0000 (14:43 -0500)]
libceph: tweak ceph_alloc_msg()
The function ceph_alloc_msg() is only used to allocate a message
that will be assigned to a connection's in_msg pointer. Rename the
function so this implied usage is more clear.
In addition, make that assignment inside the function (again, since
that's precisely what it's intended to be used for). This allows us
to return what is now provided via the passed-in address of a "skip"
variable. The return type is now Boolean to be explicit that there
are only two possible outcomes.
Make sure the result of an ->alloc_msg method call always sets the
value of *skip properly.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Sun, 27 May 2012 04:26:43 +0000 (23:26 -0500)]
libceph: fully initialize connection in con_init()
Move the initialization of a ceph connection's private pointer,
operations vector pointer, and peer name information into
ceph_con_init(). Rearrange the arguments so the connection pointer
is first. Hide the byte-swapping of the peer entity number inside
ceph_con_init()
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Sun, 27 May 2012 04:26:43 +0000 (23:26 -0500)]
libceph: init monitor connection when opening
Hold off initializing a monitor client's connection until just
before it gets opened for use.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Fri, 1 Jun 2012 03:27:50 +0000 (20:27 -0700)]
libceph: drop connection refcounting for mon_client
All references to the embedded ceph_connection come from the msgr
workqueue, which is drained prior to mon_client destruction. That
means we can ignore con refcounting entirely.
Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Alex Elder <elder@inktank.com>
Alex Elder [Sun, 27 May 2012 04:26:43 +0000 (23:26 -0500)]
libceph: embed ceph connection structure in mon_client
A monitor client has a pointer to a ceph connection structure in it.
This is the only one of the three ceph client types that do it this
way; the OSD and MDS clients embed the connection into their main
structures. There is always exactly one ceph connection for a
monitor client, so there is no need to allocate it separate from the
monitor client structure.
So switch the ceph_mon_client structure to embed its
ceph_connection structure.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Fri, 1 Jun 2012 03:22:18 +0000 (20:22 -0700)]
libceph: use con get/put ops from osd_client
There were a few direct calls to ceph_con_{get,put}() instead of the con
ops from osd_client.c. This is a bug since those ops aren't defined to
be ceph_con_get/put.
This breaks refcounting on the ceph_osd structs that contain the
ceph_connections, and could lead to all manner of strangeness.
The purpose of the ->get and ->put methods in a ceph connection are
to allow the connection to indicate it has a reference to something
external to the messaging system, *not* to indicate something
external has a reference to the connection.
[elder@inktank.com: added that last sentence]
Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Alex Elder <elder@inktank.com>
Alex Elder [Mon, 4 Jun 2012 19:43:32 +0000 (14:43 -0500)]
libceph: osd_client: don't drop reply reference too early
In ceph_osdc_release_request(), a reference to the r_reply message
is dropped. But just after that, that same message is revoked if it
was in use to receive an incoming reply. Reorder these so we are
sure we hold a reference until we're actually done with the message.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Dan Carpenter [Wed, 6 Jun 2012 14:15:33 +0000 (09:15 -0500)]
rbd: endian bug in rbd_req_cb()
Sparse complains about this because:
drivers/block/rbd.c:996:20: warning: cast to restricted __le32
drivers/block/rbd.c:996:20: warning: cast from restricted __le16
These are set in osd_req_encode_op() and they are le16.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Yan, Zheng [Wed, 6 Jun 2012 14:15:33 +0000 (09:15 -0500)]
rbd: Fix ceph_snap_context size calculation
ceph_snap_context->snaps is an u64 array
Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Linus Torvalds [Sun, 3 Jun 2012 01:29:26 +0000 (18:29 -0700)]
Linux 3.5-rc1
Linus Torvalds [Sun, 3 Jun 2012 00:39:40 +0000 (17:39 -0700)]
Merge tag 'dm-3.5-changes-1' of git://git./linux/kernel/git/agk/linux-dm
Pull device-mapper updates from Alasdair G Kergon:
"Improve multipath's retrying mechanism in some defined circumstances
and provide a simple reserve/release mechanism for userspace tools to
access thin provisioning metadata while the pool is in use."
* tag 'dm-3.5-changes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
dm thin: provide userspace access to pool metadata
dm thin: use slab mempools
dm mpath: allow ioctls to trigger pg init
dm mpath: delay retry of bypassed pg
dm mpath: reduce size of struct multipath
Joe Thornber [Sat, 2 Jun 2012 23:30:01 +0000 (00:30 +0100)]
dm thin: provide userspace access to pool metadata
This patch implements two new messages that can be sent to the thin
pool target allowing it to take a snapshot of the _metadata_. This,
read-only snapshot can be accessed by userland, concurrently with the
live target.
Only one metadata snapshot can be held at a time. The pool's status
line will give the block location for the current msnap.
Since version 0.1.5 of the userland thin provisioning tools, the
thin_dump program displays the msnap as follows:
thin_dump -m <msnap root> <metadata dev>
Available here: https://github.com/jthornber/thin-provisioning-tools
Now that userland can access the metadata we can do various things
that have traditionally been kernel side tasks:
i) Incremental backups.
By using metadata snapshots we can work out what blocks have
changed over time. Combined with data snapshots we can ensure
the data doesn't change while we back it up.
A short proof of concept script can be found here:
https://github.com/jthornber/thinp-test-suite/blob/master/incremental_backup_example.rb
ii) Migration of thin devices from one pool to another.
iii) Merging snapshots back into an external origin.
iv) Asyncronous replication.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Sat, 2 Jun 2012 23:30:00 +0000 (00:30 +0100)]
dm thin: use slab mempools
Use dedicated caches prefixed with a "dm_" name rather than relying on
kmalloc mempools backed by generic slab caches so the memory usage of
thin provisioning (and any leaks) can be accounted for independently.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Sat, 2 Jun 2012 23:29:58 +0000 (00:29 +0100)]
dm mpath: allow ioctls to trigger pg init
After the failure of a group of paths, any alternative paths that
need initialising do not become available until further I/O is sent to
the device. Until this has happened, ioctls return -EAGAIN.
With this patch, new paths are made available in response to an ioctl
too. The processing of the ioctl gets delayed until this has happened.
Instead of returning an error, we submit a work item to kmultipathd
(that will potentially activate the new path) and retry in ten
milliseconds.
Note that the patch doesn't retry an ioctl if the ioctl itself fails due
to a path failure. Such retries should be handled intelligently by the
code that generated the ioctl in the first place, noting that some SCSI
commands should not be retried because they are not idempotent (XOR write
commands). For commands that could be retried, there is a danger that
if the device rejected the SCSI command, the path could be errorneously
marked as failed, and the request would be retried on another path which
might fail too. It can be determined if the failure happens on the
device or on the SCSI controller, but there is no guarantee that all
SCSI drivers set these flags correctly.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Christie [Sat, 2 Jun 2012 23:29:45 +0000 (00:29 +0100)]
dm mpath: delay retry of bypassed pg
If I/O needs retrying and only bypassed priority groups are available,
set the pg_init_delay_retry flag to wait before retrying.
If, for example, the reason for the bypass is that the controller is
getting reset or there is a firmware upgrade happening, retrying right
away would cause a flood of log messages and retries for what could be a
few seconds or even several minutes.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Sat, 2 Jun 2012 23:29:43 +0000 (00:29 +0100)]
dm mpath: reduce size of struct multipath
Move multipath structure's 'lock' and 'queue_size' members to eliminate
two 4-byte holes. Also use a bit within a single unsigned int for each
existing flag (saves 8-bytes). This allows future flags to be added
without each consuming an unsigned int.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Linus Torvalds [Sat, 2 Jun 2012 23:22:51 +0000 (16:22 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking updates from David Miller:
1) Make syn floods consume significantly less resources by
a) Not pre-COW'ing routing metrics for SYN/ACKs
b) Mirroring the device queue mapping of the SYN for the SYN/ACK
reply.
Both from Eric Dumazet.
2) Fix calculation errors in Byte Queue Limiting, from Hiroaki SHIMODA.
3) Validate the length requested when building a paged SKB for a
socket, so we don't overrun the page vector accidently. From Jason
Wang.
4) When netlabel is disabled, we abort all IP option processing when we
see a CIPSO option. This isn't the right thing to do, we should
simply skip over it and continue processing the remaining options
(if any). Fix from Paul Moore.
5) SRIOV fixes for the mellanox driver from Jack orgenstein and Marcel
Apfelbaum.
6) 8139cp enables the receiver before the ring address is properly
programmed, which potentially lets the device crap over random
memory. Fix from Jason Wang.
7) e1000/e1000e fixes for i217 RST handling, and an improper buffer
address reference in jumbo RX frame processing from Bruce Allan and
Sebastian Andrzej Siewior, respectively.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
fec_mpc52xx: fix timestamp filtering
mcs7830: Implement link state detection
e1000e: fix Rapid Start Technology support for i217
e1000: look into the page instead of skb->data for e1000_tbi_adjust_stats()
r8169: call netif_napi_del at errpaths and at driver unload
tcp: reflect SYN queue_mapping into SYNACK packets
tcp: do not create inetpeer on SYNACK message
8139cp/8139too: terminate the eeprom access with the right opmode
8139cp: set ring address before enabling receiver
cipso: handle CIPSO options correctly when NetLabel is disabled
net: sock: validate data_len before allocating skb in sock_alloc_send_pskb()
bql: Avoid possible inconsistent calculation.
bql: Avoid unneeded limit decrement.
bql: Fix POSDIFF() to integer overflow aware.
net/mlx4_core: Fix obscure mlx4_cmd_box parameter in QUERY_DEV_CAP
net/mlx4_core: Check port out-of-range before using in mlx4_slave_cap
net/mlx4_core: Fixes for VF / Guest startup flow
net/mlx4_en: Fix improper use of "port" parameter in mlx4_en_event
net/mlx4_core: Fix number of EQs used in ICM initialisation
net/mlx4_core: Fix the slave_id out-of-range test in mlx4_eq_int
Linus Torvalds [Sat, 2 Jun 2012 23:17:03 +0000 (16:17 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull straggler x86 fixes from Peter Anvin:
"Three groups of patches:
- EFI boot stub documentation and the ability to print error messages;
- Removal for PTRACE_ARCH_PRCTL for x32 (obsolete interface which
should never have been ported, and the port is broken and
potentially dangerous.)
- ftrace stack corruption fixes. I'm not super-happy about the
technical implementation, but it is probably the least invasive in
the short term. In the future I would like a single method for
nesting the debug stack, however."
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, x32, ptrace: Remove PTRACE_ARCH_PRCTL for x32
x86, efi: Add EFI boot stub documentation
x86, efi; Add EFI boot stub console support
x86, efi: Only close open files in error path
ftrace/x86: Do not change stacks in DEBUG when calling lockdep
x86: Allow nesting of the debug stack IDT setting
x86: Reset the debug_stack update counter
ftrace: Use breakpoint method to update ftrace caller
ftrace: Synchronize variable setting with breakpoints
Linus Torvalds [Sat, 2 Jun 2012 22:21:43 +0000 (15:21 -0700)]
tty: Revert the tty locking series, it needs more work
This reverts the tty layer change to use per-tty locking, because it's
not correct yet, and fixing it will require some more deep surgery.
The main revert is
d29f3ef39be4 ("tty_lock: Localise the lock"), but
there are several smaller commits that built upon it, they also get
reverted here. The list of reverted commits is:
fde86d310886 - tty: add lockdep annotations
8f6576ad476b - tty: fix ldisc lock inversion trace
d3ca8b64b97e - pty: Fix lock inversion
b1d679afd766 - tty: drop the pty lock during hangup
abcefe5fc357 - tty/amiserial: Add missing argument for tty_unlock()
fd11b42e3598 - cris: fix missing tty arg in wait_event_interruptible_tty call
d29f3ef39be4 - tty_lock: Localise the lock
The revert had a trivial conflict in the 68360serial.c staging driver
that got removed in the meantime.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stephan Gatzka [Sat, 2 Jun 2012 03:04:06 +0000 (03:04 +0000)]
fec_mpc52xx: fix timestamp filtering
skb_defer_rx_timestamp was called with a freshly allocated skb but must
be called with rskb instead.
Signed-off-by: Stephan Gatzka <stephan@gatzka.org>
Cc: stable <stable@vger.kernel.org>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ondrej Zary [Fri, 1 Jun 2012 10:29:08 +0000 (10:29 +0000)]
mcs7830: Implement link state detection
Add .status callback that detects link state changes.
Tested with MCS7832CV-AA chip (9710:7830, identified as rev.C by the driver).
Fixes https://bugzilla.kernel.org/show_bug.cgi?id=28532
Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 2 Jun 2012 16:03:54 +0000 (09:03 -0700)]
Merge 'for-linus' branches from git://git./linux/kernel/git/viro/{vfs,signal}
Pull vfs fix and a fix from the signal changes for frv from Al Viro.
The __kernel_nlink_t for powerpc got scrogged because 64-bit powerpc
actually depended on the default "unsigned long", while 32-bit powerpc
had an explicit override to "unsigned short". Al didn't notice, and
made both of them be the unsigned short.
The frv signal fix is fallout from simplifying the do_notify_resume()
code, and leaving an extra parenthesis.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
powerpc: Fix size of st_nlink on 64bit
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
frv: Remove bogus closing parenthesis
Anton Blanchard [Sat, 2 Jun 2012 11:34:52 +0000 (21:34 +1000)]
powerpc: Fix size of st_nlink on 64bit
commit
e57f93cc53b7 (powerpc: get rid of nlink_t uses, switch to
explicitly-sized type) changed the size of st_nlink on ppc64 from
a long to a short, resulting in boot failures.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Geert Uytterhoeven [Sat, 2 Jun 2012 11:08:39 +0000 (13:08 +0200)]
frv: Remove bogus closing parenthesis
Introduced by commit
6fd84c0831ec78d98736b76dc5e9b849f1dbfc9e
("TIF_RESTORE_SIGMASK can be set only when TIF_SIGPENDING is set")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Bruce Allan [Thu, 10 May 2012 02:51:17 +0000 (02:51 +0000)]
e1000e: fix Rapid Start Technology support for i217
The definition of I217_PROXY_CTRL must use the BM_PHY_REG() macro instead
of the PHY_REG() macro for PHY page 800 register 70 since it is for a PHY
register greater than the maximum allowed by the latter macro, and fix a
typo setting the I217_MEMPWR register in e1000_suspend_workarounds_ich8lan.
Also for clarity, rename a few defines as bit definitions instead of masks.
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Sebastian Andrzej Siewior [Tue, 15 May 2012 09:18:55 +0000 (09:18 +0000)]
e1000: look into the page instead of skb->data for e1000_tbi_adjust_stats()
This is another fixup where the data is not transfered into buffer
addressed by skb->data but into a page.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Linus Torvalds [Sat, 2 Jun 2012 02:56:23 +0000 (19:56 -0700)]
Merge branch 'akpm' (Fixups for Andrew's patchbomb)
Merge fixups for the mac NLS tables from Andrew.
* emailed from Andrew Morton, and one cleanup by me:
nls: fix (and rename) mac NLS table files and config options
fs/nls/Makefile: remove bogus CONFIG_ assignments
Linus Torvalds [Sat, 2 Jun 2012 02:51:22 +0000 (19:51 -0700)]
nls: fix (and rename) mac NLS table files and config options
The config options in the Kconfig file (with _CODEPAGE_ in the name)
didn't match the config option name in the Makefile (no _CODEPAGE_).
And both of them were of the hard-to-read MACXYZZY variety, which made
them hard to parse for normal humans: MACROMAN easily reads as "macro
man", not as "Mac Roman".
So rename the options to be consistent, and be NLS_MAC_xyzzy. Rename
the files to be mac-xyzzy.c too, and drop the "nls" part entirely (it's
already in the directory name).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Fri, 1 Jun 2012 21:57:41 +0000 (14:57 -0700)]
fs/nls/Makefile: remove bogus CONFIG_ assignments
These were debug things which snuck through.
Reported-by: Yinghai Lu <yinghai@kernel.org>
Cc: Vladimir Serbinenko <phcoder@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 1 Jun 2012 23:57:51 +0000 (16:57 -0700)]
Merge tag 'fbdev-updates-for-3.5' of git://github.com/schandinat/linux-2.6
Pull fbdev updates from Florian Tobias Schandinat:
- driver for AUO-K1900 and AUO-K1901 epaper controller
- large updates for OMAP (e.g. decouple HDMI audio and video)
- some updates for Exynos and SH Mobile
- various other small fixes and cleanups
* tag 'fbdev-updates-for-3.5' of git://github.com/schandinat/linux-2.6: (130 commits)
video: bfin_adv7393fb: Fix cleanup code
video: exynos_dp: reduce delay time when configuring video setting
video: exynos_dp: move sw reset prioir to enabling sw defined function
video: exynos_dp: use devm_ functions
fb: handle NULL pointers in framebuffer release
OMAPDSS: HDMI: OMAP4: Update IRQ flags for the HPD IRQ request
OMAPDSS: Apply VENC timings even if panel is disabled
OMAPDSS: VENC/DISPC: Delay dividing Y resolution for managers connected to VENC
OMAPDSS: DISPC: Support rotation through TILER
OMAPDSS: VRFB: remove compiler warnings when CONFIG_BUG=n
OMAPFB: remove compiler warnings when CONFIG_BUG=n
OMAPDSS: remove compiler warnings when CONFIG_BUG=n
OMAPDSS: DISPC: fix usage of dispc_ovl_set_accu_uv
OMAPDSS: use DSI_FIFO_BUG workaround only for manual update displays
OMAPDSS: DSI: Support command mode interleaving during video mode blanking periods
OMAPDSS: DISPC: Update Accumulator configuration for chroma plane
drivers/video: fsl-diu-fb: don't initialize the THRESHOLDS registers
video: exynos mipi dsi: support reverse panel type
video: exynos mipi dsi: Properly interpret the interrupt source flags
video: exynos mipi dsi: Avoid races in probe()
...
Linus Torvalds [Fri, 1 Jun 2012 23:55:42 +0000 (16:55 -0700)]
Merge tag 'for-linus-3.5-
20120601' of git://git.infradead.org/linux-mtd
Pull mtd update from David Woodhouse:
- More robust parsing especially of xattr data in JFFS2
- Updates to mxc_nand and gpmi drivers to support new boards and device tree
- Improve consistency of information about ECC strength in NAND devices
- Clean up partition handling of plat_nand
- Support NAND drivers without dedicated access to OOB area
- BCH hardware ECC support for OMAP
- Other fixes and cleanups, and a few new device IDs
Fixed trivial conflict in drivers/mtd/nand/gpmi-nand/gpmi-nand.c due to
added include files next to each other.
* tag 'for-linus-3.5-
20120601' of git://git.infradead.org/linux-mtd: (75 commits)
mtd: mxc_nand: move ecc strengh setup before nand_scan_tail
mtd: block2mtd: fix recursive call of mtd_writev
mtd: gpmi-nand: define ecc.strength
mtd: of_parts: fix breakage in Kconfig
mtd: nand: fix scan_read_raw_oob
mtd: docg3 fix in-middle of blocks reads
mtd: cfi_cmdset_0002: Slight cleanup of fixup messages
mtd: add fixup for S29NS512P NOR flash.
jffs2: allow to complete xattr integrity check on first GC scan
jffs2: allow to discriminate between recoverable and non-recoverable errors
mtd: nand: omap: add support for hardware BCH ecc
ARM: OMAP3: gpmc: add BCH ecc api and modes
mtd: nand: check the return code of 'read_oob/read_oob_raw'
mtd: nand: remove 'sndcmd' parameter of 'read_oob/read_oob_raw'
mtd: m25p80: Add support for Winbond W25Q80BW
jffs2: get rid of jffs2_sync_super
jffs2: remove unnecessary GC pass on sync
jffs2: remove unnecessary GC pass on umount
jffs2: remove lock_super
mtd: gpmi: add gpmi support for mx6q
...
Linus Torvalds [Fri, 1 Jun 2012 23:51:37 +0000 (16:51 -0700)]
Merge branch 'for_linus' of git://cavan.codon.org.uk/platform-drivers-x86
Pull x86 platform driver updates from Matthew Garrett:
"Some significant improvements for the Sony driver on newer machines,
but other than that mostly just minor fixes and a patch to remove the
broken rfkill code from the Dell driver."
* 'for_linus' of git://cavan.codon.org.uk/platform-drivers-x86: (35 commits)
apple-gmux: Fix up the suspend/resume patch
dell-laptop: Remove rfkill code
toshiba_acpi: Fix mis-merge
dell-laptop: Add touchpad led support for Dell V3450
acer-wmi: add 3 laptops to video backlight vendor mode quirk table
sony-laptop: add touchpad enable/disable function
sony-laptop: add missing Fn key combos for 0x100 handlers
sony-laptop: add support for more WWAN modems
sony-laptop: new keyboard backlight handle
sony-laptop: add high speed battery charging function
sony-laptop: support automatic resume on lid open
sony-laptop: adjust error handling in finding SNC handles
sony-laptop: add thermal profiles support
sony-laptop: support battery care functions
sony-laptop: additional debug statements
sony-laptop: improve SNC initialization and acpi notify callback code
sony-laptop: use kstrtoul to parse sysfs values
sony-laptop: generalise ACPI calls into SNC functions
sony-laptop: fix return path when no ACPI buffer is allocated
sony-laptop: use soft rfkill status stored in hw
...
Linus Torvalds [Fri, 1 Jun 2012 23:50:23 +0000 (16:50 -0700)]
Merge branch 'slab/for-linus' of git://git./linux/kernel/git/penberg/linux
Pull slab updates from Pekka Enberg:
"Mainly a bunch of SLUB fixes from Joonsoo Kim"
* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
slub: use __SetPageSlab function to set PG_slab flag
slub: fix a memory leak in get_partial_node()
slub: remove unused argument of init_kmem_cache_node()
slub: fix a possible memory leak
Documentations: Fix slabinfo.c directory in vm/slub.txt
slub: fix incorrect return type of get_any_partial()
H. Peter Anvin [Fri, 1 Jun 2012 22:55:31 +0000 (15:55 -0700)]
Merge remote-tracking branch 'rostedt/tip/perf/urgent-2' into x86-urgent-for-linus
Linus Torvalds [Fri, 1 Jun 2012 22:46:46 +0000 (15:46 -0700)]
Merge branch 'ux500/hickup' of git://git./linux/kernel/git/arm/arm-soc
Pull arm fixes for ux500 mismerge mishap from Arnd Bergmann:
"The device tree conversion for arm/ux500 in 3.5 turns out to be
incomplete because of a mismerge done by Linus Walleij that I failed
to notice early enough and that Lee Jones as the original author of
those patches did not manage to fix during the -next cycle. While we
originally to get a much larger set of ux500 device tree enablement
patches merged, this did not happen in time.
After some discussion at Linaro Connect conference this week, Lee has
been able to do damage control and provide a series to put the broken
platform back into usable shape for both DT and non-DT based booting.
This series has not been part of linux-next and is based on top of the
current state of the upstream kernel rather than an -rc, but this is
the best we could manage given the earlier breakage."
* 'ux500/hickup' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: ux500: Enable probing of pinctrl through Device Tree
ARM: ux500: Add support for ab8500 regulators into the Device Tree
ARM: ux500: Provide regulator support for SMSC911x via Device Tree
ARM: ux500: Allow PRCMU regulator to be probed during a DT enabled boot
ARM: ux500: Apply db8500-prcmu regulator information to db8500 Device Tree
ARM: ux500: Only initialise STE's UIBs on boards which support them
ARM: ux500: Disable platform setup of the ab8500 when DT is enabled
ARM: ux500: Use correct format for dynamic IRQ assignment
ARM: ux500: Re-enable SMSC911x platform code registration during non-DT boots
ARM: ux500: PRCMU related configuration and layout corrections for Device Tree
ARM: ux500: Remove DB8500 PRCMU platform registration when DT is enabled
ARM: ux500: Disable SMSC911x platform code registration when DT is enabled
ARM: ux500: New DT:ed u8500_init_devices for one-by-one device enablement
ARM: ux500: New DT:ed snowball_platform_devs for one-by-one device enablement
pinctrl-nomadik: Allow Device Tree driver probing
Devendra Naga [Thu, 31 May 2012 01:51:20 +0000 (01:51 +0000)]
r8169: call netif_napi_del at errpaths and at driver unload
when register_netdev fails, the init'ed NAPIs by netif_napi_add must be
deleted with netif_napi_del, and also when driver unloads, it should
delete the NAPI before unregistering netdevice using unregister_netdev.
Signed-off-by: Devendra Naga <devendra.aaru@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 1 Jun 2012 22:40:29 +0000 (15:40 -0700)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"A bunch of fixes:
- vmware memory corruption
- ttm spinlock balance
- cirrus/mgag200 work in the presence of efifb
and finally Alex and Jerome managed to track down a magic set of bits
that on certain rv740 and evergreen cards allow the correct use of the
complete set of render backends, this makes the cards operate
correctly in a number of scenarios we had issues in before, it also
manages to boost speed on benchmarks my large amounts on these
specific gpus."
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/edid: Make the header fixup threshold tunable
drm/radeon: fix regression in UMS CS ioctl
drm/vmwgfx: Fix nasty write past alloced memory area
drm/ttm: Fix spinlock imbalance
drm/radeon: fixup tiling group size and backendmap on r6xx-r9xx (v4)
drm/radeon: fix HD6790, HD6570 backend programming
drm/radeon: properly program gart on rv740, juniper, cypress, barts, hemlock
drm/radeon: fix bank information in tiling config
drm/mgag200: kick off conflicting framebuffers earlier.
drm/cirrus: kick out conflicting framebuffers earlier
cirrus: avoid crash if driver fails to load
Linus Torvalds [Fri, 1 Jun 2012 22:39:26 +0000 (15:39 -0700)]
Merge tag 'sound-3.5' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Just a few trivial driver-specific fixes."
* tag 'sound-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hdspm - Work around broken DDS value on PCI RME MADI
ALSA: usb-audio: fix rate_list memory leak
ASoC: fsi: bugfix: ensure dma is terminated
ASoC: fsi: bugfix: correct dma area
ASoC: fsi: bugfix: enable master clock control on DMA stream
ASoC: imx-ssi: Use clk_prepare_enable/clk_disable_unprepare
H.J. Lu [Tue, 22 May 2012 03:29:45 +0000 (20:29 -0700)]
x86, x32, ptrace: Remove PTRACE_ARCH_PRCTL for x32
When I added x32 ptrace to 3.4 kernel, I also include PTRACE_ARCH_PRCTL
support for x32 GDB For ARCH_GET_FS/GS, it takes a pointer to int64. But
at user level, ARCH_GET_FS/GS takes a pointer to int32. So I have to add
x32 ptrace to glibc to handle it with a temporary int64 passed to kernel and
copy it back to GDB as int32. Roland suggested that PTRACE_ARCH_PRCTL
is obsolete and x32 GDB should use fs_base and gs_base fields of
user_regs_struct instead.
Accordingly, remove PTRACE_ARCH_PRCTL completely from the x32 code to
avoid possible memory overrun when pointer to int32 is passed to
kernel.
Link: http://lkml.kernel.org/r/CAMe9rOpDzHfS7NH7m1vmD9QRw8SSj4Sc%2BaNOgcWm_WJME2eRsQ@mail.gmail.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: <stable@vger.kernel.org> v3.4
Sascha Hauer [Fri, 25 May 2012 14:22:42 +0000 (16:22 +0200)]
mtd: mxc_nand: move ecc strengh setup before nand_scan_tail
Since commit
6a918bade9dab40aaef80559bd1169c69e8d69cb, the mxc_nand driver
fails with:
Driver must set ecc.strength when using hardware ECC
This is because nand_scan_tail checks for correct ecc strength
settings, so we must set them up before nand_scan_tail.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Cc: stable@vger.kernel.org [3.4+]
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Gabor Juhos [Wed, 23 May 2012 22:17:23 +0000 (00:17 +0200)]
mtd: block2mtd: fix recursive call of mtd_writev
The 'mtd_writev' interface calls the function assigned
to the '_write' field of a given mtd device if that is
not NULL. The block2mtd driver sets the '_writev' field
to the 'mtd_writev' function itself and thus causes a
endless loop.
This is caused by
1dbebd32562b3c2caeca35960e5cb00bfcc12900
(mtd: harmonize mtd_writev usage).
Remove the assignment from the block2mtd driver to fix the
issue.
Signed-off-by: Gabor Juhos <juhosg@openwrt.org>
Cc: stable@kernel.org [3.3+]
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Marek Vasut [Mon, 21 May 2012 20:59:27 +0000 (22:59 +0200)]
mtd: gpmi-nand: define ecc.strength
Fix an issue which was introduced by the recent addition of ecc.strength.
The ecc.strength wasn't set in gpmi-nand, resulting in the following crash:
[ 2.550000] kernel BUG at drivers/mtd/nand/nand_base.c:3347!
...
[ 2.550000] [<
c020841c>] (nand_scan_tail+0x328/0x650) from [<
c02f68e0>] (gpmi_nand_probe+0x43c/0x5a4)
[ 2.550000] [<
c02f68e0>] (gpmi_nand_probe+0x43c/0x5a4) from [<
c01f6618>] (platform_drv_probe+0x14/0x18)
[ 2.550000] [<
c01f6618>] (platform_drv_probe+0x14/0x18) from [<
c01f55b0>] (driver_probe_device+0x74/0x1fc)
[ 2.550000] [<
c01f55b0>] (driver_probe_device+0x74/0x1fc) from [<
c01f57cc>] (__driver_attach+0x94/0x98)
[ 2.550000] [<
c01f57cc>] (__driver_attach+0x94/0x98) from [<
c01f3d40>] (bus_for_each_dev+0x50/0x80)
[ 2.550000] [<
c01f3d40>] (bus_for_each_dev+0x50/0x80) from [<
c01f4e18>] (bus_add_driver+0x188/0x25c)
[ 2.550000] [<
c01f4e18>] (bus_add_driver+0x188/0x25c) from [<
c01f5a70>] (driver_register+0x78/0x138)
[ 2.550000] [<
c01f5a70>] (driver_register+0x78/0x138) from [<
c043dc7c>] (gpmi_nand_init+0xc/0x30)
[ 2.550000] [<
c043dc7c>] (gpmi_nand_init+0xc/0x30) from [<
c0008824>] (do_one_initcall+0x108/0x17c)
[ 2.550000] [<
c0008824>] (do_one_initcall+0x108/0x17c) from [<
c042a8b8>] (kernel_init+0xfc/0x1bc)
[ 2.550000] [<
c042a8b8>] (kernel_init+0xfc/0x1bc) from [<
c000fab4>] (kernel_thread_exit+0x0/0x8)
Signed-off-by: Marek Vasut <marex@denx.de>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Matthew Garrett [Fri, 1 Jun 2012 19:18:52 +0000 (15:18 -0400)]
apple-gmux: Fix up the suspend/resume patch
I incorporated the wrong version of the suspend/resume patch for gmux,
and so lost David Woodhouse's fix to leave the backlight level unchanged
over suspend/resume. This fixes it up to v2.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Frank Svendsboe [Thu, 17 May 2012 20:43:09 +0000 (22:43 +0200)]
mtd: of_parts: fix breakage in Kconfig
MTD_OF_PARTS and the default setting is not working due to using 'Y'
instead of 'y', introduced in commit
d6137badeff1ef64b4e0092ec249ebdeaeb3ff37. This made our board, and
possibly other boards using DTS defined partitions and not having
CONFIG_MTD_OF_PARTS=y defined in the defconfig, fail to mount root.
Signed-off-by: Frank Svendsboe <frank.svendsboe@gmail.com>
Cc: stable@kernel.org [3.2+]
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Linus Torvalds [Fri, 1 Jun 2012 18:53:44 +0000 (11:53 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/signal
Pull third pile of signal handling patches from Al Viro:
"This time it's mostly helpers and conversions to them; there's a lot
of stuff remaining in the tree, but that'll either go in -rc2
(isolated bug fixes, ideally via arch maintainers' trees) or will sit
there until the next cycle."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
x86: get rid of calling do_notify_resume() when returning to kernel mode
blackfin: check __get_user() return value
whack-a-mole with TIF_FREEZE
FRV: Optimise the system call exit path in entry.S [ver #2]
FRV: Shrink TIF_WORK_MASK [ver #2]
FRV: Prevent syscall exit tracing and notify_resume at end of kernel exceptions
new helper: signal_delivered()
powerpc: get rid of restore_sigmask()
most of set_current_blocked() callers want SIGKILL/SIGSTOP removed from set
set_restore_sigmask() is never called without SIGPENDING (and never should be)
TIF_RESTORE_SIGMASK can be set only when TIF_SIGPENDING is set
don't call try_to_freeze() from do_signal()
pull clearing RESTORE_SIGMASK into block_sigmask()
sh64: failure to build sigframe != signal without handler
openrisc: tracehook_signal_handler() is supposed to be called on success
new helper: sigmask_to_save()
new helper: restore_saved_sigmask()
new helpers: {clear,test,test_and_clear}_restore_sigmask()
HAVE_RESTORE_SIGMASK is defined on all architectures now
Eric Dumazet [Fri, 1 Jun 2012 01:47:50 +0000 (01:47 +0000)]
tcp: reflect SYN queue_mapping into SYNACK packets
While testing how linux behaves on SYNFLOOD attack on multiqueue device
(ixgbe), I found that SYNACK messages were dropped at Qdisc level
because we send them all on a single queue.
Obvious choice is to reflect incoming SYN packet @queue_mapping to
SYNACK packet.
Under stress, my machine could only send 25.000 SYNACK per second (for
200.000 incoming SYN per second). NIC : ixgbe with 16 rx/tx queues.
After patch, not a single SYNACK is dropped.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 31 May 2012 21:00:26 +0000 (21:00 +0000)]
tcp: do not create inetpeer on SYNACK message
Another problem on SYNFLOOD/DDOS attack is the inetpeer cache getting
larger and larger, using lots of memory and cpu time.
tcp_v4_send_synack()
->inet_csk_route_req()
->ip_route_output_flow()
->rt_set_nexthop()
->rt_init_metrics()
->inet_getpeer( create = true)
This is a side effect of commit
a4daad6b09230 (net: Pre-COW metrics for
TCP) added in 2.6.39
Possible solution :
Instruct inet_csk_route_req() to remove FLOWI_FLAG_PRECOW_METRICS
Before patch :
# grep peer /proc/slabinfo
inet_peer_cache
4175430 4175430 192 42 2 : tunables 0 0 0 : slabdata 99415 99415 0
Samples: 41K of event 'cycles', Event count (approx.):
30716565122
+ 20,24% ksoftirqd/0 [kernel.kallsyms] [k] inet_getpeer
+ 8,19% ksoftirqd/0 [kernel.kallsyms] [k] peer_avl_rebalance.isra.1
+ 4,81% ksoftirqd/0 [kernel.kallsyms] [k] sha_transform
+ 3,64% ksoftirqd/0 [kernel.kallsyms] [k] fib_table_lookup
+ 2,36% ksoftirqd/0 [ixgbe] [k] ixgbe_poll
+ 2,16% ksoftirqd/0 [kernel.kallsyms] [k] __ip_route_output_key
+ 2,11% ksoftirqd/0 [kernel.kallsyms] [k] kernel_map_pages
+ 2,11% ksoftirqd/0 [kernel.kallsyms] [k] ip_route_input_common
+ 2,01% ksoftirqd/0 [kernel.kallsyms] [k] __inet_lookup_established
+ 1,83% ksoftirqd/0 [kernel.kallsyms] [k] md5_transform
+ 1,75% ksoftirqd/0 [kernel.kallsyms] [k] check_leaf.isra.9
+ 1,49% ksoftirqd/0 [kernel.kallsyms] [k] ipt_do_table
+ 1,46% ksoftirqd/0 [kernel.kallsyms] [k] hrtimer_interrupt
+ 1,45% ksoftirqd/0 [kernel.kallsyms] [k] kmem_cache_alloc
+ 1,29% ksoftirqd/0 [kernel.kallsyms] [k] inet_csk_search_req
+ 1,29% ksoftirqd/0 [kernel.kallsyms] [k] __netif_receive_skb
+ 1,16% ksoftirqd/0 [kernel.kallsyms] [k] copy_user_generic_string
+ 1,15% ksoftirqd/0 [kernel.kallsyms] [k] kmem_cache_free
+ 1,02% ksoftirqd/0 [kernel.kallsyms] [k] tcp_make_synack
+ 0,93% ksoftirqd/0 [kernel.kallsyms] [k] _raw_spin_lock_bh
+ 0,87% ksoftirqd/0 [kernel.kallsyms] [k] __call_rcu
+ 0,84% ksoftirqd/0 [kernel.kallsyms] [k] rt_garbage_collect
+ 0,84% ksoftirqd/0 [kernel.kallsyms] [k] fib_rules_lookup
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Thu, 31 May 2012 18:19:48 +0000 (18:19 +0000)]
8139cp/8139too: terminate the eeprom access with the right opmode
Currently, we terminate the eeprom access through clearing the CS by:
RTL_W8 (Cfg9346, ~EE_CS); or writeb (~EE_CS, ee_addr);
This would left the eeprom into "Config. Register Write Enable:"
state which is not expcted as the highest two bits were set to
0x11 ( expected is the "Normal" mode (0x00)). Solving this by write
0x0 instead of ~EE_CS when terminating the eeprom access.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>