Kirill Korotaev [Thu, 23 Jun 2005 07:09:51 +0000 (00:09 -0700)]
[PATCH] Software suspend and recalc sigpending bug fix
This patch fixes recalc_sigpending() to work correctly with tasks which are
being freezed.
The problem is that freeze_processes() sets PF_FREEZE and TIF_SIGPENDING
flags on tasks, but recalc_sigpending() called from e.g.
sys_rt_sigtimedwait or any other kernel place will clear TIF_SIGPENDING due
to no pending signals queued and the tasks won't be freezed until it
recieves a real signal or freezed_processes() fail due to timeout.
Signed-Off-By: Kirill Korotaev <dev@sw.ru>
Signed-Off-By: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Kirill Korotaev [Thu, 23 Jun 2005 07:09:50 +0000 (00:09 -0700)]
[PATCH] Fix of bogus file max limit messages
This patch fixes incorrect and bogus kernel messages that file-max limit
reached when the allocation fails
Signed-Off-By: Kirill Korotaev <dev@sw.ru>
Signed-Off-By: Denis Lunev <den@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Christoph Hellwig [Thu, 23 Jun 2005 07:09:49 +0000 (00:09 -0700)]
[PATCH] add some comments to lookup_create()
In a duplicate of lookup_create in the af_unix code Al commented what's
going on nicely, so let's bring that over to lookup_create before the copy
is going away (I'll send a patch soon)
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Alexey Dobriyan [Thu, 23 Jun 2005 07:09:47 +0000 (00:09 -0700)]
[PATCH] Document the fact that linux-arm-kernel is subscribers-only.
"Non-members are not allowed to post messages to this list. Blame the
original poster for cross-posting to subscriber-only mailing lists. "
Signed-off-by: Alexey Dobriyan <adobriyan@mail.ru>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andreas Dilger [Thu, 23 Jun 2005 07:09:45 +0000 (00:09 -0700)]
[PATCH] Support for dx directories in ext3_get_parent (NFSD)
Henrik Grubbstrom noted:
The 2.6.10 ext3_get_parent attempts to use ext3_find_entry to look up the
entry "..", which fails for dx directories since ".." is not present in the
directory hash table. The patch below solves this by looking up the dotdot
entry in the dx_root block.
Typical symptoms of the above bug are intermittent claims by nfsd that
files or directories are missing on exported ext3 filesystems.
cf https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=
3D150759 and
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=
3D144556
ext3_get_parent() is IMHO the wrong place to fix this bug as it introduces
a lot of internals from htree into that function. Instead, I think this
should be fixed in ext3_find_entry() as in the below patch. This has the
added advantage that it works for any callers of ext3_find_entry() and not
just ext3_lookup_parent().
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Henrik Grubbstrom <grubba@grubba.org>
Cc: <ext2-devel@lists.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Alan Cox [Thu, 23 Jun 2005 07:09:43 +0000 (00:09 -0700)]
[PATCH] setuid core dump
Add a new `suid_dumpable' sysctl:
This value can be used to query and set the core dump mode for setuid
or otherwise protected/tainted binaries. The modes are
0 - (default) - traditional behaviour. Any process which has changed
privilege levels or is execute only will not be dumped
1 - (debug) - all processes dump core when possible. The core dump is
owned by the current user and no security is applied. This is intended
for system debugging situations only. Ptrace is unchecked.
2 - (suidsafe) - any binary which normally would not be dumped is dumped
readable by root only. This allows the end user to remove such a dump but
not access it directly. For security reasons core dumps in this mode will
not overwrite one another or other files. This mode is appropriate when
adminstrators are attempting to debug problems in a normal environment.
(akpm:
> > +EXPORT_SYMBOL(suid_dumpable);
>
> EXPORT_SYMBOL_GPL?
No problem to me.
> > if (current->euid == current->uid && current->egid == current->gid)
> > current->mm->dumpable = 1;
>
> Should this be SUID_DUMP_USER?
Actually the feedback I had from last time was that the SUID_ defines
should go because its clearer to follow the numbers. They can go
everywhere (and there are lots of places where dumpable is tested/used
as a bool in untouched code)
> Maybe this should be renamed to `dump_policy' or something. Doing that
> would help us catch any code which isn't using the #defines, too.
Fair comment. The patch was designed to be easy to maintain for Red Hat
rather than for merging. Changing that field would create a gigantic
diff because it is used all over the place.
)
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Prasanna S Panchamukhi [Thu, 23 Jun 2005 07:09:41 +0000 (00:09 -0700)]
[PATCH] jprobes: allow a jprobe to coexist with muliple kprobes
Presently either multiple kprobes or only one jprobe could be inserted.
This patch removes the above limitation and allows one jprobe and multiple
kprobes to coexist at the same address. However multiple jprobes cannot
coexist with multiple kprobes. Currently I am working on the prototype to
allow multiple jprobes coexist with multiple kprobes.
Signed-off-by: Ananth N Mavinakayanhalli <amavin@redhat.com>
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Anil S Keshavamurthy [Thu, 23 Jun 2005 07:09:40 +0000 (00:09 -0700)]
[PATCH] Kprobes/ia64: temporary disarming of reentrant probe
This patch includes IA64 architecture specific changes(ported form i386) to
support temporary disarming on reentrancy of probes.
In case of reentrancy we single step without calling user handler.
Signed-of-by: Anil S Keshavamurth <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Prasanna S Panchamukhi [Thu, 23 Jun 2005 07:09:39 +0000 (00:09 -0700)]
[PATCH] kprobes: Temporary disarming of reentrant probe for sparc64
This patch includes sparc64 architecture specific changes to support temporary
disarming on reentrancy of probes.
Signed-of-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Prasanna S Panchamukhi [Thu, 23 Jun 2005 07:09:38 +0000 (00:09 -0700)]
[PATCH] kprobes: Temporary disarming of reentrant probe for ppc64
This patch includes ppc64 architecture specific changes to support temporary
disarming on reentrancy of probes.
Signed-of-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Prasanna S Panchamukhi [Thu, 23 Jun 2005 07:09:37 +0000 (00:09 -0700)]
[PATCH] kprobes: Temporary disarming of reentrant probe for x86_64
This patch includes x86_64 architecture specific changes to support temporary
disarming on reentrancy of probes.
Signed-of-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Prasanna S Panchamukhi [Thu, 23 Jun 2005 07:09:37 +0000 (00:09 -0700)]
[PATCH] kprobes: Temporary disarming of reentrant probe for i386
This patch includes i386 architecture specific changes to support temporary
disarming on reentrancy of probes.
Signed-of-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Prasanna S Panchamukhi [Thu, 23 Jun 2005 07:09:36 +0000 (00:09 -0700)]
[PATCH] kprobes: Temporary disarming of reentrant probe
In situations where a kprobes handler calls a routine which has a probe on it,
then kprobes_handler() disarms the new probe forever. This patch removes the
above limitation by temporarily disarming the new probe. When the another
probe hits while handling the old probe, the kprobes_handler() saves previous
kprobes state and handles the new probe without calling the new kprobes
registered handlers. kprobe_post_handler() restores back the previous kprobes
state and the normal execution continues.
However on x86_64 architecture, re-rentrancy is provided only through
pre_handler(). If a routine having probe is referenced through
post_handler(), then the probes on that routine are disarmed forever, since
the exception stack is gets changed after the processor single steps the
instruction of the new probe.
This patch includes generic changes to support temporary disarming on
reentrancy of probes.
Signed-of-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Keshavamurthy Anil S [Thu, 23 Jun 2005 07:09:35 +0000 (00:09 -0700)]
[PATCH] Kprobes/IA64: check jprobe break before handling
Once the jprobe instrumented function returns, it executes a jprobe_break
which is a break instruction with __IA64_JPROBE_BREAK value. The current
patch checks for this break value, before assuming that jprobe instrumented
function just completed.
The previous code was not checking for this value and that was a bug.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Anil S Keshavamurthy [Thu, 23 Jun 2005 07:09:34 +0000 (00:09 -0700)]
[PATCH] Kprobes IA64: safe register kprobe
The current kprobes does not yet handle register kprobes on some of the
following kind of instruction which needs to be emulated in a special way.
1) mov r1=ip
2) chk -- Speculation check instruction
This patch attempts to fail register_kprobes() when user tries to insert
kprobes on the above kind of instruction.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Anil S Keshavamurthy [Thu, 23 Jun 2005 07:09:33 +0000 (00:09 -0700)]
[PATCH] Kprobes IA64: cmp ctype unc support
The current Kprobes when patching the original instruction with the break
instruction tries to retain the original qualifying predicate(qp), however
for cmp.crel.ctype where ctype == unc, which is a special instruction
always needs to be executed irrespective of qp. Hence, if the instruction
we are patching is of this type, then we should not copy the original qp to
the break instruction, this is because we always want the break fault to
happen so that we can emulate the instruction.
This patch is based on the feedback given by David Mosberger
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Anil S Keshavamurthy [Thu, 23 Jun 2005 07:09:32 +0000 (00:09 -0700)]
[PATCH] Kprobes IA64: arch_prepare_kprobes() cleanup
arch_prepare_kprobes() was doing lots of functionality
in just one single function. This patch
attempts to clean up arch_prepare_kprobes() by moving
specific sub task to the following (new)functions
1)valid_kprobe_addr() -->> validate the given kprobe address
2)get_kprobe_inst(slot..)->> Retrives the instruction for a given slot from the bundle
3)prepare_break_inst() -->> Prepares break instruction within the bundle
3a)update_kprobe_inst_flag()-->>Updates the internal flags, required
for proper emulation of the instruction at later
point in time.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rusty Lynch [Thu, 23 Jun 2005 07:09:31 +0000 (00:09 -0700)]
[PATCH] Kprobes ia64 qp fix
Fix a bug where a kprobe still fires when the instruction is predicated
off. So given the p6=0, and we have an instruction like:
(p6) move loc1=0
we should not be triggering the kprobe. This is handled by carrying over
the qp section of the original instruction into the break instruction.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Rusty Lynch <Rusty.lynch@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rusty Lynch [Thu, 23 Jun 2005 07:09:30 +0000 (00:09 -0700)]
[PATCH] Kprobes ia64 cleanup
A cleanup of the ia64 kprobes implementation such that all of the bundle
manipulation logic is concentrated in arch_prepare_kprobe().
With the current design for kprobes, the arch specific code only has a
chance to return failure inside the arch_prepare_kprobe() function.
This patch moves all of the work that was happening in arch_copy_kprobe()
and most of the work that was happening in arch_arm_kprobe() into
arch_prepare_kprobe(). By doing this we can add further robustness checks
in arch_arm_kprobe() and refuse to insert kprobes that will cause problems.
Signed-off-by: Rusty Lynch <Rusty.lynch@intel.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Anil S Keshavamurthy [Thu, 23 Jun 2005 07:09:29 +0000 (00:09 -0700)]
[PATCH] Kprobes/IA64: support kprobe on branch/call instructions
This patch is required to support kprobe on branch/call instructions.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Anil S Keshavamurthy [Thu, 23 Jun 2005 07:09:28 +0000 (00:09 -0700)]
[PATCH] Kprobes/IA64: architecture specific JProbes support
This patch adds IA64 architecture specific JProbes support on top of Kprobes
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Rusty Lynch <Rusty.lynch@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Anil S Keshavamurthy [Thu, 23 Jun 2005 07:09:28 +0000 (00:09 -0700)]
[PATCH] Kprobes/IA64: arch specific handling
This is an IA64 arch specific handling of Kprobes
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Rusty Lynch <Rusty.lynch@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Anil S Keshavamurthy [Thu, 23 Jun 2005 07:09:27 +0000 (00:09 -0700)]
[PATCH] Kprobes/IA64: kdebug die notification mechanism
As many of you know that kprobes exist in the main line kernel for various
architecture including i386, x86_64, ppc64 and sparc64. Attached patches
following this mail are a port of Kprobes and Jprobes for IA64.
I have tesed this patches for kprobes and Jprobes and this seems to work fine.
I have tested this patch by inserting kprobes on various slots and various
templates including various types of branch instructions.
I have also tested this patch using the tool
http://marc.theaimsgroup.com/?l=linux-kernel&m=
111657358022586&w=2 and the
kprobes for IA64 works great.
Here is list of TODO things and pathes for the same will appear soon.
1) Support kprobes on "mov r1=ip" type of instruction
2) Support Kprobes and Jprobes to exist on the same address
3) Support Return probes
3) Architecture independent cleanup of kprobes
This patch adds the kdebug die notification mechanism needed by Kprobes.
For break instruction on Branch type slot, imm21 is ignored and value
zero is placed in IIM register, hence we need to handle kprobes
for switch case zero.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Rusty Lynch <Rusty.lynch@intel.com>
From: Rusty Lynch <rusty.lynch@intel.com>
At the point in traps.c where we recieve a break with a zero value, we can
not say if the break was a result of a kprobe or some other debug facility.
This simple patch changes the informational string to a more correct "break
0" value, and applies to the 2.6.12-rc2-mm2 tree with all the kprobes
patches that were just recently included for the next mm cut.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hien Nguyen [Thu, 23 Jun 2005 07:09:26 +0000 (00:09 -0700)]
[PATCH] kprobes: moves lock-unlock to non-arch kprobe_flush_task
This patch moves the lock/unlock of the arch specific kprobe_flush_task()
to the non-arch specific kprobe_flusk_task().
Signed-off-by: Hien Nguyen <hien@us.ibm.com>
Acked-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rusty Lynch [Thu, 23 Jun 2005 07:09:25 +0000 (00:09 -0700)]
[PATCH] Move kprobe [dis]arming into arch specific code
The architecture independent code of the current kprobes implementation is
arming and disarming kprobes at registration time. The problem is that the
code is assuming that arming and disarming is a just done by a simple write
of some magic value to an address. This is problematic for ia64 where our
instructions look more like structures, and we can not insert break points
by just doing something like:
*p->addr = BREAKPOINT_INSTRUCTION;
The following patch to 2.6.12-rc4-mm2 adds two new architecture dependent
functions:
* void arch_arm_kprobe(struct kprobe *p)
* void arch_disarm_kprobe(struct kprobe *p)
and then adds the new functions for each of the architectures that already
implement kprobes (spar64/ppc64/i386/x86_64).
I thought arch_[dis]arm_kprobe was the most descriptive of what was really
happening, but each of the architectures already had a disarm_kprobe()
function that was really a "disarm and do some other clean-up items as
needed when you stumble across a recursive kprobe." So... I took the
liberty of changing the code that was calling disarm_kprobe() to call
arch_disarm_kprobe(), and then do the cleanup in the block of code dealing
with the recursive kprobe case.
So far this patch as been tested on i386, x86_64, and ppc64, but still
needs to be tested in sparc64.
Signed-off-by: Rusty Lynch <rusty.lynch@intel.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rusty Lynch [Thu, 23 Jun 2005 07:09:23 +0000 (00:09 -0700)]
[PATCH] x86_64 specific function return probes
The following patch adds the x86_64 architecture specific implementation
for function return probes.
Function return probes is a mechanism built on top of kprobes that allows
a caller to register a handler to be called when a given function exits.
For example, to instrument the return path of sys_mkdir:
static int sys_mkdir_exit(struct kretprobe_instance *i, struct pt_regs *regs)
{
printk("sys_mkdir exited\n");
return 0;
}
static struct kretprobe return_probe = {
.handler = sys_mkdir_exit,
};
<inside setup function>
return_probe.kp.addr = (kprobe_opcode_t *) kallsyms_lookup_name("sys_mkdir");
if (register_kretprobe(&return_probe)) {
printk(KERN_DEBUG "Unable to register return probe!\n");
/* do error path */
}
<inside cleanup function>
unregister_kretprobe(&return_probe);
The way this works is that:
* At system initialization time, kernel/kprobes.c installs a kprobe
on a function called kretprobe_trampoline() that is implemented in
the arch/x86_64/kernel/kprobes.c (More on this later)
* When a return probe is registered using register_kretprobe(),
kernel/kprobes.c will install a kprobe on the first instruction of the
targeted function with the pre handler set to arch_prepare_kretprobe()
which is implemented in arch/x86_64/kernel/kprobes.c.
* arch_prepare_kretprobe() will prepare a kretprobe instance that stores:
- nodes for hanging this instance in an empty or free list
- a pointer to the return probe
- the original return address
- a pointer to the stack address
With all this stowed away, arch_prepare_kretprobe() then sets the return
address for the targeted function to a special trampoline function called
kretprobe_trampoline() implemented in arch/x86_64/kernel/kprobes.c
* The kprobe completes as normal, with control passing back to the target
function that executes as normal, and eventually returns to our trampoline
function.
* Since a kprobe was installed on kretprobe_trampoline() during system
initialization, control passes back to kprobes via the architecture
specific function trampoline_probe_handler() which will lookup the
instance in an hlist maintained by kernel/kprobes.c, and then call
the handler function.
* When trampoline_probe_handler() is done, the kprobes infrastructure
single steps the original instruction (in this case just a top), and
then calls trampoline_post_handler(). trampoline_post_handler() then
looks up the instance again, puts the instance back on the free list,
and then makes a long jump back to the original return instruction.
So to recap, to instrument the exit path of a function this implementation
will cause four interruptions:
- A breakpoint at the very beginning of the function allowing us to
switch out the return address
- A single step interruption to execute the original instruction that
we replaced with the break instruction (normal kprobe flow)
- A breakpoint in the trampoline function where our instrumented function
returned to
- A single step interruption to execute the original instruction that
we replaced with the break instruction (normal kprobe flow)
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hien Nguyen [Thu, 23 Jun 2005 07:09:19 +0000 (00:09 -0700)]
[PATCH] kprobes: function-return probes
This patch adds function-return probes to kprobes for the i386
architecture. This enables you to establish a handler to be run when a
function returns.
1. API
Two new functions are added to kprobes:
int register_kretprobe(struct kretprobe *rp);
void unregister_kretprobe(struct kretprobe *rp);
2. Registration and unregistration
2.1 Register
To register a function-return probe, the user populates the following
fields in a kretprobe object and calls register_kretprobe() with the
kretprobe address as an argument:
kp.addr - the function's address
handler - this function is run after the ret instruction executes, but
before control returns to the return address in the caller.
maxactive - The maximum number of instances of the probed function that
can be active concurrently. For example, if the function is non-
recursive and is called with a spinlock or mutex held, maxactive = 1
should be enough. If the function is non-recursive and can never
relinquish the CPU (e.g., via a semaphore or preemption), NR_CPUS should
be enough. maxactive is used to determine how many kretprobe_instance
objects to allocate for this particular probed function. If maxactive <=
0, it is set to a default value (if CONFIG_PREEMPT maxactive=max(10, 2 *
NR_CPUS) else maxactive=NR_CPUS)
For example:
struct kretprobe rp;
rp.kp.addr = /* entrypoint address */
rp.handler = /*return probe handler */
rp.maxactive = /* e.g., 1 or NR_CPUS or 0, see the above explanation */
register_kretprobe(&rp);
The following field may also be of interest:
nmissed - Initialized to zero when the function-return probe is
registered, and incremented every time the probed function is entered but
there is no kretprobe_instance object available for establishing the
function-return probe (i.e., because maxactive was set too low).
2.2 Unregister
To unregiter a function-return probe, the user calls
unregister_kretprobe() with the same kretprobe object as registered
previously. If a probed function is running when the return probe is
unregistered, the function will return as expected, but the handler won't
be run.
3. Limitations
3.1 This patch supports only the i386 architecture, but patches for
x86_64 and ppc64 are anticipated soon.
3.2 Return probes operates by replacing the return address in the stack
(or in a known register, such as the lr register for ppc). This may
cause __builtin_return_address(0), when invoked from the return-probed
function, to return the address of the return-probes trampoline.
3.3 This implementation uses the "Multiprobes at an address" feature in
2.6.12-rc3-mm3.
3.4 Due to a limitation in multi-probes, you cannot currently establish
a return probe and a jprobe on the same function. A patch to remove
this limitation is being tested.
This feature is required by SystemTap (http://sourceware.org/systemtap),
and reflects ideas contributed by several SystemTap developers, including
Will Cohen and Ananth Mavinakayanahalli.
Signed-off-by: Hien Nguyen <hien@us.ibm.com>
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Frederik Deweerdt <frederik.deweerdt@laposte.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Christoph Hellwig [Thu, 23 Jun 2005 07:09:16 +0000 (00:09 -0700)]
[PATCH] quota: sanitize dentry handling in vfs_quota_on_mount
Use lookup_one_len instead of opencoding a simplified lookup using
lookup_hash with a fake hash.
Also there's no need anymore for the d_invalidate as we have a completely
valid dentry now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Christoph Hellwig [Thu, 23 Jun 2005 07:09:16 +0000 (00:09 -0700)]
[PATCH] quota: consolidate code surrounding vfs_quota_on_mount
Move some code duplicated in both callers into vfs_quota_on_mount
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Alexander Nyberg [Thu, 23 Jun 2005 07:09:13 +0000 (00:09 -0700)]
[PATCH] avoid resursive oopses
Prevent recursive faults in do_exit() by leaving the task alone and wait
for reboot. This may allow a more graceful shutdown and possibly save the
original oops.
Signed-off-by: Alexander Nyberg <alexn@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Christoph Hellwig [Thu, 23 Jun 2005 07:09:12 +0000 (00:09 -0700)]
[PATCH] remove duplicate get_dentry functions in various places
Various filesystem drivers have grown a get_dentry() function that's a
duplicate of lookup_one_len, except that it doesn't take a maximum length
argument and doesn't check for \0 or / in the passed in filename.
Switch all these places to use lookup_one_len.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Greg KH <greg@kroah.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Neil Horman [Thu, 23 Jun 2005 07:09:11 +0000 (00:09 -0700)]
[PATCH] add check to /proc/devices read routines
Patch to add check to get_chrdev_list and get_blkdev_list to prevent reads
of /proc/devices from spilling over the provided page if more than 4096
bytes of string data are generated from all the registered character and
block devices in a system
Signed-off-by: Neil Horman <nhorman@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pekka Enberg [Thu, 23 Jun 2005 07:09:09 +0000 (00:09 -0700)]
[PATCH] remove redundant vm_flags clearing from madvise.c
This patch removes redundant VM_ClearReadHint from mm/madvice.c which was
left there by Prasanna's patch.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jesper Juhl [Thu, 23 Jun 2005 07:09:09 +0000 (00:09 -0700)]
[PATCH] preempt_count is int - remove cast and don't assign to unsigned type
In kernel/sched.c the return value from preempt_count() is cast to an int.
That made sense when preempt_count was defined as different types on is not
needed and should go away. The patch removes the cast.
In kernel/timer.c the return value from preempt_count() is assigned to a
variable of type u32 and then that unsigned value is later compared to
preempt_count(). Since preempt_count() returns an int, an int is what
should be used to store its return value. Storing the result in an
unsigned 32bit integer made a tiny bit of sense back when preempt_count was
different types on different archs, but no more - let's not play signed vs
unsigned comparison games when we don't have to. The patch modifies the
code to use an int to hold the value. While I was around that bit of code
I also made two changes to a nearby (related) printk() - I modified it to
specify the loglevel explicitly and also broke the line into a few pieces
to avoid it being longer than 80 chars and clarified the text a bit.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jesper Juhl [Thu, 23 Jun 2005 07:09:07 +0000 (00:09 -0700)]
[PATCH] streamline preempt_count type across archs
The preempt_count member of struct thread_info is currently either defined
as int, unsigned int or __s32 depending on arch. This patch makes the type
of preempt_count an int on all archs.
Having preempt_count be an unsigned type prevents the catching of
preempt_count < 0 bugs, and using int on some archs and __s32 on others is
not exactely "neat" - much nicer when it's just int all over.
A previous version of this patch was already ACK'ed by Robert Love, and the
only change in this version of the patch compared to the one he ACK'ed is
that this one also makes sure the preempt_count member is consistently
commented.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nick Piggin [Thu, 23 Jun 2005 07:09:06 +0000 (00:09 -0700)]
[PATCH] optimise loop driver a bit
Looks like locking can be optimised quite a lot. Increase lock widths
slightly so lo_lock is taken fewer times per request. Also it was quite
trivial to cover lo_pending with that lock, and remove the atomic
requirement. This also makes memory ordering explicitly correct, which is
nice (not that I particularly saw any mem ordering bugs).
Test was reading 4 250MB files in parallel on ext2-on-tmpfs filesystem (1K
block size, 4K page size). System is 2 socket Xeon with HT (4 thread).
intel:/home/npiggin# umount /dev/loop0 ; mount /dev/loop0 /mnt/loop ; /usr/bin/time ./mtloop.sh
Before:
0.24user 5.51system 0:02.84elapsed 202%CPU (0avgtext+0avgdata 0maxresident)k
0.19user 5.52system 0:02.88elapsed 198%CPU (0avgtext+0avgdata 0maxresident)k
0.19user 5.57system 0:02.89elapsed 198%CPU (0avgtext+0avgdata 0maxresident)k
0.22user 5.51system 0:02.90elapsed 197%CPU (0avgtext+0avgdata 0maxresident)k
0.19user 5.44system 0:02.91elapsed 193%CPU (0avgtext+0avgdata 0maxresident)k
After:
0.07user 2.34system 0:01.68elapsed 143%CPU (0avgtext+0avgdata 0maxresident)k
0.06user 2.37system 0:01.68elapsed 144%CPU (0avgtext+0avgdata 0maxresident)k
0.06user 2.39system 0:01.68elapsed 145%CPU (0avgtext+0avgdata 0maxresident)k
0.06user 2.36system 0:01.68elapsed 144%CPU (0avgtext+0avgdata 0maxresident)k
0.06user 2.42system 0:01.68elapsed 147%CPU (0avgtext+0avgdata 0maxresident)k
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Greg Edwards [Thu, 23 Jun 2005 07:09:05 +0000 (00:09 -0700)]
[PATCH] CON_CONSDEV bit not set correctly on last console
According to include/linux/console.h, CON_CONSDEV flag should be set on
the last console specified on the boot command line:
86 #define CON_PRINTBUFFER (1)
87 #define CON_CONSDEV (2) /* Last on the command line */
88 #define CON_ENABLED (4)
89 #define CON_BOOT (8)
This does not currently happen if there is more than one console specified
on the boot commandline. Instead, it gets set on the first console on the
command line. This can cause problems for things like kdb that look for
the CON_CONSDEV flag to see if the console is valid.
Additionaly, it doesn't look like CON_CONSDEV is reassigned to the next
preferred console at unregister time if the console being unregistered
currently has that bit set.
Example (from sn2 ia64):
elilo vmlinuz root=<dev> console=ttyS0 console=ttySG0
in this case, the flags on ttySG console struct will be 0x4 (should be
0x6).
Attached patch against bk fixes both issues for the cases I looked at. It
uses selected_console (which gets incremented for each console specified on
the command line) as the indicator of which console to set CON_CONSDEV on.
When adding the console to the list, if the previous one had CON_CONSDEV
set, it masks it out. Tested on ia64 and x86.
The problem with the current behavior is it breaks overriding the default from
the boot line. In the ia64 case, there may be a global append line defining
console=a in elilo.conf. Then you want to boot your kernel, and want to
override the default by passing console=b on the boot line. elilo constructs
the kernel cmdline by starting with the value of the global append line, then
tacks on whatever else you specify, which puts console=b last.
Signed-off-by: Greg Edwards <edwardsg@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Robert Love [Thu, 23 Jun 2005 07:09:04 +0000 (00:09 -0700)]
[PATCH] kstrdup: convert a few existing implementations
Convert a bunch of strdup() implementations and their callers to the new
kstrdup(). A few remain, for example see sound/core, and there are tons of
open coded strdup()'s around. Sigh. But this is a start.
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Paulo Marques [Thu, 23 Jun 2005 07:09:02 +0000 (00:09 -0700)]
[PATCH] create a kstrdup library function
This patch creates a new kstrdup library function and changes the "local"
implementations in several places to use this function.
Most of the changes come from the sound and net subsystems. The sound part
had already been acknowledged by Takashi Iwai and the net part by David S.
Miller.
I left UML alone for now because I would need more time to read the code
carefully before making changes there.
Signed-off-by: Paulo Marques <pmarques@grupopie.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Alexander Viro [Thu, 23 Jun 2005 07:09:01 +0000 (00:09 -0700)]
[PATCH] fix for prune_icache()/forced final iput() races
Based on analysis and a patch from Russ Weight <rweight@us.ibm.com>
There is a race condition that can occur if an inode is allocated and then
released (using iput) during the ->fill_super functions. The race
condition is between kswapd and mount.
For most filesystems this can only happen in an error path when kswapd is
running concurrently. For isofs, however, the error can occur in a more
common code path (which is how the bug was found).
The logic here is "we want final iput() to free inode *now* instead of
letting it sit in cache if fs is going down or had not quite come up". The
problem is with kswapd seeing such inodes in the middle of being killed and
happily taking over.
The clean solution would be to tell kswapd to leave those inodes alone and
let our final iput deal with them. I.e. add a new flag
(I_FORCED_FREEING), set it before write_inode_now() there and make
prune_icache() leave those alone.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Oleg Nesterov [Thu, 23 Jun 2005 07:09:00 +0000 (00:09 -0700)]
[PATCH] posix-timers: use try_to_del_timer_sync()
sys_timer_settime/sys_timer_delete needs to delete k_itimer->real.timer
synchronously while holding ->it_lock, which is also locked in
posix_timer_fn.
This patch removes timer_active/set_timer_inactive which plays with
timer_list's internals in favour of using try_to_del_timer_sync(), which
was introduced in the previous patch.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Oleg Nesterov [Thu, 23 Jun 2005 07:08:59 +0000 (00:08 -0700)]
[PATCH] timers: introduce try_to_del_timer_sync()
This patch splits del_timer_sync() into 2 functions. The new one,
try_to_del_timer_sync(), returns -1 when it hits executing timer.
It can be used in interrupt context, or when the caller hold locks which
can prevent completion of the timer's handler.
NOTE. Currently it can't be used in interrupt context in UP case, because
->running_timer is used only with CONFIG_SMP.
Should the need arise, it is possible to kill #ifdef CONFIG_SMP in
set_running_timer(), it is cheap.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Oleg Nesterov [Thu, 23 Jun 2005 07:08:56 +0000 (00:08 -0700)]
[PATCH] timers fixes/improvements
This patch tries to solve following problems:
1. del_timer_sync() is racy. The timer can be fired again after
del_timer_sync have checked all cpus and before it will recheck
timer_pending().
2. It has scalability problems. All cpus are scanned to determine
if the timer is running on that cpu.
With this patch del_timer_sync is O(1) and no slower than plain
del_timer(pending_timer), unless it has to actually wait for
completion of the currently running timer.
The only restriction is that the recurring timer should not use
add_timer_on().
3. The timers are not serialized wrt to itself.
If CPU_0 does mod_timer(jiffies+1) while the timer is currently
running on CPU 1, it is quite possible that local interrupt on
CPU_0 will start that timer before it finished on CPU_1.
4. The timers locking is suboptimal. __mod_timer() takes 3 locks
at once and still requires wmb() in del_timer/run_timers.
The new implementation takes 2 locks sequentially and does not
need memory barriers.
Currently ->base != NULL means that the timer is pending. In that case
->base.lock is used to lock the timer. __mod_timer also takes timer->lock
because ->base can be == NULL.
This patch uses timer->entry.next != NULL as indication that the timer is
pending. So it does __list_del(), entry->next = NULL instead of list_del()
when the timer is deleted.
The ->base field is used for hashed locking only, it is initialized
in init_timer() which sets ->base = per_cpu(tvec_bases). When the
tvec_bases.lock is locked, it means that all timers which are tied
to this base via timer->base are locked, and the base itself is locked
too.
So __run_timers/migrate_timers can safely modify all timers which could
be found on ->tvX lists (pending timers).
When the timer's base is locked, and the timer removed from ->entry list
(which means that _run_timers/migrate_timers can't see this timer), it is
possible to set timer->base = NULL and drop the lock: the timer remains
locked.
This patch adds lock_timer_base() helper, which waits for ->base != NULL,
locks the ->base, and checks it is still the same.
__mod_timer() schedules the timer on the local CPU and changes it's base.
However, it does not lock both old and new bases at once. It locks the
timer via lock_timer_base(), deletes the timer, sets ->base = NULL, and
unlocks old base. Then __mod_timer() locks new_base, sets ->base = new_base,
and adds this timer. This simplifies the code, because AB-BA deadlock is not
possible. __mod_timer() also ensures that the timer's base is not changed
while the timer's handler is running on the old base.
__run_timers(), del_timer() do not change ->base anymore, they only clear
pending flag.
So del_timer_sync() can test timer->base->running_timer == timer to detect
whether it is running or not.
We don't need timer_list->lock anymore, this patch kills it.
We also don't need barriers. del_timer() and __run_timers() used smp_wmb()
before clearing timer's pending flag. It was needed because __mod_timer()
did not lock old_base if the timer is not pending, so __mod_timer()->list_add()
could race with del_timer()->list_del(). With this patch these functions are
serialized through base->lock.
One problem. TIMER_INITIALIZER can't use per_cpu(tvec_bases). So this patch
adds global
struct timer_base_s {
spinlock_t lock;
struct timer_list *running_timer;
} __init_timer_base;
which is used by TIMER_INITIALIZER. The corresponding fields in tvec_t_base_s
struct are replaced by struct timer_base_s t_base.
It is indeed ugly. But this can't have scalability problems. The global
__init_timer_base.lock is used only when __mod_timer() is called for the first
time AND the timer was compile time initialized. After that the timer migrates
to the local CPU.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Renaud Lienhart <renaud.lienhart@free.fr>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nick Piggin [Thu, 23 Jun 2005 07:08:54 +0000 (00:08 -0700)]
[PATCH] blk: unplug later
get_request_wait needn't unplug the device immediately.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nick Piggin [Thu, 23 Jun 2005 07:08:53 +0000 (00:08 -0700)]
[PATCH] blk: branch hints
Sprinkle around a few branch hints in the block layer.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nick Piggin [Thu, 23 Jun 2005 07:08:52 +0000 (00:08 -0700)]
[PATCH] blk: no memory barrier
This memory barrier is not needed because the waitqueue will only get waiters
on it in the following situations:
rq->count has exceeded the threshold - however all manipulations of ->count
are performed under the runqueue lock, and so we will correctly pick up any
waiter.
Memory allocation for the request fails. In this case, there is no additional
help provided by the memory barrier. We are guaranteed to eventually wake up
waiters because the request allocation mempool guarantees that if the mem
allocation for a request fails, there must be some requests in flight. They
will wake up waiters when they are retired.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Tejun Heo [Thu, 23 Jun 2005 07:08:51 +0000 (00:08 -0700)]
[PATCH] blk: cleanup generic tag support error messages
Add KERN_ERR and __FUNCTION__ to generic tag error messages, and add a comment
in blk_queue_end_tag() which explains the silent failure path.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Tejun Heo [Thu, 23 Jun 2005 07:08:50 +0000 (00:08 -0700)]
[PATCH] blk: remove BLK_TAGS_{PER_LONG|MASK}
Replace BLK_TAGS_PER_LONG with BITS_PER_LONG and remove unused BLK_TAGS_MASK.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Tejun Heo [Thu, 23 Jun 2005 07:08:49 +0000 (00:08 -0700)]
[PATCH] blk: remove blk_queue_tag->real_max_depth optimization
blk_queue_tag->real_max_depth was used to optimize out unnecessary
allocations/frees on tag resize. However, the whole thing was very broken -
tag_map was never allocated to real_max_depth resulting in access beyond the
end of the map, bits in [max_depth..real_max_depth] were set when initializing
a map and copied when resizing resulting in pre-occupied tags.
As the gain of the optimization is very small, well, almost nill, remove the
whole thing.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Tejun Heo [Thu, 23 Jun 2005 07:08:48 +0000 (00:08 -0700)]
[PATCH] blk: use find_first_zero_bit() in blk_queue_start_tag()
blk_queue_start_tag() hand-coded searching for the first zero bit in the tag
map. Replace it with find_first_zero_bit(). With this patch,
blk_queue_star_tag() doesn't need to fill remains of tag map with 1, thus
allowing it to work properly with the next remove_real_max_depth patch.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Domen Puncer [Thu, 23 Jun 2005 07:08:47 +0000 (00:08 -0700)]
[PATCH] ptrace_h8300: condition bugfix
Assignment doesn't make much sense here as condition would always be true.
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Vincent Hanquez [Thu, 23 Jun 2005 07:08:46 +0000 (00:08 -0700)]
[PATCH] xen: x86_64: use more usermode macro
Make use of the user_mode macro where it's possible. This is useful for Xen
because it will need only to redefine only the macro to a hypervisor call.
Signed-off-by: Vincent Hanquez <vincent.hanquez@cl.cam.ac.uk>
Cc: Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Vincent Hanquez [Thu, 23 Jun 2005 07:08:46 +0000 (00:08 -0700)]
[PATCH] xen: x86_64: Add macro for debugreg
Add 2 macros to set and get debugreg on x86_64. This is useful for Xen
because it will need only to redefine each macro to a hypervisor call.
Signed-off-by: Vincent Hanquez <vincent.hanquez@cl.cam.ac.uk>
Cc: Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Vincent Hanquez [Thu, 23 Jun 2005 07:08:45 +0000 (00:08 -0700)]
[PATCH] xen: x86: Use more usermode macro
Use the user_mode macro where it's possible.
Signed-off-by: Vincent Hanquez <vincent.hanquez@cl.cam.ac.uk>
Cc: Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Vincent Hanquez [Thu, 23 Jun 2005 07:08:44 +0000 (00:08 -0700)]
[PATCH] xen: x86: Rename usermode macro
Rename user_mode to user_mode_vm and add a user_mode macro similar to the
x86-64 one.
This is useful for Xen because the linux xen kernel does not runs on the same
priviledge that a vanilla linux kernel, and with this we just need to redefine
user_mode().
Signed-off-by: Vincent Hanquez <vincent.hanquez@cl.cam.ac.uk>
Cc: Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Vincent Hanquez [Thu, 23 Jun 2005 07:08:43 +0000 (00:08 -0700)]
[PATCH] xen: x86: Use new macro for debugreg
Make use of the 2 new macro set_debugreg and get_debugreg.
Signed-off-by: Vincent Hanquez <vincent.hanquez@cl.cam.ac.uk>
Cc: Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Vincent Hanquez [Thu, 23 Jun 2005 07:08:42 +0000 (00:08 -0700)]
[PATCH] xen: x86: add macro for debugreg
Add 2 macros to set and get debugreg on x86. This is useful for Xen because
it will need only to redefine each macro to a hypervisor call.
Signed-off-by: Vincent Hanquez <vincent.hanquez@cl.cam.ac.uk>
Cc: Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Natalie Protasevich [Thu, 23 Jun 2005 07:08:41 +0000 (00:08 -0700)]
[PATCH] x86_64: avoid wasting IRQs
I suggest to change the way IRQs are handed out to PCI devices.
Currently, each I/O APIC pin gets associated with an IRQ, no matter if the
pin is used or not. It is expected that each pin can potentually be
engaged by a device inserted into the corresponding PCI slot. However,
this imposes severe limitation on systems that have designs that employ
many I/O APICs, only utilizing couple lines of each, such as P64H2 chipset.
It is used in ES7000, and currently, there is no way to boot the system
with more that 9 I/O APICs.
The simple change below allows to boot a system with say 64 (or more) I/O
APICs, each providing 1 slot, which otherwise impossible because of the IRQ
gaps created for unused lines on each I/O APIC. It does not resolve the
problem with number of devices that exceeds number of possible IRQs, but
eases up a tension for IRQs on any large system with potentually large
number of devices.
I only implemented this for the ACPI boot, since if the system is this big
and using newer chipsets it is probably (better be!) an ACPI based system
:). The change is completely "mechanical" and does not alter any internal
structures or interrupt model/implementation. The patch works for both
i386 and x86_64 archs. It works with MSIs just fine, and should not
intervene with implementations like shared vectors, when they get worked
out and incorporated.
To illustrate, below is the interrupt distribution for 2-cell ES7000 with
20 I/O APICs, and an Ethernet card in the last slot, which should be eth1
and which was not configured because its IRQ exceeded allowable number (it
actially turned out huge - 480!):
zorro-tb2:~ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 65716 30012 30007 30002 30009 30010 30010 30010 IO-APIC-edge timer
4: 373 0 725 280 0 0 0 0 IO-APIC-edge serial
8: 0 0 0 0 0 0 0 0 IO-APIC-edge rtc
9: 0 0 0 0 0 0 0 0 IO-APIC-level acpi
14: 39 3 0 0 0 0 0 0 IO-APIC-edge ide0
16: 108 13 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb1
18: 0 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb3
19: 15 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb2
23: 3 0 0 0 0 0 0 0 IO-APIC-level ehci_hcd:usb4
96: 4240 397 18 0 0 0 0 0 IO-APIC-level aic7xxx
97: 15 0 0 0 0 0 0 0 IO-APIC-level aic7xxx
192: 847 0 0 0 0 0 0 0 IO-APIC-level eth0
NMI: 0 0 0 0 0 0 0 0
LOC: 273423 274528 272829 274228 274092 273761 273827 273694
ERR: 7
MIS: 0
Even though the system doesn't have that many devices, some don't get
enabled only because of IRQ numbering model.
This is the IRQ picture after the patch was applied:
zorro-tb2:~ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 44169 10004 10004 10001 10004 10003 10004 6135 IO-APIC-edge timer
4: 345 0 0 0 0 244 0 0 IO-APIC-edge serial
8: 0 0 0 0 0 0 0 0 IO-APIC-edge rtc
9: 0 0 0 0 0 0 0 0 IO-APIC-level acpi
14: 39 0 3 0 0 0 0 0 IO-APIC-edge ide0
17: 4425 0 9 0 0 0 0 0 IO-APIC-level aic7xxx
18: 15 0 0 0 0 0 0 0 IO-APIC-level aic7xxx, uhci_hcd:usb3
21: 231 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb1
22: 26 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb2
23: 3 0 0 0 0 0 0 0 IO-APIC-level ehci_hcd:usb4
24: 348 0 0 0 0 0 0 0 IO-APIC-level eth0
25: 6 192 0 0 0 0 0 0 IO-APIC-level eth1
NMI: 0 0 0 0 0 0 0 0
LOC: 107981 107636 108899 108698 108489 108326 108331 108254
ERR: 7
MIS: 0
Not only we see the card in the last I/O APIC, but we are not even close to
using up available IRQs, since we didn't waste any.
Signed-off-by: Natalie Protasevich <Natalie.Protasevich@unisys.com>
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jan Beulich [Thu, 23 Jun 2005 07:08:38 +0000 (00:08 -0700)]
[PATCH] eliminate duplicate rdpmc definition
Eliminate duplicate definition of rdpmc in x86-64's mtrr.h.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Roland McGrath [Thu, 23 Jun 2005 07:08:37 +0000 (00:08 -0700)]
[PATCH] x86_64: never block forced SIGSEGV
This is the x86_64 version of the signal fix I just posted for i386.
This problem was first noticed on PPC and has already been fixed there.
But the exact same issue applies to other platforms in the same way. The
signal blocking for sa_mask and the handled signal takes place after the
handler setup. When the stack is bogus, the handler setup forces a
SIGSEGV. But then this will be blocked, and returning to user mode will
fault again and iterate. This patch fixes the problem by checking whether
signal handler setup failed, and not doing the signal-blocking if so. This
copies what was done in the ppc code. I think all architectures' signal
handler setup code follows this pattern and needs the change.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
john stultz [Thu, 23 Jun 2005 07:08:36 +0000 (00:08 -0700)]
[PATCH] x86_64: fix hpet for systems that don't support legacy replacement
Currently the x86-64 HPET code assumes the entire HPET implementation from
the spec is present. This breaks on boxes that do not implement the
optional legacy timer replacement functionality portion of the spec.
This patch fixes this issue, allowing x86-64 systems that cannot use the
HPET for the timer interrupt and RTC to still use the HPET as a time
source. I've tested this patch on a system systems without HPET, with HPET
but without legacy timer replacement, as well as HPET with legacy timer
replacement.
This version adds a minor check to cap the HPET counter value in
gettimeoffset_hpet to avoid possible time inconsistencies. Please ignore
the A2 version I sent to you earlier.
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Alexander Nyberg [Thu, 23 Jun 2005 07:08:35 +0000 (00:08 -0700)]
[PATCH] x86_64: i8259.c iso99 structure initialization
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andrew Morton [Thu, 23 Jun 2005 07:08:35 +0000 (00:08 -0700)]
[PATCH] mtrr size-and-base debugging
Consolidate the mtrr sanity checking, add a dump_stack().
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andrew Morton [Thu, 23 Jun 2005 07:08:34 +0000 (00:08 -0700)]
[PATCH] x86: cpu_khz type fix
x86_64's cpu_khz is unsigned int and there is no reason why x86 needs to use
unsigned long.
So make cpu_khz unsigned int on x86 as well.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Alexey Dobriyan [Thu, 23 Jun 2005 07:08:33 +0000 (00:08 -0700)]
[PATCH] Remove i386_ksyms.c, almost.
* EXPORT_SYMBOL's moved to other files
* #include <linux/config.h>, <linux/module.h> where needed
* #include's in i386_ksyms.c cleaned up
* After copy-paste, redundant due to Makefiles rules preprocessor directives
removed:
#ifdef CONFIG_FOO
EXPORT_SYMBOL(foo);
#endif
obj-$(CONFIG_FOO) += foo.o
* Tiny reformat to fit in 80 columns
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Alexey Dobriyan [Thu, 23 Jun 2005 07:08:32 +0000 (00:08 -0700)]
[PATCH] x86: #include asm/uaccess.h in asm/checksum.h
csum_and_copy_to_user is static inline and uses VERIFY_WRITE. Patch allows
to remove asm/uaccess.h from i386_ksyms.c without dependency surprises.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Aleksey Gorelov [Thu, 23 Jun 2005 07:08:29 +0000 (00:08 -0700)]
[PATCH] VIA
82C586B IRQ routing fix
According to the VIA
82C586B datasheet (still available from
http://gkernel.sourceforge.net/specs/via/586b.pdf.bz2) this chip need a
special PIRQ mapping.
Signed-off-by: Karsten Keil <kkeil@suse.de>
Signed-off-by: Aleksey Gorelov <aleksey_gorelov@phoenix.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Natalie Protasevich [Thu, 23 Jun 2005 07:08:29 +0000 (00:08 -0700)]
[PATCH] x86: avoid wasting IRQs for PCI devices
I have submitted the patch for x86_64, this is submission for i386.
The patch changes the way IRQs are handed out to PCI devices. Currently,
each I/O APIC pin gets associated with an IRQ, no matter if the pin is used
or not. This imposes severe limitation on systems that have designs that
employ many I/O APICs, only utilizing couple lines of each, such as P64H2
chipset. It is used in ES7000, and currently, there is no way to boot the
system with more that 9 I/O APICs.
The simple change below allows to boot a system with say 64 (or more) I/O
APICs, each providing 1 slot, which otherwise impossible because of the IRQ
gaps created for unused lines on each I/O APIC. It does not resolve the
problem with number of devices that exceeds number of possible IRQs, but
eases up a tension for IRQs on any large system with potentually large
number of devices.
Signed-off-by: Natalie Protasevich <Natalie.Protasevich@unisys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Christoph Lameter [Thu, 23 Jun 2005 07:08:27 +0000 (00:08 -0700)]
[PATCH] ia64: Selectable Timer Interrupt Frequency
It allows a selectable timer interrupt frequency of 100, 250 and 1000 HZ.
Reducing the timer frequency may have important performance benefits on
large systems.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Christoph Lameter [Thu, 23 Jun 2005 07:08:25 +0000 (00:08 -0700)]
[PATCH] i386: Selectable Frequency of the Timer Interrupt
Make the timer frequency selectable. The timer interrupt may cause bus
and memory contention in large NUMA systems since the interrupt occurs
on each processor HZ times per second.
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jan Beulich [Thu, 23 Jun 2005 07:08:24 +0000 (00:08 -0700)]
[PATCH] allow early printk to use more than 25 lines
Allow early printk code to take advantage of the full size of the screen, not
just the first 25 lines.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jan Beulich [Thu, 23 Jun 2005 07:08:23 +0000 (00:08 -0700)]
[PATCH] adjust i386 watchdog tick calculation
Get the i386 watchdog tick calculation into a state where it can also be used
on CPUs with frequencies beyond 4GHz, and it consolidates the calculation into
a single place (for potential furture adjustments).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Natalie Protasevich [Thu, 23 Jun 2005 07:08:22 +0000 (00:08 -0700)]
[PATCH] Do not enforce unique IO_APIC_ID check for xAPIC systems (i386)
This patch is per Andi's request to remove NO_IOAPIC_CHECK from genapic and
use heuristics to prevent unique I/O APIC ID check for systems that don't
need it. The patch disables unique I/O APIC ID check for Xeon-based and
other platforms that don't use serial APIC bus for interrupt delivery.
Andi stated that AMD systems don't need unique IO_APIC_IDs either.
Signed-off-by: Natalie Protasevich <Natalie.Protasevich@unisys.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Roland McGrath [Thu, 23 Jun 2005 07:08:21 +0000 (00:08 -0700)]
[PATCH] i386: never block forced SIGSEGV
This problem was first noticed on PPC and has already been fixed there.
But the exact same issue applies to other platforms in the same way. The
signal blocking for sa_mask and the handled signal takes place after the
handler setup. When the stack is bogus, the handler setup forces a
SIGSEGV. But then this will be blocked, and returning to user mode will
fault again and iterate. This patch fixes the problem by checking whether
signal handler setup failed, and not doing the signal-blocking if so. This
copies what was done in the ppc code. I think all architectures' signal
handler setup code follows this pattern and needs the change.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Christoph Lameter [Thu, 23 Jun 2005 07:08:19 +0000 (00:08 -0700)]
[PATCH] NUMA aware block device control structure allocation
Patch to allocate the control structures for for ide devices on the node of
the device itself (for NUMA systems). The patch depends on the Slab API
change patch by Manfred and me (in mm) and the pcidev_to_node patch that I
posted today.
Does some realignment too.
Signed-off-by: Justin M. Forbes <jmforbes@linuxtx.org>
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Pravin Shelar <pravin@calsoftinc.com>
Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Christoph Lameter [Thu, 23 Jun 2005 07:08:18 +0000 (00:08 -0700)]
[PATCH] x86/x86_64: pcibus_to_node
Define pcibus_to_node to be able to figure out which NUMA node contains a
given PCI device. This defines pcibus_to_node(bus) in
include/linux/topology.h and adjusts the macros for i386 and x86_64 that
already provided a way to determine the cpumask of a pci device.
x86_64 was changed to not build an array of cpumasks anymore. Instead an
array of nodes is build which can be used to generate the cpumask via
node_to_cpumask.
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Christoph Lameter [Thu, 23 Jun 2005 07:08:17 +0000 (00:08 -0700)]
[PATCH] ppc64: pcibus_to_node fix
asm-generic/topology.h must also be included if CONFIG_NUMA is set in order to
provide the fall back pcibus_to_node function.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hirokazu Takata [Thu, 23 Jun 2005 07:08:14 +0000 (00:08 -0700)]
[PATCH] m32r: build fix for asm-m32r/topology.h
Use asm-generic/topology.h to fix yet another pcibus_to_node() build error.
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Venkatesh Pallipadi [Thu, 23 Jun 2005 07:08:13 +0000 (00:08 -0700)]
[PATCH] Platform SMIs and their interferance with tsc based delay calibration
Issue:
Current tsc based delay_calibration can result in significant errors in
loops_per_jiffy count when the platform events like SMIs
(System Management Interrupts that are non-maskable) are present. This could
lead to potential kernel panic(). This issue is becoming more visible with 2.6
kernel (as default HZ is 1000) and on platforms with higher SMI handling
latencies. During the boot time, SMIs are mostly used by BIOS (for things
like legacy keyboard emulation).
Description:
The psuedocode for current delay calibration with tsc based delay looks like
(0) Estimate a value for loops_per_jiffy
(1) While (loops_per_jiffy estimate is accurate enough)
(2) wait for jiffy transition (jiffy1)
(3) Note down current tsc (tsc1)
(4) loop until tsc becomes tsc1 + loops_per_jiffy
(5) check whether jiffy changed since jiffy1 or not and refine
loops_per_jiffy estimate
Consider the following cases
Case 1:
If SMIs happen between (2) and (3) above, we can end up with a
loops_per_jiffy value that is too low. This results in shorted delays and
kernel can panic () during boot (Mostly at IOAPIC timer initialization
timer_irq_works() as we don't have enough timer interrupts in a specified
interval).
Case 2:
If SMIs happen between (3) and (4) above, then we can end up with a
loops_per_jiffy value that is too high. And with current i386 code, too
high lpj value (greater than 17M) can result in a overflow in
delay.c:__const_udelay() again resulting in shorter delay and panic().
Solution:
The patch below makes the calibration routine aware of asynchronous events
like SMIs. We increase the delay calibration time and also identify any
significant errors (greater than 12.5%) in the calibration and notify it to
user.
Patch below changes both i386 and x86-64 architectures to use this
new and improved calibrate_delay_direct() routine.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ian Campbell [Thu, 23 Jun 2005 07:08:10 +0000 (00:08 -0700)]
[PATCH] use ${CROSS_COMPILE}installkernel in arch/*/boot/install.sh
The attached patch causes the various arch specific install.sh scripts to
look for ${CROSS_COMPILE}installkernel rather than just installkernel (in
both /sbin/ and ~/bin/ where the script already did this). This allows you
to have e.g. arm-linux-installkernel as a handy way to install on your
cross target. It also prevents the script picking up on the host
/sbin/installkernel which causes the script to fall through and do the
install itself (which is what I actually use myself, with $INSTALL_PATH
set).
I don't believe it causes back-compatibility problems since calling the
host installkernel was never likely to work or be what you wanted when
cross compiling anyway. If $CROSS_COMPILE isn't set then nothing changes.
I only use ARM and i386 myself but I figured it couldn't hurt to do the
whole lot. I've cc'd those who I hope are the arch maintainers for files
that I've touched.
Signed-off-by: Ian Campbell <icampbell@arcom.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
H. Peter Anvin [Thu, 23 Jun 2005 07:08:09 +0000 (00:08 -0700)]
[PATCH] biarch compiler support for i386
This allows the i386 architecture to be built on a system with a biarch
compiler that defaults to x86-64, merely by specifying ARCH=i386.
As previously discussed, this uses the equivalent logic to the ppc port.
Signed-Off-By: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Martin J. Bligh [Thu, 23 Jun 2005 07:08:08 +0000 (00:08 -0700)]
[PATCH] add page_state info to show_mem
This helps a lot when debugging out of memory stuff - useful especially to
see if all the memory is sucked into slab, etc.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Matt Tolentino [Thu, 23 Jun 2005 07:08:07 +0000 (00:08 -0700)]
[PATCH] add x86-64 specific support for sparsemem
This patch adds in the necessary support for sparsemem such that x86-64
kernels may use sparsemem as an alternative to discontigmem for NUMA
kernels. Note that this does no preclude one from continuing to build NUMA
kernels using discontigmem, but merely allows the option to build NUMA
kernels with sparsemem.
Interestingly, the use of sparsemem in lieu of discontigmem in NUMA kernels
results in reduced text size for otherwise equivalent kernels as shown in
the example builds below:
text data bss dec hex filename
2371036 765884
1237108 4374028 42be0c vmlinux.discontig
2366549 776484
1302772 4445805 43d66d vmlinux.sparse
Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Matt Tolentino [Thu, 23 Jun 2005 07:08:06 +0000 (00:08 -0700)]
[PATCH] reorganize x86-64 NUMA and DISCONTIGMEM config options
In order to use the alternative sparsemem implmentation for NUMA kernels,
we need to reorganize the config options. This patch effectively abstracts
out the CONFIG_DISCONTIGMEM options to CONFIG_NUMA in most cases. Thus,
the discontigmem implementation may be employed as always, but the
sparsemem implementation may be used alternatively.
Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Matt Tolentino [Thu, 23 Jun 2005 07:08:05 +0000 (00:08 -0700)]
[PATCH] add x86-64 Kconfig options for sparsemem
Add the requisite arch specific Kconfig options to enable the use of the
sparsemem implementation for NUMA kernels on x86-64.
Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Matt Tolentino [Thu, 23 Jun 2005 07:08:03 +0000 (00:08 -0700)]
[PATCH] remove direct ref to contig_page_data for x86-64
This patch pulls out all remaining direct references to contig_page_data
from arch/x86-64, thus saving an ifdef in one case.
Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andy Whitcroft [Thu, 23 Jun 2005 07:08:03 +0000 (00:08 -0700)]
[PATCH] ppc64: sparsemem memory model
Provide the architecture specific implementation for SPARSEMEM for PPC64
systems.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> (in part)
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andy Whitcroft [Thu, 23 Jun 2005 07:08:02 +0000 (00:08 -0700)]
[PATCH] ppc64: add memory present
Provide hooks for PPC64 to allow memory models to be informed of installed
memory areas. This allows SPARSEMEM to instantiate mem_map for the populated
areas.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andy Whitcroft [Thu, 23 Jun 2005 07:08:01 +0000 (00:08 -0700)]
[PATCH] ppc64: add early_pfn_to_nid
Provide an implementation of early_pfn_to_nid for PPC64. This is used by
memory models to determine the node from which to take allocations before the
memory allocators are fully initialised.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andy Whitcroft [Thu, 23 Jun 2005 07:08:00 +0000 (00:08 -0700)]
[PATCH] sparsemem hotplug base
Make sparse's initalization be accessible at runtime. This allows sparse
mappings to be created after boot in a hotplug situation.
This patch is separated from the previous one just to give an indication how
much of the sparse infrastructure is *just* for hotplug memory.
The section_mem_map doesn't really store a pointer. It stores something that
is convenient to do some math against to get a pointer. It isn't valid to
just do *section_mem_map, so I don't think it should be stored as a pointer.
There are a couple of things I'd like to store about a section. First of all,
the fact that it is !NULL does not mean that it is present. There could be
such a combination where section_mem_map *is* NULL, but the math gets you
properly to a real mem_map. So, I don't think that check is safe.
Since we're storing 32-bit-aligned structures, we have a few bits in the
bottom of the pointer to play with. Use one bit to encode whether there's
really a mem_map there, and the other one to tell whether there's a valid
section there. We need to distinguish between the two because sometimes
there's a gap between when a section is discovered to be present and when we
can get the mem_map for it.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andy Whitcroft [Thu, 23 Jun 2005 07:07:59 +0000 (00:07 -0700)]
[PATCH] sparsemem swiss cheese numa layouts
The part of the sparsemem patch which modifies memmap_init_zone() has recently
become a problem. It changes behavior so that there is a call to
pfn_to_page() for each individual page inside of a node's range:
node_start_pfn through node_end_pfn. It used to simply do this once, at the
beginning of the node, but having sparsemem's non-contiguous mem_map[]s inside
of a node made it necessary to change.
Mike Kravetz recently wrote a patch which made the NUMA code accept some new
kinds of layouts. The system's memory was laid out like this, with node 0's
memory in two pieces: one before and one after node 1's memory:
Node 0: +++++ +++++
Node 1: +++++
Previous behavior before Mike's patch was to assign nodes like this:
Node 0: 00000 XXXXX
Node 1: 11111
Where the 'X' areas were simply thrown away. The new behavior was to make the
pg_data_t span node 0 across all of its areas, including areas that are really
node 1's: Node 0:
000000000000000 Node 1: 11111
This wastes a little bit of mem_map space, but ends up being OK, and more
fully utilizes the system's memory. memmap_init_zone() initializes all of the
"struct page"s for node 0, even for the "hole", but those never get used,
because there is no pfn_to_page() that resolves to those pages. However, only
calling pfn_to_page() once, memmap_init_zone() always uses the pages that were
allocated for node0->node_mem_map because:
struct page *start = pfn_to_page(start_pfn);
// effectively start = &node->node_mem_map[0]
for (page = start; page < (start + size); page++) {
init_page_here();...
page++;
}
Slow, and wasteful, but generally harmless.
But, modify that to call pfn_to_page() for each loop iteration (like sparsemem
does):
for (pfn = start_pfn; pfn < < (start_pfn + size); pfn++++) {
page = pfn_to_page(pfn);
}
And you end up trying to initialize node 1's pages too early, along with bogus
data from node 0. This patch checks for those weird layouts and declines to
touch the pages, making the more frequent pfn_to_page() calls OK to do.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andy Whitcroft [Thu, 23 Jun 2005 07:07:57 +0000 (00:07 -0700)]
[PATCH] sparsemem memory model for i386
Provide the architecture specific implementation for SPARSEMEM for i386 SMP
and NUMA systems.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andy Whitcroft [Thu, 23 Jun 2005 07:07:54 +0000 (00:07 -0700)]
[PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous. It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory. This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags. Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations. However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions. Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags. It might provide
speed increases on certain platforms and will be stored there if there is
room. But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM. It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andy Whitcroft [Thu, 23 Jun 2005 07:07:53 +0000 (00:07 -0700)]
[PATCH] generify memory present
Allow architectures to indicate that they will be providing hooks to indice
installed memory areas, memory_present(). Provide prototypes for the i386
implementation.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andy Whitcroft [Thu, 23 Jun 2005 07:07:52 +0000 (00:07 -0700)]
[PATCH] generify early_pfn_to_nid
Provide a default implementation for early_pfn_to_nid returning node 0. Allow
architectures to override this with their own implementation out of
asm/mmzone.h.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mike Kravetz [Thu, 23 Jun 2005 07:07:51 +0000 (00:07 -0700)]
[PATCH] ppc64: Kconfig memory models
This patch changes some of the default behavior in the ppc64 Kconfig file
that was recently changed/added to 2.6.12-rc2-mm1 by Dave Hansen in
preparation for SPARSEMEM. Patch allows the display of both FLAT and
DISCONTIG models on pseries. As before, default is DISCONTIG for SMP and
PSERIES and FLAT for others.
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Dave Hansen [Thu, 23 Jun 2005 07:07:50 +0000 (00:07 -0700)]
[PATCH] mm/Kconfig: give DISCONTIG more help text
This gives DISCONTIGMEM a bit more help text to explain what it does, not just
when to choose it.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Dave Hansen [Thu, 23 Jun 2005 07:07:49 +0000 (00:07 -0700)]
[PATCH] mm/Kconfig: hide "Memory Model" selection menu
I got some feedback from users who think that the new "Memory Model" menu is a
little invasive. This patch will hide that menu, except when
CONFIG_EXPERIMENTAL is enabled *or* when an individual architecture wants it.
An individual arch may want to enable it because they've removed their
arch-specific DISCONTIG prompt in favor of the mm/Kconfig one.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Dave Hansen [Thu, 23 Jun 2005 07:07:48 +0000 (00:07 -0700)]
[PATCH] mm/Kconfig: kill unused ARCH_FLATMEM_DISABLE
This used to be used to disable FLATMEM selection, but I decided to change it
to be done generically when DISCONTIG is enabled. The option is unused, so
this kills it.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Dave Hansen [Thu, 23 Jun 2005 07:07:47 +0000 (00:07 -0700)]
[PATCH] sparsemem: fix minor "defaults" issue in mm/Kconfig
The following patch applies on top of 2.6.12-rc2-mm1. It fixes a minor
user interaction issue, and an early reference to SPARSEMEM.
This "choice" menu would always default to FLATMEM, as it was listed first.
Move it to the end so that the other defaults have a chance first.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>