Commit | Line | Data |
---|---|---|
1da177e4 LT |
1 | CPU frequency and voltage scaling code in the Linux(TM) kernel |
2 | ||
3 | ||
4 | L i n u x C P U F r e q | |
5 | ||
6 | C P U F r e q G o v e r n o r s | |
7 | ||
8 | - information for users and developers - | |
9 | ||
10 | ||
11 | Dominik Brodowski <linux@brodo.de> | |
594dd2c9 | 12 | some additions and corrections by Nico Golde <nico@ngolde.de> |
1da177e4 LT |
13 | |
14 | ||
15 | ||
16 | Clock scaling allows you to change the clock speed of the CPUs on the | |
17 | fly. This is a nice method to save battery power, because the lower | |
18 | the clock speed, the less power the CPU consumes. | |
19 | ||
20 | ||
21 | Contents: | |
22 | --------- | |
23 | 1. What is a CPUFreq Governor? | |
24 | ||
25 | 2. Governors In the Linux Kernel | |
26 | 2.1 Performance | |
27 | 2.2 Powersave | |
28 | 2.3 Userspace | |
594dd2c9 | 29 | 2.4 Ondemand |
537208c8 | 30 | 2.5 Conservative |
6fa3eb70 | 31 | 2.6 Interactive |
1da177e4 LT |
32 | |
33 | 3. The Governor Interface in the CPUfreq Core | |
34 | ||
35 | ||
36 | ||
37 | 1. What Is A CPUFreq Governor? | |
38 | ============================== | |
39 | ||
40 | Most cpufreq drivers (in fact, all except one, longrun) or even most | |
41 | cpu frequency scaling algorithms only offer the CPU to be set to one | |
42 | frequency. In order to offer dynamic frequency scaling, the cpufreq | |
43 | core must be able to tell these drivers of a "target frequency". So | |
44 | these specific drivers will be transformed to offer a "->target" | |
45 | call instead of the existing "->setpolicy" call. For "longrun", all | |
46 | stays the same, though. | |
47 | ||
48 | How to decide what frequency within the CPUfreq policy should be used? | |
49 | That's done using "cpufreq governors". Two are already in this patch | |
50 | -- they're the already existing "powersave" and "performance" which | |
51 | set the frequency statically to the lowest or highest frequency, | |
52 | respectively. At least two more such governors will be ready for | |
53 | addition in the near future, but likely many more as there are various | |
54 | different theories and models about dynamic frequency scaling | |
55 | around. Using such a generic interface as cpufreq offers to scaling | |
56 | governors, these can be tested extensively, and the best one can be | |
57 | selected for each specific use. | |
58 | ||
59 | Basically, it's the following flow graph: | |
60 | ||
2fe0ae78 | 61 | CPU can be set to switch independently | CPU can only be set |
1da177e4 LT |
62 | within specific "limits" | to specific frequencies |
63 | ||
64 | "CPUfreq policy" | |
65 | consists of frequency limits (policy->{min,max}) | |
66 | and CPUfreq governor to be used | |
67 | / \ | |
68 | / \ | |
69 | / the cpufreq governor decides | |
70 | / (dynamically or statically) | |
71 | / what target_freq to set within | |
72 | / the limits of policy->{min,max} | |
73 | / \ | |
74 | / \ | |
75 | Using the ->setpolicy call, Using the ->target call, | |
76 | the limits and the the frequency closest | |
77 | "policy" is set. to target_freq is set. | |
78 | It is assured that it | |
79 | is within policy->{min,max} | |
80 | ||
81 | ||
82 | 2. Governors In the Linux Kernel | |
83 | ================================ | |
84 | ||
85 | 2.1 Performance | |
86 | --------------- | |
87 | ||
88 | The CPUfreq governor "performance" sets the CPU statically to the | |
89 | highest frequency within the borders of scaling_min_freq and | |
90 | scaling_max_freq. | |
91 | ||
92 | ||
594dd2c9 | 93 | 2.2 Powersave |
1da177e4 LT |
94 | ------------- |
95 | ||
96 | The CPUfreq governor "powersave" sets the CPU statically to the | |
97 | lowest frequency within the borders of scaling_min_freq and | |
98 | scaling_max_freq. | |
99 | ||
100 | ||
594dd2c9 | 101 | 2.3 Userspace |
1da177e4 LT |
102 | ------------- |
103 | ||
104 | The CPUfreq governor "userspace" allows the user, or any userspace | |
105 | program running with UID "root", to set the CPU to a specific frequency | |
106 | by making a sysfs file "scaling_setspeed" available in the CPU-device | |
107 | directory. | |
108 | ||
109 | ||
594dd2c9 NG |
110 | 2.4 Ondemand |
111 | ------------ | |
112 | ||
a2ffd275 | 113 | The CPUfreq governor "ondemand" sets the CPU depending on the |
594dd2c9 | 114 | current usage. To do this the CPU must have the capability to |
537208c8 AC |
115 | switch the frequency very quickly. There are a number of sysfs file |
116 | accessible parameters: | |
117 | ||
118 | sampling_rate: measured in uS (10^-6 seconds), this is how often you | |
119 | want the kernel to look at the CPU usage and to make decisions on | |
120 | what to do about the frequency. Typically this is set to values of | |
112124ab TR |
121 | around '10000' or more. It's default value is (cmp. with users-guide.txt): |
122 | transition_latency * 1000 | |
112124ab TR |
123 | Be aware that transition latency is in ns and sampling_rate is in us, so you |
124 | get the same sysfs value by default. | |
125 | Sampling rate should always get adjusted considering the transition latency | |
126 | To set the sampling rate 750 times as high as the transition latency | |
127 | in the bash (as said, 1000 is default), do: | |
128 | echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) \ | |
129 | >ondemand/sampling_rate | |
537208c8 | 130 | |
e7cbb5b5 | 131 | sampling_rate_min: |
4f4d1ad6 TR |
132 | The sampling rate is limited by the HW transition latency: |
133 | transition_latency * 100 | |
134 | Or by kernel restrictions: | |
3451d024 FW |
135 | If CONFIG_NO_HZ_COMMON is set, the limit is 10ms fixed. |
136 | If CONFIG_NO_HZ_COMMON is not set or nohz=off boot parameter is used, the | |
4f4d1ad6 TR |
137 | limits depend on the CONFIG_HZ option: |
138 | HZ=1000: min=20000us (20ms) | |
139 | HZ=250: min=80000us (80ms) | |
140 | HZ=100: min=200000us (200ms) | |
141 | The highest value of kernel and HW latency restrictions is shown and | |
142 | used as the minimum sampling rate. | |
143 | ||
d9195881 | 144 | up_threshold: defines what the average CPU usage between the samplings |
537208c8 AC |
145 | of 'sampling_rate' needs to be for the kernel to make a decision on |
146 | whether it should increase the frequency. For example when it is set | |
292e0041 MF |
147 | to its default value of '95' it means that between the checking |
148 | intervals the CPU needs to be on average more than 95% in use to then | |
537208c8 AC |
149 | decide that the CPU frequency needs to be increased. |
150 | ||
992caacf ML |
151 | ignore_nice_load: this parameter takes a value of '0' or '1'. When |
152 | set to '0' (its default), all processes are counted towards the | |
153 | 'cpu utilisation' value. When set to '1', the processes that are | |
537208c8 | 154 | run with a 'nice' value will not count (and thus be ignored) in the |
992caacf | 155 | overall usage calculation. This is useful if you are running a CPU |
537208c8 AC |
156 | intensive calculation on your laptop that you do not care how long it |
157 | takes to complete as you can 'nice' it and prevent it from taking part | |
158 | in the deciding process of whether to increase your CPU frequency. | |
159 | ||
5b95364f VB |
160 | sampling_down_factor: this parameter controls the rate at which the |
161 | kernel makes a decision on when to decrease the frequency while running | |
162 | at top speed. When set to 1 (the default) decisions to reevaluate load | |
163 | are made at the same interval regardless of current clock speed. But | |
164 | when set to greater than 1 (e.g. 100) it acts as a multiplier for the | |
165 | scheduling interval for reevaluating load when the CPU is at its top | |
166 | speed due to high load. This improves performance by reducing the overhead | |
167 | of load evaluation and helping the CPU stay at its top speed when truly | |
168 | busy, rather than shifting back and forth in speed. This tunable has no | |
169 | effect on behavior at lower speeds/lower CPU loads. | |
170 | ||
9c5320c8 JS |
171 | powersave_bias: this parameter takes a value between 0 to 1000. It |
172 | defines the percentage (times 10) value of the target frequency that | |
173 | will be shaved off of the target. For example, when set to 100 -- 10%, | |
174 | when ondemand governor would have targeted 1000 MHz, it will target | |
175 | 1000 MHz - (10% of 1000 MHz) = 900 MHz instead. This is set to 0 | |
176 | (disabled) by default. | |
177 | When AMD frequency sensitivity powersave bias driver -- | |
178 | drivers/cpufreq/amd_freq_sensitivity.c is loaded, this parameter | |
179 | defines the workload frequency sensitivity threshold in which a lower | |
180 | frequency is chosen instead of ondemand governor's original target. | |
181 | The frequency sensitivity is a hardware reported (on AMD Family 16h | |
182 | Processors and above) value between 0 to 100% that tells software how | |
183 | the performance of the workload running on a CPU will change when | |
184 | frequency changes. A workload with sensitivity of 0% (memory/IO-bound) | |
185 | will not perform any better on higher core frequency, whereas a | |
186 | workload with sensitivity of 100% (CPU-bound) will perform better | |
187 | higher the frequency. When the driver is loaded, this is set to 400 | |
188 | by default -- for CPUs running workloads with sensitivity value below | |
189 | 40%, a lower frequency is chosen. Unloading the driver or writing 0 | |
190 | will disable this feature. | |
191 | ||
537208c8 AC |
192 | |
193 | 2.5 Conservative | |
194 | ---------------- | |
195 | ||
196 | The CPUfreq governor "conservative", much like the "ondemand" | |
197 | governor, sets the CPU depending on the current usage. It differs in | |
198 | behaviour in that it gracefully increases and decreases the CPU speed | |
199 | rather than jumping to max speed the moment there is any load on the | |
200 | CPU. This behaviour more suitable in a battery powered environment. | |
201 | The governor is tweaked in the same manner as the "ondemand" governor | |
202 | through sysfs with the addition of: | |
203 | ||
204 | freq_step: this describes what percentage steps the cpu freq should be | |
205 | increased and decreased smoothly by. By default the cpu frequency will | |
206 | increase in 5% chunks of your maximum cpu frequency. You can change this | |
207 | value to anywhere between 0 and 100 where '0' will effectively lock your | |
208 | CPU at a speed regardless of its load whilst '100' will, in theory, make | |
209 | it behave identically to the "ondemand" governor. | |
210 | ||
211 | down_threshold: same as the 'up_threshold' found for the "ondemand" | |
212 | governor but for the opposite direction. For example when set to its | |
213 | default value of '20' it means that if the CPU usage needs to be below | |
214 | 20% between samples to have the frequency decreased. | |
1da177e4 | 215 | |
7af1c056 SK |
216 | sampling_down_factor: similar functionality as in "ondemand" governor. |
217 | But in "conservative", it controls the rate at which the kernel makes | |
218 | a decision on when to decrease the frequency while running in any | |
219 | speed. Load for frequency increase is still evaluated every | |
220 | sampling rate. | |
221 | ||
6fa3eb70 S |
222 | 2.6 Interactive |
223 | --------------- | |
224 | ||
225 | The CPUfreq governor "interactive" is designed for latency-sensitive, | |
226 | interactive workloads. This governor sets the CPU speed depending on | |
227 | usage, similar to "ondemand" and "conservative" governors, but with a | |
228 | different set of configurable behaviors. | |
229 | ||
230 | The tuneable values for this governor are: | |
231 | ||
232 | target_loads: CPU load values used to adjust speed to influence the | |
233 | current CPU load toward that value. In general, the lower the target | |
234 | load, the more often the governor will raise CPU speeds to bring load | |
235 | below the target. The format is a single target load, optionally | |
236 | followed by pairs of CPU speeds and CPU loads to target at or above | |
237 | those speeds. Colons can be used between the speeds and associated | |
238 | target loads for readability. For example: | |
239 | ||
240 | 85 1000000:90 1700000:99 | |
241 | ||
242 | targets CPU load 85% below speed 1GHz, 90% at or above 1GHz, until | |
243 | 1.7GHz and above, at which load 99% is targeted. If speeds are | |
244 | specified these must appear in ascending order. Higher target load | |
245 | values are typically specified for higher speeds, that is, target load | |
246 | values also usually appear in an ascending order. The default is | |
247 | target load 90% for all speeds. | |
248 | ||
249 | min_sample_time: The minimum amount of time to spend at the current | |
250 | frequency before ramping down. Default is 80000 uS. | |
251 | ||
252 | hispeed_freq: An intermediate "hi speed" at which to initially ramp | |
253 | when CPU load hits the value specified in go_hispeed_load. If load | |
254 | stays high for the amount of time specified in above_hispeed_delay, | |
255 | then speed may be bumped higher. Default is the maximum speed | |
256 | allowed by the policy at governor initialization time. | |
257 | ||
258 | go_hispeed_load: The CPU load at which to ramp to hispeed_freq. | |
259 | Default is 99%. | |
260 | ||
261 | above_hispeed_delay: When speed is at or above hispeed_freq, wait for | |
262 | this long before raising speed in response to continued high load. | |
263 | The format is a single delay value, optionally followed by pairs of | |
264 | CPU speeds and the delay to use at or above those speeds. Colons can | |
265 | be used between the speeds and associated delays for readability. For | |
266 | example: | |
267 | ||
268 | 80000 1300000:200000 1500000:40000 | |
269 | ||
270 | uses delay 80000 uS until CPU speed 1.3 GHz, at which speed delay | |
271 | 200000 uS is used until speed 1.5 GHz, at which speed (and above) | |
272 | delay 40000 uS is used. If speeds are specified these must appear in | |
273 | ascending order. Default is 20000 uS. | |
274 | ||
275 | timer_rate: Sample rate for reevaluating CPU load when the CPU is not | |
276 | idle. A deferrable timer is used, such that the CPU will not be woken | |
277 | from idle to service this timer until something else needs to run. | |
278 | (The maximum time to allow deferring this timer when not running at | |
279 | minimum speed is configurable via timer_slack.) Default is 20000 uS. | |
280 | ||
281 | timer_slack: Maximum additional time to defer handling the governor | |
282 | sampling timer beyond timer_rate when running at speeds above the | |
283 | minimum. For platforms that consume additional power at idle when | |
284 | CPUs are running at speeds greater than minimum, this places an upper | |
285 | bound on how long the timer will be deferred prior to re-evaluating | |
286 | load and dropping speed. For example, if timer_rate is 20000uS and | |
287 | timer_slack is 10000uS then timers will be deferred for up to 30msec | |
288 | when not at lowest speed. A value of -1 means defer timers | |
289 | indefinitely at all speeds. Default is 80000 uS. | |
290 | ||
291 | boost: If non-zero, immediately boost speed of all CPUs to at least | |
292 | hispeed_freq until zero is written to this attribute. If zero, allow | |
293 | CPU speeds to drop below hispeed_freq according to load as usual. | |
294 | Default is zero. | |
295 | ||
296 | boostpulse: On each write, immediately boost speed of all CPUs to | |
297 | hispeed_freq for at least the period of time specified by | |
298 | boostpulse_duration, after which speeds are allowed to drop below | |
299 | hispeed_freq according to load as usual. | |
300 | ||
301 | boostpulse_duration: Length of time to hold CPU speed at hispeed_freq | |
302 | on a write to boostpulse, before allowing speed to drop according to | |
303 | load as usual. Default is 80000 uS. | |
304 | ||
305 | ||
1da177e4 LT |
306 | 3. The Governor Interface in the CPUfreq Core |
307 | ============================================= | |
308 | ||
309 | A new governor must register itself with the CPUfreq core using | |
310 | "cpufreq_register_governor". The struct cpufreq_governor, which has to | |
311 | be passed to that function, must contain the following values: | |
312 | ||
313 | governor->name - A unique name for this governor | |
314 | governor->governor - The governor callback function | |
315 | governor->owner - .THIS_MODULE for the governor module (if | |
316 | appropriate) | |
317 | ||
318 | The governor->governor callback is called with the current (or to-be-set) | |
319 | cpufreq_policy struct for that CPU, and an unsigned int event. The | |
320 | following events are currently defined: | |
321 | ||
322 | CPUFREQ_GOV_START: This governor shall start its duty for the CPU | |
323 | policy->cpu | |
324 | CPUFREQ_GOV_STOP: This governor shall end its duty for the CPU | |
325 | policy->cpu | |
326 | CPUFREQ_GOV_LIMITS: The limits for CPU policy->cpu have changed to | |
327 | policy->min and policy->max. | |
328 | ||
329 | If you need other "events" externally of your driver, _only_ use the | |
330 | cpufreq_governor_l(unsigned int cpu, unsigned int event) call to the | |
331 | CPUfreq core to ensure proper locking. | |
332 | ||
333 | ||
334 | The CPUfreq governor may call the CPU processor driver using one of | |
335 | these two functions: | |
336 | ||
337 | int cpufreq_driver_target(struct cpufreq_policy *policy, | |
338 | unsigned int target_freq, | |
339 | unsigned int relation); | |
340 | ||
341 | int __cpufreq_driver_target(struct cpufreq_policy *policy, | |
342 | unsigned int target_freq, | |
343 | unsigned int relation); | |
344 | ||
345 | target_freq must be within policy->min and policy->max, of course. | |
346 | What's the difference between these two functions? When your governor | |
347 | still is in a direct code path of a call to governor->governor, the | |
348 | per-CPU cpufreq lock is still held in the cpufreq core, and there's | |
349 | no need to lock it again (in fact, this would cause a deadlock). So | |
350 | use __cpufreq_driver_target only in these cases. In all other cases | |
351 | (for example, when there's a "daemonized" function that wakes up | |
352 | every second), use cpufreq_driver_target to lock the cpufreq per-CPU | |
353 | lock before the command is passed to the cpufreq processor driver. | |
354 |