123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718 |
- ===============================================
- ``intel_pstate`` CPU Performance Scaling Driver
- ===============================================
- ::
- Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- General Information
- ===================
- ``intel_pstate`` is a part of the
- :doc:`CPU performance scaling subsystem <cpufreq>` in the Linux kernel
- (``CPUFreq``). It is a scaling driver for the Sandy Bridge and later
- generations of Intel processors. Note, however, that some of those processors
- may not be supported. [To understand ``intel_pstate`` it is necessary to know
- how ``CPUFreq`` works in general, so this is the time to read :doc:`cpufreq` if
- you have not done that yet.]
- For the processors supported by ``intel_pstate``, the P-state concept is broader
- than just an operating frequency or an operating performance point (see the
- `LinuxCon Europe 2015 presentation by Kristen Accardi <LCEU2015_>`_ for more
- information about that). For this reason, the representation of P-states used
- by ``intel_pstate`` internally follows the hardware specification (for details
- refer to `Intel® 64 and IA-32 Architectures Software Developer’s Manual
- Volume 3: System Programming Guide <SDM_>`_). However, the ``CPUFreq`` core
- uses frequencies for identifying operating performance points of CPUs and
- frequencies are involved in the user space interface exposed by it, so
- ``intel_pstate`` maps its internal representation of P-states to frequencies too
- (fortunately, that mapping is unambiguous). At the same time, it would not be
- practical for ``intel_pstate`` to supply the ``CPUFreq`` core with a table of
- available frequencies due to the possible size of it, so the driver does not do
- that. Some functionality of the core is limited by that.
- Since the hardware P-state selection interface used by ``intel_pstate`` is
- available at the logical CPU level, the driver always works with individual
- CPUs. Consequently, if ``intel_pstate`` is in use, every ``CPUFreq`` policy
- object corresponds to one logical CPU and ``CPUFreq`` policies are effectively
- equivalent to CPUs. In particular, this means that they become "inactive" every
- time the corresponding CPU is taken offline and need to be re-initialized when
- it goes back online.
- ``intel_pstate`` is not modular, so it cannot be unloaded, which means that the
- only way to pass early-configuration-time parameters to it is via the kernel
- command line. However, its configuration can be adjusted via ``sysfs`` to a
- great extent. In some configurations it even is possible to unregister it via
- ``sysfs`` which allows another ``CPUFreq`` scaling driver to be loaded and
- registered (see `below <status_attr_>`_).
- Operation Modes
- ===============
- ``intel_pstate`` can operate in three different modes: in the active mode with
- or without hardware-managed P-states support and in the passive mode. Which of
- them will be in effect depends on what kernel command line options are used and
- on the capabilities of the processor.
- Active Mode
- -----------
- This is the default operation mode of ``intel_pstate``. If it works in this
- mode, the ``scaling_driver`` policy attribute in ``sysfs`` for all ``CPUFreq``
- policies contains the string "intel_pstate".
- In this mode the driver bypasses the scaling governors layer of ``CPUFreq`` and
- provides its own scaling algorithms for P-state selection. Those algorithms
- can be applied to ``CPUFreq`` policies in the same way as generic scaling
- governors (that is, through the ``scaling_governor`` policy attribute in
- ``sysfs``). [Note that different P-state selection algorithms may be chosen for
- different policies, but that is not recommended.]
- They are not generic scaling governors, but their names are the same as the
- names of some of those governors. Moreover, confusingly enough, they generally
- do not work in the same way as the generic governors they share the names with.
- For example, the ``powersave`` P-state selection algorithm provided by
- ``intel_pstate`` is not a counterpart of the generic ``powersave`` governor
- (roughly, it corresponds to the ``schedutil`` and ``ondemand`` governors).
- There are two P-state selection algorithms provided by ``intel_pstate`` in the
- active mode: ``powersave`` and ``performance``. The way they both operate
- depends on whether or not the hardware-managed P-states (HWP) feature has been
- enabled in the processor and possibly on the processor model.
- Which of the P-state selection algorithms is used by default depends on the
- :c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option.
- Namely, if that option is set, the ``performance`` algorithm will be used by
- default, and the other one will be used by default if it is not set.
- Active Mode With HWP
- ~~~~~~~~~~~~~~~~~~~~
- If the processor supports the HWP feature, it will be enabled during the
- processor initialization and cannot be disabled after that. It is possible
- to avoid enabling it by passing the ``intel_pstate=no_hwp`` argument to the
- kernel in the command line.
- If the HWP feature has been enabled, ``intel_pstate`` relies on the processor to
- select P-states by itself, but still it can give hints to the processor's
- internal P-state selection logic. What those hints are depends on which P-state
- selection algorithm has been applied to the given policy (or to the CPU it
- corresponds to).
- Even though the P-state selection is carried out by the processor automatically,
- ``intel_pstate`` registers utilization update callbacks with the CPU scheduler
- in this mode. However, they are not used for running a P-state selection
- algorithm, but for periodic updates of the current CPU frequency information to
- be made available from the ``scaling_cur_freq`` policy attribute in ``sysfs``.
- HWP + ``performance``
- .....................
- In this configuration ``intel_pstate`` will write 0 to the processor's
- Energy-Performance Preference (EPP) knob (if supported) or its
- Energy-Performance Bias (EPB) knob (otherwise), which means that the processor's
- internal P-state selection logic is expected to focus entirely on performance.
- This will override the EPP/EPB setting coming from the ``sysfs`` interface
- (see `Energy vs Performance Hints`_ below).
- Also, in this configuration the range of P-states available to the processor's
- internal P-state selection logic is always restricted to the upper boundary
- (that is, the maximum P-state that the driver is allowed to use).
- HWP + ``powersave``
- ...................
- In this configuration ``intel_pstate`` will set the processor's
- Energy-Performance Preference (EPP) knob (if supported) or its
- Energy-Performance Bias (EPB) knob (otherwise) to whatever value it was
- previously set to via ``sysfs`` (or whatever default value it was
- set to by the platform firmware). This usually causes the processor's
- internal P-state selection logic to be less performance-focused.
- Active Mode Without HWP
- ~~~~~~~~~~~~~~~~~~~~~~~
- This is the default operation mode for processors that do not support the HWP
- feature. It also is used by default with the ``intel_pstate=no_hwp`` argument
- in the kernel command line. However, in this mode ``intel_pstate`` may refuse
- to work with the given processor if it does not recognize it. [Note that
- ``intel_pstate`` will never refuse to work with any processor with the HWP
- feature enabled.]
- In this mode ``intel_pstate`` registers utilization update callbacks with the
- CPU scheduler in order to run a P-state selection algorithm, either
- ``powersave`` or ``performance``, depending on the ``scaling_governor`` policy
- setting in ``sysfs``. The current CPU frequency information to be made
- available from the ``scaling_cur_freq`` policy attribute in ``sysfs`` is
- periodically updated by those utilization update callbacks too.
- ``performance``
- ...............
- Without HWP, this P-state selection algorithm is always the same regardless of
- the processor model and platform configuration.
- It selects the maximum P-state it is allowed to use, subject to limits set via
- ``sysfs``, every time the driver configuration for the given CPU is updated
- (e.g. via ``sysfs``).
- This is the default P-state selection algorithm if the
- :c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option
- is set.
- ``powersave``
- .............
- Without HWP, this P-state selection algorithm is similar to the algorithm
- implemented by the generic ``schedutil`` scaling governor except that the
- utilization metric used by it is based on numbers coming from feedback
- registers of the CPU. It generally selects P-states proportional to the
- current CPU utilization.
- This algorithm is run by the driver's utilization update callback for the
- given CPU when it is invoked by the CPU scheduler, but not more often than
- every 10 ms. Like in the ``performance`` case, the hardware configuration
- is not touched if the new P-state turns out to be the same as the current
- one.
- This is the default P-state selection algorithm if the
- :c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option
- is not set.
- Passive Mode
- ------------
- This mode is used if the ``intel_pstate=passive`` argument is passed to the
- kernel in the command line (it implies the ``intel_pstate=no_hwp`` setting too).
- Like in the active mode without HWP support, in this mode ``intel_pstate`` may
- refuse to work with the given processor if it does not recognize it.
- If the driver works in this mode, the ``scaling_driver`` policy attribute in
- ``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq".
- Then, the driver behaves like a regular ``CPUFreq`` scaling driver. That is,
- it is invoked by generic scaling governors when necessary to talk to the
- hardware in order to change the P-state of a CPU (in particular, the
- ``schedutil`` governor can invoke it directly from scheduler context).
- While in this mode, ``intel_pstate`` can be used with all of the (generic)
- scaling governors listed by the ``scaling_available_governors`` policy attribute
- in ``sysfs`` (and the P-state selection algorithms described above are not
- used). Then, it is responsible for the configuration of policy objects
- corresponding to CPUs and provides the ``CPUFreq`` core (and the scaling
- governors attached to the policy objects) with accurate information on the
- maximum and minimum operating frequencies supported by the hardware (including
- the so-called "turbo" frequency ranges). In other words, in the passive mode
- the entire range of available P-states is exposed by ``intel_pstate`` to the
- ``CPUFreq`` core. However, in this mode the driver does not register
- utilization update callbacks with the CPU scheduler and the ``scaling_cur_freq``
- information comes from the ``CPUFreq`` core (and is the last frequency selected
- by the current scaling governor for the given policy).
- .. _turbo:
- Turbo P-states Support
- ======================
- In the majority of cases, the entire range of P-states available to
- ``intel_pstate`` can be divided into two sub-ranges that correspond to
- different types of processor behavior, above and below a boundary that
- will be referred to as the "turbo threshold" in what follows.
- The P-states above the turbo threshold are referred to as "turbo P-states" and
- the whole sub-range of P-states they belong to is referred to as the "turbo
- range". These names are related to the Turbo Boost technology allowing a
- multicore processor to opportunistically increase the P-state of one or more
- cores if there is enough power to do that and if that is not going to cause the
- thermal envelope of the processor package to be exceeded.
- Specifically, if software sets the P-state of a CPU core within the turbo range
- (that is, above the turbo threshold), the processor is permitted to take over
- performance scaling control for that core and put it into turbo P-states of its
- choice going forward. However, that permission is interpreted differently by
- different processor generations. Namely, the Sandy Bridge generation of
- processors will never use any P-states above the last one set by software for
- the given core, even if it is within the turbo range, whereas all of the later
- processor generations will take it as a license to use any P-states from the
- turbo range, even above the one set by software. In other words, on those
- processors setting any P-state from the turbo range will enable the processor
- to put the given core into all turbo P-states up to and including the maximum
- supported one as it sees fit.
- One important property of turbo P-states is that they are not sustainable. More
- precisely, there is no guarantee that any CPUs will be able to stay in any of
- those states indefinitely, because the power distribution within the processor
- package may change over time or the thermal envelope it was designed for might
- be exceeded if a turbo P-state was used for too long.
- In turn, the P-states below the turbo threshold generally are sustainable. In
- fact, if one of them is set by software, the processor is not expected to change
- it to a lower one unless in a thermal stress or a power limit violation
- situation (a higher P-state may still be used if it is set for another CPU in
- the same package at the same time, for example).
- Some processors allow multiple cores to be in turbo P-states at the same time,
- but the maximum P-state that can be set for them generally depends on the number
- of cores running concurrently. The maximum turbo P-state that can be set for 3
- cores at the same time usually is lower than the analogous maximum P-state for
- 2 cores, which in turn usually is lower than the maximum turbo P-state that can
- be set for 1 core. The one-core maximum turbo P-state is thus the maximum
- supported one overall.
- The maximum supported turbo P-state, the turbo threshold (the maximum supported
- non-turbo P-state) and the minimum supported P-state are specific to the
- processor model and can be determined by reading the processor's model-specific
- registers (MSRs). Moreover, some processors support the Configurable TDP
- (Thermal Design Power) feature and, when that feature is enabled, the turbo
- threshold effectively becomes a configurable value that can be set by the
- platform firmware.
- Unlike ``_PSS`` objects in the ACPI tables, ``intel_pstate`` always exposes
- the entire range of available P-states, including the whole turbo range, to the
- ``CPUFreq`` core and (in the passive mode) to generic scaling governors. This
- generally causes turbo P-states to be set more often when ``intel_pstate`` is
- used relative to ACPI-based CPU performance scaling (see `below <acpi-cpufreq_>`_
- for more information).
- Moreover, since ``intel_pstate`` always knows what the real turbo threshold is
- (even if the Configurable TDP feature is enabled in the processor), its
- ``no_turbo`` attribute in ``sysfs`` (described `below <no_turbo_attr_>`_) should
- work as expected in all cases (that is, if set to disable turbo P-states, it
- always should prevent ``intel_pstate`` from using them).
- Processor Support
- =================
- To handle a given processor ``intel_pstate`` requires a number of different
- pieces of information on it to be known, including:
- * The minimum supported P-state.
- * The maximum supported `non-turbo P-state <turbo_>`_.
- * Whether or not turbo P-states are supported at all.
- * The maximum supported `one-core turbo P-state <turbo_>`_ (if turbo P-states
- are supported).
- * The scaling formula to translate the driver's internal representation
- of P-states into frequencies and the other way around.
- Generally, ways to obtain that information are specific to the processor model
- or family. Although it often is possible to obtain all of it from the processor
- itself (using model-specific registers), there are cases in which hardware
- manuals need to be consulted to get to it too.
- For this reason, there is a list of supported processors in ``intel_pstate`` and
- the driver initialization will fail if the detected processor is not in that
- list, unless it supports the `HWP feature <Active Mode_>`_. [The interface to
- obtain all of the information listed above is the same for all of the processors
- supporting the HWP feature, which is why they all are supported by
- ``intel_pstate``.]
- User Space Interface in ``sysfs``
- =================================
- Global Attributes
- -----------------
- ``intel_pstate`` exposes several global attributes (files) in ``sysfs`` to
- control its functionality at the system level. They are located in the
- ``/sys/devices/system/cpu/intel_pstate/`` directory and affect all CPUs.
- Some of them are not present if the ``intel_pstate=per_cpu_perf_limits``
- argument is passed to the kernel in the command line.
- ``max_perf_pct``
- Maximum P-state the driver is allowed to set in percent of the
- maximum supported performance level (the highest supported `turbo
- P-state <turbo_>`_).
- This attribute will not be exposed if the
- ``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel
- command line.
- ``min_perf_pct``
- Minimum P-state the driver is allowed to set in percent of the
- maximum supported performance level (the highest supported `turbo
- P-state <turbo_>`_).
- This attribute will not be exposed if the
- ``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel
- command line.
- ``num_pstates``
- Number of P-states supported by the processor (between 0 and 255
- inclusive) including both turbo and non-turbo P-states (see
- `Turbo P-states Support`_).
- The value of this attribute is not affected by the ``no_turbo``
- setting described `below <no_turbo_attr_>`_.
- This attribute is read-only.
- ``turbo_pct``
- Ratio of the `turbo range <turbo_>`_ size to the size of the entire
- range of supported P-states, in percent.
- This attribute is read-only.
- .. _no_turbo_attr:
- ``no_turbo``
- If set (equal to 1), the driver is not allowed to set any turbo P-states
- (see `Turbo P-states Support`_). If unset (equalt to 0, which is the
- default), turbo P-states can be set by the driver.
- [Note that ``intel_pstate`` does not support the general ``boost``
- attribute (supported by some other scaling drivers) which is replaced
- by this one.]
- This attrubute does not affect the maximum supported frequency value
- supplied to the ``CPUFreq`` core and exposed via the policy interface,
- but it affects the maximum possible value of per-policy P-state limits
- (see `Interpretation of Policy Attributes`_ below for details).
- ``hwp_dynamic_boost``
- This attribute is only present if ``intel_pstate`` works in the
- `active mode with the HWP feature enabled <Active Mode With HWP_>`_ in
- the processor. If set (equal to 1), it causes the minimum P-state limit
- to be increased dynamically for a short time whenever a task previously
- waiting on I/O is selected to run on a given logical CPU (the purpose
- of this mechanism is to improve performance).
- This setting has no effect on logical CPUs whose minimum P-state limit
- is directly set to the highest non-turbo P-state or above it.
- .. _status_attr:
- ``status``
- Operation mode of the driver: "active", "passive" or "off".
- "active"
- The driver is functional and in the `active mode
- <Active Mode_>`_.
- "passive"
- The driver is functional and in the `passive mode
- <Passive Mode_>`_.
- "off"
- The driver is not functional (it is not registered as a scaling
- driver with the ``CPUFreq`` core).
- This attribute can be written to in order to change the driver's
- operation mode or to unregister it. The string written to it must be
- one of the possible values of it and, if successful, the write will
- cause the driver to switch over to the operation mode represented by
- that string - or to be unregistered in the "off" case. [Actually,
- switching over from the active mode to the passive mode or the other
- way around causes the driver to be unregistered and registered again
- with a different set of callbacks, so all of its settings (the global
- as well as the per-policy ones) are then reset to their default
- values, possibly depending on the target operation mode.]
- That only is supported in some configurations, though (for example, if
- the `HWP feature is enabled in the processor <Active Mode With HWP_>`_,
- the operation mode of the driver cannot be changed), and if it is not
- supported in the current configuration, writes to this attribute will
- fail with an appropriate error.
- Interpretation of Policy Attributes
- -----------------------------------
- The interpretation of some ``CPUFreq`` policy attributes described in
- :doc:`cpufreq` is special with ``intel_pstate`` as the current scaling driver
- and it generally depends on the driver's `operation mode <Operation Modes_>`_.
- First of all, the values of the ``cpuinfo_max_freq``, ``cpuinfo_min_freq`` and
- ``scaling_cur_freq`` attributes are produced by applying a processor-specific
- multiplier to the internal P-state representation used by ``intel_pstate``.
- Also, the values of the ``scaling_max_freq`` and ``scaling_min_freq``
- attributes are capped by the frequency corresponding to the maximum P-state that
- the driver is allowed to set.
- If the ``no_turbo`` `global attribute <no_turbo_attr_>`_ is set, the driver is
- not allowed to use turbo P-states, so the maximum value of ``scaling_max_freq``
- and ``scaling_min_freq`` is limited to the maximum non-turbo P-state frequency.
- Accordingly, setting ``no_turbo`` causes ``scaling_max_freq`` and
- ``scaling_min_freq`` to go down to that value if they were above it before.
- However, the old values of ``scaling_max_freq`` and ``scaling_min_freq`` will be
- restored after unsetting ``no_turbo``, unless these attributes have been written
- to after ``no_turbo`` was set.
- If ``no_turbo`` is not set, the maximum possible value of ``scaling_max_freq``
- and ``scaling_min_freq`` corresponds to the maximum supported turbo P-state,
- which also is the value of ``cpuinfo_max_freq`` in either case.
- Next, the following policy attributes have special meaning if
- ``intel_pstate`` works in the `active mode <Active Mode_>`_:
- ``scaling_available_governors``
- List of P-state selection algorithms provided by ``intel_pstate``.
- ``scaling_governor``
- P-state selection algorithm provided by ``intel_pstate`` currently in
- use with the given policy.
- ``scaling_cur_freq``
- Frequency of the average P-state of the CPU represented by the given
- policy for the time interval between the last two invocations of the
- driver's utilization update callback by the CPU scheduler for that CPU.
- The meaning of these attributes in the `passive mode <Passive Mode_>`_ is the
- same as for other scaling drivers.
- Additionally, the value of the ``scaling_driver`` attribute for ``intel_pstate``
- depends on the operation mode of the driver. Namely, it is either
- "intel_pstate" (in the `active mode <Active Mode_>`_) or "intel_cpufreq" (in the
- `passive mode <Passive Mode_>`_).
- Coordination of P-State Limits
- ------------------------------
- ``intel_pstate`` allows P-state limits to be set in two ways: with the help of
- the ``max_perf_pct`` and ``min_perf_pct`` `global attributes
- <Global Attributes_>`_ or via the ``scaling_max_freq`` and ``scaling_min_freq``
- ``CPUFreq`` policy attributes. The coordination between those limits is based
- on the following rules, regardless of the current operation mode of the driver:
- 1. All CPUs are affected by the global limits (that is, none of them can be
- requested to run faster than the global maximum and none of them can be
- requested to run slower than the global minimum).
- 2. Each individual CPU is affected by its own per-policy limits (that is, it
- cannot be requested to run faster than its own per-policy maximum and it
- cannot be requested to run slower than its own per-policy minimum).
- 3. The global and per-policy limits can be set independently.
- If the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, the
- resulting effective values are written into its registers whenever the limits
- change in order to request its internal P-state selection logic to always set
- P-states within these limits. Otherwise, the limits are taken into account by
- scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
- every time before setting a new P-state for a CPU.
- Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument
- is passed to the kernel, ``max_perf_pct`` and ``min_perf_pct`` are not exposed
- at all and the only way to set the limits is by using the policy attributes.
- Energy vs Performance Hints
- ---------------------------
- If ``intel_pstate`` works in the `active mode with the HWP feature enabled
- <Active Mode With HWP_>`_ in the processor, additional attributes are present
- in every ``CPUFreq`` policy directory in ``sysfs``. They are intended to allow
- user space to help ``intel_pstate`` to adjust the processor's internal P-state
- selection logic by focusing it on performance or on energy-efficiency, or
- somewhere between the two extremes:
- ``energy_performance_preference``
- Current value of the energy vs performance hint for the given policy
- (or the CPU represented by it).
- The hint can be changed by writing to this attribute.
- ``energy_performance_available_preferences``
- List of strings that can be written to the
- ``energy_performance_preference`` attribute.
- They represent different energy vs performance hints and should be
- self-explanatory, except that ``default`` represents whatever hint
- value was set by the platform firmware.
- Strings written to the ``energy_performance_preference`` attribute are
- internally translated to integer values written to the processor's
- Energy-Performance Preference (EPP) knob (if supported) or its
- Energy-Performance Bias (EPB) knob.
- [Note that tasks may by migrated from one CPU to another by the scheduler's
- load-balancing algorithm and if different energy vs performance hints are
- set for those CPUs, that may lead to undesirable outcomes. To avoid such
- issues it is better to set the same energy vs performance hint for all CPUs
- or to pin every task potentially sensitive to them to a specific CPU.]
- .. _acpi-cpufreq:
- ``intel_pstate`` vs ``acpi-cpufreq``
- ====================================
- On the majority of systems supported by ``intel_pstate``, the ACPI tables
- provided by the platform firmware contain ``_PSS`` objects returning information
- that can be used for CPU performance scaling (refer to the `ACPI specification`_
- for details on the ``_PSS`` objects and the format of the information returned
- by them).
- The information returned by the ACPI ``_PSS`` objects is used by the
- ``acpi-cpufreq`` scaling driver. On systems supported by ``intel_pstate``
- the ``acpi-cpufreq`` driver uses the same hardware CPU performance scaling
- interface, but the set of P-states it can use is limited by the ``_PSS``
- output.
- On those systems each ``_PSS`` object returns a list of P-states supported by
- the corresponding CPU which basically is a subset of the P-states range that can
- be used by ``intel_pstate`` on the same system, with one exception: the whole
- `turbo range <turbo_>`_ is represented by one item in it (the topmost one). By
- convention, the frequency returned by ``_PSS`` for that item is greater by 1 MHz
- than the frequency of the highest non-turbo P-state listed by it, but the
- corresponding P-state representation (following the hardware specification)
- returned for it matches the maximum supported turbo P-state (or is the
- special value 255 meaning essentially "go as high as you can get").
- The list of P-states returned by ``_PSS`` is reflected by the table of
- available frequencies supplied by ``acpi-cpufreq`` to the ``CPUFreq`` core and
- scaling governors and the minimum and maximum supported frequencies reported by
- it come from that list as well. In particular, given the special representation
- of the turbo range described above, this means that the maximum supported
- frequency reported by ``acpi-cpufreq`` is higher by 1 MHz than the frequency
- of the highest supported non-turbo P-state listed by ``_PSS`` which, of course,
- affects decisions made by the scaling governors, except for ``powersave`` and
- ``performance``.
- For example, if a given governor attempts to select a frequency proportional to
- estimated CPU load and maps the load of 100% to the maximum supported frequency
- (possibly multiplied by a constant), then it will tend to choose P-states below
- the turbo threshold if ``acpi-cpufreq`` is used as the scaling driver, because
- in that case the turbo range corresponds to a small fraction of the frequency
- band it can use (1 MHz vs 1 GHz or more). In consequence, it will only go to
- the turbo range for the highest loads and the other loads above 50% that might
- benefit from running at turbo frequencies will be given non-turbo P-states
- instead.
- One more issue related to that may appear on systems supporting the
- `Configurable TDP feature <turbo_>`_ allowing the platform firmware to set the
- turbo threshold. Namely, if that is not coordinated with the lists of P-states
- returned by ``_PSS`` properly, there may be more than one item corresponding to
- a turbo P-state in those lists and there may be a problem with avoiding the
- turbo range (if desirable or necessary). Usually, to avoid using turbo
- P-states overall, ``acpi-cpufreq`` simply avoids using the topmost state listed
- by ``_PSS``, but that is not sufficient when there are other turbo P-states in
- the list returned by it.
- Apart from the above, ``acpi-cpufreq`` works like ``intel_pstate`` in the
- `passive mode <Passive Mode_>`_, except that the number of P-states it can set
- is limited to the ones listed by the ACPI ``_PSS`` objects.
- Kernel Command Line Options for ``intel_pstate``
- ================================================
- Several kernel command line options can be used to pass early-configuration-time
- parameters to ``intel_pstate`` in order to enforce specific behavior of it. All
- of them have to be prepended with the ``intel_pstate=`` prefix.
- ``disable``
- Do not register ``intel_pstate`` as the scaling driver even if the
- processor is supported by it.
- ``passive``
- Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to
- start with.
- This option implies the ``no_hwp`` one described below.
- ``force``
- Register ``intel_pstate`` as the scaling driver instead of
- ``acpi-cpufreq`` even if the latter is preferred on the given system.
- This may prevent some platform features (such as thermal controls and
- power capping) that rely on the availability of ACPI P-states
- information from functioning as expected, so it should be used with
- caution.
- This option does not work with processors that are not supported by
- ``intel_pstate`` and on platforms where the ``pcc-cpufreq`` scaling
- driver is used instead of ``acpi-cpufreq``.
- ``no_hwp``
- Do not enable the `hardware-managed P-states (HWP) feature
- <Active Mode With HWP_>`_ even if it is supported by the processor.
- ``hwp_only``
- Register ``intel_pstate`` as the scaling driver only if the
- `hardware-managed P-states (HWP) feature <Active Mode With HWP_>`_ is
- supported by the processor.
- ``support_acpi_ppc``
- Take ACPI ``_PPC`` performance limits into account.
- If the preferred power management profile in the FADT (Fixed ACPI
- Description Table) is set to "Enterprise Server" or "Performance
- Server", the ACPI ``_PPC`` limits are taken into account by default
- and this option has no effect.
- ``per_cpu_perf_limits``
- Use per-logical-CPU P-State limits (see `Coordination of P-state
- Limits`_ for details).
- Diagnostics and Tuning
- ======================
- Trace Events
- ------------
- There are two static trace events that can be used for ``intel_pstate``
- diagnostics. One of them is the ``cpu_frequency`` trace event generally used
- by ``CPUFreq``, and the other one is the ``pstate_sample`` trace event specific
- to ``intel_pstate``. Both of them are triggered by ``intel_pstate`` only if
- it works in the `active mode <Active Mode_>`_.
- The following sequence of shell commands can be used to enable them and see
- their output (if the kernel is generally configured to support event tracing)::
- # cd /sys/kernel/debug/tracing/
- # echo 1 > events/power/pstate_sample/enable
- # echo 1 > events/power/cpu_frequency/enable
- # cat trace
- gnome-terminal--4510 [001] ..s. 1177.680733: pstate_sample: core_busy=107 scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618 freq=2474476
- cat-5235 [002] ..s. 1177.681723: cpu_frequency: state=2900000 cpu_id=2
- If ``intel_pstate`` works in the `passive mode <Passive Mode_>`_, the
- ``cpu_frequency`` trace event will be triggered either by the ``schedutil``
- scaling governor (for the policies it is attached to), or by the ``CPUFreq``
- core (for the policies with other scaling governors).
- ``ftrace``
- ----------
- The ``ftrace`` interface can be used for low-level diagnostics of
- ``intel_pstate``. For example, to check how often the function to set a
- P-state is called, the ``ftrace`` filter can be set to to
- :c:func:`intel_pstate_set_pstate`::
- # cd /sys/kernel/debug/tracing/
- # cat available_filter_functions | grep -i pstate
- intel_pstate_set_pstate
- intel_pstate_cpu_init
- ...
- # echo intel_pstate_set_pstate > set_ftrace_filter
- # echo function > current_tracer
- # cat trace | head -15
- # tracer: function
- #
- # entries-in-buffer/entries-written: 80/80 #P:4
- #
- # _-----=> irqs-off
- # / _----=> need-resched
- # | / _---=> hardirq/softirq
- # || / _--=> preempt-depth
- # ||| / delay
- # TASK-PID CPU# |||| TIMESTAMP FUNCTION
- # | | | |||| | |
- Xorg-3129 [000] ..s. 2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func
- gnome-terminal--4510 [002] ..s. 2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func
- gnome-shell-3409 [001] ..s. 2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func
- <idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func
- .. _LCEU2015: http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf
- .. _SDM: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html
- .. _ACPI specification: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
|