Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvme: Add support for Autonomous Power State Transition #1444

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

eseipi
Copy link

@eseipi eseipi commented Oct 1, 2024

APST is an optional NVMe power-saving feature that allows devices to automatically enter higher non-operational power states after a certain amount of idle time, reducing the controller's overall power consumption.

The feature configuration involves filling out the transition table, which then needs to be sent to the controller in a data buffer. Each table entry corresponds to one of the available power states and contains two values: idle transition power state (ITPS) and idle time prior to transition (ITPT). The first specifies the next power state the controller should switch to, and the second specifies the amount of idle time required before that switch.

Two sysctls are added: apst_itpt_factor for ITPT calculation (as an integer by which the total latency will be multiplied to get a suitable transition flow), and apst_max_latency for cutting off higher states with unwanted latency (by specifying a maximum value in microseconds).

The thing I struggled with the most was the default settings. I know that different controllers set them differently, and it's probably not a good idea to change them without the user's explicit intent. The best solution I could come up with is to check the return values of the TUNABLE_INT_FETCH calls and, depending on those, decide whether to touch the feature or not. The issue here is that if the feature is enabled by the vendor, sysctls will show zero values by default, but this would be incorrect. I'm not sure if this is a problem, but it's quite possible that a more elegant solution can be found.

@eseipi eseipi requested a review from bsdimp as a code owner October 1, 2024 10:03
@bsdimp
Copy link
Member

bsdimp commented Oct 1, 2024

Generally, I like this. I'm unsure of what we should do by default...

@eseipi
Copy link
Author

eseipi commented Oct 2, 2024

@bsdimp, are you considering the UX issue I described earlier or deciding whether to enable or disable the feature by default?

If it's the former, there's also an option to set both tunables to -1 by default and take no action unless they are changed. This would avoid inconsistency, though it would make the code slightly less straightforward due to handling signed values. I can implement this if needed.

If it's the latter, I strongly believe it's not a good idea to enable the feature by default due to the latency trade-offs and other potential flaws in the higher power states. At the same time, disabling power-saving when it's working properly isn't ideal either, and if it doesn't work properly, that’s not a FreeBSD issue. Adding support for the mechanism without enforcing it won't harm anyone and can only benefit users who are concerned about power consumption.

@wigneddoom
Copy link

This is a great feature. But for some reason, I think that we can implement it in user space - nvmecontrol(8)

@eseipi
Copy link
Author

eseipi commented Oct 3, 2024

@wigneddoom This can easily be implemented in user space (in fact, most of it has already been done, as nvmecontrol(8) is fully capable of sending Set Features commands, even with a payload). However, the main issue is that it needs to be hooked with the controller's initialization/reset process somehow. That could probably be done too (how exactly, by the way?), but I'm not sure if it would simplify things or make them more reliable.

@bsdimp bsdimp self-assigned this Oct 4, 2024
@eseipi eseipi marked this pull request as draft October 6, 2024 08:08
@eseipi
Copy link
Author

eseipi commented Oct 7, 2024

Though my approach to configuring APST closely resembles the Linux implementation, I agree that its design could be improved. In particular, I totally agree with @wigneddoom on moving as much functionality as possible out of kernel, though I'm still considering the best way to achieve this. I've marked this PR as a draft, so I can take some time to give it more thought. Please remove the needs-review label for now.

The functions nvme_ctrlr_cmd_set_feature and nvme_ctrlr_cmd_get_feature
already have payload and payload_size in their parameter lists, but
they were not used. However, this is neccessary for some features
such as APST because they send data through a data buffer.

Signed-off-by: Alexey Sukhoguzov <[email protected]>
@eseipi
Copy link
Author

eseipi commented Oct 9, 2024

It turned out that I had discarded most of the previous code. Pretty much all that's left are Get/Set Feature calls if hw.nvme.apst_enable variable is set to 1. The only additional thing is the hw.nvme.apst_data tunable, which allows for overriding the default transition data if, for some reason, it is unsuitable or missing. Sysctls have also been removed since there are other options already available to toggle the feature at runtime.

At this point, I don't think it's worth adding a command to nvmecontrol(8) to generate this data in a user-friendly way, because AFAIK this should be a rare necessity; if such data is needed, it can be easily calculated using other tools such as bc(1).

@eseipi eseipi marked this pull request as ready for review October 9, 2024 16:51
@eseipi eseipi force-pushed the main branch 2 times, most recently from 21e9258 to 45d3802 Compare October 9, 2024 22:00
APST is an optional NVMe power-saving feature that allows devices
to automatically enter higher non-operational power states after a
certain amount of idle time, reducing the controller's overall power
consumption.

Setting up the feature requires sending an APST data structure (32
64-bit entries) through the data buffer.  This payload could be
obtained from the controller itself, thus maintaining default
behaviour provided by the vendor.

Signed-off-by: Alexey Sukhoguzov <[email protected]>
APST data can be manually overridden using the apst_data tunable,
which should be an array of unsigned integers.  This method offers
more flexibility than automatic generation, but the specific values
might depend on the particular controller.

Signed-off-by: Alexey Sukhoguzov <[email protected]>
APST data can be generated automatically based on the apst_max_latency
tunable.  This method offers the same simple configuration as on
Linux but allows for setting an upper latency limit only.

Signed-off-by: Alexey Sukhoguzov <[email protected]>
@eseipi
Copy link
Author

eseipi commented Oct 11, 2024

I've further split up the commits, made some doc fixes, and slightly clarified the control flow. I also added back automatic APST data generation (i.e. apst_max_latency tunable) as a separate commit, although I’m still not sure if anyone really needs this in a driver.

In any case, everything seems to be ready for review. Apologies for the noise above, I should have kept this PR as a draft until now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants