Soft action masking

Is there such an idea as "soft action masking"? I'll apologize ahead of time for those of you who are sticklers for the raw mathematics of reinforcement learning. There is no formal math for my idea, yet.

Let me illustrate my idea with an example. Imagine an environment with the following constraints:

- One of the agent's available actions is "do nothing".

- Sending too many actions per second is a bad thing. However, a concrete number is not known here. Maybe we have some data that somewhere around 10 actions per second is the maximum. Sometimes 13/second is ok, sometimes 8/second is undesired.

One way to prevent the agent from taking too many actions in a given time frame is to use action masking. If the maximum rate of actions was a well defined quantity, for example, 10/second, in the last second, the agent has already taken 10 actions, the agent will be forced to "do nothing" via an action mask. Once the number of actions in the last second has fallen below 10, we no longer apply the mask and let the agent choose freely.

However, now considering our fuzzy requirement, can we gradually force our agent to choose the "do nothing" action as it gets closer to the limit? I intentionally will not mathematically formally describe this idea, because I think it depends a lot on what algorithm type you're using. I'll instead attempt to describe the intuition. As mentioned above in the environment constraints, our rate limit is somewhere around 8-13 actions per second. If the agent has already taken 10 actions in the last second and is incredibly confident that it would like to take another action, maybe we should allow it. However, if it is kind of on the fence, only slightly preferring to take another action compared to doing nothing, maybe we should slightly nudge it so that it chooses to do nothing. As the number of actions increases, this "nudging" becomes stronger and stronger. Once we hit 13, in this example, we essentially use the typical action masking approach described above and force the agent to do nothing, regardless of its preferences.

In policy gradient algorithms, this approach makes a little more sense in my mind. I could imagine simply multiplying discouraged action preferences by a value in (0,1). Traditional action masking might multiply by exactly 0. I haven't yet thought about it enough for a value-based algorithm.

What do you all think? Does this seem like a useful thing? I'm roughly encountering this problem in a project of my own, and brain storming solutions. Another solution I could implement is a reward function which discourages exceeding the limit, but until the agent actually learns this aspect of the reward function, it is likely to vastly exceed the limits and I'd need to implement some hard action masking anyways. Also, such a reward function seems tricky since the rate limit reward might be orthogonal to the reward I actually want to learn.