Thursday, July 7, 2022
HomeArtificial IntelligenceMeasuring Goodhart’s Regulation

Measuring Goodhart’s Regulation


Goodhart’s regulation famously says: “When a measure turns into a goal, it ceases to be an excellent measure.” Though initially from economics, it’s one thing we’ve to grapple with at OpenAI when determining tips on how to optimize goals which can be tough or expensive to measure. It’s typically essential to introduce some proxy goal that’s simpler or cheaper to measure, however once we do that, we must be cautious to not optimize it an excessive amount of.

For instance, as a part of our work to align fashions like GPT-3 with human intent and values, we want to optimize issues like “How useful is that this response?”, or “How factually correct is that this declare?”. These are complicated goals that require people to fastidiously test issues over. For that reason, we prepare a mannequin to foretell these human preferences, referred to as a reward mannequin, and use the reward mannequin’s predictions as a proxy goal. Nevertheless it’s essential to maintain monitor of how effectively the true goal is being optimized.

On this submit we’ll have a look at among the arithmetic behind how we do that. We’ll deal with a setting that’s significantly clear to investigate, during which we’ve entry to the true goal. In observe, even human preferences can fail to measure what we actually care about, however we’re setting that difficulty apart on this submit.

Greatest-of-$n$ sampling

There are numerous methods during which one might optimize the proxy goal, however maybe the only is best-of-$n$ sampling, often known as rejection sampling or reranking. We merely pattern $n$ occasions and take the one which scores the best in response to the proxy goal.

Though this technique could be very easy, it will possibly truly be aggressive with extra superior methods similar to reinforcement studying, albeit at the price of extra inference-time compute. For instance, in WebGPT, our best-of-$64$ mannequin outperformed our reinforcement studying mannequin, maybe partly as a result of the best-of-$64$ mannequin received to browse many extra web sites. Even making use of best-of-$4$ offered a major increase to human preferences.

As well as, best-of-$n$ sampling has dependable efficiency and is easy to investigate mathematically, making it well-suited to empirical research of Goodhart’s regulation and associated phenomena.

The arithmetic of best-of-$n$ sampling

Let’s research best-of-$n$ sampling extra formally. Suppose we’ve some pattern area $S$ (such because the set of potential question-answer pairs), some chance distribution $P$ over $S$, a real goal (or “reward”) $R_{textual content{true}}:Stomathbb R$, and a proxy goal $R_{textual content{proxy}}:Stomathbb R$. Let’s say that we one way or the other optimize $R_{textual content{proxy}}$ and thereby receive some new distribution $P^prime$. Then:

  • The expectation $mathbb E_{x^primesim P^prime}left[R_{text{true}}left(x^primeright)right]$ measures how effectively we’ve optimized the true goal.
  • The KL divergence $D_{textual content{KL}}left(P^primeparallel Pright)$ measures how a lot optimization we’ve carried out. For instance, if $P^prime$ is obtained by taking the primary pattern from $P$ that lies in some subset $S^primesubseteq S$, then this KL divergence is simply the unfavorable log chance {that a} pattern from $P$ lies in $S^prime$.

It seems that within the case of best-of-$n$ sampling, each of those portions may be estimated effectively utilizing samples from $P$.

Let’s have a look at the expectation first. The naive method is to make use of a Monte Carlo estimator: run best-of-$n$ sampling many occasions, measure the true goal on these samples, and common the outcomes. Nevertheless, there’s a higher estimator. If we’ve $Ngeq n$ samples from $P$ total, then we will concurrently contemplate each potential subset of those samples of dimension $n$, weight every pattern by the variety of subsets for which it’s the finest in response to the proxy goal, after which take the weighted common true goal rating. This weight is simply the binomial coefficient $binom{k-1}{n-1}$, the place $okay$ is the rank of the pattern underneath the proxy goal, from $1$ (worst) as much as $N$ (finest). In addition to utilizing samples extra effectively, this additionally permits us to reuse samples for various values of $n$.

As for the KL divergence, surprisingly, this seems to have an actual components that works for any steady chance distribution $P$ (i.e., so long as $P$ has no level plenty). One would possibly naively guess that the reply is $log n$, since best-of-$n$ is doing one thing like taking the highest $frac 1n$ of the distribution, and that is roughly appropriate: the precise reply is $log n-frac{n-1}n$.

Collectively, these estimators permit us to simply analyze how the true goal varies with the quantity of optimization utilized to the proxy goal.

Right here’s a real-life instance from WebGPT:

Greatest-of-$n$ efficiency for WebGPT 175B

Greatest-of-$n$ efficiency for WebGPT, with shaded areas representing $pm 1$ commonplace error, and the KL axis following a sq. root scale. Right here, the unique distribution ($P$) is given by the 175B mannequin educated utilizing conduct cloning, the proxy goal used to compute best-of-$n$ ($R_{textual content{proxy}}$) is given by the coaching reward mannequin, and we contemplate three putatively “true” goals ($R_{textual content{true}}$): the coaching reward mannequin itself, a validation reward mannequin educated on held-out information, and precise human preferences. There is not a lot over-optimization of the proxy goal, however we might count on there to be at larger KLs.

Going past best-of-$n$ sampling

The primary limitation of best-of-$n$ sampling is that the KL divergence grows logarithmically with $n$, so it’s only appropriate for making use of a small quantity of optimization.

To use extra optimization, we usually use reinforcement studying. Within the settings we’ve studied to date, similar to summarization, we’ve usually been capable of attain a KL of round 10 nats utilizing reinforcement studying earlier than the true goal begins to lower as a consequence of Goodhart’s regulation. We’d need to take $n$ to be round 60,000 to succeed in this KL utilizing best-of-$n$, and we hope to have the ability to attain a lot bigger KLs than this with enhancements to our reward modeling and reinforcement studying practices.

Nevertheless, not all nats are equal. Empirically, for small KL budgets, best-of-$n$ higher optimizes each the proxy and the true goals than reinforcement studying. Intuitively, best-of-$n$ is the “brute drive” method, making it extra information-theoretically environment friendly than reinforcement studying, however much less computationally environment friendly at massive KLs.

We’re actively learning the scaling properties of proxy goals as a part of our work to align our fashions with human intent and values. In the event you’d like to assist us with this analysis, we’re hiring!

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments