26 September 2023

Saying what not to do: Can state-of-the-art language models understand negated instructions?

Cartoon-not-not-knot

Summary

The sensitivity of language models to subtle differences in their inputs has brought about a new discipline, prompt engineering, with an emerging and evolving set of best practices to make the most of language models’ strengths, and sidestep the tasks they struggle with. One such task that language models have seemed to struggle with is understanding negation, and prompt engineering guides often advise readers to avoid “telling the model what not to do”. Indeed, studies have previously demonstrated that models fail to change their answer to a question when the word “not” is inserted into the question. To investigate how well today’s language models understand negation in contexts that more closely resemble the prompts that prompt engineers are likely to write, I tested various models’ ability to answer a question while instructing the model to respond in a specific way, first with the constraint phrased as a positive instruction, and then separately with the same constraint phrased as a negated instruction. I found that state-of-the-art language models typically achieve similar results in each case, with some variation from task to task. I conclude that prompt engineers need not apply a blanket rule of avoiding negation when instructing language models, but emphasize the importance of testing for each specific use case.

Background

A surge of interest in generative AI has led to a proliferation of articles providing best practices, tips, and hacks for writing effective prompts for language models.

From the top 5 google results for “prompt engineering best practices”,¹ there is one piece of advice common to all 5 articles: “be specific”, telling the model exactly how many blueberry muffins you intend to make with the requested recipe, for which day and location you want to know the high tide, or in which famous poet’s style you would like the requested poem. This advice seems uncontroversial, though of course what constitutes specificity will vary greatly from one use case to the next.

The second most commonly included best practice is more prescriptive: instead of telling models what not to do, for example, “do not use technical jargon”, use positive language to tell the model what it should do, for example, “use simple language”. If models consistently perform poorly at following negated instructions, this is important advice for prompt engineers. But is it true?

Studies have indeed shown poor results when prompting language models with questions that contain negation, for example, the word “not” and its contractions such as “isn’t” and “don’t”.² Some of these results suggest that models as recent as the GPT-3 text series are almost entirely insensitive to negation, performing no better than random chance on tasks that test this ability.³

I found this surprising. Could state-of-the-art language models, which have impressed us all with remarkable language comprehension, really be so bad at recognizing and understanding negation that we can’t tell them not to do something?

I decided to dig deeper.

Often, one of the most efficient ways to create a new benchmark for a specific capability is to transform an existing benchmark, modifying each question in an easily automatable way. Many negation benchmarks are derived by applying rules-based logic to existing, non-negation-focused benchmarks in this manner. For example, given the question:

Denmark is a member of ___?
A. nato
B. fun

where the correct answer is “nato”, we might use the rule “insert ‘not’ after the verb, and consider the previously incorrect answer to be correct”. This gives us the following negated question:

Denmark is not a member of ___?
A. nato
B. fun

where the correct answer is “fun”.

Unfortunately, this new question is not representative of the way we most commonly use negation in the real world. In this case, we are asking the model to choose between a normal-sounding, but incorrect, continuation, and a continuation which is nonsensical. While we might wish the model to prefer nonsensical statements over incorrect ones, this test seems to tell us more about the model’s preference between these two less-than-ideal options than it does about the model’s capability to understand negation. That a model correctly chooses “nato” in the first case and incorrectly chooses “nato” in the second case does not seem like strong evidence that the model does not recognize or understand negation. While I did not evaluate every question in this data set, I found such issues to be common, including some examples where the answer labeled incorrect seems like a clearly better choice (for example, "Some plants are not ___?", A: disrupted, B: edible, correct answer: A).

Experiments

I decided to find out how well language models respond when negation appears in a more natural context, specifically, the context most relevant to prompt engineers: when telling a model what it shouldn’t do.

To avoid challenges with a rules-based approach like the one described above, I manually created 10 sets of instructions that ask a language model to answer a question, with an additional instruction asking the model to respond in a specific way. I chose constraints that could be phrased either positively or negatively in natural-sounding language.⁴ For example,

Answer the question and provide an explanation. Use only lowercase letters in your response.

and

Answer the question and provide an explanation. Do not use uppercase letters in your response.

This keeps the underlying task the same in each case, such that if the language model understands the negated instruction no better or worse than it understands the positive instruction, then the task should be equally easy or difficult regardless of which version of the instruction is provided. On the other hand, if the model struggles to recognize or understand negation, we would expect the model to perform worse with the negated instruction.

Next, to create a large enough test set to provide informative results, I combined each of these 10 instructions with 114 questions from the popular MMLU benchmark,⁵ to produce a total of 1140 task instances with positively phrased instructions and the same 1140 task instances with negated instructions.

Finally, I evaluated how often various models succeeded in following each of the positively phrased and negated instructions. In each case, I evaluated the model’s response against a regular expression representing the constraint, assessing only whether the model correctly followed the instruction, and not the quality of the responses against any other measure.

Results and conclusions

Table 1 and Chart 1 below summarize the results. I include the ratio of (valid responses with negated instructions) to (valid responses with positively phrased instructions) as a metric for “negation understanding”, where a score of 0 indicates that the model is unable to follow negated instructions at all,⁶ and a score of approximately 1 indicates that the model follows positively phrased and negated instructions approximately equally well.

Provider	Model	Success rate with positive phrasing	Success rate with negation	Negation understanding ratio
AI21	j2-mid	49.1%	32.4%	0.66
AI21	j2-ultra	43.0%	36.2%	0.84
Anthropic	claude-1.3	77.4%	68.7%	0.89
Anthropic	claude-2.0	48.5%	49.4%	1.02
Cohere	command-light	20.9%	19.9%	0.95
Cohere	command	34.8%	26.5%	0.76
OpenAI	text-ada-001	13.0%	12.5%	0.97
OpenAI	text-babbage-001	9.6%	6.8%	0.71
OpenAI	text-curie-001	20.9%	11.1%	0.53
OpenAI	text-davinci-001	25.8%	20.4%	0.79
OpenAI	text-davinci-002	37.4%	31.2%	0.84
OpenAI	text-davinci-003	52.6%	50.0%	0.95
OpenAI	gpt-3.5-turbo-0613	50.7%	50.2%	0.99
OpenAI	gpt-4-0613	51.7%	45.4%	0.88

Table 1: Success rate for following instructions that constrain the language model's response across 1140 task instances.

Figure 1: Negation understanding ratio for all models tested, plotted against estimated number of parameters on a log scale.

The results confirm that many models indeed perform worse at following instructions that are phrased using negation, with text-curie-001 and j2-mid achieving negation understanding ratios as low as 0.53 and 0.66 (though this is much better than 0, the value we would expect from complete insensitivity to negation). However, with OpenAI’s GPT 3.5 and Anthropic’s Claude 2.0, I find essentially no difference in results between the positively and negatively phrased instructions when averaging across tasks, with some tasks showing better performance with positively phrased instructions and some tasks showing better performance with negated instructions (shown below in Table 2 and Table 3).

Task	Success rate with positive phrasing	Success rate with negation
1	1.8%	2.6%
2	0.0%	0.0%
3	41.2%	36.8%
4	99.1%	89.5%
5	6.1%	3.5%
6	61.4%	71.9%
7	100.0%	100.0%
8	100.0%	97.4%
9	91.2%	91.2%
10	6.1%	8.8%
Totals	50.7%	50.2%

Table 2: Success rate by task for GPT 3.5.

Task	Success rate with positive phrasing	Success rate with negation
1	5.3%	4.4%
2	1.8%	0.0%
3	11.4%	19.3%
4	99.1%	99.1%
5	0.9%	0.9%
6	66.7%	71.1%
7	100.0%	100.0%
8	100.0%	99.1%
9	100.0%	100.0%
10	0.0%	0.0%
Totals	48.5%	49.4%

Table 3: Success rate by task for Claude 2.0.

Interestingly, while GPT-4 achieves similar performance for most of the 10 tasks, it consistently fails to follow negated instructions on one task, significantly affecting the overall result (shown below as task 1 in Table 4). This is the example task mentioned earlier, requiring the model to use only lowercase letters. On this task, GPT-4 scores 37% with positive phrasing, but 0% with the negative instruction “Do not use uppercase letters in your response.”

Task	Success rate with positive phrasing	Success rate with negation
1	36.8%	0.0%
2	1.8%	0.0%
3	67.5%	63.2%
4	100.0%	92.1%
5	21.1%	6.1%
6	92.1%	92.1%
7	97.4%	100.0%
8	100.0%	100.0%
9	0.0%	0.0%
10	0.0%	0.0%
Totals	51.7%	45.4%

Table 4: Success rate by task for GPT 4.

I conclude that the claim that state-of-the-art language models do not understand negated instructions does not hold in general. I point to Claude 2.0 and GPT-3.5 as examples of models that, on average, perform about equally well with positively phrased and negated instructions for the set of tasks I constructed, and suggest that prompt engineers working with such models should feel free to try out negated instructions in contexts where negative phrasing is natural. Moreover, the fact that these results vary on a task-by-task basis, and the case of GPT-4 achieving significantly different performance on one task, highlights that state-of-the-art language models can still be sensitive to minor changes in prompts, emphasizing the value of testing specific prompts for every use case.

Other observations

Non-standard scaling

Typically, as language models increase in size, their performance on a given task increases monotonically. However, researchers have found some tasks for which performance decreases with model size (“inverse scaling”) or first decreases as models get a bit larger, and then increases again as they get much larger (“U-shaped scaling”). Zhang et al demonstrated inverse and U-shaped scaling when prompting models with negated multiple-choice questions using the rules-based approach discussed earlier.³

I also find evidence for non-standard scaling on the task of following negated instructions that constrain a language model’s outputs (shown below in Figure 2). OpenAI’s GPT-3 text series, a set of models sharing the same architecture at 4 different scales, shows clear U-shaped scaling on the negation understanding ratio. Additionally, while Cohere’s command model achieves better absolute results than its smaller command-light model with both positively phrased and negated instructions, it scores lower on the negation understanding ratio. On the other hand, AI21’s J2 series shows standard positive scaling on the negation understanding ratio.⁷

Figure 2: Negation understanding ratio by model size, for models of the same family.

Overall ability to adhere to constraints

While Claude 2.0 improves on Claude 1.3’s ability to understand negation as measured by the negation understanding ratio, its overall ability to follow instructions that constrain its response is significantly worse for both types of phrasing (shown below in Table 5). I suspect that this is an artifact of the Anthropic team’s efforts to improve “harmlessness” in Claude 2, for example, to prevent the model from using toxic or biased language, or assisting with illegal or violent activities. It seems reasonable that a model’s reluctance to deviate from its normal approach to answering questions (including simple behaviors like following normal capitalization rules) might correlate with its reluctance to engage in directly harmful behaviors, as both involve the model being less “steerable”. This poses an interesting challenge for language model providers, because application developers have many reasons to want to steer models to behave or respond in a certain way. In other words, “steerability” is useful for “helpfulness” but detrimental for “harmlessness”, consistent with the broader tension between these two concepts that the Anthropic team has highlighted.⁸

Provider	Model	Success rate with positive phrasing	Success rate with negation	Negation understanding ratio
Anthropic	claude-1.3	77.4%	68.7%	0.89
Anthropic	claude-2.0	48.5%	49.4%	1.02

Table 5: Extract from table 1 highlighting Claude 2.0's poorer performance following instructions that constrain its response.

Closing thoughts

Testing language models with unusually worded prompts can teach us something about how the models work, including their preferences between concepts such as factual correctness and normal-sounding sentences. However, these tests aren’t always representative of the more natural language that prompt engineers are likely to use in real-world applications today, and can lead us to false conclusions about how we should instruct them. Further, as language model capabilities continue to evolve at a remarkable rate, our notions of what they can and can’t do become outdated quickly.

Finally, the lack of regularity across the results in these tests underscores the importance of testing for a specific use case rather than relying on general assumptions about language models’ capabilities or popular prompt engineering techniques. Until language models cease to be sensitive to minor changes in language, prompt engineering is best practiced as a science.

Footnotes and references

¹ Top 5 non-paywalled results for “prompt engineering best practices” on 24-Sep-2023, querying google.com in a private session from a US IP address: 1. Best practices for prompt engineering | Google Cloud Blog, 2. Best practices for prompt engineering | OpenAI, 3. Mastering Prompt Engineering for ChatGPT | Karan Kakwani, 4. 10 prompt engineering tips and best practices | TechTarget, 5. Prompt Engineering Tutorial | LambdaTest.

² Nora Kassner and Hinrich Schutze. 2020. Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7811–7818, Online. Association for Computational Linguistics.

³ Yuhui Zhang, Michihiro Yasunaga, Zhengping Zhou, Jeff Z. HaoChen, James Zou, Percy Liang, and Serena Yeung. 2023. Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language Models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7479–7498, Toronto, Canada. Association for Computational Linguistics.

⁴ The details of all 10 tasks can be found in this project’s git repository.

⁵ To create a diverse data set, I used the first 2 questions from each of MMLU’s 57 topics.

⁶ In general, I tried to choose constraints that were unlikely to be fulfilled by chance alone, though not all of the tasks satisfy this criterion equally well.

⁷ I was unable to test AI21’s smallest model, j2-light, due to a confirmed but not yet fixed bug in the AI21 API.

⁸ Yuntao Bai, Andy Jones, Kamal Ndousse. 2022. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv preprint arXiv:2204.05862v1.

Code and full results are available at: https://github.com/alexbleakley/cloni.