AllTopicsTodayAllTopicsToday
Notification
Font ResizerAa
  • Home
  • Tech
  • Investing & Finance
  • AI
  • Entertainment
  • Wellness
  • Gaming
  • Movies
Reading: Evaluating Perplexity on Language Models
Share
Font ResizerAa
AllTopicsTodayAllTopicsToday
  • Home
  • Blog
  • About Us
  • Contact
Search
  • Home
  • Tech
  • Investing & Finance
  • AI
  • Entertainment
  • Wellness
  • Gaming
  • Movies
Have an existing account? Sign In
Follow US
©AllTopicsToday 2026. All Rights Reserved.
AllTopicsToday > Blog > AI > Evaluating Perplexity on Language Models
Lucas davies 3aubsnmgule unsplash scaled.jpg
AI

Evaluating Perplexity on Language Models

AllTopicsToday
Last updated: January 12, 2026 7:58 am
AllTopicsToday
Published: January 12, 2026
Share
SHARE

A language mannequin is a likelihood distribution over sequences of tokens. Once you practice a language mannequin, you need to measure how precisely it predicts human language use. This can be a troublesome job, and also you want a metric to judge the mannequin. On this article, you’ll study in regards to the perplexity metric. Particularly, you’ll study:

What’s perplexity, and how you can compute it
Tips on how to consider the perplexity of a language mannequin with pattern knowledge

Let’s get began.

Evaluating Perplexity on Language Fashions
Picture by Lucas Davis. Some rights reserved.

Overview

This text is split into two components; they’re:

What Is Perplexity and Tips on how to Compute It
Consider the Perplexity of a Language Mannequin with HellaSwag Dataset

What Is Perplexity and Tips on how to Compute It

Perplexity is a measure of how effectively a language mannequin predicts a pattern of textual content. It’s outlined because the inverse of the geometric imply of the chances of the tokens within the pattern. Mathematically, perplexity is outlined as:

$$
PPL(x_{1:L}) = prod_{i=1}^L p(x_i)^{-1/L} = expbig(-frac{1}{L} sum_{i=1}^L log p(x_i)massive)
$$

Perplexity is a operate of a specific sequence of tokens. In observe, it’s extra handy to compute perplexity because the imply of the log possibilities, as proven within the formulation above.

Perplexity is a metric that quantifies the common diploma to which a language mannequin hesitates in regards to the subsequent token. If the language mannequin is totally sure, the perplexity is 1. If the language mannequin is totally unsure, then each token within the vocabulary is equally possible; the perplexity is the same as the vocabulary dimension. You shouldn’t anticipate perplexity to transcend this vary.

Consider the Perplexity of a Language Mannequin with HellaSwag Dataset

Perplexity is a dataset-dependent metric. One dataset you should utilize is HellaSwag. It’s a dataset with practice, take a look at, and validation splits. It’s accessible on the Hugging Face hub, and you may load it with the next code:

import datasets

dataset = datasets.load_dataset(“HuggingFaceFW/hellaswag”)
print(dataset)

for pattern in dataset[“validation”]:
print(pattern)
break

import datasets

 

dataset = datasets.load_dataset(“HuggingFaceFW/hellaswag”)

print(dataset)

 

for pattern in dataset[“validation”]:

    print(pattern)

    break

Operating this code will print the next:

DatasetDict({
practice: Dataset({
options: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’,
‘source_id’, ‘split’, ‘split_type’, ‘label’],
num_rows: 39905
})
take a look at: Dataset({
options: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’,
‘source_id’, ‘split’, ‘split_type’, ‘label’],
num_rows: 10003
})
validation: Dataset({
options: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’,
‘source_id’, ‘split’, ‘split_type’, ‘label’],
num_rows: 10042
})
})
{‘ind’: 24, ‘activity_label’: ‘Roof shingle elimination’,
‘ctx_a’: ‘A person is sitting on a roof.’, ‘ctx_b’: ‘he’,
‘ctx’: ‘A person is sitting on a roof. he’, ‘endings’: [
‘is using wrap to wrap a pair of skis.’, ‘is ripping level tiles off.’,
“is holding a rubik’s cube.”, ‘starts pulling up roofing on a roof.’
], ‘source_id’: ‘activitynet~v_-JhWjGDPHMY’, ‘cut up’: ‘val’, ‘split_type’: ‘indomain’,
‘label’: ‘3’}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

DatasetDict({

    practice: Dataset({

        options: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’,

                   ‘source_id’, ‘split’, ‘split_type’, ‘label’],

        num_rows: 39905

    })

    take a look at: Dataset({

        options: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’,

                   ‘source_id’, ‘split’, ‘split_type’, ‘label’],

        num_rows: 10003

    })

    validation: Dataset({

        options: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’,

                   ‘source_id’, ‘split’, ‘split_type’, ‘label’],

        num_rows: 10042

    })

})

{‘ind’: 24, ‘activity_label’: ‘Roof shingle elimination’,

‘ctx_a’: ‘A person is sitting on a roof.’, ‘ctx_b’: ‘he’,

‘ctx’: ‘A person is sitting on a roof. he’, ‘endings’: [

    ‘is using wrap to wrap a pair of skis.’, ‘is ripping level tiles off.’,

    “is holding a rubik’s cube.”, ‘starts pulling up roofing on a roof.’

], ‘source_id’: ‘activitynet~v_-JhWjGDPHMY’, ‘cut up’: ‘val’, ‘split_type’: ‘indomain’,

‘label’: ‘3’}

You possibly can see that the validation cut up has 10,042 samples. That is the dataset you’ll use on this article. Every pattern is a dictionary. The important thing “activity_label” specifies the exercise class, and the important thing “ctx” supplies the context to be accomplished. The mannequin is predicted to finish the sequence by choosing one of many 4 endings. The important thing “label”, with values 0 to three, signifies which ending is appropriate.

With this, you possibly can write a brief code to judge your personal language mannequin. Let’s use a small mannequin from Hugging Face for instance:

import datasets
import torch
import torch.nn.purposeful as F
import tqdm
import transformers

mannequin = “openai-community/gpt2″

# Load the mannequin
torch.set_default_device(“cuda” if torch.cuda.is_available() else “cpu”)
tokenizer = transformers.AutoTokenizer.from_pretrained(mannequin)
mannequin = transformers.AutoModelForCausalLM.from_pretrained(mannequin)

# Load the dataset: HellaSwag has practice, take a look at, and validation splits
dataset = datasets.load_dataset(“hellaswag”, cut up=”validation”)

# Consider the mannequin: Compute the perplexity of every ending
num_correct = 0
for pattern in tqdm.tqdm(dataset):
# tokenize textual content from the pattern
textual content = tokenizer.encode(” ” + pattern[“activity_label”] + “. ” + pattern[“ctx”])
endings = [tokenizer.encode(” ” + x) for x in sample[“endings”]] # 4 endings
groundtruth = int(pattern[“label”]) # integer, 0 to three
# generate logits for every ending
perplexities = [0.0] * 4
for i, ending in enumerate(endings):
# run all the enter and ending to the mannequin
input_ids = torch.tensor(textual content + ending).unsqueeze(0)
output = mannequin(input_ids).logits
# extract the logits for every token within the ending
logits = output[0, len(text)-1:, :]
token_probs = F.log_softmax(logits, dim=-1)
# accumulate the likelihood of producing the ending
log_prob = 0.0
for j, token in enumerate(ending):
log_prob += token_probs[j, token]
# convert the sum of log possibilities to perplexity
perplexities[i] = torch.exp(-log_prob / len(ending))
# print the perplexity of every ending
print(pattern[“activity_label”] + “. ” + pattern[“ctx”])
appropriate = perplexities[groundtruth] == min(perplexities)
for i, p in enumerate(perplexities):
if i == groundtruth:
image=”(O)” if appropriate else ‘(!)’
elif p == min(perplexities):
image=”(X)”
else:
image=” “
print(f”Ending {i}: {p:.4g} {image} – {pattern[‘endings’][i]}”)
if appropriate:
num_correct += 1

print(f”Accuracy: {num_correct}/{len(dataset)} = {num_correct / len(dataset):.4f}”)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

import datasets

import torch

import torch.nn.purposeful as F

import tqdm

import transformers

 

mannequin = “openai-community/gpt2”

 

# Load the mannequin

torch.set_default_device(“cuda” if torch.cuda.is_available() else “cpu”)

tokenizer = transformers.AutoTokenizer.from_pretrained(mannequin)

mannequin = transformers.AutoModelForCausalLM.from_pretrained(mannequin)

 

# Load the dataset: HellaSwag has practice, take a look at, and validation splits

dataset = datasets.load_dataset(“hellaswag”, cut up=“validation”)

 

# Consider the mannequin: Compute the perplexity of every ending

num_correct = 0

for pattern in tqdm.tqdm(dataset):

    # tokenize textual content from the pattern

    textual content = tokenizer.encode(” “ + pattern[“activity_label”] + “. “ + pattern[“ctx”])

    endings = [tokenizer.encode(” “ + x) for x in sample[“endings”]]  # 4 endings

    groundtruth = int(pattern[“label”])  # integer, 0 to three

    # generate logits for every ending

    perplexities = [0.0] * 4

    for i, ending in enumerate(endings):

        # run all the enter and ending to the mannequin

        input_ids = torch.tensor(textual content + ending).unsqueeze(0)

        output = mannequin(input_ids).logits

        # extract the logits for every token within the ending

        logits = output[0, len(text)–1:, :]

        token_probs = F.log_softmax(logits, dim=–1)

        # accumulate the likelihood of producing the ending

        log_prob = 0.0

        for j, token in enumerate(ending):

            log_prob += token_probs[j, token]

        # convert the sum of log possibilities to perplexity

        perplexities[i] = torch.exp(–log_prob / len(ending))

    # print the perplexity of every ending

    print(pattern[“activity_label”] + “. “ + pattern[“ctx”])

    appropriate = perplexities[groundtruth] == min(perplexities)

    for i, p in enumerate(perplexities):

        if i == groundtruth:

            image = ‘(O)’ if appropriate else ‘(!)’

        elif p == min(perplexities):

            image = ‘(X)’

        else:

            image = ‘   ‘

        print(f“Ending {i}: {p:.4g} {image} – {pattern[‘endings’][i]}”)

    if appropriate:

        num_correct += 1

 

print(f“Accuracy: {num_correct}/{len(dataset)} = {num_correct / len(dataset):.4f}”)

This code masses the smallest GPT-2 mannequin from the Hugging Face Hub. It’s a 124M-parameter mannequin which you could simply run on a low-profile laptop. The mannequin and tokenizer are loaded utilizing the Hugging Face transformers library. You additionally load the HellaSwag validation dataset.

Within the for-loop, you tokenize the exercise label and the context. You additionally tokenize every of the 4 endings. Be aware that tokenizer.encode() is the tactic for utilizing the tokenizer from the transformers library. It’s completely different from the tokenizer object you used within the earlier article.

Subsequent, for every ending, you run the concatenated enter and ending to the mannequin. The input_ids tensor is a 2D tensor of integer token IDs with the batch dimension 1. The mannequin returns an object, from which you extract the output logits tensor. That is completely different from the mannequin you constructed within the earlier article, as it is a mannequin object from the transformers library. You possibly can readily exchange it together with your educated mannequin object with minor modifications.

GPT-2 is a decoder-only transformer mannequin. It processes the enter with a causal masks. For an enter tensor of form $(1, L)$, the output logits tensor has form $(1, L, V)$, the place $V$ is the vocabulary dimension. The output at place $p$ corresponds to the mannequin’s estimate of the token at place $p+1$, relying on the enter at positions 1 to $p$. Subsequently, you extract the logits beginning at offset $n-1$, the place $n$ is the size of the mixed exercise label and context. You then convert the logits to log possibilities and compute the common over the size of every ending.

The worth token_probs[j, token] is the log likelihood at place j for the token with ID token. The imply log-probability of every token within the ending is used to compute the perplexity. A great mannequin is predicted to determine the proper ending with the bottom perplexity. You possibly can consider a mannequin by counting the variety of appropriate predictions over all the HellaSwag validation dataset. Once you run this code, you will notice the next:

…
Finance and Enterprise. [header] Tips on how to purchase a peridot Evaluating Perplexity on Language Fashions Have a look at quite a lot of stones…
Ending 0: 13.02 (X) – You’ll want to watch a number of of the gems, significantly eme…
Ending 1: 30.19 – Not solely are they among the many delicates amongst them, however they are often…
Ending 2: 34.96 (!) – Familiarize your self with the completely different shades that it is available in, …
Ending 3: 28.85 – Neither peridot nor many different jade or allekite stones are necess…
Household Life. [header] Tips on how to inform in case your teen is being abused Evaluating Perplexity on Language Fashions Take note of…
Ending 0: 16.58 – Strive to determine why they’re dressing one thing that’s frowned…
Ending 1: 22.01 – Learn the next as a rule for figuring out your teen’s behaviou…
Ending 2: 15.21 (O) – [substeps] As an illustration, your teen might attempt to conceal the indicators of a…
Ending 3: 23.91 – [substeps] Ask your teen if they’ve black tights (with stripper…
Accuracy: 3041/10042 = 0.3028

…

Finance and Enterprise. [header] Tips on how to purchase a peridot Evaluating Perplexity on Language Fashions Have a look at quite a lot of stones…

Ending 0: 13.02 (X) – You’ll want to watch a number of of the gems, significantly eme…

Ending 1: 30.19 – Not solely are they among the many delicates amongst them, however they are often…

Ending 2: 34.96 (!) – Familiarize your self with the completely different shades that it is available in, …

Ending 3: 28.85 – Neither peridot nor many different jade or allekite stones are necess…

Household Life. [header] Tips on how to inform in case your teen is being abused Evaluating Perplexity on Language Fashions Take note of…

Ending 0: 16.58 – Strive to determine why they’re dressing one thing that’s frowned…

Ending 1: 22.01 – Learn the next as a rule for figuring out your teen’s behaviou…

Ending 2: 15.21 (O) – [substeps] As an illustration, your teen might attempt to conceal the indicators of a…

Ending 3: 23.91 – [substeps] Ask your teen if they’ve black tights (with stripper…

Accuracy: 3041/10042 = 0.3028

The code prints the perplexity of every ending and marks the proper reply with (O) or (!) and the mannequin’s flawed prediction with (X). You possibly can see that GPT-2 has a perplexity of 10 to twenty, even for an accurate reply. Superior LLMs can obtain perplexity beneath 10, even with a a lot bigger vocabulary dimension than GPT-2. Extra vital is whether or not the mannequin can determine the proper ending: the one which naturally completes the sentence. It ought to be the one with the bottom perplexity; in any other case, the mannequin can’t generate the proper ending. GPT-2 achieves solely 30% accuracy on this dataset.

It’s also possible to repeat the code with a distinct mannequin. Listed here are the outcomes:

mannequin openai-community/gpt2: That is the smallest GPT-2 mannequin with 124M parameters, used within the code above. The accuracy is 3041/10042 or 30.28%
mannequin openai-community/gpt2-medium: That is the bigger GPT-2 mannequin with 355M parameters. The accuracy is 3901/10042 or 38.85%
mannequin meta-llama/Llama-3.2-1B: That is the smallest mannequin within the Llama household with 1B parameters. The accuracy is 5731/10042 or 57.07%

Subsequently, it’s pure to see greater accuracy with bigger fashions.

Be aware that you shouldn’t examine perplexities throughout fashions with vastly completely different architectures. Since perplexity is a metric within the vary of 1 to the vocabulary dimension, it extremely depends upon the tokenizer. The rationale turns into obvious once you examine the perplexity within the code above after changing GPT-2 with Llama 3.2 1B: the perplexity is an order of magnitude greater for Llama 3, however the accuracy is certainly greater. It’s because GPT-2 has a vocabulary dimension of solely 50,257, whereas Llama 3.2 1B has a vocabulary dimension of 128,256.

Additional Readings

Under are some assets that you could be discover helpful:

Abstract

On this article, you discovered in regards to the perplexity metric and how you can consider the perplexity of a language mannequin with the HellaSwag dataset. Particularly, you discovered:

Perplexity measures the common diploma to which a mannequin hesitates in regards to the subsequent token.
Perplexity is a metric delicate to vocabulary dimension.
Computing perplexity means computing the geometric imply of the chances of the tokens within the pattern.

7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings
A rich, custom, visual interactive user experience for any prompt
Inside the AI brain: memory vs. reasoning
Can We Really Trust AI Detectors? The Growing Confusion Around What’s ‘Human’ and What’s Not
Netflix Adds ChatGPT-Powered AI to Stop You From Scrolling Forever
TAGGED:EvaluatingLanguageModelsPerplexity
Share This Article
Facebook Email Print
Leave a Comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Follow US

Find US on Social Medias
FacebookLike
XFollow
YoutubeSubscribe
TelegramFollow

Weekly Newsletter

Subscribe to our newsletter to get our newest articles instantly!

Popular News
Gfn thursday 11 6 nv blog 1280x680 logo.jpg
Gaming

GFN Thursday: 23 Games on GeForce NOW in Nov

AllTopicsToday
AllTopicsToday
November 8, 2025
How Climate-Focused Innovation Can Help Manage Systemic Risk
Talamasca: The Secret Order Review
A big bill, ballooning debt, and a weakening U.S. dollar
Gears Of War: Reloaded Global Unlock Times
- Advertisement -
Ad space (1)

Categories

  • Tech
  • Investing & Finance
  • AI
  • Entertainment
  • Wellness
  • Gaming
  • Movies

About US

We believe in the power of information to empower decisions, fuel curiosity, and spark innovation.
Quick Links
  • Home
  • Blog
  • About Us
  • Contact
Important Links
  • About Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
  • Contact

Subscribe US

Subscribe to our newsletter to get our newest articles instantly!

©AllTopicsToday 2026. All Rights Reserved.
1 2
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?