AllTopicsTodayAllTopicsToday
Notification
Font ResizerAa
  • Home
  • Tech
  • Investing & Finance
  • AI
  • Entertainment
  • Wellness
  • Gaming
  • Movies
Reading: Build Semantic Search with LLM Embeddings
Share
Font ResizerAa
AllTopicsTodayAllTopicsToday
  • Home
  • Blog
  • About Us
  • Contact
Search
  • Home
  • Tech
  • Investing & Finance
  • AI
  • Entertainment
  • Wellness
  • Gaming
  • Movies
Have an existing account? Sign In
Follow US
©AllTopicsToday 2026. All Rights Reserved.
AllTopicsToday > Blog > AI > Build Semantic Search with LLM Embeddings
Mlm building simple semantic search engine hero 1024x572.png
AI

Build Semantic Search with LLM Embeddings

AllTopicsToday
Last updated: March 17, 2026 3:35 am
AllTopicsToday
Published: March 17, 2026
Share
SHARE

On this article, you’ll discover ways to construct a easy semantic search engine utilizing sentence embeddings and nearest neighbors.

Matters we’ll cowl embody:

Understanding the restrictions of keyword-based search.
Producing textual content embeddings with a sentence transformer mannequin.
Implementing a nearest-neighbor semantic search pipeline in Python.

Let’s get began.

Construct Semantic Search with LLM Embeddings
Picture by Editor

Introduction

Conventional engines like google have traditionally relied on key phrase search. In different phrases, given a question like “finest temples and shrines to go to in Fukuoka, Japan”, outcomes are retrieved based mostly on key phrase matching, such that textual content paperwork containing co-occurrences of phrases like “temple”, “shrine”, and “Fukuoka” are deemed most related.

Nonetheless, this classical method is notoriously inflexible, because it largely depends on actual phrase matches and misses different essential semantic nuances resembling synonyms or different phrasing — for instance, “younger canine” as a substitute of “pet”. Because of this, extremely related paperwork could also be inadvertently omitted.

Semantic search addresses this limitation by specializing in that means quite than actual wording. Giant language fashions (LLMs) play a key function right here, as a few of them are skilled to translate textual content into numerical vector representations referred to as embeddings, which encode the semantic data behind the textual content. When two texts like “small canines are very curious by nature” and “puppies are inquisitive by nature” are transformed into embedding vectors, these vectors might be extremely comparable because of their shared that means. In the meantime, the embedding vectors for “puppies are inquisitive by nature” and “Dazaifu is a signature shrine in Fukuoka” might be very completely different, as they signify unrelated ideas.

Following this precept — which you’ll be able to discover in additional depth right here — the rest of this text guides you thru the total technique of constructing a compact but environment friendly semantic search engine. Whereas minimalistic, it performs successfully and serves as a place to begin for understanding how trendy search and retrieval programs, resembling retrieval augmented technology (RAG) architectures, are constructed.

The code defined beneath could be run seamlessly in a Google Colab or Jupyter Pocket book occasion.

Step-by-Step Information

First, we make the required imports for this sensible instance:

import pandas as pd
import json
from pydantic import BaseModel, Area
from openai import OpenAI
from google.colab import userdata
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

import pandas as pd

import json

from pydantic import BaseModel, Area

from openai import OpenAI

from google.colab import userdata

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report

from sklearn.preprocessing import StandardScaler

We are going to use a toy public dataset referred to as “ag_news”, which comprises texts from information articles. The next code hundreds the dataset and selects the primary 1000 articles.

from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from sklearn.neighbors import NearestNeighbors

from datasets import load_dataset

from sentence_transformers import SentenceTransformer

from sklearn.neighbors import NearestNeighbors

We now load the dataset and extract the “textual content” column, which comprises the article content material. Afterwards, we print a brief pattern from the primary article to examine the info:

print(“Loading dataset…”)
dataset = load_dataset(“ag_news”, cut up=”prepare[:1000]”)

# Extract the textual content column right into a Python listing
paperwork = dataset[“text”]

print(f”Loaded {len(paperwork)} paperwork.”)
print(f”Pattern: {paperwork[0][:100]}…”)

print(“Loading dataset…”)

dataset = load_dataset(“ag_news”, cut up=“prepare[:1000]”)

 

# Extract the textual content column right into a Python listing

paperwork = dataset[“text”]

 

print(f“Loaded {len(paperwork)} paperwork.”)

print(f“Pattern: {paperwork[0][:100]}…”)

The subsequent step is to acquire embedding vectors (numerical representations) for our 1000 texts. As talked about earlier, some LLMs are skilled particularly to translate textual content into numerical vectors that seize semantic traits. Hugging Face sentence transformer fashions, resembling “all-MiniLM-L6-v2”, are a standard selection. The next code initializes the mannequin and encodes the batch of textual content paperwork into embeddings.

print(“Loading embedding mannequin…”)
mannequin = SentenceTransformer(“all-MiniLM-L6-v2”)

# Convert textual content paperwork into numerical vector embeddings
print(“Encoding paperwork (this may increasingly take just a few seconds)…”)
document_embeddings = mannequin.encode(paperwork, show_progress_bar=True)

print(f”Created {document_embeddings.form[0]} embeddings.”)

print(“Loading embedding mannequin…”)

mannequin = SentenceTransformer(“all-MiniLM-L6-v2”)

 

# Convert textual content paperwork into numerical vector embeddings

print(“Encoding paperwork (this may increasingly take just a few seconds)…”)

document_embeddings = mannequin.encode(paperwork, show_progress_bar=True)

 

print(f“Created {document_embeddings.form[0]} embeddings.”)

Subsequent, we initialize a NearestNeighbors object, which implements a nearest-neighbor technique to search out the ok most comparable paperwork to a given question. When it comes to embeddings, this implies figuring out the closest vectors (smallest angular distance). We use the cosine metric, the place extra comparable vectors have smaller cosine distances (and better cosine similarity values).

search_engine = NearestNeighbors(n_neighbors=5, metric=”cosine”)

search_engine.match(document_embeddings)
print(“Search engine is prepared!”)

search_engine = NearestNeighbors(n_neighbors=5, metric=“cosine”)

 

search_engine.match(document_embeddings)

print(“Search engine is prepared!”)

The core logic of our search engine is encapsulated within the following perform. It takes a plain-text question, specifies what number of prime outcomes to retrieve by way of top_k, computes the question embedding, and retrieves the closest neighbors from the index.

The loop contained in the perform prints the top-k outcomes ranked by similarity:

def semantic_search(question, top_k=3):
# Embed the incoming search question
query_embedding = mannequin.encode([query])

# Retrieve the closest matches
distances, indices = search_engine.kneighbors(query_embedding, n_neighbors=top_k)

print(f”n🔍 Question: ‘{question}'”)
print(“-” * 50)

for i in vary(top_k):
doc_idx = indices[0][i]
# Convert cosine distance to similarity (1 – distance)
similarity = 1 – distances[0][i]

print(f”End result {i+1} (Similarity: {similarity:.4f})”)
print(f”Textual content: {paperwork[int(doc_idx)][:150]}…n”)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

def semantic_search(question, top_k=3):

    # Embed the incoming search question

    query_embedding = mannequin.encode([query])

 

    # Retrieve the closest matches

    distances, indices = search_engine.kneighbors(query_embedding, n_neighbors=top_k)

 

    print(f“n🔍 Question: ‘{question}'”)

    print(“-“ * 50)

 

    for i in vary(top_k):

        doc_idx = indices[0][i]

        # Convert cosine distance to similarity (1 – distance)

        similarity = 1 – distances[0][i]

 

        print(f“End result {i+1} (Similarity: {similarity:.4f})”)

        print(f“Textual content: {paperwork[int(doc_idx)][:150]}…n”)

And that’s it. To check the perform, we are able to formulate a few instance search queries:

semantic_search(“Wall road and inventory market traits”)
semantic_search(“Area exploration and rocket launches”)

semantic_search(“Wall road and inventory market traits”)

semantic_search(“Area exploration and rocket launches”)

The outcomes are ranked by similarity (truncated right here for readability):

🔍 Question: ‘Wall road and inventory market traits’
————————————————–
End result 1 (Similarity: 0.6258)
Textual content: Shares Increased Regardless of Hovering Oil Costs NEW YORK – Wall Avenue shifted greater Monday as discount hunters shrugged off skyrocketing oil costs and boug…

End result 2 (Similarity: 0.5586)
Textual content: Shares Sharply Increased on Dip in Oil Costs NEW YORK – A drop in oil costs and upbeat outlooks from Wal-Mart and Lowe’s prompted new bargain-hunting o…

End result 3 (Similarity: 0.5459)
Textual content: Methods for a Sideways Market (Reuters) Reuters – The bulls and the bears are on this collectively, scratching their heads and questioning what is going on t…


🔍 Question: ‘Area exploration and rocket launches’
————————————————–
End result 1 (Similarity: 0.5803)
Textual content: Redesigning Rockets: NASA Area Propulsion Finds a New House (SPACE.com) SPACE.com – Whereas the exploration of the Moon and different planets in our photo voltaic s…

End result 2 (Similarity: 0.5008)
Textual content: Canadian Group Joins Rocket Launch Contest (AP) AP – The #36;10 million competitors to ship a personal manned rocket into area began trying extra li…

End result 3 (Similarity: 0.4724)
Textual content: The Subsequent Nice Area Race: SpaceShipOne and Wild Hearth to Go For the Gold (SPACE.com) SPACE.com – A piloted rocket ship race to assert a #36;10 million…

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

🔍 Question: ‘Wall road and inventory market traits’

—————————————————————————

End result 1 (Similarity: 0.6258)

Textual content: Shares Increased Regardless of Hovering Oil Costs NEW YORK – Wall Avenue shifted greater Monday as discount hunters shrugged off skyrocketing oil costs and boug...

 

End result 2 (Similarity: 0.5586)

Textual content: Shares Sharply Increased on Dip in Oil Costs NEW YORK – A drop in oil costs and upbeat outlooks from Wal–Mart and Lowe‘s prompted new bargain-hunting o…

 

End result 3 (Similarity: 0.5459)

Textual content: Methods for a Sideways Market (Reuters) Reuters – The bulls and the bears are on this collectively, scratching their heads and questioning what’s going t...

 

 

🔍 Question: ‘Area exploration and rocket launches’

—————————————————————————

End result 1 (Similarity: 0.5803)

Textual content: Redesigning Rockets: NASA Area Propulsion Finds a New House (SPACE.com) SPACE.com – Whereas the exploration of the Moon and different planets in our photo voltaic s...

 

End result 2 (Similarity: 0.5008)

Textual content: Canadian Group Joins Rocket Launch Contest (AP) AP – The  #36;10 million competitors to ship a personal manned rocket into area began trying extra li…

 

End result 3 (Similarity: 0.4724)

Textual content: The Subsequent Nice Area Race: SpaceShipOne and Wild Hearth to Go For the Gold (SPACE.com) SPACE.com – A piloted rocket ship race to declare a  #36;10 million…

Abstract

What we’ve constructed right here could be seen as a gateway to retrieval augmented technology programs. Whereas this instance is deliberately easy, semantic engines like google like this way the foundational retrieval layer in trendy architectures that mix semantic search with giant language fashions.

Now that you know the way to construct a primary semantic search engine, you might need to discover retrieval augmented technology programs in additional depth.

The Hidden Limits of Single Vector Embeddings in Retrieval
Top 5 Agentic AI Website Builders (That Actually Ship)
Enhancing the foundation of genomic research
How to Create a Bioinformatics AI Agent Using Biopython for DNA and Protein Analysis
Training a Model with Limited Memory using Mixed Precision and Gradient Checkpointing
TAGGED:buildEmbeddingsLLMsearchSemantic
Share This Article
Facebook Email Print
Leave a Comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Follow US

Find US on Social Medias
FacebookLike
XFollow
YoutubeSubscribe
TelegramFollow

Weekly Newsletter

Subscribe to our newsletter to get our newest articles instantly!

Popular News
108177345 17534733282025 07 25t195236z 1519854362 rc2qtfarf3yy rtrmadp 0 mexico banks antitrust.jpeg
Investing & Finance

HSBC’s third-quarter profit drops 14%, beats estimates

AllTopicsToday
AllTopicsToday
October 28, 2025
Airlines start canceling flights ahead of another monster winter storm
The Complete Guide to Using Pydantic for Validating LLM Outputs
Revisiting k-Means: 3 Approaches to Make It Work Better
What To Do After A Serious Car Accident In Oregon
- Advertisement -
Ad space (1)

Categories

  • Tech
  • Investing & Finance
  • AI
  • Entertainment
  • Wellness
  • Gaming
  • Movies

About US

We believe in the power of information to empower decisions, fuel curiosity, and spark innovation.
Quick Links
  • Home
  • Blog
  • About Us
  • Contact
Important Links
  • About Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
  • Contact

Subscribe US

Subscribe to our newsletter to get our newest articles instantly!

©AllTopicsToday 2026. All Rights Reserved.
1 2
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?