7 Superior Characteristic Engineering Tips for Textual content Information Utilizing LLM Embeddings
Picture by Editor
Introduction
Massive language fashions (LLMs) aren’t solely good at understanding and producing textual content; they will additionally flip uncooked textual content into numerical representations known as embeddings. These embeddings are helpful for incorporating further data into conventional predictive machine studying fashions—corresponding to these utilized in scikit-learn—to enhance downstream efficiency.
This text presents seven superior Python examples of characteristic engineering tips that add further worth to textual content information by leveraging LLM-generated embeddings, thereby enhancing the accuracy and robustness of downstream machine studying fashions that depend on textual content, in purposes corresponding to sentiment evaluation, subject classification, doc clustering, and semantic similarity detection.
Frequent setup for all examples
Except said in any other case, the seven instance tips beneath make use of this frequent setup. We depend on Sentence Transformers for embeddings and scikit-learn for modeling utilities.
!pip set up sentence-transformers scikit-learn -q
from sentence_transformers import SentenceTransformer
import numpy as np
# Load a light-weight LLM embedding mannequin; builds 384-dimensional embeddings
mannequin = SentenceTransformer(“all-MiniLM-L6-v2”)
!pip set up sentence–transformers scikit–be taught –q
from sentence_transformers import SentenceTransformer
import numpy as np
# Load a light-weight LLM embedding mannequin; builds 384-dimensional embeddings
mannequin = SentenceTransformer(“all-MiniLM-L6-v2”)
1. Combining TF-IDF and Embedding Options
The primary instance exhibits collectively extract—given a supply textual content dataset like fetch_20newsgroups—each TF-IDF and LLM-generated sentence-embedding options. We then mix these characteristic varieties to coach a logistic regression mannequin that classifies information texts based mostly on the mixed options, typically boosting accuracy by capturing each lexical and semantic data.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.textual content import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# Loading information
information = fetch_20newsgroups(subset=”prepare”, classes=[‘sci.space’, ‘rec.autos’])
texts, y = information.information[:500], information.goal[:500]
# Extracting options of two broad varieties
tfidf = TfidfVectorizer(max_features=300).fit_transform(texts).toarray()
emb = mannequin.encode(texts, show_progress_bar=False)
# Combining options and coaching ML mannequin
X = np.hstack([tfidf, StandardScaler().fit_transform(emb)])
clf = LogisticRegression(max_iter=1000).match(X, y)
print(“Accuracy:”, clf.rating(X, y))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.textual content import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# Loading information
information = fetch_20newsgroups(subset=‘prepare’, classes=[‘sci.space’, ‘rec.autos’])
texts, y = information.information[:500], information.goal[:500]
# Extracting options of two broad varieties
tfidf = TfidfVectorizer(max_features=300).fit_transform(texts).toarray()
emb = mannequin.encode(texts, show_progress_bar=False)
# Combining options and coaching ML mannequin
X = np.hstack([tfidf, StandardScaler().fit_transform(emb)])
clf = LogisticRegression(max_iter=1000).match(X, y)
print(“Accuracy:”, clf.rating(X, y))
2. Matter-Conscious Embedding Clusters
This trick takes a number of pattern textual content sequences, generates embeddings utilizing the preloaded language mannequin, applies Okay-Means clustering on these embeddings to assign subjects, after which combines the embeddings with a one-hot encoding of every instance’s cluster identifier (its “subject class”) to construct a brand new characteristic illustration. It’s a helpful technique for creating compact subject meta-features.
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder
texts = [“Tokyo Tower is a popular landmark.”, “Sushi is a traditional Japanese dish.”,
“Mount Fuji is a famous volcano in Japan.”, “Cherry blossoms bloom in the spring in Japan.”]
emb = mannequin.encode(texts)
subjects = KMeans(n_clusters=2, n_init=”auto”, random_state=42).fit_predict(emb)
topic_ohe = OneHotEncoder(sparse_output=False).fit_transform(subjects.reshape(-1, 1))
X = np.hstack([emb, topic_ohe])
print(X.form)
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder
texts = [“Tokyo Tower is a popular landmark.”, “Sushi is a traditional Japanese dish.”,
“Mount Fuji is a famous volcano in Japan.”, “Cherry blossoms bloom in the spring in Japan.”]
emb = mannequin.encode(texts)
subjects = KMeans(n_clusters=2, n_init=‘auto’, random_state=42).fit_predict(emb)
topic_ohe = OneHotEncoder(sparse_output=False).fit_transform(subjects.reshape(–1, 1))
X = np.hstack([emb, topic_ohe])
print(X.form)
3. Semantic Anchor Similarity Options
This straightforward technique computes similarity to a small set of mounted “anchor” (or reference) sentences used as compact semantic descriptors—primarily, semantic landmarks. Every column within the similarity-feature matrix comprises the similarity of the textual content to 1 anchor. The primary worth lies in permitting the mannequin to be taught relationships between the textual content’s similarity to key ideas and a goal variable—helpful for textual content classification fashions.
from sklearn.metrics.pairwise import cosine_similarity
anchors = [“space mission”, “car performance”, “politics”]
anchor_emb = mannequin.encode(anchors)
texts = [“The rocket launch was successful.”, “The car handled well on the track.”]
emb = mannequin.encode(texts)
sim_features = cosine_similarity(emb, anchor_emb)
print(sim_features)
from sklearn.metrics.pairwise import cosine_similarity
anchors = [“space mission”, “car performance”, “politics”]
anchor_emb = mannequin.encode(anchors)
texts = [“The rocket launch was successful.”, “The car handled well on the track.”]
emb = mannequin.encode(texts)
sim_features = cosine_similarity(emb, anchor_emb)
print(sim_features)
4. Meta-Characteristic Stacking through Auxiliary Sentiment Classifier
For textual content related to labels corresponding to sentiments, the next feature-engineering method provides further worth. A meta-feature is constructed because the prediction likelihood returned by an auxiliary classifier skilled on the embeddings. This meta-feature is stacked with the unique embeddings, leading to an augmented characteristic set that may enhance downstream efficiency by exposing doubtlessly extra discriminative data than uncooked embeddings alone.
A slight further setup is required for this instance:
!pip set up sentence-transformers scikit-learn -q
from sentence_transformers import SentenceTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler # Import StandardScaler
import numpy as np
embedder = SentenceTransformer(“all-MiniLM-L6-v2”) # 384-dim
# Small dataset containing texts and sentiment labels
texts = [“I love this!”, “This is terrible.”, “Amazing quality.”, “Not good at all.”]
y = np.array([1, 0, 1, 0])
# Get hold of embeddings from the embedder LLM
emb = embedder.encode(texts, show_progress_bar=False)
# Prepare an auxiliary classifier on embeddings
X_train, X_test, y_train, y_test = train_test_split(
emb, y, test_size=0.5, random_state=42, stratify=y
)
meta_clf = LogisticRegression(max_iter=1000).match(X_train, y_train)
# Leverage the auxiliary mannequin’s predicted likelihood as a meta-feature
meta_feature = meta_clf.predict_proba(emb)[:, 1].reshape(-1, 1) # Prob of optimistic class
# Increase authentic embeddings with the meta-feature
# Don’t forget to scale once more for consistency
scaler = StandardScaler()
emb_scaled = scaler.fit_transform(emb)
X_aug = np.hstack([emb_scaled, meta_feature]) # Stack options collectively
print(“emb form:”, emb.form)
print(“meta_feature form:”, meta_feature.form)
print(“augmented form:”, X_aug.form)
print(“meta clf accuracy on take a look at slice:”, meta_clf.rating(X_test, y_test))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
!pip set up sentence–transformers scikit–be taught –q
from sentence_transformers import SentenceTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler # Import StandardScaler
import numpy as np
embedder = SentenceTransformer(“all-MiniLM-L6-v2”) # 384-dim
# Small dataset containing texts and sentiment labels
texts = [“I love this!”, “This is terrible.”, “Amazing quality.”, “Not good at all.”]
y = np.array([1, 0, 1, 0])
# Get hold of embeddings from the embedder LLM
emb = embedder.encode(texts, show_progress_bar=False)
# Prepare an auxiliary classifier on embeddings
X_train, X_test, y_train, y_test = train_test_split(
emb, y, test_size=0.5, random_state=42, stratify=y
)
meta_clf = LogisticRegression(max_iter=1000).match(X_train, y_train)
# Leverage the auxiliary mannequin’s predicted likelihood as a meta-feature
meta_feature = meta_clf.predict_proba(emb)[:, 1].reshape(–1, 1) # Prob of optimistic class
# Increase authentic embeddings with the meta-feature
# Don’t forget to scale once more for consistency
scaler = StandardScaler()
emb_scaled = scaler.fit_transform(emb)
X_aug = np.hstack([emb_scaled, meta_feature]) # Stack options collectively
print(“emb form:”, emb.form)
print(“meta_feature form:”, meta_feature.form)
print(“augmented form:”, X_aug.form)
print(“meta clf accuracy on take a look at slice:”, meta_clf.rating(X_test, y_test))
5. Embedding Compression and Nonlinear Enlargement
This technique applies PCA dimensionality discount to compress the uncooked embeddings constructed by the LLM after which polynomially expands these compressed embeddings. It might sound odd at first, however this may be an efficient strategy to seize nonlinear construction whereas sustaining effectivity.
!pip set up sentence-transformers scikit-learn -q
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
# Loading a light-weight embedding language mannequin
embedder = SentenceTransformer(“all-MiniLM-L6-v2”)
texts = [“The satellite was launched into orbit.”,
“Cars require regular maintenance.”,
“The telescope observed distant galaxies.”]
# Acquiring embeddings
emb = embedder.encode(texts, show_progress_bar=False)
# Compressing with PCA and enriching with polynomial options
pca = PCA(n_components=2).fit_transform(emb) # Decreased n_components to a sound worth
poly = PolynomialFeatures(diploma=2, include_bias=False).fit_transform(pca)
print(“Unique form:”, emb.form)
print(“After PCA:”, pca.form)
print(“After polynomial growth:”, poly.form)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
!pip set up sentence–transformers scikit–be taught –q
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
# Loading a light-weight embedding language mannequin
embedder = SentenceTransformer(“all-MiniLM-L6-v2”)
texts = [“The satellite was launched into orbit.”,
“Cars require regular maintenance.”,
“The telescope observed distant galaxies.”]
# Acquiring embeddings
emb = embedder.encode(texts, show_progress_bar=False)
# Compressing with PCA and enriching with polynomial options
pca = PCA(n_components=2).fit_transform(emb) # Decreased n_components to a sound worth
poly = PolynomialFeatures(diploma=2, include_bias=False).fit_transform(pca)
print(“Unique form:”, emb.form)
print(“After PCA:”, pca.form)
print(“After polynomial growth:”, poly.form)
6. Relational Studying with Pairwise Contrastive Options
The purpose right here is to construct pairwise relational options from textual content embeddings. Interrelated options—constructed in a contrastive trend—can spotlight features of similarity and dissimilarity. That is significantly efficient for predictive processes that inherently entail comparisons amongst texts.
!pip set up sentence-transformers -q
from sentence_transformers import SentenceTransformer
import numpy as np
# Loading embedder
embedder = SentenceTransformer(“all-MiniLM-L6-v2”)
# Instance textual content pairs
pairs = [
(“The car is fast.”, “The vehicle moves quickly.”),
(“The sky is blue.”, “Bananas are yellow.”)
]
# Producing embeddings for either side
emb1 = embedder.encode([p[0] for p in pairs], show_progress_bar=False)
emb2 = embedder.encode([p[1] for p in pairs], show_progress_bar=False)
# Constructing contrastive options: absolute distinction and element-wise product
X_pairs = np.hstack([np.abs(emb1 – emb2), emb1 * emb2])
print(“Pairwise characteristic form:”, X_pairs.form)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
!pip set up sentence–transformers –q
from sentence_transformers import SentenceTransformer
import numpy as np
# Loading embedder
embedder = SentenceTransformer(“all-MiniLM-L6-v2”)
# Instance textual content pairs
pairs = [
(“The car is fast.”, “The vehicle moves quickly.”),
(“The sky is blue.”, “Bananas are yellow.”)
]
# Producing embeddings for either side
emb1 = embedder.encode([p[0] for p in pairs], show_progress_bar=False)
emb2 = embedder.encode([p[1] for p in pairs], show_progress_bar=False)
# Constructing contrastive options: absolute distinction and element-wise product
X_pairs = np.hstack([np.abs(emb1 – emb2), emb1 * emb2])
print(“Pairwise characteristic form:”, X_pairs.form)
7. Cross-Modal Fusion
The final trick combines LLM embeddings with easy linguistic or numeric options—corresponding to punctuation ratio or different domain-specific engineered options. It contributes to extra holistic text-derived options by uniting semantic indicators with handcrafted linguistic features. Right here is an instance that measures punctuation within the textual content.
!pip set up sentence-transformers -q
from sentence_transformers import SentenceTransformer
import numpy as np, re
# Loading embedder
embedder = SentenceTransformer(“all-MiniLM-L6-v2”)
texts = [“Mars mission 2024!”, “New electric car model launched.”]
# Computing embeddings
emb = embedder.encode(texts, show_progress_bar=False)
# Including easy numeric textual content options
lengths = np.array([len(t.split()) for t in texts]).reshape(-1, 1)
punct_ratio = np.array([len(re.findall(r”[^ws]”, t)) / len
# Combining all options
X = np.hstack([emb, lengths, punct_ratio])
print(“Remaining characteristic matrix form:”, X.form)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
!pip set up sentence–transformers –q
from sentence_transformers import SentenceTransformer
import numpy as np, re
# Loading embedder
embedder = SentenceTransformer(“all-MiniLM-L6-v2”)
texts = [“Mars mission 2024!”, “New electric car model launched.”]
# Computing embeddings
emb = embedder.encode(texts, show_progress_bar=False)
# Including easy numeric textual content options
lengths = np.array([len(t.split()) for t in texts]).reshape(–1, 1)
punct_ratio = np.array([len(re.findall(r“[^ws]”, t)) / len(t) for t in texts]).reshape(–1, 1)
# Combining all options
X = np.hstack([emb, lengths, punct_ratio])
print(“Remaining characteristic matrix form:”, X.form)
Wrapping Up
We explored seven superior feature-engineering tips that assist extract extra data from uncooked textual content, going past LLM-generated embeddings alone. These sensible methods can enhance downstream machine studying fashions that take textual content as enter by capturing complementary lexical, semantic, relational, and handcrafted indicators.


