AllTopicsTodayAllTopicsToday
Notification
Font ResizerAa
  • Home
  • Tech
  • Investing & Finance
  • AI
  • Entertainment
  • Wellness
  • Gaming
  • Movies
Reading: A Coding Guide to Build a Scalable End-to-End Machine Learning Data Pipeline Using Daft for High-Performance Structured and Image Data Processing
Share
Font ResizerAa
AllTopicsTodayAllTopicsToday
  • Home
  • Blog
  • About Us
  • Contact
Search
  • Home
  • Tech
  • Investing & Finance
  • AI
  • Entertainment
  • Wellness
  • Gaming
  • Movies
Have an existing account? Sign In
Follow US
©AllTopicsToday 2026. All Rights Reserved.
AllTopicsToday > Blog > AI > A Coding Guide to Build a Scalable End-to-End Machine Learning Data Pipeline Using Daft for High-Performance Structured and Image Data Processing
AI

A Coding Guide to Build a Scalable End-to-End Machine Learning Data Pipeline Using Daft for High-Performance Structured and Image Data Processing

AllTopicsToday
Last updated: March 6, 2026 2:33 am
AllTopicsToday
Published: March 6, 2026
Share
SHARE

This tutorial exhibits you the right way to use Daft as a high-performance Python native knowledge engine to construct end-to-end evaluation pipelines. We first load the real-world MNIST dataset after which remodel it step-by-step utilizing UDFs, characteristic engineering, aggregation, joins, and lazy execution. It additionally exhibits how structured knowledge processing, numerical computation, and machine studying could be seamlessly mixed. Finally, you may transcend manipulating knowledge to constructing full model-aware pipelines that leverage Daft’s scalable execution engine.

!pip -q set up daft pyarrow pandas numpy scikit-learn import os os.environ[“DO_NOT_TRACK”] = “true” import numpy as np import pandas as pd import daft from daft import Col print(“Daft model:”, getattr(daft, “__version__”, “unknown”)) URL = “https://github.com/Eventual-Inc/mnist-json/uncooked/grasp/mnist_handwriting_test.json.gz” df = daft.read_json(URL) print(“nSchema (pattern):”) print(df.schema()) print(“nPeek:”) df.present(5)

Set up Daft and its assist libraries instantly into Google Colab to make sure a clear and reproducible setting. Configure elective settings and confirm the put in model to make sure every thing is working accurately. This establishes a secure basis for constructing end-to-end knowledge pipelines.

def to_28x28(pixels): arr = np.array(pixels, dtype=np.float32) if arr.measurement != 784: return None return arr.reshape(28, 28) df2 = ( df .with_column( “img_28x28”,col(“picture”).apply(to_28x28, return_dtype=daft.DataType.python()) ) .with_column( “pixel_mean”,col(“img_28x28”).apply(lambda x: float(np.imply(x)) if x is apart from None None, return_dtype=daft.DataType.float32()) ) .with_column( “pixel_std”, Col(“img_28x28”).apply(lambda x: float(np.std(x)) if x just isn’t None then None, return_dtype=daft.DataType.float32()) ) ) print(“nAfter reformatting + easy operate:”) df2.choose(“label”, “pixel_mean”, “pixel_std”).present(5)

Load the precise MNIST JSON dataset instantly from a distant URL utilizing Daft’s native reader. Examine your schema and preview your knowledge to grasp its construction and column varieties. This lets you validate your dataset earlier than making use of transformations or characteristic engineering.

@daft.udf(return_dtype=daft.DataType.record(daft.DataType.float32()),batch_size=512) def characteristic(images_28x28): out = []
For img in image_28x28.to_pylist(): If img is None: out.append(None) Proceed img = np.asarray(img, dtype=np.float32) row_sums = img.sum(axis=1) / 255.0col_sums = img.sum(axis=0) / 255.0 whole = img.sum() + 1e-6 ys, xs = np.indices(img.form) cy = float((ys * img).sum() / whole) / 28.0 cx = float((xs * img).sum() / whole) / 28.0 vec = np.concatenate([row_sums, col_sums, np.array([cy, cx, img.mean()/255.0, img.std()/255.0]dtype=np.float32)]) out.append(vec.astype(np.float32).tolist()) return out df3 = df2.with_column(“options”, featureize(col(“img_28x28”))) print(“nFeature column created (record)[float]):”) df3.choose(“label”, “characteristic”).present(2)

Reshape the uncooked pixel array right into a structured 28 × 28 picture utilizing a row-wise UDF. Enrich your dataset by calculating statistical options reminiscent of imply and commonplace deviation. Making use of these transformations transforms uncooked picture knowledge right into a illustration appropriate for structured fashions.

label_stats = ( df3.groupby(“label”) .agg(col(“label”).rely().alias(“n”),col(“pixel_mean”).imply().alias(“mean_pixel_mean”),col(“pixel_std”).imply().alias(“mean_pixel_std”), ) .type(“label”) ) print(“nLabel distribution + abstract statistics:”) label_stats.present(10) df4 = df3.be part of(label_stats, on=”label”, how=”left”) print(“nReturn the mixed label statistics for every row:”) df4.choose(“label”, “n”, “mean_pixel_mean”, “mean_pixel_std”).present(5)

Implement a batch UDF to extract richer characteristic vectors from the reshaped photographs. Carry out group-by-group aggregations and mix abstract statistics into datasets to reinforce context. It exhibits how scalable computation and superior analytics could be mixed inside Daft.

small = df4.choose(“label”, “options”).acquire().to_pandas() small = small.dropna(subset=[“label”, “features”]).reset_index(drop=True) X = np.vstack(small[“features”].apply(np.array).values).astype(np.float32) y = small[“label”].astype(int).values import from sklearn.model_selection train_test_split import from sklearn.linear_model import LogisticRegression import from sklearn.metrics accuracy_score, classification_report clf = LogisticRegression(max_iter=1000, n_jobs=None) clf.match(X_train, y_train) pred = clf.predict(X_test) acc = precision_score(y_test, pred) print(“nBaseline accuracy (LogisticRegression with characteristic engineering):”,spherical(acc, 4)) print(“nClassification report:”) print(classification_report(y_test, pred, digits=4)) out_df = df4.choose(“label”, “options”, “pixel_mean”, “pixel_std”, “n”) out_path = “/content material/daft_mnist_features.parquet” out_df.write_parquet(out_path) print(“nWritten parquet to:”, out_path) df_back = daft.read_parquet(out_path) print(“nReadback test:”) df_back.present(3)

Materialize the chosen columns into pandas and prepare a baseline logistic regression mannequin. We consider efficiency to confirm the usefulness of engineered options. It additionally persists processed datasets to Parquet format, finishing an end-to-end pipeline from uncooked knowledge ingestion to production-ready storage.

On this tutorial, you used Daft to construct a production-style knowledge workflow, shifting from uncooked JSON ingestion to characteristic engineering, aggregation, mannequin coaching, and Parquet persistence. We demonstrated the right way to combine superior UDF logic, carry out environment friendly groupby and be part of operations, and materialize downstream machine studying outcomes inside a clear and scalable framework. By this course of, we realized how Daft can deal with complicated transformations whereas being Pythonic and environment friendly. We now have a reusable end-to-end pipeline that exhibits how fashionable knowledge engineering and machine studying workflows could be mixed in an built-in setting.

Verify the whole code right here. Additionally, be at liberty to observe us on Twitter. Additionally, remember to hitch the 120,000+ ML SubReddit and subscribe to our publication. hold on! Are you on telegram? Now you can additionally take part by telegram.

Michal Sutter is a knowledge science professional with a grasp’s diploma in knowledge science from the College of Padova. With a robust basis in statistical evaluation, machine studying, and knowledge engineering, Michal excels at remodeling complicated datasets into actionable insights.

ChatUp AI Unfiltered Video Generator: My Unfiltered Thoughts
Multi-Agent Systems: The Next Frontier in AI-Driven Cyber Defense
Google Open-Sources an MCP Server for the Google Ads API, Bringing LLM-Native Access to Ads Data
Bitcoin (BTC) price predictions for 2026
Build AI Agents with RapidAPI for Real-Time Data
Share This Article
Facebook Email Print
Leave a Comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Follow US

Find US on Social Medias
FacebookLike
XFollow
YoutubeSubscribe
TelegramFollow

Weekly Newsletter

Subscribe to our newsletter to get our newest articles instantly!

Popular News
Taylor mcu.jpg
Gaming

Taylor Swift’s Disney Plus Eras Tour news makes her the new Marvel Universe

AllTopicsToday
AllTopicsToday
October 13, 2025
LendingClub Stock Soars 25% On Strong Financial Report, CEO Announces Partnership Extension With Blue Owl – LendingClub (NYSE:LC)
15 Greatest Comic Book Series Ever Made, Ranked
The Pitt Season 2 Trailer Hints At Romance Between Mel And Langdon
wavehour.top (wavehour.top) program details. Reviews, Scam or Paying
- Advertisement -
Ad space (1)

Categories

  • Tech
  • Investing & Finance
  • AI
  • Entertainment
  • Wellness
  • Gaming
  • Movies

About US

We believe in the power of information to empower decisions, fuel curiosity, and spark innovation.
Quick Links
  • Home
  • Blog
  • About Us
  • Contact
Important Links
  • About Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
  • Contact

Subscribe US

Subscribe to our newsletter to get our newest articles instantly!

©AllTopicsToday 2026. All Rights Reserved.
1 2
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?