Santosh Borse

Santosh Borse

Senior Research Engineer @ IBM Research

Phone: 914-325-9154 | Email: sanborse@gmail.com

Profile Summary

Education

Experience

IBM Research - Senior Research Engineer (Apr 2024 – Present)

Yorktown Heights, NY, USA

Achievements:

IBM Watson Speech Services - Senior Software Engineer (May 2021 – Apr 2024)

Yorktown Heights, NY, USA

Achievements:

IBM Watson Speech Services - Senior Software Engineer (Jun 2020 – May 2021)

Yorktown Heights, NY, USA

Projects

Publications

Patents

11 Granted Patents:

Talks

Technical talks and tutorials delivered at open source conferences and meetups on topics including:

Contact

News & Updates

Latest updates and achievements (most recent first)

June 2025: Reviewer for IEEE Transactions on Computational Social Systems

Invited to serve as a reviewer for IEEE Transactions on Computational Social Systems, a premier journal covering computational aspects of social systems, social networks, and socially-oriented big data analytics.

March 2025: Judge at Westchester Science & Engineering Fair (WESEF)

Selected as a judge for the Westchester Science & Engineering Fair, evaluating innovative science and engineering projects by middle and high school students in the Westchester County region.

March 2025: Invited to Judging Panel for Hudson Valley Open Data Hackathon 2025

Invited to serve on the judging panel to assess projects for the seventh Hudson Valley Open Data Hackathon 2025, evaluating innovative data-driven solutions and applications developed by participants.

Learn more about the Hackathon →

March 2024: Judge at Westchester Science & Engineering Fair (WESEF)

Served as a judge for the Westchester Science & Engineering Fair, assessing student projects and providing feedback to encourage young scientists and engineers in their research endeavors.

Jan 2022: AMERICAS Co-Chairperson of IDT Review Board for "Software – Messaging"

Appointed as AMERICAS Co-Chairperson of the Invention Disclosure and Technology (IDT) review board for "Software – Messaging" category. Responsibilities include evaluating patent ideas submitted by inventors and voting on their potential for patent filing.

Jan 2022: Member of AIOps IDT Disclosure Evaluation Board

Selected as a member of the AIOps (Artificial Intelligence for IT Operations) Invention Disclosure and Technology evaluation board, contributing to the assessment of innovative patent disclosures in the AI operations domain.

Feb 2021: IBM Certification Exam Developer 2021

Earned IBM Certification as Exam Developer, demonstrating expertise in creating and evaluating technical certification assessments.

View Credential on Credly →

March 2013: Featured in "Social Media Analytics" Book

Name featured in the introduction of the O'Reilly book "Social Media Analytics" in recognition of contributions to the field of social media analytics and big data processing.

View Book on O'Reilly →

Experience

IBM Research - Senior Research Engineer (Apr 2024 – Present)

Yorktown Heights, NY, USA

Achievements:

Tools & Technologies: Deep Learning, AI, Data Engineering, Model Training, RITS, Java, Synthetic Data Generation, Python, Ray, KubeFlow, Kubernetes, OpenShift, Supervised Fine Tuning, Granite.Build, Data Prep Kit, Open Instruct, Hugging Face TRL

IBM Watson Speech Services - Senior Software Engineer (May 2021 – Apr 2024)

Yorktown Heights, NY, USA

Achievements:

Tools & Technologies: RNNT Models, Java, Kubernetes, Python, NodeJS, Envoy, HAProxy, IBM Cloud, Docker

IBM Watson Speech Services - Senior Software Engineer (Jun 2020 – May 2021)

Yorktown Heights, NY, USA

Achievements:

Tools & Technologies: IBM Cloud, Java, REST APIs, Kubernetes, Docker, Node.js, Python, MongoDB, WAS Liberty

IBM Watson Platform Service - Technical Lead (Jan 2017 – Jun 2020)

Yorktown Heights, NY, USA

Tools & Technologies: Cloud native development, Microservices, IBM Cloud, Kubernetes, Java, REST, Docker, NodeJS, WAS Liberty, Python, MongoDB

IBM Social Dashboards - Technical Lead/Architect (Jan 2015 – Jan 2017)

Somers, NY, USA

Tools & Technologies: Java, Hadoop, NodeJS, Microservices, Cassandra, DataStax Hadoop Data Platform, HBase, HDFS

IBM CIO Social Media Analytics Services - Technical Lead (Mar 2011 – Jan 2015)

Somers, NY, USA (Employer: Persistent Systems)

Skills

Technologies: Python, Java, Spark, Ray, Kubernetes, Docker, OpenShift, IBM Cloud, AWS

AI/ML: Deep Learning, LLM Training, Synthetic Data Generation, Model Post-Training, Supervised Fine Tuning

Big Data: Hadoop, MapReduce, Cassandra, HBase, HDFS, Data Processing Pipelines

Cloud: Cloud Native Development, Microservices, Kubernetes Operators, CI/CD

Projects

Data Prep Kit

Open-source modular framework for LLM data processing

IBM Granite Models - Data Preparation

Large-scale data curation for IBM's flagship LLM

Watson Speech Kubernetes Operator

First Watson service operator enabling hybrid cloud deployment

Korean TTS Service for Watson

Cloud-native text-to-speech service for Korean language

KreedaStar (Co-founder)

Social platform for sports industry professionals

Twitter Data Syndication Service

Real-time Twitter data processing and distribution

Social Media Analytics Platform

Enterprise-scale social media analysis infrastructure

Publications

Peer-Reviewed Papers

GneissWeb: Preparing High Quality Data for LLMs at Scale

A comprehensive study on large-scale data preparation techniques for training large language models.

Read on arXiv →

Muted: Multilingual Targeted Offensive Speech Identification and Visualization

Research on identifying and visualizing offensive speech across multiple languages using machine learning techniques.

Research Interests

Academic Profile

View my complete publication list and citation metrics on Google Scholar

Patents

11 Granted Patents from USPTO

View all patents on Google Patents

Automotive & Safety

Social Media & Content

IoT & Systems

Audio & Streaming

Patent Areas

My patent portfolio spans multiple technology domains including:

Talks

As part of my role at IBM Research and involvement in the open-source community, I regularly deliver technical talks and tutorials at conferences, meetups, and community events.

Talk Topics

Data Prep Kit for LLM Training

Introducing the open-source Data Prep Kit framework and demonstrating how to prepare high-quality data for large language model training at scale.

Large-Scale Data Processing for AI

Best practices and techniques for processing hundreds of terabytes of unstructured data for AI model training, including lessons learned from IBM Granite.

Cloud-Native AI Services Architecture

Designing and deploying scalable, cloud-native AI services using Kubernetes, focusing on cost optimization and reliability.

GneissWeb: High-Quality Data Curation

Presenting the GneissWeb dataset and discussing methodologies for curating large-scale, high-quality training data for LLMs.

Synthetic Data Generation for LLM Post-Training

Techniques and approaches for generating synthetic datasets for fine-tuning and alignment of large language models.

Conference & Community Engagement

Speaking Interests

I'm available to speak on topics including:

For speaking engagements, please contact me.

Contact

I'm always interested in connecting with fellow researchers, engineers, and professionals working on AI, data engineering, and related technologies.

Get in Touch

Email

sanborse@gmail.com

Phone

+1 914-325-9154

LinkedIn

linkedin.com/in/santoshborse

Google Scholar

View my publications and citations

Patents

View my 11 granted patents

Blog

sanborse.medium.com

GitHub

Data Prep Kit - Open Source Project

Location

Yorktown Heights, NY, USA

Professional Interests

I'm particularly interested in discussions around:

Currently

Senior Research Engineer @ IBM Research

Working on data preparation for IBM's large language model training and post-training. Core maintainer of the open-source Data Prep Kit project.