Santosh Borse

Senior Research Engineer @ IBM Research

Phone: 914-325-9154 | Email: sanborse@gmail.com

Profile Summary

Currently working as Senior Engineer @ IBM Research, working on preparing data for IBM's large language model Training & Post Training
Maintainer on Open Sourced project - Data Prep Kit
Leading Engineering Team to produce high quality LLM models using post training method and preparing high quality data for post training
Over 20 years of experience in Big Data, Artificial Intelligence, Speech technologies, Cloud software development, Java J2EE, Databases, Business Intelligence, Open Source, Kubernetes, Cloud and other varied set of technologies
Co-founded KreedaStar, a platform connecting professionals in the sports industry
I hold 11 granted patents from USPTO and multiple research papers

Education

Master of Science (Computer Science) - University of Pune, India - June 2005
Bachelor of Science (Computer Science) - North Maharashtra University, India - June 2003

Experience

IBM Research - Senior Research Engineer (Apr 2024 – Present)

Yorktown Heights, NY, USA

Design and develop scalable data processing libraries capable of handling hundreds of terabytes of unstructured and structured data required for IBM Granite Model training
Conduct ablation studies to evaluate the impact of data preparation techniques on downstream AI model performance
Curate and release high-quality, large-scale open datasets to support the broader AI research community
Research and implement innovative data transformation techniques to enhance the quality, diversity, and utility of training data for LLMs
Engage with the wider open-source community to increase awareness and adoption of Data Prep Kit
Deliver Technical talks and tutorials in open source conferences and meetups

Achievements:

Developed and deployed over 10 robust data processing pipelines, enabling scalable handling of massive datasets (100+ TB)
Open-sourced the Data Prep Kit, a modular framework for high-quality data processing in LLM training workflows
Designed and generated 15+ high-quality synthetic datasets for post-training alignment and fine-tuning of large language models

IBM Watson Speech Services - Senior Software Engineer (May 2021 – Apr 2024)

Yorktown Heights, NY, USA

Developed core services enabling speech-to-text (STT) and text-to-speech (TTS) capabilities for Watson Speech
Designed and implemented a Kubernetes Operator to integrate Watson Speech into Cloud Pak for Data
Led efforts to re-architect service components to improve cost-efficiency, performance, and scalability
Migrated Watson Speech services to modern Gen2 cloud clusters

Achievements:

Implemented first Watson service Kubernetes operator from scratch which enabled service to run in hybrid cloud environment
Implemented shared compute architecture, saving approximately 2,300 CPUs and 11 TB of memory
Delivered total infrastructure cost savings of approximately $2.35 million per year

IBM Watson Speech Services - Senior Software Engineer (Jun 2020 – May 2021)

Yorktown Heights, NY, USA

Led the design and development of a new cloud-native Text-to-Speech (TTS) service for Korean language
Conducted comprehensive evaluation of commercial TTS engines and selected NeoSpeech
Successfully launched the Korean TTS service, contributing to the retention of a $1.5M sales deal

Projects

Data Prep Kit - Open-source modular framework for high-quality data processing in LLM training workflows
IBM Granite Models - Data preparation for large language model training and post-training
Watson Speech Services - Kubernetes operator and cloud-native speech services
KreedaStar - Co-founded platform connecting professionals in the sports industry

Publications

GneissWeb: Preparing High Quality Data for LLMs at Scale
Muted: Multilingual targeted offensive speech identification and visualization

Patents

11 Granted Patents:

Vehicle information display and management - US9940089B2
Mobile device accident avoidance system - US10311304B2
Modifying a mood through selective feeding of content - US20180063064A1
Method for associating previously created social media data - US10298663B2
Determining relevancy of discussion topics - US10331683B2
Dynamically changing message classification and priority based on IOT device publication - US10659419B2
Redirecting blocked media content - US10706095B2
Method, medium, and system for automated hardware device replacement - US11049162B2
Determining an effect of a message on a personal brand based on future goals - US11636554
Automated individual security - US10798237
Real-time vs non-real time audio streaming - US11355139B2

Talks

Technical talks and tutorials delivered at open source conferences and meetups on topics including:

Data Prep Kit for LLM Training
Large-scale data processing for AI
Cloud-native AI services architecture

Contact

Email: sanborse@gmail.com
Phone: +1 914-325-9154
LinkedIn: linkedin.com/in/santoshborse
Google Scholar: View Profile
Patents: View Patents
Blog: sanborse.medium.com

News & Updates

Latest updates and achievements (most recent first)

June 2025: Reviewer for IEEE Transactions on Computational Social Systems

Invited to serve as a reviewer for IEEE Transactions on Computational Social Systems, a premier journal covering computational aspects of social systems, social networks, and socially-oriented big data analytics.

March 2025: Judge at Westchester Science & Engineering Fair (WESEF)

Selected as a judge for the Westchester Science & Engineering Fair, evaluating innovative science and engineering projects by middle and high school students in the Westchester County region.

March 2025: Invited to Judging Panel for Hudson Valley Open Data Hackathon 2025

Invited to serve on the judging panel to assess projects for the seventh Hudson Valley Open Data Hackathon 2025, evaluating innovative data-driven solutions and applications developed by participants.

Learn more about the Hackathon →

March 2024: Judge at Westchester Science & Engineering Fair (WESEF)

Served as a judge for the Westchester Science & Engineering Fair, assessing student projects and providing feedback to encourage young scientists and engineers in their research endeavors.

Jan 2022: AMERICAS Co-Chairperson of IDT Review Board for "Software – Messaging"

Appointed as AMERICAS Co-Chairperson of the Invention Disclosure and Technology (IDT) review board for "Software – Messaging" category. Responsibilities include evaluating patent ideas submitted by inventors and voting on their potential for patent filing.

Jan 2022: Member of AIOps IDT Disclosure Evaluation Board

Selected as a member of the AIOps (Artificial Intelligence for IT Operations) Invention Disclosure and Technology evaluation board, contributing to the assessment of innovative patent disclosures in the AI operations domain.

Feb 2021: IBM Certification Exam Developer 2021

Earned IBM Certification as Exam Developer, demonstrating expertise in creating and evaluating technical certification assessments.

View Credential on Credly →

March 2013: Featured in "Social Media Analytics" Book

Name featured in the introduction of the O'Reilly book "Social Media Analytics" in recognition of contributions to the field of social media analytics and big data processing.

View Book on O'Reilly →

Experience

IBM Research - Senior Research Engineer (Apr 2024 – Present)

Yorktown Heights, NY, USA

Design and develop scalable data processing libraries capable of handling hundreds of terabytes of unstructured and structured data required for IBM Granite Model training
Conduct ablation studies to evaluate the impact of data preparation techniques on downstream AI model performance
Curate and release high-quality, large-scale open datasets - GneissWeb
Prepare and refine high-quality datasets for both pretraining and post-training stages of LLM development
Research and implement innovative data transformation techniques to enhance training data for LLMs
Generate synthetic datasets for fine-tuning and post-training alignment of AI models
Contribute to and maintain open-source initiatives including Data Prep Kit
Engage with the wider open-source community to increase awareness and adoption
Deliver technical talks and tutorials in open source conferences and meetups
Author research papers to engage with wider AI research community
Post train Large Language Models using latest datasets and research techniques

Achievements:

Developed and deployed over 10 robust data processing pipelines, enabling scalable handling of massive datasets (100+ TB)
Open-sourced the Data Prep Kit, a modular framework for high-quality data processing in LLM training workflows
Serving as a core maintainer of the Data Prep Kit open-source repository
Designed and generated 15+ high-quality synthetic datasets for post-training alignment and fine-tuning

Tools & Technologies: Deep Learning, AI, Data Engineering, Model Training, RITS, Java, Synthetic Data Generation, Python, Ray, KubeFlow, Kubernetes, OpenShift, Supervised Fine Tuning, Granite.Build, Data Prep Kit, Open Instruct, Hugging Face TRL

IBM Watson Speech Services - Senior Software Engineer (May 2021 – Apr 2024)

Yorktown Heights, NY, USA

Developed core services enabling speech-to-text (STT) and text-to-speech (TTS) capabilities for Watson Speech
Designed and implemented a Kubernetes Operator to integrate Watson Speech into Cloud Pak for Data
Led efforts to re-architect service components to improve cost-efficiency, performance, and scalability
Migrated Watson Speech services to modern Gen2 cloud clusters, enhancing reliability and observability
Upgraded foundational components in the speech processing stack
Provided technical leadership, mentoring developers and guiding architectural decisions
Debug and fix issues reported by clients and users of the service

Achievements:

Implemented first Watson service Kubernetes operator from scratch enabling hybrid cloud deployment
Implemented shared compute architecture, saving approximately 2,300 CPUs and 11 TB of memory
Replaced in-house speech routing engine with open-source solution, delivering 20% improvement in load balancing
Saved 60 FTE days per year worth of efforts in deployment and maintenance
Delivered total infrastructure cost savings of approximately $2.35 million per year (7% of overall infrastructure costs)

Tools & Technologies: RNNT Models, Java, Kubernetes, Python, NodeJS, Envoy, HAProxy, IBM Cloud, Docker

IBM Watson Speech Services - Senior Software Engineer (Jun 2020 – May 2021)

Yorktown Heights, NY, USA

Led the design and development of a new cloud-native Text-to-Speech (TTS) service for Korean language
Conducted comprehensive evaluation of commercial TTS engines and selected NeoSpeech
Built and deployed the service using IBM Cloud, Java, REST APIs, Kubernetes, Docker
Ensured seamless integration with the broader Watson AI platform

Achievements:

Successfully launched the Korean TTS service under tight deadlines, contributing to the retention of a $1.5M sales deal

Tools & Technologies: IBM Cloud, Java, REST APIs, Kubernetes, Docker, Node.js, Python, MongoDB, WAS Liberty

IBM Watson Platform Service - Technical Lead (Jan 2017 – Jun 2020)

Yorktown Heights, NY, USA

Designed and developed common platform services to support multiple Watson AI offerings
Led the architecture and deployment of shared service infrastructure for Watson Visual Recognition
Built and integrated a Billing and Metering system for Watson services
Developed and maintained CI/CD pipelines for seamless automated integration and deployment
Designed and implemented a Multi-Cluster Proxy for high availability and geo-distributed service delivery
Led a team of 4 developers, assigning and scheduling work to achieve milestones

Tools & Technologies: Cloud native development, Microservices, IBM Cloud, Kubernetes, Java, REST, Docker, NodeJS, WAS Liberty, Python, MongoDB

IBM Social Dashboards - Technical Lead/Architect (Jan 2015 – Jan 2017)

Somers, NY, USA

Developed software stack architecture for implementing functionalities
Architected, designed, and developed big data applications
Developed processing logic for real-time and non-real-time data using Pig Latin/Java/MapReduce
Worked on adoption of "Social Dashboard" across IBM
Installation and monitoring of Hadoop Data Platform cluster, Cassandra & DataStax cluster, Ambari Cluster
Managed technical team, running agile scrum development

Tools & Technologies: Java, Hadoop, NodeJS, Microservices, Cassandra, DataStax Hadoop Data Platform, HBase, HDFS

IBM CIO Social Media Analytics Services - Technical Lead (Mar 2011 – Jan 2015)

Somers, NY, USA (Employer: Persistent Systems)

Implemented Social Media Analytics services
Provided technical leadership to team
Processed large social media data using MapReduce algorithms to extract analytics
Configured and managed multi-node Big Data cluster
Architected solutions and developed POCs
Developed data ingestion utilities for various data sources
Setup Linux-based clustered environment with 15+ servers, ensuring 100% availability
Provided analytics services to 15+ teams
Developed Twitter Data Syndication Service handling ~40K tweets per hour from GNIP

Skills

Technologies: Python, Java, Spark, Ray, Kubernetes, Docker, OpenShift, IBM Cloud, AWS

AI/ML: Deep Learning, LLM Training, Synthetic Data Generation, Model Post-Training, Supervised Fine Tuning

Big Data: Hadoop, MapReduce, Cassandra, HBase, HDFS, Data Processing Pipelines

Cloud: Cloud Native Development, Microservices, Kubernetes Operators, CI/CD

Projects

Data Prep Kit

Open-source modular framework for LLM data processing

Core maintainer of the Data Prep Kit open-source repository
Developed scalable data processing libraries handling 100+ TB of data
Created modular framework for high-quality data processing in LLM training workflows
Designed pipelines for both pretraining and post-training stages

IBM Granite Models - Data Preparation

Large-scale data curation for IBM's flagship LLM

Prepared and refined datasets for IBM Granite Model training
Conducted ablation studies on data preparation impact
Released GneissWeb - large-scale open datasets
Generated 15+ high-quality synthetic datasets for post-training alignment

Watson Speech Kubernetes Operator

First Watson service operator enabling hybrid cloud deployment

Designed and implemented Kubernetes Operator from scratch
Enabled Watson Speech integration into Cloud Pak for Data
Facilitated deployment in hybrid and private cloud environments
Implemented shared compute architecture saving 2,300 CPUs and 11 TB memory

Korean TTS Service for Watson

Cloud-native text-to-speech service for Korean language

Led end-to-end design and development
Evaluated and integrated NeoSpeech TTS engine
Contributed to retention of $1.5M sales deal
Built using cloud-native architecture on IBM Cloud

KreedaStar (Co-founder)

Social platform for sports industry professionals

Co-founded platform connecting professionals in the sports industry
Created "LinkedIn for Sports" - networking platform for sports professionals

Twitter Data Syndication Service

Real-time Twitter data processing and distribution

Developed shared service processing ~40K tweets per hour from GNIP
Real-time tweet storage to file-based store and MongoDB
REST API for teams to access syndicated Twitter data
Supported 15+ internal teams with social media analytics

Social Media Analytics Platform

Enterprise-scale social media analysis infrastructure

Processed large-scale social media data (200GB+) from Twitter, GNIP, BoardReader
Built data ingestion interfaces for 15+ HTML sites, RSS/CSV/XML feeds, DB2
Developed sentiment analytics, deduplication, and summarization techniques
Managed multi-node cluster environment with 15+ servers

Publications

Peer-Reviewed Papers

GneissWeb: Preparing High Quality Data for LLMs at Scale

A comprehensive study on large-scale data preparation techniques for training large language models.

Read on arXiv →

Muted: Multilingual Targeted Offensive Speech Identification and Visualization

Research on identifying and visualizing offensive speech across multiple languages using machine learning techniques.

Research Interests

Large Language Model Training and Post-Training
Data-Centric AI and Data Quality for ML
Synthetic Data Generation
Speech Technologies
Social Media Analytics
Cloud-Native AI Systems

Academic Profile

View my complete publication list and citation metrics on Google Scholar

Patents

11 Granted Patents from USPTO

View all patents on Google Patents

Automotive & Safety

Vehicle information display and management - US9940089B2
Mobile device accident avoidance system - US10311304B2
Automated individual security - US10798237

Social Media & Content

Modifying a mood through selective feeding of content - US20180063064A1
Method for associating previously created social media data - US10298663B2
Determining relevancy of discussion topics - US10331683B2
Redirecting blocked media content - US10706095B2
Determining an effect of a message on a personal brand based on future goals - US11636554

IoT & Systems

Dynamically changing message classification and priority based on IOT device publication - US10659419B2
Method, medium, and system for automated hardware device replacement - US11049162B2

Audio & Streaming

Real-time vs non-real time audio streaming - US11355139B2

Patent Areas

My patent portfolio spans multiple technology domains including:

Artificial Intelligence and Machine Learning
Social Media Analytics
Internet of Things (IoT)
Automotive Technology
Audio/Video Streaming
Security and Privacy
Content Management and Personalization

Talks

As part of my role at IBM Research and involvement in the open-source community, I regularly deliver technical talks and tutorials at conferences, meetups, and community events.

Talk Topics

Data Prep Kit for LLM Training

Introducing the open-source Data Prep Kit framework and demonstrating how to prepare high-quality data for large language model training at scale.

Large-Scale Data Processing for AI

Best practices and techniques for processing hundreds of terabytes of unstructured data for AI model training, including lessons learned from IBM Granite.

Cloud-Native AI Services Architecture

Designing and deploying scalable, cloud-native AI services using Kubernetes, focusing on cost optimization and reliability.

GneissWeb: High-Quality Data Curation

Presenting the GneissWeb dataset and discussing methodologies for curating large-scale, high-quality training data for LLMs.

Synthetic Data Generation for LLM Post-Training

Techniques and approaches for generating synthetic datasets for fine-tuning and alignment of large language models.