Profile Summary
- Currently working as Senior Engineer @ IBM Research, working on preparing data for IBM's large language model Training & Post Training
- Maintainer on Open Sourced project - Data Prep Kit
- Leading Engineering Team to produce high quality LLM models using post training method and preparing high quality data for post training
- Over 20 years of experience in Big Data, Artificial Intelligence, Speech technologies, Cloud software development, Java J2EE, Databases, Business Intelligence, Open Source, Kubernetes, Cloud and other varied set of technologies
- Co-founded KreedaStar, a platform connecting professionals in the sports industry
- I hold 11 granted patents from USPTO and multiple research papers
Education
- Master of Science (Computer Science) - University of Pune, India - June 2005
- Bachelor of Science (Computer Science) - North Maharashtra University, India - June 2003
Experience
IBM Research - Senior Research Engineer (Apr 2024 – Present)
Yorktown Heights, NY, USA
- Design and develop scalable data processing libraries capable of handling hundreds of terabytes of unstructured and structured data required for IBM Granite Model training
- Conduct ablation studies to evaluate the impact of data preparation techniques on downstream AI model performance
- Curate and release high-quality, large-scale open datasets to support the broader AI research community
- Research and implement innovative data transformation techniques to enhance the quality, diversity, and utility of training data for LLMs
- Engage with the wider open-source community to increase awareness and adoption of Data Prep Kit
- Deliver Technical talks and tutorials in open source conferences and meetups
Achievements:
- Developed and deployed over 10 robust data processing pipelines, enabling scalable handling of massive datasets (100+ TB)
- Open-sourced the Data Prep Kit, a modular framework for high-quality data processing in LLM training workflows
- Designed and generated 15+ high-quality synthetic datasets for post-training alignment and fine-tuning of large language models
IBM Watson Speech Services - Senior Software Engineer (May 2021 – Apr 2024)
Yorktown Heights, NY, USA
- Developed core services enabling speech-to-text (STT) and text-to-speech (TTS) capabilities for Watson Speech
- Designed and implemented a Kubernetes Operator to integrate Watson Speech into Cloud Pak for Data
- Led efforts to re-architect service components to improve cost-efficiency, performance, and scalability
- Migrated Watson Speech services to modern Gen2 cloud clusters
Achievements:
- Implemented first Watson service Kubernetes operator from scratch which enabled service to run in hybrid cloud environment
- Implemented shared compute architecture, saving approximately 2,300 CPUs and 11 TB of memory
- Delivered total infrastructure cost savings of approximately $2.35 million per year
IBM Watson Speech Services - Senior Software Engineer (Jun 2020 – May 2021)
Yorktown Heights, NY, USA
- Led the design and development of a new cloud-native Text-to-Speech (TTS) service for Korean language
- Conducted comprehensive evaluation of commercial TTS engines and selected NeoSpeech
- Successfully launched the Korean TTS service, contributing to the retention of a $1.5M sales deal
Projects
- Data Prep Kit - Open-source modular framework for high-quality data processing in LLM training workflows
- IBM Granite Models - Data preparation for large language model training and post-training
- Watson Speech Services - Kubernetes operator and cloud-native speech services
- KreedaStar - Co-founded platform connecting professionals in the sports industry
Publications
Patents
11 Granted Patents:
- Vehicle information display and management - US9940089B2
- Mobile device accident avoidance system - US10311304B2
- Modifying a mood through selective feeding of content - US20180063064A1
- Method for associating previously created social media data - US10298663B2
- Determining relevancy of discussion topics - US10331683B2
- Dynamically changing message classification and priority based on IOT device publication - US10659419B2
- Redirecting blocked media content - US10706095B2
- Method, medium, and system for automated hardware device replacement - US11049162B2
- Determining an effect of a message on a personal brand based on future goals - US11636554
- Automated individual security - US10798237
- Real-time vs non-real time audio streaming - US11355139B2
Talks
Technical talks and tutorials delivered at open source conferences and meetups on topics including:
- Data Prep Kit for LLM Training
- Large-scale data processing for AI
- Cloud-native AI services architecture
Contact
News & Updates
Latest updates and achievements (most recent first)
June 2025: Reviewer for IEEE Transactions on Computational Social Systems
Invited to serve as a reviewer for IEEE Transactions on Computational Social Systems, a premier journal covering computational aspects of social systems, social networks, and socially-oriented big data analytics.
March 2025: Judge at Westchester Science & Engineering Fair (WESEF)
Selected as a judge for the Westchester Science & Engineering Fair, evaluating innovative science and engineering projects by middle and high school students in the Westchester County region.
March 2025: Invited to Judging Panel for Hudson Valley Open Data Hackathon 2025
Invited to serve on the judging panel to assess projects for the seventh Hudson Valley Open Data Hackathon 2025, evaluating innovative data-driven solutions and applications developed by participants.
Learn more about the Hackathon →
March 2024: Judge at Westchester Science & Engineering Fair (WESEF)
Served as a judge for the Westchester Science & Engineering Fair, assessing student projects and providing feedback to encourage young scientists and engineers in their research endeavors.
Jan 2022: AMERICAS Co-Chairperson of IDT Review Board for "Software – Messaging"
Appointed as AMERICAS Co-Chairperson of the Invention Disclosure and Technology (IDT) review board for "Software – Messaging" category. Responsibilities include evaluating patent ideas submitted by inventors and voting on their potential for patent filing.
Jan 2022: Member of AIOps IDT Disclosure Evaluation Board
Selected as a member of the AIOps (Artificial Intelligence for IT Operations) Invention Disclosure and Technology evaluation board, contributing to the assessment of innovative patent disclosures in the AI operations domain.
Feb 2021: IBM Certification Exam Developer 2021
Earned IBM Certification as Exam Developer, demonstrating expertise in creating and evaluating technical certification assessments.
View Credential on Credly →
March 2013: Featured in "Social Media Analytics" Book
Name featured in the introduction of the O'Reilly book "Social Media Analytics" in recognition of contributions to the field of social media analytics and big data processing.
View Book on O'Reilly →
Experience
IBM Research - Senior Research Engineer (Apr 2024 – Present)
Yorktown Heights, NY, USA
- Design and develop scalable data processing libraries capable of handling hundreds of terabytes of unstructured and structured data required for IBM Granite Model training
- Conduct ablation studies to evaluate the impact of data preparation techniques on downstream AI model performance
- Curate and release high-quality, large-scale open datasets - GneissWeb
- Prepare and refine high-quality datasets for both pretraining and post-training stages of LLM development
- Research and implement innovative data transformation techniques to enhance training data for LLMs
- Generate synthetic datasets for fine-tuning and post-training alignment of AI models
- Contribute to and maintain open-source initiatives including Data Prep Kit
- Engage with the wider open-source community to increase awareness and adoption
- Deliver technical talks and tutorials in open source conferences and meetups
- Author research papers to engage with wider AI research community
- Post train Large Language Models using latest datasets and research techniques
Achievements:
- Developed and deployed over 10 robust data processing pipelines, enabling scalable handling of massive datasets (100+ TB)
- Open-sourced the Data Prep Kit, a modular framework for high-quality data processing in LLM training workflows
- Serving as a core maintainer of the Data Prep Kit open-source repository
- Designed and generated 15+ high-quality synthetic datasets for post-training alignment and fine-tuning
Tools & Technologies: Deep Learning, AI, Data Engineering, Model Training, RITS, Java, Synthetic Data Generation, Python, Ray, KubeFlow, Kubernetes, OpenShift, Supervised Fine Tuning, Granite.Build, Data Prep Kit, Open Instruct, Hugging Face TRL
IBM Watson Speech Services - Senior Software Engineer (May 2021 – Apr 2024)
Yorktown Heights, NY, USA
- Developed core services enabling speech-to-text (STT) and text-to-speech (TTS) capabilities for Watson Speech
- Designed and implemented a Kubernetes Operator to integrate Watson Speech into Cloud Pak for Data
- Led efforts to re-architect service components to improve cost-efficiency, performance, and scalability
- Migrated Watson Speech services to modern Gen2 cloud clusters, enhancing reliability and observability
- Upgraded foundational components in the speech processing stack
- Provided technical leadership, mentoring developers and guiding architectural decisions
- Debug and fix issues reported by clients and users of the service
Achievements:
- Implemented first Watson service Kubernetes operator from scratch enabling hybrid cloud deployment
- Implemented shared compute architecture, saving approximately 2,300 CPUs and 11 TB of memory
- Replaced in-house speech routing engine with open-source solution, delivering 20% improvement in load balancing
- Saved 60 FTE days per year worth of efforts in deployment and maintenance
- Delivered total infrastructure cost savings of approximately $2.35 million per year (7% of overall infrastructure costs)
Tools & Technologies: RNNT Models, Java, Kubernetes, Python, NodeJS, Envoy, HAProxy, IBM Cloud, Docker
IBM Watson Speech Services - Senior Software Engineer (Jun 2020 – May 2021)
Yorktown Heights, NY, USA
- Led the design and development of a new cloud-native Text-to-Speech (TTS) service for Korean language
- Conducted comprehensive evaluation of commercial TTS engines and selected NeoSpeech
- Built and deployed the service using IBM Cloud, Java, REST APIs, Kubernetes, Docker
- Ensured seamless integration with the broader Watson AI platform
Achievements:
- Successfully launched the Korean TTS service under tight deadlines, contributing to the retention of a $1.5M sales deal
Tools & Technologies: IBM Cloud, Java, REST APIs, Kubernetes, Docker, Node.js, Python, MongoDB, WAS Liberty
IBM Watson Platform Service - Technical Lead (Jan 2017 – Jun 2020)
Yorktown Heights, NY, USA
- Designed and developed common platform services to support multiple Watson AI offerings
- Led the architecture and deployment of shared service infrastructure for Watson Visual Recognition
- Built and integrated a Billing and Metering system for Watson services
- Developed and maintained CI/CD pipelines for seamless automated integration and deployment
- Designed and implemented a Multi-Cluster Proxy for high availability and geo-distributed service delivery
- Led a team of 4 developers, assigning and scheduling work to achieve milestones
Tools & Technologies: Cloud native development, Microservices, IBM Cloud, Kubernetes, Java, REST, Docker, NodeJS, WAS Liberty, Python, MongoDB
IBM Social Dashboards - Technical Lead/Architect (Jan 2015 – Jan 2017)
Somers, NY, USA
- Developed software stack architecture for implementing functionalities
- Architected, designed, and developed big data applications
- Developed processing logic for real-time and non-real-time data using Pig Latin/Java/MapReduce
- Worked on adoption of "Social Dashboard" across IBM
- Installation and monitoring of Hadoop Data Platform cluster, Cassandra & DataStax cluster, Ambari Cluster
- Managed technical team, running agile scrum development
Tools & Technologies: Java, Hadoop, NodeJS, Microservices, Cassandra, DataStax Hadoop Data Platform, HBase, HDFS
IBM CIO Social Media Analytics Services - Technical Lead (Mar 2011 – Jan 2015)
Somers, NY, USA (Employer: Persistent Systems)
- Implemented Social Media Analytics services
- Provided technical leadership to team
- Processed large social media data using MapReduce algorithms to extract analytics
- Configured and managed multi-node Big Data cluster
- Architected solutions and developed POCs
- Developed data ingestion utilities for various data sources
- Setup Linux-based clustered environment with 15+ servers, ensuring 100% availability
- Provided analytics services to 15+ teams
- Developed Twitter Data Syndication Service handling ~40K tweets per hour from GNIP
Skills
Technologies: Python, Java, Spark, Ray, Kubernetes, Docker, OpenShift, IBM Cloud, AWS
AI/ML: Deep Learning, LLM Training, Synthetic Data Generation, Model Post-Training, Supervised Fine Tuning
Big Data: Hadoop, MapReduce, Cassandra, HBase, HDFS, Data Processing Pipelines
Cloud: Cloud Native Development, Microservices, Kubernetes Operators, CI/CD
Projects
Data Prep Kit
Open-source modular framework for LLM data processing
- Core maintainer of the Data Prep Kit open-source repository
- Developed scalable data processing libraries handling 100+ TB of data
- Created modular framework for high-quality data processing in LLM training workflows
- Designed pipelines for both pretraining and post-training stages
IBM Granite Models - Data Preparation
Large-scale data curation for IBM's flagship LLM
- Prepared and refined datasets for IBM Granite Model training
- Conducted ablation studies on data preparation impact
- Released GneissWeb - large-scale open datasets
- Generated 15+ high-quality synthetic datasets for post-training alignment
Watson Speech Kubernetes Operator
First Watson service operator enabling hybrid cloud deployment
- Designed and implemented Kubernetes Operator from scratch
- Enabled Watson Speech integration into Cloud Pak for Data
- Facilitated deployment in hybrid and private cloud environments
- Implemented shared compute architecture saving 2,300 CPUs and 11 TB memory
Korean TTS Service for Watson
Cloud-native text-to-speech service for Korean language
- Led end-to-end design and development
- Evaluated and integrated NeoSpeech TTS engine
- Contributed to retention of $1.5M sales deal
- Built using cloud-native architecture on IBM Cloud
KreedaStar (Co-founder)
Social platform for sports industry professionals
- Co-founded platform connecting professionals in the sports industry
- Created "LinkedIn for Sports" - networking platform for sports professionals
Twitter Data Syndication Service
Real-time Twitter data processing and distribution
- Developed shared service processing ~40K tweets per hour from GNIP
- Real-time tweet storage to file-based store and MongoDB
- REST API for teams to access syndicated Twitter data
- Supported 15+ internal teams with social media analytics
Social Media Analytics Platform
Enterprise-scale social media analysis infrastructure
- Processed large-scale social media data (200GB+) from Twitter, GNIP, BoardReader
- Built data ingestion interfaces for 15+ HTML sites, RSS/CSV/XML feeds, DB2
- Developed sentiment analytics, deduplication, and summarization techniques
- Managed multi-node cluster environment with 15+ servers
Publications
Peer-Reviewed Papers
GneissWeb: Preparing High Quality Data for LLMs at Scale
A comprehensive study on large-scale data preparation techniques for training large language models.
Read on arXiv →
Muted: Multilingual Targeted Offensive Speech Identification and Visualization
Research on identifying and visualizing offensive speech across multiple languages using machine learning techniques.
Research Interests
- Large Language Model Training and Post-Training
- Data-Centric AI and Data Quality for ML
- Synthetic Data Generation
- Speech Technologies
- Social Media Analytics
- Cloud-Native AI Systems
Academic Profile
View my complete publication list and citation metrics on Google Scholar
Talks
As part of my role at IBM Research and involvement in the open-source community, I regularly deliver technical talks and tutorials at conferences, meetups, and community events.
Talk Topics
Data Prep Kit for LLM Training
Introducing the open-source Data Prep Kit framework and demonstrating how to prepare high-quality data for large language model training at scale.
Large-Scale Data Processing for AI
Best practices and techniques for processing hundreds of terabytes of unstructured data for AI model training, including lessons learned from IBM Granite.
Cloud-Native AI Services Architecture
Designing and deploying scalable, cloud-native AI services using Kubernetes, focusing on cost optimization and reliability.
GneissWeb: High-Quality Data Curation
Presenting the GneissWeb dataset and discussing methodologies for curating large-scale, high-quality training data for LLMs.
Synthetic Data Generation for LLM Post-Training
Techniques and approaches for generating synthetic datasets for fine-tuning and alignment of large language models.
Conference & Community Engagement
- Open source conferences and meetups
- AI and machine learning community events
- Technical tutorials and workshops
- IBM Research technical presentations
Speaking Interests
I'm available to speak on topics including:
- LLM Training and Post-Training
- Data-Centric AI
- Large-Scale Data Engineering
- Cloud-Native Architecture
- Open Source AI Tools
- Speech Technologies
For speaking engagements, please contact me.