I am a Principal Applied Scientist in Amazon AGI. I closely collaborate with AWS AI Labs. Before this, I was an applied scientist in Amazon Halo where I worked on problems at the intersection of Computer Vision and Health.
Prior to Amazon, I was a Principal Computer Vision Researcher at Magic Leap. I finished my Ph.D. from the School of Interactive Computing at Georgia Institute of Technology. I was advised by Professor Henrik I. Christensen and Professor Frank Dellaert. I completed my Bachelors and Masters in Computer Science from IIIT Hyderabad
where I was advised by Prof. P J Narayanan.
What I’m building toward —
The Multimodal Mind
The next frontier in AI is a single model that understands the physical world well enough to imagine its futures and act to shape them — a natively multimodal foundation model that perceives, reasons, and generates across all modalities (image, video, text, speech, action), organized around a three-stage cognitive loop: Perceive → Imagine → Act. Perceive ingests the current state across modalities in a unified space. Imagine thinks in multimodal space — mental video of the planned trajectory, predicted proprioception, anticipated contact forces — and this imagination is the world model. Act generates whatever output the task requires (motor commands, keyboard/mouse, speech), conditioned on the imagined plan. Under this view, controllable video generation, computer use, robotics, world simulation, autonomous driving, and creative tools are all different routes through the same architecture — and improvements to the shared backbone benefit every task simultaneously. [full writeup]
Tech-lead for Nova 2.0 Lite: Core-team member focused on designing MM architecture and finalized pretraining data mix and recipe integrating vision, speech and language understanding at scale. Achieves state-of-the-art performance on 13 out of 15 benchmarks vs Claude Haiku 4.5.
Nova 2.0 Omni pre-training: I developed unified multimodal architecture integrating native image generation with text, vision, and speech understanding. Designed pre-training recipe that enable generative capabilities while preserving performance across all input modalities. Checkout the models at nova.amazon.com/chat.
Tech-lead on multimodal pre-training team responsible for developing image/video pretraining recipes for Amazon Nova foundation models. Achieved state-of-the-art performance on multimodal understanding benchmarks against Claude 3 Haiku and Gemini 1.5 Pro. The family includes Nova Pro, Nova Lite, Nova Micro, Nova Canvas, and Nova Reel.
We address hallucination in Generative Vision-Language Models by introducing Multi-Modal Mutual-Information Decoding (M3ID), which amplifies the influence of reference images, reducing hallucinated responses by up to 28% without compromising linguistic fluency.
RAVEN is a multi-task retrieval-augmented vision-language model framework that improves performance on various tasks through efficient fine-tuning without requiring additional parameters.
Propose a CNN based model called MeasureNet for accurately and reliably predicting body measurements and waist-hip ratio. Model is trained using realistic synthetic dataset.
A fully articulated Gaussian splatting model for human avatars. Our model includes both rigid and non-rigid skinning components, and a Neural Color Field for implicit color regularization.
A two-stage transformer-based system analyzes real-time fitness videos using pose sequences to detect exercise repetitions, classify movements, and identify form errors with severity levels.
Propose a distributed implementation of the two-stage pose graph optimization, using Successive Over-Relaxation (SOR) and the Jacobi
Over-Relaxation (JOR) as workhorses to split the computation among the robots. Extends it to work with object-based map models.
Multi robot SLAM approach that uses 3D objects as landmarks for localization and mapping. Leverages
local computation at each robot (e.g., object detection and object pose estimation) to reduce the communication burden.
Leverages recent results which show that the maximum likelihood trajectory is well approximated by a sequence of two quadratic subproblems and solves it in a distributed manner, using the distributed Gauss-Seidel (DGS) algorithm.
Information theoretic algorithm to efficiently reduce the number of landmarks and poses in a SLAM estimate without compromising the accuracy of the estimated trajectory.
Propose an approach
for online object discovery and object modeling, and extend a
SLAM system to utilize these discovered and modeled objects as
landmarks to help localize the robot in an online manner.
Structure from Motion / GPU Computing
CPU and/or GPU: Revisiting the GPU Vs. CPU Myth
Kishore Kothapalli, Dip Sankar Banerjee, P. J. Narayanan, Surinder Sood, Aman Kumar Bahl, Shashank Sharma, Shrenik Lad, Krishna Kumar Singh, Kiran Matam, Sivaramakrishna Bharadwaj, Rohit Nigam, Parikshit Sakurikar, Aditya Deshpande, Ishan Misra, Siddharth Choudhary, Shubham Gupta
arXiv, 2013  
arXiv