I am an end-to-end AI Engineer who builds across the stack, from autonomous AI agents to GPU kernels. I can design and architect new systems from scratch, but I am just as comfortable jumping into an existing codebase to optimize and maintain it.
I thrive in environments that demand velocity and rapid adaptation. Give me an ambiguous problem, and I will ship the solution.
scroll
01Experience
Sep 2025 — Present
Backend & Systems Engineer
Easley Dunn Productions
Engineered a C# transaction subsystem in Unity governing card pack acquisition and player inventory state.
Implemented atomic purchase flows with rollback logic to guarantee economy consistency.
Reduced screen transition latency by 13% via a unified canvas architecture and decoupled event handling.
Sep 2024 — May 2025
HPC Researcher
GRIDS @ University of Southern California
Shipped OTTER, a C++17 reverse-mode autodiff library with 16 operations.
Wrote both memory allocators from scratch: CPU on an mmap pool, CUDA with pooled segments, OOM eviction, and cudaMalloc retry before propagating the error.
Built the autograd engine: runtime graph construction, backward traversal, and gradient accumulation across 4 parallel workers with mutex-protected writes into shared leaf tensors.
Separated memory management and kernel dispatch into abstract interfaces: adding a backend requires no changes to Tensor or autograd code, proven across CPU and CUDA with full kernel parity.
Oct 2023 — Dec 2024
Teaching Assistant
University of Southern California
Mentored a cohort of 850+ graduate students across Database Systems and NLP.
Collaborated with faculty to standardize grading criteria, resulting in a 7% improvement in student outcomes.
Developed custom OpenAI Triton kernels across 7 progressive phases: from elementwise ops and reductions to optimized matrix multiply, FFT, convolution, and Flash Attention v2.
Achieved cuBLAS-parity (within 1%) on tiled matmul at N=4096.
Benchmarked every kernel against PyTorch/cuBLAS baselines using roofline-model analysis.
Applied SRAM tiling, L2-reuse group ordering, and memory coalescing.
Constructed core kernel primitives for process scheduling, thread synchronization, and signal handling. Tested end-to-end in QEMU, debugging concurrency hazards across scheduling, VFS, and VM subsystems.
Enforced process isolation and memory protection via VFS and VM subsystems.
Enabled kernel-level security guarantees for fork, mmap, and open.
Devised a split-debugging strategy in QEMU to isolate critical concurrency bugs across scheduling and VFS.
Modular Denoising Diffusion Probabilistic Model trained on CIFAR-10. Implements a full UNet architecture with custom noise scheduling, variance-preserving forward process, and progressive denoising.
Built a configurable noise scheduler with linear and cosine beta schedules.
Implemented sinusoidal timestep embeddings and residual attention blocks.
Achieved class-conditional generation across all 10 CIFAR categories with FID 87.94.
End-to-end robotic manipulation pipeline for the Kinova Gen2 arm, integrating BiRRT motion planning, Task-Space Region grasp sampling, and IKFast inverse kinematics via the AIKIDO planning framework.
Implemented BiRRT planner with TSR-constrained grasp sampling for reliable object pick-and-place.
Integrated IKFast analytic IK solver for real-time joint-space trajectory generation.
Validated full pipeline on physical Kinova Gen2 hardware using ROS and AIKIDO.
Comparative study of three deep RL agents (A2C, DQN, PPO) trained on a custom OpenAI Gym Snake environment, with reward shaping and curriculum scheduling to accelerate convergence.
Built a custom Gym-compatible environment with configurable grid sizes and rendering.
Benchmarked A2C, DQN, and PPO; PPO achieved a 60% reduction in training time to target score.
Reward shaping and curriculum scheduling pushed peak score to 16 points.
Compute-graph pipeline that analyzes PyTorch model topology across five signals to recommend kernel optimizations and hardware placement — grounded in graph structure, no model name or documentation required.
Each recommendation cites the graph evidence that triggered it and what contradicted it.
Arithmetic intensity profiled per op type and matched against chip compute ceilings.
Single trace per model; classification, technique matching, and hardware placement all run weight-free.
Full-stack stock research and paper trading platform: single-page web app, REST API, and native Android client, all deployed on AWS.
Built real-time stock search and data visualisation pipeline with an Angular SPA backed by a Node/Express REST API.
Implemented a simulated trading engine with watchlist management and portfolio P&L analytics.
Delivered a native Android client with feature parity to the web app, deployed across AWS EC2 and S3.
01 / 08
04Archive
May 2025
Multi-Backbone Waste Classifier
9-class waste image classifier benchmarking four frozen ImageNet backbones (VGG16, ResNet101, ResNet50, EfficientNetB0) with augmentation and early stopping.
Oct 2024
Monte Carlo Localization
Particle filter–based robot localization in ROS, fusing lidar measurements with an EKF for position estimation.
Oct 2024
Robot Behavioral Cloning
Imitation learning agent trained via behavioral cloning on expert demonstrations in MuJoCo continuous-control environments.
Feb 2024
Traffic Shaper Simulator
Multithreaded traffic shaping simulator in C implementing token-bucket and leaky-bucket algorithms with mutex-guarded queues.
Dec 2023
Hinglish Language Detection
Code-switched Hinglish language detection using a BiLSTM model with FastText subword embeddings.
Oct 2023
HMM POS Tagger
Hidden Markov Model part-of-speech tagger with Viterbi decoding, built for the USC NLP course.
May 2023
SWINDetector
Deepfake detection pipeline using a SWIN Transformer backbone fine-tuned on FaceForensics++ via HuggingFace.
Open to full-time roles in ML infrastructure, systems engineering, and GPU optimization. If you have an interesting problem, I'd love to hear about it.