@baymax3009 Profile picture

Ansh Shah

@baymax3009

Research Associate at RRC, IIIT Hyderabad. Previously undergrad at BITS Pilani | Interested in Robot Learning

Similar User
м о в photo

@_mob_mob_mob_

Aaron Jetson photo

@AaronTrashman

Lennon photo

@Muchindamuhombe

Charles W. Dean photo

@CharlesWDean1

Oigo© photo

@sesegere

JonSkiJr74 photo

@Jr74Ski

Mr White💯 photo

@marvin_whiteeee

David🇨🇦 photo

@David55th

African photo

@quentin33287364

Mercy From Above 💫 photo

@mullar_22

Roshan Kharel photo

@realroshan2001

Payel💐 photo

@CryptoPayel

Tenrai 天雷 photo

@Tenrai_44

Lino Banks🎲 photo

@Lino_Banks

I love big women photo

@Sbbwlover256

Ansh Shah Reposted

Check out @binghao_huang 's new touch censor from CoRL24!

Want to use tactile sensing but not familiar with hardware? No worries! Just follow the steps, and you’ll have a high-resolution tactile sensor ready in 30 mins! It’s as simple as making a sandwich! 🥪 🎥 YouTube Tutorial: youtube.com/watch?v=8eTpFY… 🛠️ Open Source & Hardware…



Ansh Shah Reposted

Excited to finally share Generative Value Learning (GVL), my @GoogleDeepMind project on extracting universal value functions from long-context VLMs via in-context learning! We discovered a simple method to generate zero-shot and few-shot values for 300+ robot tasks and 50+…


Ansh Shah Reposted

Our seminal paper (yes, I do believe this is transformative to the field) "Spatial Cognition from Egocentric Video: Out of Sight not Out of Mind" is accepted @3DVconf #3DV2025 Camera ready soon Congrats 2 great coauthors @plizzari38126 @goelshbhm Toby @JacobChalkie @akanazawa

🆕on ArXiv Out of Sight, Not Out of Mind Spatial Cognition from Egocentric Video. dimadamen.github.io/OSNOM/ arxiv.org/abs/2404.05072 3D tracking active objects using observations captured through egocentric camera. Objects are tracked while in hand, from cupboards and into drawers



Ansh Shah Reposted

Wrote a blogpost on using image and video diffusion models to "draw actions" Summary: - LLMs can model arbitrary sequences, diffusion models can generate arbitrary patterns - Images can serve as a common format across modalities like vision, audio, actions Link below

Tweet Image 1

Ansh Shah Reposted

Very happy to start sharing our work at Pi 🤖❤️ - a 3B pre-trained generalist model trained on 8+ robot platforms - a post-training recipe that allows robots to do dexterous, long-horizon tasks physicalintelligence.company/blog/pi0 What's exciting isn't laundry, but the recipe- a short 🧵


Ansh Shah Reposted

Not every foundation model needs to be gigantic. We trained a 1.5M-parameter neural network to control the body of a humanoid robot. It takes a lot of subconscious processing for us humans to walk, maintain balance, and maneuver our arms and legs into desired positions. We…


Ansh Shah Reposted

Diffusion-based approach beats autoregressive models at solving puzzles and planning 🤖 Original Problem: Autoregressive LLMs struggle with complex reasoning and long-term planning tasks despite their impressive capabilities. They have inherent difficulties maintaining global…

Tweet Image 1

Ansh Shah Reposted

Last Sunday, we competed in the Vision Assistance Race at the @cybathlon 2024—the "cyber Olympics" designed to push the boundaries of assistive technology. In this race, our system guided a blind participant through everyday tasks such as walking along a sidewalk, sorting colors,…


Ansh Shah Reposted

How do we represent 3D world knowledge for spatial intelligence in next-generation robots? We recently wrote an extensive survey paper on this emerging topic, covering recent state-of-the-art! 🦾 🚀 Check it out below. Feedback/Suggestions welcome! 📖arXiv:…

Tweet Image 1

Ansh Shah Reposted

📢 Excited to share our new paper with @fabreetseo: "Beyond Position: How Rotary Embeddings Shape Representations and Memory in Autoregressive Transformers"! arxiv.org/abs/2410.18067 Keep reading to find out how RoPE affects Transformer models beyond just positional encoding 🧵


Ansh Shah Reposted

Pretraining can transform RL, but it might need rethinking how to pretrain with RL on unlabeled data to bootstrap downstream exploration. In our new work, we show how to accomplish this with unsupervised skills and exploration.

Latest work on leveraging prior trajectory data with *no* reward label to accelerate online RL exploration! Our method leverages our prior work (ExPLORe) and skill pretraining to achieve better sample efficiency on a range of spare-reward tasks than all prior approaches!

Tweet Image 1
Tweet Image 2


Ansh Shah Reposted

Sequence models have skyrocketed in popularity for their ability to analyze data & predict what to do next. MIT’s "Diffusion Forcing" method combines the strengths of next-token prediction (like w/ChatGPT) & video diffusion (like w/Sora), training neural networks to handle…


Ansh Shah Reposted

state space models are super neat & interesting, but i have never seen any evidence that they’re *smarter* than transformers - only more efficient any architectural innovation that doesn’t advance the pareto frontier of intelligence-per-parameter is an offramp on the road to AGI


Ansh Shah Reposted

Sirui's new work presents a nice system design with a user-friendly interface, for data collection without a robot. Collecting robot data without robots but with humans only is the right way to go.

How can we collect high-quality robot data without teleoperation? AR can help! Introducing ARCap, a fully open-sourced AR solution for collecting cross-embodiment robot data (gripper and dex hand) directly using human hands. 🌐:stanford-tml.github.io/ARCap/ 📜:arxiv.org/abs/2410.08464



Ansh Shah Reposted

Mechazilla has caught the Super Heavy booster!


Ansh Shah Reposted
Tweet Image 1

Ansh Shah Reposted

A perfect real-world example of equivariance haha

Cats are invariant under SO(3) transformations! 😼



Ansh Shah Reposted

The 3D vision community really hates the bitter lesson. Dust3r is what you get when you take the lesson seriously.


Ansh Shah Reposted

The paper contains many ablation studies on various ways to use the LLM backbone 👇🏻 🦩 Flamingo-like cross-attention (NVLM-X) 🌋 Llava-like concatenation of image and text embeddings to a decoder-only model (NVLM-D) ✨ a hybrid architecture (NVLM-H)

Tweet Image 1

Ansh Shah Reposted

chat is this true

Tweet Image 1

Loading...

Something went wrong.


Something went wrong.