@MuCai7 Profile picture

Mu Cai @ Industry Job Market

@MuCai7

Ph.D. student @WisconsinCS, Multimodal Large Language Models Will graduate around 2025 May, looking for Research Scientist position around multimodal models.

Similar User
Sean Xuefeng Du (on academic job market) photo

@xuefeng_du

Zifeng Wang photo

@ZifengWang315

Liu Yang photo

@Yang_Liuu

Jiefeng Chen photo

@jiefengchen1

Tzu-Heng Huang photo

@zihengh1

Yuchen Zeng photo

@yzeng58

Yiyou Sun photo

@YiyouSun

Jitian Zhao photo

@jzhao326

Boru Chen photo

@blue75525366

Changho Shin photo

@Changho_Shin_

Jiachen Sun photo

@JiachenSun5

Minghao Yan photo

@Minghao__Yan

Zhenmei SHI photo

@zhmeishi

Jifan Zhang photo

@jifan_zhang

Ziqian Lin photo

@myhakureimu

Pinned

Thanks for @_akhaliq 's sharing! (1/N) We propose M3: Matryoshka Multimodal Models, arxiv.org/abs/2405.17430 which (1) reduces the number of visual tokens significantly while maintaining as good performance as vanilla LMM (2) organizes visual tokens in a coarse-to-fine nested way.

Matryoshka Multimodal Models Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM).

Tweet Image 1


I am not in #EMNLP2024 but @bochengzou is in Florida! Go checkout vector graphics, a promising format that is completely different from pixels for visual representation. Thanks to LLMs, vector graphics are more powerful now! Go chat with @bochengzou if you are interested!

VGBench is accepted to EMNLP main conference! Congratulations to the team @bochengzou @HyperStorm9682 @yong_jae_lee The first work for "Evaluating Large Language Models on Vector Graphics Understanding and Generation" as a comprehensive benchmark!



Now TemporalBench is fully public! See how your video understanding model performs on TemporalBench before CVPR! 🤗 Dataset: huggingface.co/datasets/micro… 📎 Integrated to lmms-eval (systematic eval): github.com/EvolvingLMMs-L… (great work by @ChunyuanLi @zhang_yuanhan ) 📗 Our…

Tweet Image 1

1/N) Are current large multimodal models like #GPT4o really good at video understanding? 🚀 We are thrilled to introduce TemporalBench to examine temporal dynamics understanding for LMMs! Our TemporalBench reveals even the SOTA LMM #GPT4o achieves only 38.5, far from…

Tweet Image 1


Mu Cai @ Industry Job Market Reposted

Personal update: After 5.5 yrs at @MSFTResearch , I will join @williamandmary in 2025 to be an assistant professor. Welcome to apply for my PhD/interns. Interest: ML with foundation models, LLM understanding, and AI for social sciences. More information: jd92wang.notion.site/Professor-Jind…

Tweet Image 1

Mu Cai @ Industry Job Market Reposted

Fine-grained temporal understanding is fundamental for any video understanding model. Excited to see LLaVA-Video showing promising results on TemporalBench, @MuCai7! Yet, there remains a significant gap between the best model and human-level performance. The journey continues!

Tweet Image 1

🔥Check out our new LMM benchmark TemporalBench! Our world is temporal, dynamic and physical, which can be only captured in videos. To move forward, we do need LMMs to understand the fine-grained changes and motions to really benefit the downstream applications such as video…

Tweet Image 1
Tweet Image 2
Tweet Image 3
Tweet Image 4


Great work on apply multi-granularity idea on image generation/manipulation! This share the same visual encoding design as our earlier work Matryoshka Multimodal Models (matryoshka-mm.github.io), where pooling is used to control visual granularity, leading to a multi…

Tweet Image 1

Introducing PUMA: a new MLLM for unified vision-language understanding and visual content generation at various granularities, from diverse text-to-image generation to precise image manipulation. rongyaofang.github.io/puma/ arxiv.org/abs/2410.13861 huggingface.co/papers/2410.13…

Tweet Image 1
Tweet Image 2


Mu Cai @ Industry Job Market Reposted

🔥Check out our new LMM benchmark TemporalBench! Our world is temporal, dynamic and physical, which can be only captured in videos. To move forward, we do need LMMs to understand the fine-grained changes and motions to really benefit the downstream applications such as video…

Tweet Image 1
Tweet Image 2
Tweet Image 3
Tweet Image 4

1/N) Are current large multimodal models like #GPT4o really good at video understanding? 🚀 We are thrilled to introduce TemporalBench to examine temporal dynamics understanding for LMMs! Our TemporalBench reveals even the SOTA LMM #GPT4o achieves only 38.5, far from…

Tweet Image 1


Mu Cai @ Industry Job Market Reposted

🚀 Excited to share our latest research: "Parameter-Efficient Fine-Tuning of SSMs" Summary: 🧵

Tweet Image 1

Loading...

Something went wrong.


Something went wrong.