@yihengxu_ Profile picture

Yiheng Xu

@yihengxu_

digital ai agent research @hkuniversity | ex @msftresearch | layoutlm / lemur / aguvis | from automation to autonomy

Joined May 2020
Similar User
Tianbao Xie photo

@TianbaoX

Lingpeng Kong photo

@ikekong

jinyang (patrick) li photo

@jinyang34647007

Lei Li photo

@_TobiasLee

Qingxiu Dong photo

@qx_dong

Ming Zhong photo

@MingZhong_

Jiacheng Ye photo

@JiachengYe15

Gordon Lee🍀 photo

@redoragd

Siru Ouyang photo

@Siru_Ouyang

Chen Wu photo

@ChenHenryWu

Bei Chen photo

@beichen1019

Zhoujun (Jorge) Cheng photo

@ChengZhoujun

Bailin Wang photo

@bailin_28

Yifei Li photo

@YifeiLiPKU

Jiahui Gao photo

@jiahuigao3

Pinned

Very happy to share that Lemur has been accepted to #ICLR2024 as a spotlight! 🥳 Great thanks to all my amazing coauthors!

1/ 🧵 🎉 Introducing Lemur-70B & Lemur-70B-Chat: 🚀Open & SOTA Foundation Models for Language Agents! The closest open model to GPT-3.5 on 🤖15 agent tasks🤖! 📄Paper: arxiv.org/abs/2310.06830 🤗Model @huggingface : huggingface.co/OpenLemur More details 👇

Tweet Image 1


Yiheng Xu Reposted

🚀 Excited to introduce a new member of the OS-Copilot family: OS-Atlas - a foundational action model for GUI agents Paper: huggingface.co/papers/2410.23… Website: osatlas.github.io A thread on why this matters for the future of OS automation 🧵 TL;DR: OS-Atlas offers: 1.…

Tweet Image 1

Yiheng Xu Reposted

Splashdown confirmed! Congratulations to the entire SpaceX team on an exciting fifth flight test of Starship!


Yiheng Xu Reposted

Mechazilla has caught the Super Heavy booster!


Yiheng Xu Reposted

We're launching SWE-bench Multimodal to eval agents' ability to solve visual GitHub issues. - 617 *brand new* tasks from 17 JavaScript repos - Each task has an image! Existing agents struggle here! We present SWE-agent Multimodal to remedy some issues Led w/ @_carlosejimenez 🧵

Tweet Image 1

Yiheng Xu Reposted

Just took a look at the ICLR '25 submissions. 1. 👀 LLMs are still crushing it as the TOP-1 topic this year, with diffusion models in the next position. 2. 🤔 Evaluation & benchmarks might be what we should focus more on in the future, because making something like O1 or…

Tweet Image 1
Tweet Image 2

Yiheng Xu Reposted

🚀 Still relying on human-crafted rules to improve pretraining data? Time to try Programming Every Example(ProX)! Our latest efforts use LMs to refine data with unprecedented accuracy, and brings up to 20x faster training in general and math domain! 👇 Curious about the details?


Yiheng Xu Reposted

After months of efforts, we are pleased to announce the evolution from Qwen1.5 to Qwen2. This time, we bring to you: ⭐ Base and Instruct models of 5 sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B. Having been trained on data in 27 additional…

Tweet Image 1

Yiheng Xu Reposted

Who is ready to rock some AI? 🧑‍🎤🤘 #ICLR2024


Yiheng Xu Reposted

Downstream scores can be noisy. If you wonder about Llama 3's compression perf in this figure, we have tested the BPC: Llama3 8B: 0.427, best at its size, comparable to Yi-34B Llama3 70B: 0.359, way ahead of all the models here Details at github.com/hkust-nlp/llm-…

Compression Represents Intelligence Linearly LLMs' intelligence – reflected by average benchmark scores – almost linearly correlates with their ability to compress external text corpora repo: github.com/hkust-nlp/llm-… abs: arxiv.org/abs/2404.09937

Tweet Image 1


Yiheng Xu Reposted

🔥 Do you want an open and versatile code assistant? Today, we are delighted to introduce CodeQwen1.5-7B and CodeQwen1.5-7B-Chat, are specialized codeLLMs built upon the Qwen1.5 language model! 🔋 CodeQwen1.5 has been pretrained with 3T tokens of code-related data and exhibits…

Tweet Image 1

Yiheng Xu Reposted

🚀Multimodal agents is on rise in 2024! But even building app/domain-specific agent env is hard😰. Our real computer OSWorld env allows you to define agent tasks about arbitrary apps on diff. OS w.o crafting new envs. 🧐Benchmarked #VLMs on 369 OSWorld tasks: #GPT4V >> #Claude3

Tweet Image 1

🤔Can we assess agents across various apps & OS w.o. crafting new envs? OSWorld🖥️: A unified, real computer env for multimodal agents to evaluate open-ended computer tasks with arbitrary apps and interfaces on Ubuntu, Windows, & macOS. + annotated 369 real-world computer tasks…



Yiheng Xu Reposted

🤔Can we assess agents across various apps & OS w.o. crafting new envs? OSWorld🖥️: A unified, real computer env for multimodal agents to evaluate open-ended computer tasks with arbitrary apps and interfaces on Ubuntu, Windows, & macOS. + annotated 369 real-world computer tasks…


Yiheng Xu Reposted

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments The first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating…


Yiheng Xu Reposted

Visualization-of-Thoughts (VoT): Mind's Eye of LLMs

Visualization-of-Thought Elicits Spatial Reasoning in LLMs Inspired by a human cognitive capacity to imagine unseen worlds, this new work proposes Visualization-of-Thought (VoT) prompting to elicit spatial reasoning in LLMs. VoT enables LLMs to "visualize" their reasoning…

Tweet Image 1


Yiheng Xu Reposted

SWE-agent is our new system for autonomously solving issues in GitHub repos. It gets similar accuracy to Devin on SWE-bench, takes 93 seconds on avg + it's open source! We designed a new agent-computer interface to make it easy for GPT-4 to edit+run code github.com/princeton-nlp/…

Tweet Image 1

Yiheng Xu Reposted

Our arxiv preprint is released now! 🔗: arxiv.org/abs/2403.15452 If you know other awesome papers on tool use in LLMs, please let us know and feel free to open a PR! 👩‍💻: github.com/zorazrw/awesom…

Tools can empower LMs to solve many tasks. But what are tools anyway? github.com/zorazrw/awesom… Our survey studies tools for LLM agents w/ –A formal def. of tools –Methods/scenarios to use&make tools –Issues in testbeds and eval metrics –Empirical analysis of cost-gain trade-off

Tweet Image 1


Yiheng Xu Reposted

Wanna train a SOTA reward model? 🌟New Blog Alert: "Reward Modeling for RLHF" (with @weixiong_1 & @RuiYang70669025) is live this weekend! 🌐✨ We delve into the insights behind achieving groundbreaking performance on the RewardBench (by @natolambert). efficient-unicorn-451.notion.site/Reward-Modelin…


Yiheng Xu Reposted

🎉🎉We are excited to release a full package for AI Agent R&D: 1) For Data & Training, 🎙️AgentOhana🎙️: Design Unified Data and Training Pipeline for Effective Agent Learning. 2) For model, 🔥xLAM-v0.1-R🔥: A strong large action model for AI Agent while maintaining abilities on…

Tweet Image 1

Yiheng Xu Reposted

Since we released SWE-bench, we've been asked for a smaller & slightly easier subset of the benchmark, to make it easier to develop and test new ideas in language modeling for code. Today we're releasing SWE-bench Lite. By @_carlosejimenez @jyangballin @JiayiiGeng!

SWE-bench Lite is a smaller & slightly easier *subset* of SWE-bench, with 23 dev / 300 test examples (full SWE-bench is 225 dev / 2,294 test). We hopes this makes SWE-bench evals easier. Special thanks to @JiayiiGeng for making this happen. Download here: swebench.com/lite

Tweet Image 1


Yiheng Xu Reposted

Happy to share REPLUG🔌 is accepted to #NAACL2024 Introduce a retrieval-augmented LM framework that combines a frozen LM with a frozen/tunable retriever. Improving GPT-3 in language modeling & downstream tasks by prepending retrieved docs to LM inputs arxiv.org/abs/2301.12652

Tweet Image 1

Loading...

Something went wrong.


Something went wrong.