Xueguang Ma @xueguang_ma Twitter Profile

Xueguang Ma

@xueguang_ma

PhD student at @uwaterloo. Working on encoding the world into vectors. Current part-time intern at @Meta. Prev. intern at @MSFTResearch, @amazon

104Posts 463Followers 343Following

Similar User

@luyu_gao

@dylan_wangs

@macavaney

@beirmug

@NegarEmpr

@rpradeep42

@crystina_z

@XiongChenyan

@alexlimh23

@thibault_formal

@EYangTW

@cadurosar

@mattlease

@joelmmackenzie

@XinyuShi9825

Pinned

Xueguang Ma

@xueguang_ma

18 Jun

Introducing Document Screenshot Embedding (DSE): a new retrieval paradigm that unifies various formats and modalities into a single form for direct document encoding. paper: arxiv.org/abs/2406.11251 work done with amazing co-authors: @jacklin_64 @alexlimh23 @WenhuChen @lintool

Xueguang Ma Reposted

Xueguang Ma

@xueguang_ma

18 h

Glad to see over 100 packages and repositories depending on BM25S (bm25s.github.io)! My main goal was to make a BM25 library that was easy to use but fast enough, so I'm glad it's useful 100+ public projects :)

Xueguang Ma Reposted

Xueguang Ma

@xueguang_ma

13 Nov

If you are looking for an open-source model that can do the same thing. e.g. represent image docs, pictures, or text only with a single vector. try or fine-tune our ckpt. huggingface.co/MrLight/dse-qw…

Voyage AI

@VoyageAI

12 Nov

📢 Announcing voyage-multimodal-3, our first multimodal embedding model! It vectorizes interleaved text & images, capturing key visual features from screenshots of PDFs, slides, tables, figures, etc. +19.63% accuracy gain on 3 multimodal retrieval tasks (20 datasets)! 🧵🧵

MrLight/dse-qwen2-2b-mrl-v1 · Hugging Face

Source: https://t.co/SY91m1sgZs

Xueguang Ma

@xueguang_ma

14 Nov

Congrats @ralph_tang and @crystina_z et. al. for getting outstanding paper award!

ralphtang.eth

@ralph_tang

14 Nov

Attending the #EMNLP2024 award ceremony virtually was fun. Many thanks to my collaborators @crystina_z @Ulienida @yaolu_nlp @Wenyan62 Pontus @lintool @ferhanture, without whom the award would not have been possible. Check out the paper here: w1kp.com

Xueguang Ma Reposted

Xueguang Ma

@xueguang_ma

14 Nov

Our #EMNLP2024 work OneGen now supports using Faiss as the vector retrieval engine! 🎉 Just set use_faiss to true in the inference section of the config.json file, and you’re all set! 🚀 #AI #NLP #RAG #LLM #OneGen Paper: arxiv.org/abs/2409.05152 Code: github.com/zjunlp/OneGen

Ningyu Zhang@ZJU

@zxlzr

10 Sep

Introducing our latest work, OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs 🚀. OneGen enables LLMs to perform retrieval during generation, utilizing less training data while achieving impressive performance and efficiency. #NLP #LLMs #RAG #Generation…

Xueguang Ma Reposted

Xueguang Ma

@xueguang_ma

14 Nov

🤖Are multilingual LLMs able to understand "relevance" across languages? 🥼 Can we construct a reliable dataset to evaluate this? #EMNLP2024 Stop by our poster on Nov 14 (Thu), 10:30-12:00 @ Riverfront Hall by @crystina_z! Paper🎉: aclanthology.org/2024.findings-… @UWCheritonCS @Huawei

Xueguang Ma Reposted

Xueguang Ma

@xueguang_ma

13 Nov

github.com/vllm-project/v… I just saw the ckpt is integrated to vllm by the community! appreciate it a lot!

[Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLight/dse-qwen2-2b-mrl-v1 by...

Source: https://t.co/SpLNCjYM96

Xueguang Ma Reposted

Xueguang Ma

@xueguang_ma

10 Nov

💡Check MAIR at #EMNLP2024 A large-scale IR benchmark! Highlights: - Task Diversity: 126 realistic tasks, 8x than BEIR 📈 - Domain Coverage: 6 domains and heterogeneous sources 📚 - Instruction Following: 805 relevance criterions - Lightweight & Fast: optimized data sampling ⚡️

Xueguang Ma

@xueguang_ma

12 Nov

Not able to go Miami #EMNLP2024 due to visa issues. But @ShengyaoZhuang will present the work in person today. Chat with him and @dwzhu128, they know the details of our DSE work 😁 They will also present their amazing work PromptReps and LongEmbed on LLM IR and LongCtx IR.

Xueguang Ma

@xueguang_ma

20 Sep

Now accepted by #EMNLP2024 Main! What is the best way to process a document to be indexed for search? HTML parsing? OCR? Our answer is: don’t do any processing. Directly encode document original look with VLM to vector for search. Flow the gradients to its real appearance.

Xueguang Ma Reposted

Xueguang Ma

@xueguang_ma

10 Nov

@sunweiwei12 will present our work about a massive benchmark for evaluating instructed retrieval at #EMNLP2024, joint work with Baidu search team @lingyongyan @xinyuma8 @Yiding_tanh @yindawei @Zhengliang_Shi

Weiwei Sun

@sunweiwei12

10 Nov

Xueguang Ma Reposted

Xueguang Ma

@xueguang_ma

11 Nov

This week at #EMNLP2024, I’ll be presenting our PromptReps and proxy presenting DSE for @xueguang_ma at poster session A (Riverfront Hall) from 11:00 to 12:30 on November 12th. After EMNLP, I’m off to DC for TREC. Come say hi!

Xueguang Ma Reposted

Xueguang Ma

@xueguang_ma

8 Nov

1/7 🚨non-LLM paper alert!🚨 Human's perception of the sentence is quite robust against interchanging words with similar meanings, not even mentioning the semantically equivalent words across different languages. How about the language models? In our recent work, we measure the…

Xueguang Ma Reposted

Xueguang Ma

@xueguang_ma

6 Nov

Introducing MM-Embed, the first multimodal retriever achieving SOTA results on the multimodal M-BEIR benchmark and compelling results (among top-5 retrievers) on the text-only MTEB retrieval benchmark. Paper: arxiv.org/abs/2411.02571 🤗 Model: huggingface.co/nvidia/MM-Embed

nvidia/MM-Embed · Hugging Face

Source: https://t.co/nSb6fFre08

Xueguang Ma Reposted

Xueguang Ma

@xueguang_ma

27 Oct

For multimodal RAG enthusiasts📣 mcdse-2b is a new performant, scalable and efficient multilingual document retrieval model ✨ 🪆 you can shrink it 6x with tiny degradation 🤏🏻 embed 100M pages in 10GB! 💨 run with transformers or vLLM

Xueguang Ma Reposted

Xueguang Ma

@xueguang_ma

27 Oct

Introducing mcdse-2b-v1: a multilingual (🇮🇹 🇪🇸 🇬🇧 🇫🇷 🇩🇪) embedding model for flexible visual document retrieval. Trained on MrLight/dse-qwen2-2b-mrl-v1 (Qwen2-VL) using the DSE approach. It's like ColPali but multilingual and single-vector. - MRL: shrink embeddings from 1536 to…

Xueguang Ma Reposted

Xueguang Ma

@xueguang_ma

22 Oct

🌍 I’ve always had a dream of making AI accessible to everyone, regardless of location or language. However, current open MLLMs often respond in English, even to non-English queries! 🚀 Introducing Pangea: A Fully Open Multilingual Multimodal LLM supporting 39 languages! 🌐✨…

Xueguang Ma Reposted

Xueguang Ma

@xueguang_ma

18 Oct

Are you running out of money to run LLM as a Judge evals? 📉 Introducing 🏜️ MIRAGE-Bench, a multilingual RAG benchmark using heuristic metrics to train a *surrogate* judge to approximate LLM as a Judge evals for a synthetic RAG-based leaderboard! Paper: arxiv.org/abs/2410.13716