I have been curious of what RAG built on top of scientific knowledge would look like… …until @AkariAsai showed me the OpenScholar prototype 😍 for a 8B model, it can handle well even subtle questions like whether BM25 or DPR is better! (demo is CS only, expanding soon!)
Episode 5 of our podcast is out! We discuss how complicated it is to assess intelligence, whether in humans, animals, or machines. With two fantastic guests: Comparative Psychologist Erica Cartmill and Computer Scientist Ellie Pavlick. Check it out! complexity.simplecast.com/episodes/natur…
I really enjoyed working with Wenda 😆💪
I am on job market for full-time industry positions. My research focuses on text generation evaluation and LLM alignment. If you have relevant positions, I’d love to connect! Here are list of my publications and summary of my research:
Since this guarantee is model-agnostic by nature, we no longer have to rely solely on GPT-4 as a judge 🤩 We propose Cascaded Selective Evaluation that operates with cascades of judge models instead of expensive GPT-4 for all evaluations! [3/n]
LLM-as-a-judge has become a norm, but how can we be sure that it will really agree with human annotators? 🤔In our new paper, we introduce a principled approach to provide LLM judges with provable guarantees of human agreement⚡ #LLM #LLM_as_a_judge #reliable_evaluation 🧵[1/n]
Can OpenAI o1 handle complex planning tasks? ——Not really! It seems o1 gets even more confused by context than GPT-4o 😲. We test the TravelPlanner validation set with the updated o1 models and fine-tuned GPT-4o. Key insights: 1. Mixed Results for o1 and o1-mini: No…
[NEW PAPER ALERT!] In this work, we present PROFILE, a framework designed to discern the alignment of LLM-generated responses with human preferences at a fine-grained level (length, formality and intent etc). Our key finding is a significant misalignment between LLM's output and…
📜New preprint! LLMs generate impressive texts, but often miss what humans actually prefer—like being too wordy. 🤯 The problem? We don’t have a precise way to pinpoint where these misalignments occur. That’s the gap we aim to fill!🔍
🚨New Benchmark Alert🚨 Our paper accepted to Findings of EMNLP 2024🌴 introduces a new dataset, DynamicQA! DynamicQA contains inherently conflicting data (both disputable🤷♀️ & temporal🕰️) crucial to studying LM’s internal memory conflict. Work with @hayu204 🥳 #EMNLP2024 #NLProc
Can LLMs cater to diverse cultures in text generation? We find: 1️⃣lexical variance across nationalities 2️⃣culturally salient words 3️⃣weak correlation w/ cultural values 📜arxiv.org/abs/2406.11565 🤗huggingface.co/datasets/shail… 💻github.com/shaily99/eecc 🎉@emnlpmeeting🎉 w/ @841io 🧵
Excited to attend the first @COLM_conf 😝 Come check our work on Multi-lingual Factuality Evaluation on Wednesday morning and say hi 👋 📚: arxiv.org/pdf/2402.18045
Excited to attend the first @COLM_conf 🦙❤️🤩 Very open to talk to anyone about faculty/postdoc/phd opportunities at KAIST, as well as about multilingual multicultural LLM research. Come join the multilingual special session on Wednesday morning, and find my students…
Can your LLM Stay Faithful to Context, Even If "The Moon 🌕 is Made of Marshmallows 🍡"? We Introduce FaithEval, a new and comprehensive benchmark dedicated to evaluating contextual faithfulness for LLMs with 4.9K high-quality question-context pairs across 3 challenging tasks:…
LLMs Know More Than They Show We know very little about how and why LLMs "hallucinate" but it's an important topic nonetheless. This new paper finds that the "truthfulness" information in LLMs is concentrated in specific tokens. This insight can help enhance error detection…
RLHF is a popular method. It makes your human eval score better and Elo rating 🚀🚀. But really❓Your model might be “cheating” you! 😈😈 We show that LLMs can learn to mislead human evaluators via RLHF. 🧵below
United States Trends
- 1. Mike Rogers 138 B posts
- 2. #FridayVibes 6.960 posts
- 3. Good Friday 64,6 B posts
- 4. $mad 5.494 posts
- 5. Happy Friyay 2.733 posts
- 6. CONGRATULATIONS JIMIN 313 B posts
- 7. Jason Kelce 1.748 posts
- 8. Pam Bondi 322 B posts
- 9. #FridayMotivation 11,4 B posts
- 10. #FridayFeeling 3.550 posts
- 11. #KashOnly 64,1 B posts
- 12. McCabe 25,4 B posts
- 13. #FursuitFriday 12,3 B posts
- 14. Chris Brown 31,1 B posts
- 15. Randle 7.423 posts
- 16. President John F. Kennedy 7.966 posts
- 17. Finally Friday 3.331 posts
- 18. Kang 35,7 B posts
- 19. Jameis 71,8 B posts
- 20. St. Cecilia 1.554 posts
Something went wrong.
Something went wrong.