@juhyunohh Profile picture

Juhyun Oh

@juhyunohh

PhD student at KAIST

Juhyun Oh Reposted

I have been curious of what RAG built on top of scientific knowledge would look like… …until @AkariAsai showed me the OpenScholar prototype 😍 for a 8B model, it can handle well even subtle questions like whether BM25 or DPR is better! (demo is CS only, expanding soon!)

Tweet Image 1

1/ Introducing ᴏᴘᴇɴꜱᴄʜᴏʟᴀʀ: a retrieval-augmented LM to help scientists synthesize knowledge 📚 @uwnlp @allen_ai With open models & 45M-paper datastores, it outperforms proprietary systems & match human experts. Try out our demo! We also introduce ꜱᴄʜᴏʟᴀʀQᴀʙᴇɴᴄʜ,…



Juhyun Oh Reposted

Episode 5 of our podcast is out! We discuss how complicated it is to assess intelligence, whether in humans, animals, or machines. With two fantastic guests: Comparative Psychologist Erica Cartmill and Computer Scientist Ellie Pavlick. Check it out! complexity.simplecast.com/episodes/natur…


I really enjoyed working with Wenda 😆💪

I am on job market for full-time industry positions. My research focuses on text generation evaluation and LLM alignment. If you have relevant positions, I’d love to connect! Here are list of my publications and summary of my research:



Juhyun Oh Reposted

Since this guarantee is model-agnostic by nature, we no longer have to rely solely on GPT-4 as a judge 🤩 We propose Cascaded Selective Evaluation that operates with cascades of judge models instead of expensive GPT-4 for all evaluations! [3/n]

Tweet Image 1

Juhyun Oh Reposted

LLM-as-a-judge has become a norm, but how can we be sure that it will really agree with human annotators? 🤔In our new paper, we introduce a principled approach to provide LLM judges with provable guarantees of human agreement⚡ #LLM #LLM_as_a_judge #reliable_evaluation 🧵[1/n]


Juhyun Oh Reposted

Can OpenAI o1 handle complex planning tasks? ——Not really! It seems o1 gets even more confused by context than GPT-4o 😲. We test the TravelPlanner validation set with the updated o1 models and fine-tuned GPT-4o. Key insights: 1. Mixed Results for o1 and o1-mini: No…

Tweet Image 1

Thanks AK for sharing! 🫣How far are language agents hill-climbing towards human-level planning?——⚡️Introducing TravelPlanner, a benchmark for real-world planning



Juhyun Oh Reposted

[NEW PAPER ALERT!] In this work, we present PROFILE, a framework designed to discern the alignment of LLM-generated responses with human preferences at a fine-grained level (length, formality and intent etc). Our key finding is a significant misalignment between LLM's output and…

📜New preprint! LLMs generate impressive texts, but often miss what humans actually prefer—like being too wordy. 🤯 The problem? We don’t have a precise way to pinpoint where these misalignments occur. That’s the gap we aim to fill!🔍

Tweet Image 1


Juhyun Oh Reposted

🚨New Benchmark Alert🚨 Our paper accepted to Findings of EMNLP 2024🌴 introduces a new dataset, DynamicQA! DynamicQA contains inherently conflicting data (both disputable🤷‍♀️ & temporal🕰️) crucial to studying LM’s internal memory conflict. Work with @hayu204 🥳 #EMNLP2024 #NLProc

Tweet Image 1

Juhyun Oh Reposted

Can LLMs cater to diverse cultures in text generation? We find: 1️⃣lexical variance across nationalities 2️⃣culturally salient words 3️⃣weak correlation w/ cultural values 📜arxiv.org/abs/2406.11565 🤗huggingface.co/datasets/shail… 💻github.com/shaily99/eecc 🎉@emnlpmeeting🎉 w/ @841io 🧵

Tweet Image 1

Excited to attend the first @COLM_conf 😝 Come check our work on Multi-lingual Factuality Evaluation on Wednesday morning and say hi 👋 📚: arxiv.org/pdf/2402.18045

Excited to attend the first @COLM_conf 🦙❤️🤩 Very open to talk to anyone about faculty/postdoc/phd opportunities at KAIST, as well as about multilingual multicultural LLM research. Come join the multilingual special session on Wednesday morning, and find my students…



Juhyun Oh Reposted

Can your LLM Stay Faithful to Context, Even If "The Moon 🌕 is Made of Marshmallows 🍡"? We Introduce FaithEval, a new and comprehensive benchmark dedicated to evaluating contextual faithfulness for LLMs with 4.9K high-quality question-context pairs across 3 challenging tasks:…

Tweet Image 1

Juhyun Oh Reposted

LLMs Know More Than They Show We know very little about how and why LLMs "hallucinate" but it's an important topic nonetheless. This new paper finds that the "truthfulness" information in LLMs is concentrated in specific tokens. This insight can help enhance error detection…

Tweet Image 1

Juhyun Oh Reposted

RLHF is a popular method. It makes your human eval score better and Elo rating 🚀🚀. But really❓Your model might be “cheating” you! 😈😈 We show that LLMs can learn to mislead human evaluators via RLHF. 🧵below

Tweet Image 1

United States Trends
Loading...

Something went wrong.


Something went wrong.