@_julianmichael_ Profile picture

Julian Michael

@_julianmichael_

Researching stuff @NYUDataScience. he/him

Similar User
Maarten Sap (he/him) photo

@MaartenSap

Suchin Gururangan photo

@ssgrn

Sewon Min photo

@sewon__min

UW NLP photo

@uwnlp

Hanna Hajishirzi photo

@HannaHajishirzi

Wei Xu photo

@cocoweixu

Sarah Wiegreffe (on faculty job market!) photo

@sarahwiegreffe

Mike Lewis photo

@ml_perception

Roy Schwartz photo

@royschwartzNLP

Luke Zettlemoyer photo

@LukeZettlemoyer

Victor Zhong photo

@hllo_wrld

Alexis Ross photo

@alexisjross

Yanai Elazar photo

@yanaiela

Yonatan Belinkov photo

@boknilev

Patrick Lewis photo

@PSH_Lewis

Pinned

As AIs improve at persuasion & argumentation, how do we ensure that they help us seek truth vs. just sounding convincing? In human experiments, we validate debate as a truth-seeking process, showing that it may soon be needed for supervising AI. Paper: github.com/julianmichael/…

Tweet Image 1

Julian Michael Reposted

Computer scientists are pitting large language models against each other in debates. The resulting arguments can help a third-party judge determine who’s telling the truth. @stephenornes reports: quantamagazine.org/debate-may-hel…


Julian Michael Reposted

This coming Monday, @_julianmichael_ (julianmichael.org; postdoc at NYU) will talk about AI Alignment. Title: Progress on AI alignment using debate: where we are and what's missing Date: Monday, November 11, 2024 Time: 3:00 pm Location: RLP 1.302E (Glickman Center)


Julian Michael Reposted

🚨 New paper: We find that even safety-tuned LLMs learn to manipulate vulnerable users when training them further with user feedback 🤖😵‍💫 In our simulated scenarios, LLMs learn to e.g. selectively validate users' self-destructive behaviors, or deceive them into giving 👍. 🧵👇

Tweet Image 1

Julian Michael Reposted

Sad to see Sherrod Brown lose in Ohio. He is a great Senator with very smart staff. It's not well known, but he was also a quiet inspiration to many early realignment conservatives.


Julian Michael Reposted

📈New paper on implicit language and context! She bought the largest pumpkin? - Largest pumpkin out of what? All pumpkins in the store? Out of all pumpkins bought by her friends? In the world? Superlatives are (often) ambiguous and their interpretation is extremely context…

Tweet Image 1

Julian Michael Reposted

Really excited that this paper is out now! We show that models are capable of a basic form of introspection. Scaling this to more advanced forms would have major ramifications for safety, interpretability, and the moral status of AI systems.

New paper: Are LLMs capable of introspection, i.e. special access to their own inner states? Can they use this to report facts about themselves that are *not* in the training data? Yes — in simple tasks at least! This has implications for interpretability + moral status of AI 🧵

Tweet Image 1


Julian Michael Reposted

I’ll be (probably blearily) talking about GPQA as an oral spotlight at @COLM_conf tomorrow at 10:30am in the main hall! I’m excited to share my mildly spicy takes on what recent progress on the benchmark means (I’m also poster #47 from 11am - 1pm if you want to chat!)


Julian Michael Reposted

Excited to announce I've joined the SEAL team at @scale_AI in SF! I'm going to be working on leveraging explainability/reasoning methods to improve robustness and oversight quality.


Julian Michael Reposted

I'll be at ICML! (Also on the job market) Excited to chat about improving the language model evaluations, more realistic/varied "sleeper agent" models, and safety cases, feel free to DM


Julian Michael Reposted

Bias-Augmented Consistency Training shows promise in improving AI trustworthiness by training models to provide unbiased reasoning even with biased prompts. More details: nyudatascience.medium.com/new-research-f…

Tweet Image 1

Julian Michael Reposted

Want to know why OpenAI's safety team imploded? Here's why. Thank you to the company insiders who bravely spoke to me. According to my sources, the answer to "What did Ilya see?" is actually very simple... vox.com/future-perfect…


Julian Michael Reposted

Is GPQA garbage? A couple weeks ago, @typedfemale pointed out some mistakes in a GPQA question, so I figured this would be a good opportunity to discuss how we interpret benchmark scores, and what our goals should be when creating benchmarks.

Tweet Image 1

i asked GPQA's example quantum mechanics question to my friend who is an expert in quantum and they told me: "all of these answers are incorrect" - it's google proof only because it's word salad!

Tweet Image 1


Julian Michael Reposted

llama 3 is a snitch...

Tweet Image 1

Julian Michael Reposted

We are thrilled to announce Colleen McKenzie (@collegraphy) as our new Executive Director. Read about it from @degerturann: ai.objectives.institute/blog/colleen-m…


Julian Michael Reposted

🚨📄 Following up on "LMs Don't Always Say What They Think", @milesaturpin et al. now have an intervention that dramatically reduces the problem! 📄🚨 It's not a perfect solution, but it's a simple method with few assumptions and it generalizes *much* better than I'd expected.

🚀New paper!🚀 Chain-of-thought (CoT) prompting can give misleading explanations of an LLM's reasoning, due to the influence of unverbalized biases. We introduce a simple unsupervised consistency training method that dramatically reduces this, even on held-out forms of bias. 🧵

Tweet Image 1


Check out our latest work on reducing unfaithfulness in chain of thought! Turns out you can get a long way just by training the model to output consistent explanations even in the presence of spurious biasing features that ~tempt~ the model.

🚀New paper!🚀 Chain-of-thought (CoT) prompting can give misleading explanations of an LLM's reasoning, due to the influence of unverbalized biases. We introduce a simple unsupervised consistency training method that dramatically reduces this, even on held-out forms of bias. 🧵

Tweet Image 1


Julian Michael Reposted

🚀New paper!🚀 Chain-of-thought (CoT) prompting can give misleading explanations of an LLM's reasoning, due to the influence of unverbalized biases. We introduce a simple unsupervised consistency training method that dramatically reduces this, even on held-out forms of bias. 🧵

Tweet Image 1

Julian Michael Reposted
Tweet Image 1

Claude 3 gets ~60% accuracy on GPQA. It's hard for me to understate how hard these questions are—literal PhDs (in different domains from the questions) with access to the internet get 34%. PhDs *in the same domain* (also with internet access!) get 65% - 75% accuracy.

Tweet Image 1


Julian Michael Reposted

Two new preprints by CDS Jr Research Scientist @idavidrein and CDS Research Scientist @_julianmichael_, working with CDS Assoc. Prof. @sleepinyourhat, aim to enhance the reliability of AI systems through innovative debate methodologies and new benchmarks. nyudatascience.medium.com/pioneering-ai-…


Loading...

Something went wrong.


Something went wrong.