Mayank Mishra @MishraMish98 Twitter Profile

Mayank Mishra

@MishraMish98

47Posts 95Followers 119Following

Similar User

@iHR4K

@matsiiako

@Vizcom_ai

@aviskowron

@GKopanas

@MattMikula

@thejuproject

@jwkirchenbauer

@PiotrRMilos

@HoskinsAllen

@bobda

@dmayhem93

@HaoyiZhu

@daryl_imagineai

@tonyjhartshorn

Mayank Mishra Reposted

Ekin Akyürek

@akyurekekin

10 Nov

Why do we treat train and test times so differently? Why is one “training” and the other “in-context learning”? Just take a few gradients during test-time — a simple way to increase test time compute — and get a SoTA in ARC public validation set 61%=avg. human score! @arcprize

Mayank Mishra Reposted

Tri Dao

@tri_dao

30 Oct

Quantization method that accounts for how many instructions it takes on GPUs to dequantize! We're passed just counting FLOPS or memory access, we're counting instructions

Albert Tseng

@tsengalb99

30 Oct

🧵 🏎️ Want faster, better quantized LLMs? Introducing QTIP, a new LLM quantization method that achieves a SOTA combination of quality and speed – outperforming methods like QuIP#! 🧑‍💻+🦙(w/ 2 bit 405B!): github.com/Cornell-RelaxM… 📜arxiv.org/abs/2406.11235

Mayank Mishra

@MishraMish98

24 Sep

We have released a torch compile friendly version of the scatterMoE kernel. Speedups are around 5-7% on 64 H100s for 1B MoE model. Larger MoEs or MoEs with larger compute density will benefit more from the optimization. Code: github.com/mayank31398/ke…

GitHub - mayank31398/kernel-hyperdrive: A bunch of kernels that might make stuff slower 😉

Source: https://t.co/pOpZKP2CsV

Mayank Mishra Reposted

Yikang Shen

@Yikang_Shen

17 Sep

We have updated our PowerLM series models. They are now under Apache 2.0. And with a slight tweak to the data mix, they perform better than the previous version. PowerLM-3B: huggingface.co/ibm/PowerLM-3b PowerMoE-3B (800M active params): huggingface.co/ibm/PowerMoE-3b

Mayank Mishra

@MishraMish98

8 Sep

Smol models FTW

Vaibhav (VB) Srivastav

@reach_vb

7 Sep

PowerMoE from IBM look underrated - Trained on just 1T (PowerLM 3B) & 2.5T (PowerMoE 0.8B active, 3B total) - open model weights - comparable perf to Gemma, Qwen 🔥 > Two-stage training scheme > Stage 1 linearly warms up the learning rate and then applies the power decay > Stage…

Mayank Mishra

@MishraMish98

28 Aug

Padding-Free transformers now accelerate training models with HuggingFace library natively. 2x throughput improvement without any approximations. NO CUSTOM DEVICE KERNELS!! (except Flash Attention)

Raghu Ganti

@RaghuGanti

28 Aug

Want to get 2x throughput improvement on your tuning jobs across various HF models without changing any code and effecting model quality? Now you can simply use Hugging Face transformers and TRL to do this! Read more here: research.ibm.com/blog/hugging-f… Key findings: 1. Simple sequence…

Mayank Mishra Reposted

Stas Bekman

@StasBekman

18 Jul

Here is a new Machine Learning Engineering chapter: Network debug github.com/stas00/ml-engi… The intention is to help non-network engineers to figure out how to resolve common problems around multi-gpu and multi-node collectives networking - it's heavily NCCL-biased at the moment.…

Mayank Mishra Reposted

Yikang Shen

@Yikang_Shen

11 Jul

ScatterMoE is accepted by CoLM. See you in Philadelphia!

Aran Komatsuzaki

@arankomatsuzaki

14 Mar

Scattered Mixture-of-Experts Implementation - Presents ScatterMoE, an implementation of Sparse Mixture-of-Experts on GPU - Enables a higher throughput and lower memory footprint repo: github.com/shawntan/scatt… abs: arxiv.org/abs/2403.08245

Mayank Mishra Reposted

Yikang Shen

@Yikang_Shen

4 Jul

With a few tricks, Llama-3-8B can be continued trained to outperform GPT-4 on Medical tasks. For more details, check our paper Efficient Continual Pre-training by Mitigating the Stability Gap (arxiv.org/abs/2406.14833)!

Mayank Mishra

@MishraMish98

30 May

We have released 4-bit GGUF versions of all Granite Code models for local inference. 💻 The models can be found here: huggingface.co/collections/ib…

Mayank Mishra

@MishraMish98

24 May

New preprint out with colleagues from MIT and IBM "Reducing Transformer Key-Value Cache Size with Cross-Layer Attention": arxiv.org/abs/2405.12981 We introduce a simple mechanism of sharing keys and values across layers, reducing the memory needed for KV cache during inference!!

Mayank Mishra Reposted

Yikang Shen

@Yikang_Shen

17 May

JetMoE and IBM Granite Code models are now natively available on in Huggingface Transformers v4.41! github.com/huggingface/tr…

Release v4.41.0: Phi3, JetMoE, PaliGemma, VideoLlava, Falcon2, FalconVLM & GGUF support · hugging...

Source: https://t.co/9QGJubYa6e

Mayank Mishra

@MishraMish98

12 May

Yes sir :)

Chinar

@highchinar

10 May

The SOTA code model on huggingface is from....IBM?

Mayank Mishra Reposted

Yikang Shen

@Yikang_Shen

7 May

Our IBM Granite Code series models are finally released today. Despite the strong code performance that you should definitely check out, I also want to point out that the math reasoning performance of our 8B models is unexpectedly good. Congrats to all our teammates!…

Mayank Mishra

@MishraMish98

7 May

Open-sourcing Granite Code models (3B, 8B, 20B, 34B) trained on 3-4 trillion tokens of code. → Completely Apache 2.0 → Outperforming all models openly available → Amazing mathematical and reasoning performance Paper: github.com/ibm-granite/gr… Models: huggingface.co/collections/ib…

Mayank Mishra

@MishraMish98

2 May

Unveiling BRAIn, a new method of aligning LLMs with preferencial data. Kudos to @gauravpandeyamu for leading this effort. The work has been accepted for publication at ICML.

Gaurav Pandey

@gauravpandeyamu

2 May

Minimizing forward KL wrt PPO-optimal policy (proceedings.neurips.cc/paper_files/pa……) policy doesn't perform as well for RLHF as PPO and DPO. Or does it? In our ICML paper (arxiv.org/abs/2402.02479), we show that it actually performs much better if an appropriate baseline is chosen.

Mayank Mishra

@MishraMish98

9 Apr

New way of training MoE's Thanks for all you hard work @BowenPan7

Bowen Pan

@BowenPan7

9 Apr

Thrilled to unveil DS-MoE: a dense training and sparse inference scheme for enhanced computational and memory efficiency in your MoE models! 🚀🚀🚀 Discover more in our blog: huggingface.co/blog/bpan/ds-m… and dive into the details with our paper: arxiv.org/pdf/2404.05567…