Arena (arena.ai) Logo

Arena (arena.ai)

Machine Learning Scientist

Reposted 16 Days Ago
Remote or Hybrid
Hiring Remotely in CA
Expert/Leader
Remote or Hybrid
Hiring Remotely in CA
Expert/Leader
Responsible for designing experiments to evaluate AI models, developing evaluation methodologies, analyzing performance data, and collaborating on production insights.
The summary above was generated by AI
About Arena Intelligence

Arena is the platform for evaluating how AI models perform in the real world. Founded by researchers from UC Berkeley's SkyLab, we're on a mission to measure and advance the frontier of AI for real-world use, and to build the foundation for everyone to understand, shape, and benefit from it.


Tens of millions of people use Arena each month to evaluate how frontier systems handle the work they actually do. The preferences they share power the most transparent, rigorous, and human-centered evaluations in AI. Leading AI labs, enterprises, and independent researchers rely on our work and open datasets to understand how models behave in real workflows: agentic coding, creative generation, professional productivity, and beyond. We go beyond leaderboards and decompose what human experience reveals about AI, so models advance toward the work people actually do.


We're a team of researchers, academics, builders, and creatives from UC Berkeley, Google, Stanford, and DeepMind. We seek truth, move fast, and value craftsmanship, curiosity, and impact over hierarchy. We're building a company where thoughtful, curious people from all backgrounds can do their best work together, in an office culture that radiates excellence, energy, and focus.

About the Role

Arena Intelligence is seeking a variety of Machine Learning Scientist to help advance how we evaluate and understand AI models. You’ll help design and analyse experiments that uncover what makes models useful, trustworthy and capable through human preference signals. Your work will contribute to the scientific foundations of understanding AI at scale.

This role is deeply interdisciplinary. You’ll work closely with engineers, product teams, marketing and the broader research community to develop new methods for comparing models, analyzing preference data, and disentangling performance factors like style, reasoning, and robustness. Your work will inform both the public leaderboard and the tools we provide to model developers.

If you’re excited by open-ended questions, rigorous evaluation, and research that’s grounded in real-world impact, you’ll find a meaningful home here. We’re looking for:

  • Hands-on experience training large-scale models, including reward models, preference models, and fine-tuning LLMs with methods like RLHF, DPO, and contrastive learning.

  • Strong foundation in ML and statistics, with a track record of designing novel training objectives, evaluation schemes, or statistical frameworks to improve model reliability and alignment.

  • Fluent in the full experimental stack, from dataset design and large-batch training to rigorous evaluation and ablation, with an eye for what scales to production.

  • Deeply collaborative mindset, working closely with engineers to productionize research insights and iterating with product teams to align modeling goals with user needs.

You’ll
  • Design and conduct experiments to evaluate AI model behavior across reasoning, style, robustness, and user preference dimensions

  • Develop new metrics, methodologies, and evaluation protocols that go beyond traditional benchmarks

  • Analyze large-scale human voting and interaction data to uncover insights into model performance and user preferences

  • Collaborate with engineers to implement and scale research findings into production systems

  • Prototype and test research ideas rapidly, balancing rigor with iteration speed

  • Author internal reports and external publications that contribute to the broader ML research community

  • Partner with model providers to shape evaluation questions and support responsible model testing

  • Contribute to the scientific integrity and transparency of the Arena Intelligence leaderboard and tools

You’ll have
  • PhD or equivalent research experience in Machine Learning, Natural Language Processing, Statistics, or a related field

  • Strong understanding of LLMs and modern deep learning architectures (e.g., Transformers, diffusion models, reinforcement learning with human feedback)
    Proficiency in Python and ML research libraries such as PyTorch, JAX, or TensorFlow

  • Demonstrated ability to design and analyze experiments with statistical rigor

  • Experience publishing research or working on open-source projects in ML, NLP, or AI evaluation

  • Comfortable working with real-world usage data and designing metrics beyond standard benchmarks

  • Ability to translate research questions into practical systems and collaborate across engineering and product teams

  • Passion for open science, reproducibility, and community-driven research.

What we offer
  • We offer competitive compensation and equity aligned to the markets where our team members are based. The base salary range will depend on the candidate’s permanent work location.

  • Comprehensive health and wellness benefits, including medical, dental, vision, and additional support programs.

  • The opportunity to work on cutting-edge AI with a small, mission-driven team

  • A culture that values transparency, trust, and community impact

Come help build the space where anyone can explore and help shape the future of AI.

Arena Intelligence provides equal employment opportunities (EEO) to all employees and applicants for employment without regard to race, color, religion, sex, national origin, age, disability, genetics, sexual orientation, gender identity, or gender expression. We are committed to a diverse and inclusive workforce and welcome people from all backgrounds, experiences, perspectives, and abilities.

Similar Jobs

3 Days Ago
Remote or Hybrid
CA
Mid level
Mid level
Angel or VC Firm • Artificial Intelligence
Research and build efficient ML systems for large-scale LLMs and agentic RL: design algorithms and system techniques, prototype in training/inference stacks, run large-scale experiments, and translate findings into production or publications.
Top Skills: Attention MechanismsDistributed TrainingHugging FaceJaxPythonPyTorchReinforcement LearningTransformers
4 Days Ago
In-Office or Remote
CA
Expert/Leader
Expert/Leader
Agency • Information Technology • Professional Services • Software
Develop prototypes, PoCs, and MVPs for GenAI solutions. Apply deep learning and ML principles using Python, Hugging Face, LangChain, OpenAI API, TensorFlow/Keras/PyTorch, and cloud model services. Work with multi-modal data and intelligent agent tools. Be self-motivated, collaborative, and focused on solving hard AI/GenAI problems.
Top Skills: Amazon BedrockGoogle Model GardenHugging FaceKerasLangchainNvidia NimOpenai ApiPythonPyTorchTensorFlow
5 Days Ago
Remote or Hybrid
Senior level
Senior level
Artificial Intelligence • Computer Vision • Healthtech • Machine Learning • Software
As a Machine Learning Scientist at Nucs AI, you will develop innovative ML methods for medical image analysis, collaborate with clinicians, and contribute to clinical validation studies, impacting cancer diagnostics.
Top Skills: PythonPyTorch

What you need to know about the Calgary Tech Scene

Employees can spend up to one-third of their life at work, so choosing the right company is crucial, not just for the job itself but for the company culture as well. While startups often offer dynamic culture and growth opportunities, large corporations provide benefits like career development and networking, especially appealing to recent graduates. Fortunately, Calgary stands out as a hub for both, recognized as one of Startup Genome's Top 100 Emerging Ecosystems, while also playing host to a number of multinational enterprises. In Calgary, job seekers can find a wide range of opportunities.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account