Evaluating Long-Context Question & Answer Systems
Evaluation metrics, how to build eval datasets, eval methodology, and a review of several benchmarks.
Evaluation metrics, how to build eval datasets, eval methodology, and a review of several benchmarks.
Recsys & search are converging with LLMs via semantic IDs, data augmentation, and unified foundation models.
What makes a good leader? What do good leaders do? And commando, soldier, and police leadership.
Learning to automate simple agentic workflows with Amazon Q CLI, Anthropic MCP, and tmux.
Special thanks to John Schulman for a lot of super valuable feedback and direct edits on this post. Test time compute (Graves et al. 2016, Ling,…
Applying the scientific method, building via eval-driven development, and monitoring AI output.
How I started, why I write, who I write for, how I write, and more.
I’m freezing this blog and starting to post on my Substack instead. The authoring experience is much more convenient for me there. Please follow me there,…
Chip Huyen and I share what we've learned, best practices, and insights at NVIDIA GTC 2025.
Model architectures, data generation, training paradigms, and unified frameworks inspired by LLMs.
As we’re still in the early days of building applications with foundation models, it’s normal to make mistakes. This is a quick note with examples of…
Exploring how an AI-powered reading experience could look like.
Intelligent agents are considered by many to be the ultimate goal of AI. The classic book by Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern…
A peaceful year of steady progress on my craft and health.
nbsanity - Share Notebooks as Polished Web Pages in Seconds
With regard to writing, there are many rules and also no rules at all.
Building an Audience Through Technical Writing: Strategies and Mistakes
Reward hacking occurs when a reinforcement learning (RL) agent exploits flaws or ambiguities in the reward function to achieve high rewards, without genuinely learning or completing…
Benefits of running a weekly paper club, how to start one, and how to read and facilitate papers.
Setting up my new MacBook Pro from scratch
ML systems, production & scaling, execution & collaboration, building for users, conference etiquette.
Using LLM-as-a-Judge For Evaluation: A Complete Guide
Look at and label your data, build and evaluate your LLM-evaluator, and optimize it against your labels.
Being a human judge at the Weights & Biases LLM-as-a-Judge Hackathon