Evidence for defining skills, developing metrics, and evaluating reliability and validity of these assessments has been described.3 Prior to beginning the simulation component of the CHEST Challenge, the facilitator and graders practiced the clinical scenarios and agreed on standardized scoring by consensus. Players were oriented to the capabilities and limitations of the mannequin. Graders utilized valid, behaviorally anchored checklists (core actions were either present or not), although holistic (global) scoring also has value. Two graders were used for interrater reliability, but adding simulation tasks (broader domain sampling) may best improve overall reliability. Generalizability studies, if done, can further assess the sources and magnitude of measurement errors and help with test design. A trained facilitator played the role of the ICU nurse to ensure consistency and respond to participants for all technical limitations (eg, simulator does not sweat or have changes in skin color). As much as possible, player participants had to perform rather than just verbalize interventions. Finally, encounters were recorded on video; although intended for promotional value and quality assurance and not necessarily considered better than oral debriefing,5 these recordings can provide learners with valuable feedback and insights.