
Surge AI provides a human intelligence platform for training and evaluating frontier AI systems, combining expert human feedback, rubrics/verifiers, and rich RL environments to produce high-quality post-training and evaluation data.
This solution hasn't earned enough merit to be scored
Surge AISurge AI is a human intelligence and data platform focused on “raising AGI with the richness of humanity.” The company positions data as the formative ingredient that turns AI into useful intelligence—emphasizing the depth of lived human experience, values, taste, and judgment rather than optimizing models for “clicks and hype.” Its guiding message across the site is “Smart ≠ Useful,” highlighting a quality-first philosophy. The company provides human-centered post-training and evaluation work for frontier AI systems. Its offerings span supervised fine-tuning (SFT), RLHF, rubric design and verification, rich reinforcement learning (RL) environments and agents, and gold-standard human evaluation—areas where human judgment is presented as essential because automated benchmarks and proxy metrics are easily gamed. Surge AI also highlights an elite expert network used to shape model behavior across professional domains and the humanities. The site describes recruiting and working with highly credentialed contributors (e.g., doctors, lawyers, investment bankers, professors, Olympians, linguists, poets) to design scenarios, rubrics, verifiers, and evaluations that reflect real-world complexity. The company emphasizes credibility via collaborations and partnerships with leading AI labs and institutions (named on the site), and claims it has been profitable from day one without raising venture funding. Surge AI also publishes research and benchmarks (e.g., RL environment evaluations, instruction-following benchmarks, reward model benchmarks) and runs the Hemingway-bench AI writing leaderboard based on real-world prompts and expert evaluation.
HC score
verified business cases



Access to highly credentialed experts across STEM and the humanities (e.g., doctors, lawyers, professors, Fields Medalists) to shape datasets, evaluations, and judgments for model training and oversight.
Vetted Experts
Domain Expertise
Scalable Oversight
A curated expert network across professional domains (e.g., doctors, lawyers, finance professionals, academics, linguists) used to shape AI systems via domain judgment, writing, and evaluation.
Expert contributors
Domain specialists
Contractor network
Human evaluation services used as a gold standard for assessing usefulness, sense-making, and safety beyond academic benchmarks or automated evals.
Human judgments
Safety evaluation
Usefulness testing
Human-powered post-training and evaluation services to shape frontier models toward real-world usefulness and aligned behavior, emphasizing quality and expert judgment beyond easily-gamed benchmarks.
Human Evaluation
Expert Feedback
Quality Control
Creation of rich, complex reinforcement learning environments that challenge agentic models in novel ways, including design of verifiers that reward desired behavior.
RL Environments
Agent Evaluation
Behavior Verifiers
Creation of rich, complex reinforcement learning environments to challenge agentic models and design of verifiers that reward model behavior.
RL environments
Agent evaluation
Behavior verifiers
Generation of preference and reward data (RLHF) intended to capture nuanced human taste and judgment to improve model behavior beyond ordinary outputs.
Preference Data
Reward Modeling
Human Judgments
Design of scoring rubrics and grading/verifier systems that differentiate quality and failure modes across tasks, enabling structured evaluation and reward signals.
Scoring Rubrics
Verifiers
Reward Signals
Design of scoring rubrics and grading/verifier systems that differentiate quality and failure modes in model outputs.
Scoring rubrics
Verifier design
Quality grading
Skill bootstrapping via demonstrations that teach foundational capabilities such as computer use, web navigation, and early reasoning skills.
Demonstrations
Skill Bootstrapping
Task Guidance

The customer needed to explore where human artists fit alongside generative models. They lacked a clear human baseline to compare against AI creativity. They needed a consistent way to contextualize model outputs using the same prompt set. They asked 100 people to draw prompts intended for DALL·E. The project assembled a shared prompt set created by humans. This created a human baseline that could be compared directly against generative model outputs. The work produced a human baseline for comparison on the same prompt set. The results were presented to contextualize model outputs against human creativity. This enabled clearer evaluation of how AI outputs related to human-made prompts.
Skills
Project Details

OpenAI
OpenAI needed a dataset to train and measure language-model reasoning on grade-school math word problems. Existing resources did not meet the need for a dedicated set of problems aligned to that evaluation goal. The challenge was to assemble enough problems to support both training and measurement. A GSM8K dataset was built consisting of grade-school math word problems. The implementation focused on creating a structured dataset that could be used to train models like GPT-3. The dataset was also designed to support consistent measurement of reasoning ability. The work delivered a dataset of 8,500 problems in GSM8K. The dataset supported training models like GPT-3 on grade-school math word problems. It also enabled measurement of language-model reasoning ability using the same problem set.
Skills
Project Details

Google faced concerns about label quality in its GoEmotions dataset of 58K Reddit comments categorized into 27 emotions. The dataset’s labels were expected to reliably support emotion classification research and model training. However, the accuracy of the annotations had not been sufficiently validated. This created risk that downstream models would learn incorrect patterns. An audit was conducted to review the GoEmotions labels across the dataset. The review evaluated whether Reddit comments were correctly assigned to the intended emotions within the 27-category taxonomy. The audit process surfaced widespread labeling problems across many entries. The findings were used to advocate for stronger dataset construction and quality assurance. The audit found that 30% of the GoEmotions dataset entries were mislabeled. This indicated substantial quality issues within a widely used benchmark dataset. The results reinforced the need for improved QA practices for large-scale labeled datasets. The analysis clarified how labeling errors could undermine reliability in downstream emotion modeling.
Skills
Project Details

Hugging Face
Hugging Face needed to assess real-world LLM performance beyond automated benchmarks. The team sought to understand how the BLOOM multilingual model performed in practical settings. They also wanted to identify strengths and weaknesses relative to other models. A human evaluation was implemented on Hugging Face’s BLOOM model. The evaluation compared BLOOM against other models across seven real-world categories. The work positioned human evaluation as a necessary approach for understanding real capability. The effort delivered a structured comparison of BLOOM relative to other models. It surfaced strengths and weaknesses across seven real-world categories. The work reinforced the role of human evaluation in assessing real-world LLM performance.
Skills
Project Details

Neeva
Neeva needed a robust way to measure and improve search quality. The company faced the challenge of evaluating search results consistently enough to guide iteration. It also needed a dependable quality measurement approach while building a state-of-the-art search engine positioned to challenge Google. Neeva implemented human evaluation of search results to guide quality measurement and iteration. The effort focused on using structured human feedback to assess search quality. This evaluation approach was used to inform ongoing improvements to the search experience. The company used the human evaluation process to measure search quality and support iterative improvements. This method provided a practical way to assess results and guide changes over time. It helped Neeva advance its quality measurement approach while pursuing a state-of-the-art search engine.
Skills
Project Details

Instagram faced a challenge understanding why it was losing Gen Z. Existing engagement metrics did not explain the underlying reasons for user drop-off. The team needed clearer insight into how Gen Z perceived TikTok versus Instagram Reels. A personalized human evaluation study was implemented to address the gap. The approach asked users to compare TikTok directly with Instagram Reels. The evaluation focused on capturing qualitative feedback rather than relying only on engagement signals. The study surfaced qualitative reasons users preferred TikTok and viewed Reels negatively. The findings were framed as guidance for moving beyond simple engagement metrics. This helped clarify the factors contributing to Gen Z drop-off.
Skills
Project Details

A quality audit was conducted on HellaSwag, a widely used LLM benchmark. The customer faced uncertainty about whether benchmark results could be trusted without verification. The dataset’s scale and adoption made it difficult to spot issues through casual inspection. This created risk that downstream evaluations could be based on faulty rows. The team implemented a structured quality review of HellaSwag. They analyzed the benchmark rows to identify and categorize errors. The audit process focused on verifying row correctness rather than relying solely on prior benchmark usage. Findings were documented to inform how the benchmark should be interpreted. The analysis found that a significant portion of HellaSwag rows contained errors. Specifically, 36% of rows were identified as erroneous. This result called into question conclusions drawn from the benchmark without additional verification. The customer used the findings to guide more cautious use of the benchmark in evaluation work.
Skills
Project Details

An evaluation aimed to understand real-world search performance across a large set of queries. The customer needed a clear comparison between ChatGPT and Google across different query types. The challenge was determining which system performed better on coding versus general information requests. The comparison also mattered because one system was not optimized for a search experience. The team implemented an evaluation that compared ChatGPT and Google across 500 search queries. The analysis segmented performance by query category, including coding and general information. Results were reviewed to identify which system performed better per category. The comparison accounted for the fact that ChatGPT was not optimized for a traditional search experience. The analysis found that ChatGPT outperformed Google on coding queries. It tied Google on general information queries. This outcome was notable given that ChatGPT was not optimized for a search experience. The evaluation provided evidence of category-specific strengths across the tested queries.
Skills
Project Details

Anthropic
Anthropic needed to train and evaluate Claude using scalable, high-quality human feedback. The team faced the challenge of gathering that feedback at scale while maintaining high quality. This created pressure to find an approach that could support ongoing training and evaluation needs. They partnered to gather human feedback at scale using an RLHF platform. The implementation focused on enabling scalable collection of high-quality human feedback for training and evaluation. The collaboration supported the process of improving Claude through human feedback loops. The collaboration contributed to building one of the safest and most advanced large language models. It enabled Anthropic to gather high-quality human feedback at scale for training and evaluating Claude. It supported continued iteration on the model using scalable human feedback.
Skills
Project Details

Meta Superintelligence Labs
Frontier instruction-following models still failed 22–30% of the time on key domains. This gap limited reliability in areas where precise adherence to instructions mattered. Meta Superintelligence Labs needed a stronger way to measure and improve instruction-following performance. Meta Superintelligence Labs partnered to build AdvancedIF, an instruction-following benchmark. Human experts wrote the prompts and evaluation rubrics. These human-crafted rubrics were then used as reward signals for reinforcement learning. Using the human-crafted rubrics as reinforcement learning reward signals delivered a 13% gain. The work specifically targeted domains where frontier models had previously failed 22–30% of the time. The benchmark and rubrics provided a structured way to evaluate and improve instruction-following.
Skills
Project Details
An independent global marketing consultancy delivering outsized growth.




Human Cloud Verification ensures that the listed end customer is verified. It's used across kudos, customers, and business cases, and performed by Human Cloud. Think about it like a background check.


