Powering LLM Regulation Through Data: Bridging the Gap from Compute Thresholds to Customer Experiences
A Framework for Domain-Specific AI Evaluation, Consumer Protection, and more Innovation
My paper from the Regulatable ML workshop at Neurips is now publicly available, so I thought I’d write an analysis discussing the paper, and addressing what has changed in the four months since the paper was finalized. The links can be found below:
Counters to Compute-Level Regulation
The core thesis of the paper is that compute-level regulation is not a sufficient form of consumer protection as it serves only as a proxy for model capabilities, and is not directly related to the content that users interact with through large language model (LLM) based experiences.
Compute level regulation has appeared in some influential forms of regulation, notably SB-1047 in California and the EU AI act, setting thresholds that would trigger compliance requirements for parties that exceed the stated limits. In isolation, this represents a reasonable means of regulation, there is certainly a correlation between compute power and model capabilities, and given the general-purpose nature of modern models, compute is a simple path to capturing the most capable models.
I think there are three primary issues with using compute thresholds as the primary regulatory mechanism:
Compute Thresholds are a Fast-Moving Target
AI and LLM innovation is far from a static target. Compute thresholds that may reflect the most capable models today become obsolete quickly based on algorithmic optimization techniques or better quality data. Since this paper was accepted, the new brand of LLM “reasoning” models, highlighted by the o3 model from OpenAI and Deepseek-R1, has validated this concern. The shift to inference-time optimization as a means for performance gains renders the level of training compute less of a representative proxy of model capabilities.
Another recent example of impressive performance relative to compute availability is the recent release of DeepSeek v3, which boasted significantly lower compute resources required for extremely impressive performance relative to other major models. As articulated in this great post from Interconnects, the “final” version of DeepSeek v3 obfuscates all of the compute needed to get to that final version through experimentation. While this complexity matters for understanding this particular model's development, techniques leveraged for this model can be applied to develop more efficient models in the future.
Compute is Many Layers Abstracted from Consumer Experiences
Compute-power is just a single variable associated with model performance, and finding the “right” threshold that reflects model capabilities is almost an impossible challenge given how disconnected compute is from the final product users receive.
The general-purpose nature of these models means that individual use cases are often disconnected from existing benchmarks or even what model providers are looking at when they are making model improvements.
I learned a lot of this first hand in my time at Lark Health when we built a LLM-based nutrition coach leveraging major third-party models. We observed that both major updates and version iterations within the same model affected performance of our nutrition coaching experience. We knew this because we built a robust evaluation framework that enabled us to quantitatively measure the quality of nutrition coaching through subject matter expert data labeling and a corresponding model built leveraging that data.
While it’s reasonable to expect model versions or new models to drive performance up for your specific experience, there are no guarantees. The crucial insight here is that without a robust evaluation framework you have no idea whether that is true or not, beyond “vibes” or purely subjective valuation.
While I have no way of knowing this for sure, it seems improbable that OpenAI, Anthropic and any other foundation model player have specific nutrition coaching evaluations they use when deciding to release a new model. Additionally, model performance is not dictated purely by the foundation model itself, it’s based on the unique data and prompt that companies building on top of the models are bringing to the table.
Nutrition coaching is just a single example - you can imagine the same concerns for any specific use case. So while compute is a reasonable metric to use to reflect general capabilities, it is not sufficient to reflect an individual use case’s specific experience.
The Risk of Regulatory Capture
The risk of regulatory capture presents significant challenges, as transparency and rigor around ensuring model-based experiences are appropriate is a worthy objective. At the same time, requiring intense levels of compliance above a certain level of compute disincentivizes smaller companies or open-source initiatives from exceeding those levels of compute because they don’t have the necessary resources to keep up with compliance1. This has the unintended effect of concentrating power for this potentially transformative technology among a very small group of companies that do have the resources.
General Public Reaction and Understanding
While compute-based regulatory efforts may aim more for transparency from major model providers than consumer protection, this nuanced distinction appears lost on the general public. This is reflected in Gavin Newsom’s veto message of the SB-1047 bill:
“By focusing only on the most expensive and large-scale models, SB 1047 establishes a regulatory framework that could give the public a false sense of security about controlling this fast-moving technology. Smaller, specialized models may emerge as equally or even more dangerous than the models targeted by SB 1047 - at the potential expense of curtailing the very innovation that fuels advancement in favor of the public good.”
This is perhaps the most subjective means of push-back, but understanding public perception is a crucial consideration. Thematically it is increasingly important to try and step out of the AI/tech bubble when evaluating the impact of AI on society. The purpose of compute-level regulation might be obvious to most of the AI community, but for the general public the likely expectation is that AI regulation means consumer protection.
Curated Data and Domain-Specific Evaluation as the Alternative or Complement to Compute Level Regulation
As discussed, compute-level regulation itself is not inherently wrong as a means for transparency, but it should be accompanied by efforts that do ensure appropriate consumer protections.
In this paper I propose a data-based alternative that involves identifying high-value, high-risk use cases that warrant specific evaluation. The approach involves curating datasets, meaning user interactions and model responses, related to domain-specific experiences, having experts in the area manually review and score content, and ultimately convert those manual reviews to a model-based approach for scalability. This ultimately can be converted to a certification process, where a centralized authority could independently evaluate a specific experience and grant a certification if it exceeds a certain threshold. This has several potential benefits:
Consumer Protection
Since evaluation is domain-specific, consumers have confidence that their specific experience is safe to use with LLMs. Examples could include nutrition coaching, mental health coaching, among many others. This evaluation would not just be for the model providers, but for all companies leveraging models, so it would more accurately reflect the full AI Supply Chain, rather than just foundation model providers.
Stimulating Innovation in more Sensitive Areas
The lack of available certifying authority and a means to evaluate specific experiences actually inhibits innovation, or at worst only those that are willing to take the most risk will release products in certain areas. I felt this first-hand at Lark, where we needed to develop our own evaluation framework to convince our healthcare partners that our experience was safe to use with their clients. This is suboptimal because we essentially had no choice but to grade our own homework. If there was an established centralized means of this evaluation, it would drive more responsible innovation in areas like healthcare, which carries to the dual benefit of increased economic activity and better outcomes for consumers.
Data has Persistent Value
For these centralized authorities, data carries significantly more value than shifting a compute threshold around. A growing theme in AI that I strongly believe in is that “Evals are all you need.” Being able to measure and understand what AI is doing will be a critical challenge, notably for government entities as AI proliferates in the private sector and among communities.
I do not mean to trivialize this data-based effort. It would be a significant challenge to create and maintain these datasets. Benchmarks very quickly become outdated as models are trained in order to optimize for specific benchmarks, so it will be critical to maintain datasets hidden from model training data, as well as update datasets as model capabilities improve.
Furthermore, there is no clear institutional home for this process, and it remains uncertain whether this would be a government exercise, or something that trade associations or even private companies would perform.
At its core, my proposal for a data-based approach drives us in the right direction of measurement, consumer protection, and stimulating innovation around LLMs. Compute-level regulations certainly could have a place alongside this data-based approach for transparency purposes, but as LLM capabilities continue to push forward, measurement and evaluation are going to be key areas for human intervention, and this regulatory approach starts us down that path.
Reflections on More Recent Developments
My proposal evaluates the functional efficacy of LLM-based experiences in specific domains. While this is very relevant across essentially all domains, it might not be sufficient where the reasoning of the LLM is relevant to understanding whether the output is sufficient. Right now it can be really hard to reconcile LLMs being able to achieve state of the art performance on tests like the Bar Exam and US Medical Licensing Exam, with their propensity to make simple mistakes. This often makes conversations about LLM capabilities devolve into dismissals of hype, a focus on anecdotes and anchoring on what can be emotional positions.
The performance of o3 on the ARC Challenge has pushed me to think a bit more in this direction, and whether there is a way to incorporate the principles of the ARC Challenge into domain-specific evaluation, to better incorporate functional evaluation with reasoning capabilities. My hope is that this could drive discussions about LLM capabilities to a more logical and analytical place, and lead to more productive decisions and a realistic view around LLM capabilities.
I realize on one hand I’m saying compute can’t be used for regulation because it’s just a proxy, but on the other hand that setting a compute threshold disincentivizes innovation by constraining compute for smaller players unwilling to exceed the thresholds triggering required compliance. As argued in the paper and this article, I think it can be true that compute itself is a reasonable proxy for model capabilities, and that compute itself is not a sufficient means of regulation for consumer protection.