Large Language Model Regulation in the US
We're in the first inning, but need to pivot our strategy
This is the fifth post in a series exploring the societal impacts of AI, with an eye towards the future. In this post, we'll discuss existing regulatory efforts around large language models (LLMs) and potential future directions. I will be presenting my paper "Powering LLM Regulation through Data: Bridging the Gap from Compute Thresholds to Customer Experiences" at the NeurIPS RegML workshop next month. I'll share this paper when it becomes publicly available!
Current State of US Regulatory Efforts
This section will not aim to capture all of the current regulatory efforts around LLMs, and focus only on the United States. This primarily aims to generally summarize the regulatory landscape for where it might move in the future.
In the United States, regulatory efforts to date have primarily been concentrated at the State level. The Generative extension of the NIST AI Risk Management Framework and the Executive Order 14110 both represent federal-level guidance around how to develop and deploy AI, but no federal regulation has emerged.
The most publicized state-level bill to date was the now-vetoed SB-1047 in California. This bill drew a strong reaction from industry, with some providing tepid support and others stronger pushback. Texas has recently pushed forward a proposal with a wide-sweeping purview, and Colorado passed a bill earlier in 2024.
Two consistent themes across both proposed legislation and guidance are:
A focus on catastrophic and wide-sweeping outcomes from AI
Using compute power as a proxy for model capabilities to trigger regulation
These are the core areas that I believe future guidance and legislation should pivot away from. Instead, we need more focus on practical capabilities and the importance of data.
If You Have Two Quarterbacks, You Really Have None
This football analogy suggests that having multiple quarterbacks likely means none are exceptional. Similarly, the sweeping purview of existing AI guidance and regulatory efforts is so broad that it fails to make meaningful progress in any single area.
While I don't minimize AI's catastrophic potential, my previous posts about AI's impact on education and the economy detail why I'm more concerned about near-term societal impacts than hypothetical concerns. Addressing these requires focus, specificity, and practical steps for developers. Currently, the emphasis has shifted too much toward the hypothetical and an impractical list of requirements for developers.
In Dean Ball’s recent post from his great blog Hyperdimensional, he calls this the “everything is everything” equivalent of AI policy. He references NIST Risk Management Framework recommendations that suggest both AI developers and corporate users should consult numerous stakeholders about various societal impacts before deploying AI. As Ball notes:
“Another passage recommends that both AI developers and corporate users talk to “trade associations, standards developing organizations, researchers, advocacy groups, environmental groups, civil society organizations, end users, and potentially impacted individuals and communities” about “the tradeoffs needed to balance societal values and priorities related to civil liberties and rights, equity, the environment and the planet, and the economy” before they have released or begun using AI.
So, talk to everyone about everything that could go wrong—including issues relating to the planet, “society,” and struggles that have persisted for all of human history—with your use of a general-purpose technology that is changing constantly. Got it.”
This broad and wide-sweeping nature has almost made guidance a form of virtue-signaling, and leads potential developers to either completely ignore the guidance, or be hesitant to innovate for fear of the wide-spread scope of potential regulatory retribution down the road.
Compute is Just a Proxy
Using compute power as a means of consumer protection is a related but distinct issue. While compute power correlates with model capabilities—though possibly with diminishing returns—it's insufficient alone for consumer protection. Given LLMs' multi-purpose nature and that LLM-based experiences typically combine models, data, and architecture, compute power doesn't adequately indicate whether an LLM-based experience can perform specific tasks effectively.
Its reasonable to use compute power as a means to identify extremely capable models, at least as of right now, but it’s not sufficient alone as a sweeping means of consumer protection. Gavin Newsom’s pushback in his SB-1047 veto articulates the limitations of compute as a standalone means of regulation.
“By focusing only on the most expensive and large-scale models, SB 1047 establishes a regulatory framework that could give the public a false sense of security about controlling this fast-moving technology. Smaller, specialized models may emerge as equally or even more dangerous than the models targeted by SB 1047 - at the potential expense of curtailing the very innovation that fuels advancement in favor of the public good”
While compute levels could trigger certain regulatory requirements, they shouldn't be the sole regulatory mechanism. Combined with the "regulate everything" approach, this could create unintentional regulatory capture, where only the largest companies can afford to meet requirements.
Shifting to a Data and a Use-Case Focus
LLM developers know that moving from demos to production applications is challenging. Given LLMs' flexible nature, evaluating whether outputs truly "work" can be difficult. While basic metrics like API response success are easy to measure, validating whether responses properly reflect user intent across possible outcomes is a common example of a more complex evaluations that is often required when working with LLMs.
To make regulation more practical, we should:
Curate data around high-risk, high-value use cases
Develop expert-labeled data (scalable through synthetic generation) for evaluating specific use cases
Measure and validate user experiences based on actual content delivery, not just foundation model capabilities
This approach could support a centralized certification process, detailed in my forthcoming paper which I will link to when it becomes publicly available
The Benefits
The obvious consumer benefit is clear understanding of an LLM-based experience's capabilities. Less obviously, this approach could encourage business innovation in otherwise risky areas.
For businesses, this offers a path to recognized credibility. Current enterprise-level LLM deployment challenges often stem from regulatory uncertainty, leading to inaction, particularly in sensitive areas like healthcare.
Consider healthcare as an example. Current LLMs could serve as nutrition coaches. While imperfect, key questions remain: How effective can LLMs be? Which models excel? How do they compare to human therapists?
Human nutrition coaches have clear credentialing paths, making verification straightforward. With LLMs, there’s generally just “vibes”, or you have to hope that there’s a sufficient publicly available benchmark that fits your exact use case. This often requires businesses to develop internal evaluation frameworks, as my previous company Lark Health did for our AI-based food coach.
At Lark, this custom approach helped us build trust with partners and users, ideally such evaluation would come from unbiased third parties, providing regulatory credibility and encouraging innovation. Nutrition coaching is just one example, in healthcare alone offering LLM-based products as a low cost alternative could democratize care in areas that we know are currently underserved, like mental health and proactive diagnostic care.
There’s no Sugarcoating it - Evaluation is Hard, But It’s Worth it
I don’t mean to trivialize these efforts. Creating these datasets is labor intensive, and often requires inputs from specialized experts. It is not a one-time effort, these datasets would need to be maintained and there will always be risk of the certification process leaking into training data and diminishing it’s value, like we’ve see with some LLM benchmarks.
For that reason we should target the highest potential use cases first that have the potential to drive both economic impact as well as improving the lives of consumers.
These datasets should also provide persistent strategic benefits. We are seeing an overall trend of algorithmic commoditization, with data and system integration that balances probabilistic and deterministic approaches together as potential means to unlock new capabilities. Of course new algorithmic approaches could emerge, but to date algorithmic advantages have been short-term, while data has proven to be a source of a persistent moat (see: Google, Amazon, Facebook etc..).
Conclusion
This use-case specific approach requires pivoting from "Everything is Everything" AI policy to tactical implementation. Rather than attempting to regulate everything these general-purpose models can do, we should excel in specific areas and expand methodically, focusing on areas where measuring LLM-based efficacy could drive significant societal change.
We must also embrace humanity's role in a world of increasingly capable AI systems. We've created flexible knowledge stores that can perform many previously human tasks. While we're justifiably cautious about ceding control—both for livelihood preservation and valid trust concerns—these systems need not be perfect to be valuable.
Perhaps the future lies in humans serving as validators and arbiters of truth, possibly in conjunction with models. What I think is clear is that it is worth it for us to collectively establish ground truth through data, which allows us to measure this societal transformation.