We Are All Data Curators Now

The shift from knowledge work to data management

Sep 16, 2024

As artificial intelligence (AI1) continues to progress, a profound shift is occurring: we are becoming a society of data curators that outsource knowledge distillation. As AI systems increasingly distill and process information more efficiently than humans across a growing number of tasks, the role of people in this process is transforming. The ability to manage, organize, and leverage data is becoming as crucial as traditional forms of knowledge and expertise. This is a significant shift, and carries serious implications on how we structure society moving forward.

This is the first post in a series that will cover some of the broad potential impacts of AI on society.

Post 2: AI's Impact on Knowledge Work

Post 3: AI & Education

The AI Revolution: From Niche to Mainstream

The AI revolution has been brewing for decades, but recent developments have catapulted it into the spotlight. If the deep learning breakthroughs of the early-mid 2010s lit the match, then the 2022 release of ChatGPT emptied the gasoline tank on the fire. By amplifying already promising language models with reinforcement learning from human feedback, ChatGPT brought AI out of the shadows and into general discussion. Today, you wouldn't think twice if AI came up during Thanksgiving dinner.

This rapid advancement has sparked both excitement and concern. Claims of artificial general intelligence (AGI) are met with skepticism about AI's practical usefulness. The truth, as usual, lies somewhere in the middle. While current tools often struggle to move beyond impressive demos into trusted production applications, they have already demonstrated capabilities that surprised and even scared experts in the field.

The Commoditization of Methodologies and Value of Data

Often discussions about AI focus on the methodologies that have enabled recent innovations. However, history shows that algorithmic or methodological advantages tend to be short-lived. They are quickly commoditized by competitors, driving continuous innovation as companies strive to maintain an edge in the rapidly advancing field of AI.

What seems to provide a persistent competitive advantage is data, as evidenced by the rise of "Big Tech." Google's search algorithm, for example, drove its initial advantage, but over time, its data collection machine—amplified by free-to-end-user products—enabled its establishment as one of the world's largest companies. Now, user preference data has become the most valuable resource in steering large language models towards correct and human-preferred outputs. With current state-of-the-art models, high-quality data is the primary bottleneck, spurring significant investment in synthetic data generation strategies.

The Rise of Computer-Based Information Distillation

Outside of the marketplace dynamics of data, we as individuals should consider the consequences of computer based systems possessing a generalized version of human expertise.

Amid the hype and debates, a fundamental shift is occurring: We are creating computer-based experiences that are more efficient than humans at distilling information across a wide range of tasks. This efficiency stems from our growing ability to represent any arbitrary entity—be it a person, place, or concept—in quantitative form across generalized mediums like text, images, and audio.

This concept of machines becoming more efficient than humans at certain tasks is not new (calculators replacing math-on-paper, for instance). Computers and the internet have amplified our ability to capture knowledge for decades through quick search across nearly unlimited data, the ability to store and manage data and content, and countless efficiency tools.

What sets the current state-of-the-art generative AI applications apart is the flexibility with which they can answer questions and create solutions. Access to the internet did not mean computers could answer questions in a vacuum; they still required humans to orchestrate and aggregate information, making it more efficient for humans to serve as the primary knowledge distillation engine.

Consider this example: Someone wants to build an application that splits restaurant bills automatically based on a set of input parameters, like the number of people and desired tip percentage. Previously, that person could access the internet to either learn how to write code, or read documentation and discussion forums to help them debug, and leverage interactive development environments. While these tools significantly improved efficiency, they still required direct human involvement to execute a series of tasks.

Now, LLM-based experiences are capable of taking free-form user input, understanding the semantic intent, and producing a final output. Using the bill-splitting example, the LLM can produce the code and run the application by itself, rendering human orchestration largely unnecessary aside from asking the initial question.

In some situations, the output will not be correct or will not fit the user's desired intent. But in other situations, it will completely satisfy the user request. This, of course, is not isolated to just programming. Some AI services can now pass tests that once required years or even decades of human study and experience, including the Bar exam and the US Medical Licensing Examination.

It would be naive to claim that the ability for these experiences to pass these tests indicates they represent sufficient replacements for their human counterparts, especially given the probabilistic nature of these models and their inherent imperfections. I would argue it is just as naive to brush these capabilities aside as parlor tricks after showing this level of efficacy.

The New Paradigm: Data Curation as a Core Skill

As AI systems become more adept at processing and generating information, our relationship with knowledge is changing.

The more examples that model see, including feedback as to whether the responses are accurate, the more capable the models will become (not to mention further methodological advancements). This suggests that the current capabilities we see now are the worst these models will perform in the future.

This should cause us to question whether our current operating model that emphasizes the need for humans to be at the center of knowledge distillation, is really the most efficient path forward. Rather than focusing solely on human knowledge distillation, then memorization and recall, the most efficient path forward is to curate data effectively. This shift requires us to:

Identify areas where AI models are effective and where they need improvement
Provide high-quality, relevant data to enhance AI performance
Critically evaluate and contextualize AI-generated outputs

In this new paradigm, intelligence manifests differently. The ability to manage and curate the right data becomes paramount, complementing traditional forms of knowledge acquisition and problem-solving.

This does not in any way invalidate human expertise, as these tools are not always right. Using the example from the prior section, if the code provided did not match the user's intent or did not work as expected, someone without programming expertise would be stuck. In an ideal world where the person did have the required expertise, they could make tweaks to the provided code and produce a solution that would have taken significantly longer if built from scratch.

The key theme is that the more examples a model has on a particular subject, the more likely it is to produce the correct response. So both at a corporate or personal level, curating data in a way that can be consumed by these models and is relevant to your specific situation, and complementing with some form of human expertise, is likely the most efficient possible knowledge store of the future.

Redefining Human Intelligence

This represents a radical shift in how we view human intelligence. Traditionally, both in education and society at large, we've associated intelligence with the ability to retain and memorize information, as well as the capacity to apply knowledge creatively to solve problems. This skill-set has been uniquely human and underpins much of our society.

This post doesn't suggest abandoning the pursuit of knowledge. Rather, it proposes approaching this pursuit differently. Intelligence may manifest in new ways in the future, with the ability to manage and curate the right data taking center stage. We might think of people as both the writers and editors of the future, with the primary skill being knowing when to switch between these roles.

The fully realized vision of AI places a lawyer, doctor, and confidant that knows your own personal distinct preferences in your pocket. The dystopian equivalent mirrors a form of "Idiocracy" that leaves society intellectually crippled, or perhaps even worse. Regardless of which direction we're headed, the proverbial cork is out of the bottle. Given the level of progress and investment (both corporate and sovereign) in recent years, continued advancement seems inevitable. It's imperative that we try to predict AI's trajectory to maximize its benefits and minimize its drawbacks.

Looking Ahead: Key Areas for Further Exploration

This post is the first in a series exploring how AI might alter our society and how we can proactively position ourselves for a more promising future. Both for society as a whole, and at the individual level. While there are no clear, definitive answers to the questions below, these areas below certainly warrant further discussion, and require us to think differently than we have before:

Employment: What does the white-collar job market look like when knowledge is commoditized?
Education: How can we pivot education to fit this new paradigm rather than resisting generative AI?
Government: What type of regulation protects consumers from the dangers of AI but doesn’t bog down progress?
Economy: Is it possible to disentangle the capitalistic benefits of AI from technical progress?
Risk: Beyond obvious downside risks, what are the risks of inaction in the face of AI advancement?
Future: How can society keep pace with AI progress?

Throughout this series, we'll continue to explore the themes of efficient knowledge distillation and the pivotal role of data curation in our collective future. These concepts will serve as a framework for understanding and navigating the challenges and opportunities that lie ahead.

Thanks for reading!

Pasfield Substack

Discussion about this post