The South Korean government’s proposal to regulate the use of copyrighted works in AI training under a “use first, compensate later” framework is drawing growing controversy. The plan seeks to loosen copyright restrictions to allow companies easier access to data for AI development. Copyright groups, however, argue that it effectively shifts the costs of industrial expansion onto creators.
Under the proposed system, AI developers would be permitted to use data to train their models before compensating copyright holders from the revenue generated. Copyright organizations have raised strong objections, questioning whether creators can receive fair compensation when developers are not required to disclose what data they use or the extent of that use. Under such conditions, they argue, bargaining power lies overwhelmingly with companies.
Supporters of technology-led growth often dismiss these concerns as an attempt by creators to protect their own interests. But the issue extends beyond redistribution. AI systems depend on high-quality training data to improve performance. Weakening the ecosystem that generates human-created data risks a future shortage of reliable materials, which would ultimately undermine the effectiveness and sustainability of AI itself.
Signs of data scarcity are already emerging. A 2024 study by the nonprofit AI research institute Epoch AI projects that high-quality language data could be exhausted within two to five years. By 2030, the supply of high-quality language datasets available for AI training is expected to be critically low.
In response, AI developers are exploring ways to replace scarce training material with AI-generated synthetic data. However, repeatedly training models on AI-generated outputs can trigger a phenomenon known as model collapse.
Model collapse occurs when an AI system continually learns from its own outputs, causing results to converge toward a narrow range and reducing diversity. As synthetic data is reused, embedded biases and generalization tendencies are amplified, leading models to produce outputs shaped by familiar AI patterns rather than the complexities of the real world.
This phenomenon poses serious risks for companies that rely on AI for decision-making and can further entrench existing social biases. AI systems affected by model collapse may ultimately lose their usefulness as users reject outputs that fail to reflect real-world conditions.
Experts broadly agree that human-generated data must remain a core component of AI training. Diverse human experiences, unexpected events and natural language use are inherently difficult for machines to replicate and are essential to building resilient AI systems.
Creators are the primary source of such high-quality human data. Policies that weaken the position of copyright holders threaten the foundation of South Korea’s sovereign AI ambitions. Sacrificing the long-term sustainability of the AI ecosystem for short-term industrial gains would be counterproductive. A framework that enables copyright holders and AI developers to coexist is essential for responsible and durable AI development.
Most Viewed