The global AI competition has officially entered an era where data is king. Giants like Meta and xAI are heavily betting on high-quality data. Appen, a leading Chinese data services provider, leveraged its technological strengths and talent to achieve revenue of 306 million yuan in the first half of the year, a 90% year-on-year increase. This article provides an in-depth analysis of how Appen builds core competencies from data to intelligence for over 450 clients.

Originally published by: New Intelligence

Editor: KingHZ Taozi


[New Intelligence Introduction] The true determinant of AI's limit has shifted from "model scale" to "data quality." From Meta's bet on data platforms to xAI's layoffs and recruitment of "professional AI mentors," the global "data war" has fully entered its second half. Among Chinese players, Appen Data holds a commanding position, with revenue reaching 306 million yuan in the first half of 2025 alone. High-quality, traceable, and engineerable data production is becoming a new barrier to entry for the AI industry.

In 2025, big models continue to evolve at a high speed, and technology giants are competing fiercely for the "fuel" - high-quality data.

Data is no longer a supporting role behind the scenes, but the core battlefield that directly determines the success or failure of AI.

Today, the "data war" in the global AI circle is intensifying.

Zuckerberg once spent $14.3 billion to poach the founder of Scale AI and bought a 49% stake, just to seize the commanding heights of high-quality data.

This "marriage of the century," originally seen as Meta's trump card in the AI race, soon revealed "awkward cracks":

TBD Labs, responsible for next-generation model training, was disappointed with the data quality of Scale AI and turned to competitors such as Surge AI and Mercor.
1.png

Before the storm subsided, Anthropic was forced to pay a "sky-high settlement" of US$1.5 billion for allegedly stealing copyrighted data to train Claude.

This astonishing figure set a record for the highest compensation in copyright disputes in the United States, and also announced the end of the era of AI's "brutal grabbing" of data.

2.png

At the same time, Musk decisively laid off 500 "general data labelers" overnight and instead actively recruited 10 times more "professional AI mentors."

It focuses on areas such as STEM, finance, medicine, and security, and directly points to the profound transformation of AI from the accumulation of massive data to professional refinement.

3.png

The "data anxiety" of these technology giants and AI unicorns is not an isolated case, but a common mirror in the AI ecosystem.

Data has become the "new oil" in the AI era.

Overseas, newcomers such as Scale AI, Surge AI, and Mercor have become the "behind-the-scenes driving force" of giants such as OpenAI and Google by relying on refined labeling and expert resources.

In China, the pioneer of this "data revolution" - Aopeng Data, is rising strongly with local innovation and global vision.

Few people know that the high-quality data behind China's top ten Internet giants, top ten autonomous driving manufacturers, and 450+ leading companies all come from Aopeng's AI data engine.

The results for the first half of 2025 showed that Aopeng China achieved a new revenue record of RMB 306 million, setting a new industry benchmark.

It is expected that the annual figure will exceed 700 million.

This is not just a number, but a strong proof of the compound growth rate of 90% over the past five years.

Dr. Tian Xiaopeng, Appen’s Senior Vice President and General Manager of Greater China and North Asia, said:

We are witnessing a fundamental paradigm shift. The ultimate competitive advantage in AI lies in building a robust "closed data loop." Powered by "data engineering," it continuously produces scarce, high-quality data fuel.

To put it bluntly, the competition in the future will not only be about computing power or model architecture, but also about who can systematically construct accurate and scarce data.

This is the root cause behind many current industry chaos, and it also points us to the next key direction of AI data evolution.

From zero to a profit of 306 million yuan in half a year

Perhaps in the eyes of many people, this is undoubtedly the "Chinese version of Scale AI"!

That’s right, but it goes beyond that. To be more precise, Appen Data is a cutting-edge company that combines the advantages of both Scale AI and Surge AI.

Founded in 2019 and headquartered in Shanghai, Appen Data is a leading data company invested by Appen in China, founded and independently operated by a local management team.

It combines Scale AI's deep layout in autonomous driving and multimodal data, as well as Surge AI's high-quality labeling and vertical category refinement services.

Compared with the other two, Appen has a deeper understanding of the pulse of the Chinese market and provides seamless integration of global resources and localized delivery.

As early as 2023, Aopeng's revenue had surpassed that of its domestic competitors in the industry, making it the "dark horse king" of China's data services.

In the first half of this year, Aopeng's revenue has reached 306 million yuan, which is about 10 times the full year of 2020. It is undoubtedly the largest AI data service provider in China!

4.png

This "counterattack" is not accidental, but the result of five years of hard work.

Looking back at its growth path, we can see that Appen has accurately grasped three key market opportunity nodes:

  • 2020-2021: Intelligent Voice
  • 2022-2023: Autonomous driving
  • 2024-2025: Large Model
5.png

In 2020-2021, Aopeng Data seized the explosive period of traditional AI business.

At that time, the demand for speech recognition and image labeling was booming, and Appen quickly laid the foundation with its global resource network and localized team.

Its revenue started at over 30 million in 2020 and increased fivefold in 2021, reaching about 160 million yuan.

By 2022-2023, the rapid rise of autonomous driving technology will become the second growth engine of Appen Data.

Through in-depth cooperation with China's top ten autonomous driving companies, Aopeng's revenue continued to double during this period, reaching nearly 244 million yuan in 2023.

6.png

In 2024-2025, Aopeng caught up with the development of large models and laid out vertical large models in advance.

From ChatGPT to DeepSeek, big models have not only reshaped the global AI competition landscape, but also brought unprecedented development opportunities to the data service industry.

By 2024, Appen China's annual growth rate will reach more than 70%, among which the growth rate of large model and generative AI-related businesses will reach more than 500%.

In the first half of 2025, riding on the booming domestic AI industry, Appen Data's revenue hit a new high, driven mainly by five major engines:

1. Structural growth dividend

The focus of the industry has shifted from "model competition" to "application implementation", and the demand for high-quality vertical data has continued to be released and its priority has been raised.

2. Supplier concentration trend

In order to reduce costs and increase efficiency, leading customers are converging their supply chains; service providers with comprehensive capabilities are taking on high-difficulty, high-complexity, and high-security projects, and concentration is increasing.

3. Breakthrough in overseas data services

As Chinese internet companies accelerate their global expansion, the demand for compliance and localization is surging. With a delivery network spanning the Philippines, Malaysia, Vietnam, and Europe, Appen's overseas business accounts for nearly 40%, providing multilingual, cross-cultural, and compliance solutions.

4. Opportunities for cold-start data productization

As large models are iterating at an increasingly rapid pace, the demand for finished datasets is growing. Appen transforms data into modular, composable, high-quality data products, significantly shortening client development cycles while maintaining high gross margins.

5. High-end data resources and service barriers

We proactively deploy high-end talent and platforms (such as medical experts, professional musicians, competition winners, etc.), linking technology platforms with ten vertical capabilities to support large-scale model training and evaluation - high-quality data is determining the upper limit of model capabilities.

Faced with these unprecedented opportunities, what has Appen done to stand out from the fierce competition?

Five major platforms, leading technology industry

At the top of the technological wave, Aopeng has always built long-term technological barriers with a forward-looking vision and driven industry change with innovation.

China's first end-to-end universal pre-annotation large-scale model, combined with project-level fine-tuning, achieves automated data annotation loopback, improving efficiency by 25%. The first integrated collection-annotation-quality inspection-delivery process reduces redundant storage and improves data processing efficiency by 30%.

Starting from the "first principles", Appen reconstructed "data engineering".

The core is the self-developed industry-level pre-labeling large model: it understands the context and completes high-precision initial screening in advance.

Subsequently, human experts only deal with the "hardest 5%" - ambiguous samples, boundary samples and rule conflicts.

Finally, the correction results are fed back to optimize the model parameters again, forming a closed loop of "pre-labeling - manual correction - model optimization".

6401111.png

This model increases data labeling efficiency several times, while greatly reducing labor costs and subjective errors, achieving a double leap in efficiency and accuracy.

AI is on the cutting edge, with big models, embodied intelligence, autonomous driving, and more.

These fields have even more stringent requirements for data: quality, quantity, and multi-dimensionality are essential. General-purpose tools are naturally unable to keep up.

To this end, Aopeng invested heavily in its own research and development, building an industry-level platform matrix covering multiple fields:

MatrixGo, MediGo, RoboGo, AI Agent, and the large-scale intelligent development platform—each responsible for their own tasks, yet able to work together.

For example, the large model line provides multimodal data cleaning → SFT instruction fine-tuning data construction → RLHF preference labeling and evaluation.

7.gif

For embodied intelligence, the robot's "hand-eye-brain" coordination training requires data as "fuel".

Multi-sensor fusion labeling, complex motion trajectory labeling, multi-modal thought chain labeling...Aopeng RoboGo platform can handle it all in one stop, while Scale AI does not even have related business.

6 Long-term logical reasoning.gif

AI+medical application scenarios are more professional.

The MediGo platform has built-in intelligent labeling, multimodal fusion and private deployment, providing a high-precision, compliant and secure data base for large medical models and applications, covering eight core scenarios including diagnosis and treatment, consultation/guidance, and health education.

7 Medical imaging detection.gif

Today, MatrixGo, an enterprise-level high-precision data production platform, has achieved a chain connection, accelerated iteration, and steady optimization.

In the field of autonomous driving, the demands are even more diverse: LiDAR 3D point cloud, high-precision map feature extraction, 4D time series annotation...

8.gif

Aopeng strictly complies with the L4+ safety standards to support the implementation of high-level intelligent driving algorithms.

Not only that, they are also actively developing the next generation of data production intelligent agents to independently collect, clean, label, and amplify data to generate high-quality data sets.

It is worth mentioning that the Aopeng engineering team has always adhered to the concept of "speed is the only way to defeat all martial arts", demonstrating strong engineering implementation capabilities:

  • Adhere to iterating and updating the product at least once a week;
  • Ensure that the latest technological achievements can be converted into available product features as quickly as possible.

It is not difficult to see that in terms of technology, Aopeng has always been at the forefront of the industry.

In addition, in terms of finished data sets, Appen provides more than 800 professional data sets, including nearly 100,000 hours of audio resources, more than 500,000 images and more than 100 million words of text data, covering more than 80 languages and dialects.

On high-difficulty data sets, through a huge network of domain experts, Aopen has carefully selected more than 1,000 industry experts from different sub-professional fields and constructed a data set of more than 100,000 high-difficulty thinking chains, covering disciplines such as mathematics, computer science, physics, chemistry, biology, and humanities.

With the help of this data, some clients have achieved a 40% improvement in model performance compared to the public data baseline, which provides them with confidence for future development.

The second half of AI: High-quality data is key

Currently, the AI industry is in a "super cycle", with large-scale model technologies emerging like a tide.

Scaling Law has not expired or slowed down.

As long as you are willing to invest computing power and feed the model enough high-quality data, LLM capabilities will be enhanced and there will be almost no ceiling.

According to Our World in Data, from 2010 to October 2024, the amount of AI training data (purple) will double approximately every 9-10 months.

In particular, the training dataset for LLM has tripled every year since 2010.

9.png

In 2019, GPT-2 training used approximately 4 billion tokens; in 2023, GPT-3 expanded to 300 billion tokens; it is even speculated that GPT-4 used 13 trillion tokens.

It can be seen that the data scale required for LLM training has long jumped from the traditional TB level to the PB level, and has even almost exhausted the public resources on the Internet.

Statistics show that the AI data center market size is expected to reach US$78.91 billion by 2032, with a compound annual growth rate of 24.5%.

With the comprehensive escalation of the AI competition, the three major data service providers that “sell shovels” have collectively ushered in a moment of “getting rich quick”, with their valuations soaring.

  • Meta spent $14.3 billion to acquire a 49% stake, pushing Scale AI's valuation to $29 billion.
  • Surge AI seeks $1 billion in funding, targeting a valuation exceeding $25 billion;
  • Mercor is currently in talks for a Series C funding round at a valuation exceeding $10 billion.

These vivid cases precisely highlight the king status of data in the AI ecosystem.

Data "black holes" are expanding infinitely

At the NeurIPS 2024 conference, Ilya bluntly stated, "The era of pre-training is coming to an end, Internet data is exhausted, and will not continue to grow."

This prediction sparked heated debate. But in reality, has AI data really run out?

Obviously not. In an interview with Appen Data, Dr. Tian Xiaopeng of Appen Data made a powerful rebuttal to this point:

What AI lacks is not data, but high-quality data.

In reality, there is still a large amount of data that has not been effectively utilized. After cleaning and processing, this data can be further used as training data, especially multimodal and domain-specific data.

Ultimately, general AI serves humanity and must meet ever-changing information needs. For ordinary people, near-term information needs far outweigh long-term ones, which requires LLMs to continuously receive the latest training data.

Just like computing power, AI's demand for data has not diminished, but the industry is undergoing transformation and upgrading - with huge changes in scale, quality and complexity.

Traditional deep learning only requires GB to TB of data, but in the LLM era, it has already reached PB level data.

Secondly, data quality requirements are also rising. The early 95% accuracy rate is no longer sufficient to meet the demand. Now, in professional fields such as quantum mechanics and healthcare, annotation accuracy must reach above 99.5%.

In addition, multimodal data fusion has also become a mainstream trend.

The complexity has increased from the previous 2D/3D annotation to 4D annotation that includes the time dimension, as well as the collaborative processing of text, images, audio, and video.

All these put higher demands on the technical capabilities of data service providers.

In response to these new challenges, Appen has adopted three main strategies:

  1. Forward-looking technology layout and productization capabilities
  2. Highly flexible intelligent platform
  3. Professional talent network and precise matching mechanism

Aopeng has developed and reserved emerging data production platforms in advance, such as GUI trajectory collection, multimodal annotation tools, and embodied intelligent platforms; through plug-and-play modular product design, it supports rapid deployment and flexible adaptation, significantly improving data service efficiency.

At the same time, Appen has built standardized finished data sets (such as codes, high-difficulty question banks, etc.), significantly shortening the development cycle of models in various sub-fields.

The Aopeng intelligent platform can not only quickly respond to multi-modal and multi-scenario labeling function requirements, but also focus on agile iteration and refined management capabilities of business rules.

This ensures that Appen can implement and deliver projects with complex requirements efficiently and accurately.

In addition, Appen has established an expert resource library and talent labeling system covering multiple fields, achieving intelligent matching of talents and task requirements.

In particular, in high-barrier vertical categories such as medicine, Aopeng accurately dispatches professionals with corresponding qualifications to ensure the quality and professionalism of data delivery.

Model evaluation > training, data quality > scale

In April of this year, OpenAI research scientist Shunyu Yao judged:

As AI enters the second half, evaluation is more important than training.
10.png

On multiple benchmarks, AI has already surpassed most humans - yet the world has not changed dramatically, at least from the perspective of economics and GDP.

Yao Shunyu calls this the "utility problem" and regards it as the most critical issue in the field of AI.

He believes that players in the second half of AI will build billion-dollar businesses by transforming intelligence into practical products.

This is a huge change in the data industry: data quality is more important than scale!

According to the latest data, as of June 2025, my country has built more than 35,000 high-quality data sets with a total volume of more than 400PB. The construction of high-quality data sets has even become a national strategy.

The development of general large models has made AI applications in various vertical fields possible. Even OpenAI has gradually turned its attention to specific fields such as programming.

Data labeling in professional fields such as medicine, law, and finance requires the participation of industry experts, and the labeling accuracy rate is required to be increased from 95% to above 99.5%.

It is estimated that by 2028, the market size of medical and health data elements will exceed 25 billion yuan, and the industrial manufacturing field will reach 30.2 billion yuan.

This has led to the coexistence of "data deserts" and "data oases" : general data faces a bottleneck, but the development level of high-value vertical data is still low.

In many vertical scenarios, what is lacking is high-quality data, such as extreme accident data in autonomous driving and medical data that is difficult to obtain from the public domain.

And "synthetic data" can fill some of the market gaps.

For example, Nvidia's open-source "world model," Cosmos, can synthesize some of the data needed for autonomous driving. In many scenarios, the lack isn't about data, but rather high-quality data.

In some scenarios, real data and synthetic data can complement each other or even rely entirely on synthetic data, such as pictures in games.

However, synthetic data always carries certain assumptions and cannot take into account some special circumstances, which critical industries cannot afford to lose.

At present, most application scenarios still require real data to train AI. To improve performance, professionals are needed to produce data to empower the model.

In fact, the current data industry has increasingly higher professional requirements. A bachelor's degree can no longer meet the needs of the data industry. Some companies have begun to recruit PhDs to construct training data!

The domestic AI data service industry has also transformed and upgraded from a labor-intensive industry to a technology-intensive industry.

To meet these challenges, in addition to developing five major technology platforms, including MatrixGo, Appen also established vertical teams led by top industry experts:

  • The medical team has more than 500 medical experts, 15% of whom hold licensed physician qualifications;
  • The financial team has more than 300 experts covering finance, insurance, funds and other fields, 70% of whom have professional qualifications;
  • The code team has more than 120 full-time engineers, covering mainstream programming languages;
  • The legal team is composed of practicing lawyers and legal experts;
  • The mathematics and science team is composed of national competition winners;
  • The music team has more than 500 part-time musicians;
  • Multilingual team covering 200+ languages;
  • The TTS team has thousands of hours of data collection experience in dozens of countries around the world;
  • The literary team brings together talents from 985/211 universities;
  • The aesthetics team consists of more than 50 professional designers.

Healthcare is one of the sectors with the highest data threshold: multi-group representation, compliance red lines, and cycle/cost pressures coexist.

To this end, Appen uses a dual-track approach of "platform + experts" to solve the problem:

  • The data engineering platform integrates intelligent annotation, multimodal fusion, and private deployment capabilities;
  • The expert network ensures that the annotation accuracy approaches clinical-level requirements.

The entire process is strictly aligned with GDPR, ISO and other standards, and the project cycle is shortened by 30%-50% through standardized SOPs.

The result is a faster, more accurate and more compliant medical AI data base, which accelerates product implementation and international deployment.

The future of AI, the future of data

In the past, the outside world and even the AI industry often focused on breakthroughs in algorithms and computing power, but had many stereotypes and misunderstandings about the data industry.

Many people believe that the data industry has no future, that a "data desert" is imminent, and that data labeling requires no technical skills and is simply manual labor.

In fact, nothing could be further from the truth.

This industry is advancing rapidly at a double-digit annual growth rate, and as a leader, Aopeng Data has maintained growth for six consecutive years, winning the top spot in China's market share.

Previous misconceptions are no longer tenable: Without a technical platform, progress is impossible, and manual labor alone cannot handle complex requirements. Technical platforms and data engineering capabilities have long become core competitive advantages in the industry.

Today, AI is moving from perception to cognition and reasoning, and its capabilities are expanding from 2D static recognition to 4D spatiotemporal modeling, achieving multimodal fusion.

As a result, the order of magnitude of data and computing power has increased, and quality, traceability and refinement have become rigid demands.

Once effective breakthroughs are achieved in scenarios such as autonomous driving and medical care, they can be quickly replicated and applied globally.

To support this process, two types of infrastructure need to be completed: high-confidence physical world data for world models, and a multimodal content platform that supports secure connection between enterprises and individuals.

The data industry has shifted from passive supply to co-building cognitive systems and evaluation standards.

Relying on its global resource network, platform-based R&D and AI-native process transformation, Aopeng will continue to make breakthroughs in the AI wave.

Looking ahead to the next 3-5 years, Aopeng Data has clear strategic priorities: deepening its global resource network, vertical category depth, and the transformation of its platform into productization.

Their next goal is to achieve revenue of over 2 billion yuan in China by 2030.

During the interview, Dr. Tian Xiaopeng, Global Senior Vice President and General Manager of Greater China and North Asia at Appen, carefully shared the three principles that will guide its future.

First, data services must be globalized and delivered in a compliant manner. This is not only about risk prevention and control, but also about competitiveness when companies expand overseas.

Secondly, we need to manage the breadth and complexity of our client base to build a true moat. This means being more than just a data labeler; we need to be a data consultant, providing value-added services beyond labeling, such as model evaluation and process optimization.

Finally, we need to build a good platform. Relying on the dual platform of "technology + human resources", Appen provides services that are more competitive than those of its competitors.

As long as we continue to adhere to principles and maintain past growth rates, Aopeng believes that the next "small goal" of 2 billion will not be empty talk.