This case study from Appen demonstrates a sophisticated approach to integrating LLMs into production data annotation workflows, highlighting both the opportunities and challenges of combining artificial and human intelligence in practical applications.
The context of this implementation is particularly important: Appen's research showed a 177% increase in generative AI adoption in 2024, yet simultaneously observed an 8% drop in projects making it to deployment since 2021. Their analysis identified data management as a key bottleneck, with a 10 percentage point increase in data-related challenges from 2023 to 2024. This speaks to the broader industry challenge of scaling high-quality data annotation for AI systems.
The technical architecture of their co-annotation system consists of several key components:
The system's workflow is particularly noteworthy from an LLMOps perspective:
- A co-annotation engine that intelligently routes work between LLMs and human annotators
- An uncertainty calculation system using GPT-3.5 Turbo with multiple prompt variations to assess confidence
- A flexible threshold system for balancing accuracy vs. cost
- Integration with their AI data platform (ADAP) through a feature called Model Mate
One of the most interesting aspects of their implementation is the uncertainty measurement approach. Rather than relying on the model's self-reported confidence scores (which they found to be unreliable), they generate multiple annotations using different prompt variations and measure the consistency of the outputs. This provides a more robust measure of model uncertainty and helps determine which items need human review.
- Initial data processing through LLMs (primarily GPT-3.5 Turbo)
- Uncertainty calculation using multiple prompt variations
- Automated routing based on entropy/uncertainty thresholds
- Human review for high-uncertainty cases
- Quality sampling of high-confidence cases
- Feedback loop for continuous improvement
The system demonstrated impressive results in production:
Their Model Mate feature implementation shows careful consideration of production requirements, including:
- 87% accuracy achieved with hybrid approach (compared to 95% with pure human annotation)
- 62% cost reduction ($450 to $169 per thousand items)
- 63% reduction in labor time (150 hours to 55 hours)
- 8 seconds per item for LLM processing vs 180 seconds for human annotation
A particularly interesting production case study involved a leading US electronics company seeking to enhance search relevance data accuracy. The implementation used GPT-4 for multimodal analysis of search queries, product titles, and images. Key findings included:
- Support for multiple LLMs in the same workflow
- Real-time integration within existing task designs
- Flexibility to handle various data types (text, image, audio, video, geospatial)
- Built-in monitoring and validation capabilities
- Support for custom routing rules and multi-stage reviews
From an LLMOps perspective, several best practices emerge from this implementation:
- 3-4 percentage point accuracy increase across different components
- 94% accuracy when combining LLM assistance with human annotation (up from 90%)
- Importantly, incorrect LLM suggestions did not negatively impact human accuracy
The system also addresses common LLM challenges in production:
- Use of multiple prompt variations for robust uncertainty estimation
- Implementation of flexible thresholds that can be adjusted based on accuracy/cost requirements
- Integration of human expertise at strategic points in the workflow
- Regular sampling of high-confidence predictions to ensure quality
- Support for multimodal inputs and various data types
- Built-in monitoring and evaluation capabilities
The implementation demonstrates a sophisticated understanding of both the capabilities and limitations of LLMs in production. Rather than attempting to fully automate annotation, Appen has created a system that leverages the strengths of both AI and human annotators while mitigating their respective weaknesses. This approach appears particularly well-suited to scaling annotation operations while maintaining quality standards.
- Hallucination mitigation through human verification
- Bias protection through diverse human annotator pools
- Quality control through strategic sampling
- Cost management through intelligent routing
Looking forward, the system appears designed to accommodate future improvements in LLM technology while maintaining its core benefits. The flexible architecture allows for easy integration of new models and adjustment of routing thresholds as model capabilities evolve.
- Forums
- ASX - By Stock
- Appen is the first company in Australia to use A.I for operations
APX
appen limited
Add to My Watchlist
1.29%
!
$1.18

[MEDIA] This case study from Appen demonstrates a sophisticated...
Featured News
Add to My Watchlist
What is My Watchlist?
A personalised tool to help users track selected stocks. Delivering real-time notifications on price updates, announcements, and performance stats on each to help make informed investment decisions.
|
|||||
Last
$1.18 |
Change
0.015(1.29%) |
Mkt cap ! $311.2M |
Open | High | Low | Value | Volume |
$1.18 | $1.20 | $1.17 | $6.401M | 5.405M |
Buyers (Bids)
No. | Vol. | Price($) |
---|---|---|
1 | 2500 | $1.18 |
Sellers (Offers)
Price($) | Vol. | No. |
---|---|---|
$1.18 | 32759 | 3 |
View Market Depth
No. | Vol. | Price($) |
---|---|---|
1 | 2500 | 1.175 |
12 | 226075 | 1.170 |
8 | 266418 | 1.165 |
12 | 91951 | 1.160 |
7 | 80200 | 1.155 |
Price($) | Vol. | No. |
---|---|---|
1.180 | 32759 | 3 |
1.185 | 54694 | 3 |
1.190 | 148560 | 7 |
1.195 | 115288 | 10 |
1.200 | 185281 | 28 |
Last trade - 16.10pm 18/07/2025 (20 minute delay) ? |
Featured News
APX (ASX) Chart |