Overview

We are a dynamic and innovative small-sized SaaS company specializing in language data products and services. We are a team of 17, distributed across two offices in Amsterdam and Thessaloniki.

About the Project

TAUS is executing technical workstreams for the European Commission’s BEACON project, focused on collecting, curating, and publishing high-quality parallel text corpora for machine translation in EU candidate country languages. This 9-month project involves processing hundreds of millions of sentences from diverse sources, applying rigorous quality assurance frameworks, and preparing publication-ready datasets for seven language pairs: English paired with Ukrainian, Serbian, Bosnian, Macedonian, Albanian, Montenegrin, and Romanian/Moldovan, with particular focus on legal and administrative domains.

Position Overview

We seek a skilled and motivated Language Data Engineer to join our technical team for large-scale parallel corpus collection, processing, and quality assurance. You will work hands-on with real-world challenges in low-resource language processing, quality assurance at scale, and contribute directly to expanding Europe’s multilingual digital infrastructure.

Responsibilities

Data Collection & Acquisition

Download and catalog parallel corpora from public repositories and implement targeted web crawling for legal/administrative domain content,
Extract text from diverse formats (PDFs, HTML, document archives) and apply bilingual as well as monolingual corpus mining techniques,
Document source provenance, licensing, and metadata comprehensively.

Corpus Processing & Pipeline Management

Execute preprocessing pipelines: format normalization, sentence segmentation, alignment, language identification, and quality filtering,
Handle large-scale data processing with deduplication and anonymization,
Maintain detailed processing logs and quality metrics throughout the pipeline.

Quality Assurance & Validation

Validate NLP tool performance across seven language pairs and implement, automated quality checks (alignment confidence, language ID accuracy, domain classification),
Coordinate with linguists for human validation and generate quality reports with statistical metrics,
Troubleshoot and resolve quality issues in processing workflows.

Documentation & Collaboration

Contribute to technical deliverables and project documentation meeting EC standards,
Collaborate with European Commission experts and cross-functional teams on methodology and quality criteria,
Ensure compliance with EU data governance, GDPR, and licensing requirements.

Company:

TAUS

Qualifications:

• 3+ years of work experience with Natural Language Processing (NLP)

• 3+ years of work experience with Python (Programming Language)

Specific requirements:

• Authorized to work in Yes

Level of experience (years):

Mid Career (2+ years of experience)

Tagged as: , , , , ,

About Saus Taus Global PTE Ltd.

Bangladeshi D2C condiment brand crafting premium quality dips & sauces with locally sourced ingredients.