Quality management in data collection projects : 3 things to consider

The demand for data collection projects has a hug gross in the past 10 years. The fundamental work of these projects typically involve translating large quantities of language data, which is then used to train Natural Language Processing (NLP) or machine learning engines. Transee has adapted our language quality assurance (LQA) strategies to meet this need, because data collection projects also need to be managed according to specific requirements. Following is the several tips we’ve collected for data collection quality management that we think are effective.

Translation quality management

There are many familiar aspects between most localization experts and translation quality concepts. but the process of running a language quality program can often be unclear to their clients. Using an similar example of manufacturing industry, LQA can be described as a sampling process, whose goal is to prevent defects from reaching customers. A sample of a finished product is tested, and non-conformances may be detected. This is a good attempt to identify the causes of the non-conformances, if any, and adoption of preventive measures. You will have a idea of how LQA works while replacing a batch sample of a product with a selection from a translated text. In an LQA, a text sample is reviewed by an independent assessor who reports errors (non-compliances) and assigns error categories and severities to them.

It can be concluded as an attempt to implement quality in the “very” human activity of translation. It proceeds on the assumption that language quality is something that can be assessed and evaluated – that any two (or more) professional linguists can agree whether a translation meets certain criteria or not.

As in manufacturing industry, quality assurance in translation is about tactically managing risk. Language quality managers typically focus on:

Selecting appropriate sample size and ensuring consistent quality. A small sample may not represent the general quality of the project. Because it is easy to overlook more challenging parts of the documents while you using a small sample. On the other hand, a large sample will affect the deadline and project margin, and still may not provide a clear picture of quality.

Reviewing the quality of the LQA. An experienced LQA reviewer plays a important role in this part–it’s important that the reviewer understands the subject, comprehend all client instructions, works thoroughly, and keep a good balance between harsh and lenient in their assessment.

The managed through quality management strategies always can overcome the internal risks of a project, including:

· Sampling frequency

· selecting representative samples. When translating a larger document, the sample should be selected from different parts of the file, and theses samples should cover all the content types included in this project.

· Careful selection, vetting and training of LQA reviewers

· LQA follow-up. Following up on issues that arise in an LQA review will make that review far more effective than an LQA with no follow-up.

Adapting LQA to data collection projects

Data collection projects pose a bunch of challenges to translators. Clients always want every program can benefit machine learning models. So generally they give translator a complex instructions, and the source sentences are often incomplete, fragmented and usually contain a wide range of subject matter, requiring the translators to perform extensive research to become familiar with the context. Text selections are often presented out of context and can be ungrammatical, colloquial, slangy or very technical with a specialized vocabulary. Finally, most data collection projects are translation-only, it means that language solution providers can’t rely on the normal 3-step workflow of translation, editing and proofreading by a second linguist.

The Transee advantage

As Transee has worked with many data collection clients, we’ve adopted the same LQA standards that our clients use to evaluate deliveries from suppliers. These standards are specific to data collection requirements and are different from the ones applied to typical translation projects for reader consumption. For example, they typically discourage reviewers from logging stylistic improvements, which don’t add much benefit given the purpose and size of these projects. In some cases, Transee has developed hybrid models that draw on the client’s LQA standards but weigh certain error categories or severities differently.

All attended LQA reviewers should training sessions alongside the translators to ensure that everyone is on the same page about the parameters and objectives of each project. We’ve found that this first initial step results in a translation and review team that is aligned and thoroughly understands all client instructions, goals and the LQA process.

In our experience, a very hands-on approach is necessary in managing data collection LQAs. The quality team often needs to give suggestions at different phases of the LQA process, train the reviewers further on specific aspects of the process, correct misperceptions and act as moderators when the translators and reviewers have differing interpretations of the client instructions and requirements. Transee has built up in-house expertise database in this area, and we’re familiar with the typical issues and questions that arise during translation and LQA of data collection projects. With this database of experience, we can guide the teams to settle issues in accordance with the client’s expectations.

In the processing of sampling, we distinguish between data collection projects that relate to training AI-powered virtual assistants and projects aimed at training machine translation engines. The data for virtual assistants can usually be sorted according to scenarios (smart home, TV, music, driving directions, weather, planning, shopping lists, and more) and we ensure the LQA samples cover as many of these scenarios as possible. Preparing samples for virtual assistant data can require a lot of up-front work, but we can be fairly confident that the sample is representative—the non-LQAed content will consist of variations of the same commands and responses. In projects related to training machine translation engines, the language data is typically random, so the samples can be randomly selected.

While processing the a huge of data, a large number of linguists work on these projects. The variations in translators’ work quality and the levels of difficulty in the content, however, are within reasonable limits that satisfy the clients’ quality requirements. And we can use the results of an LQA to provide extra training or clarity to a translator with a higher error rate.

Being cost-effective and reliable in detecting issues is the most important for a successful quality management program in data collection projects, especially to meet client instructions. Several factors need to align for this to work. It’s a very hands-on process where all parties involved—supply chain, quality, production, translators and LQA reviewers—must be flexible and open to learning and improving.

搜索此博客

Transee Translation