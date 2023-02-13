Today, the fanatical pursuit of ChatGPT seems to have become a kind of “political correctness”.

As soon as ChatGPT came out, the academic and industrial circles were all shocked. A senior researcher at a research institution commented on AI technology: “ChatGPT comes out, and we can’t fix it directly – let’s not say that the generation is better than ours, and the NLP (Natural Language Processing) ability is better than ours. Quite a few.”

Microsoft is injecting tens of billions of dollars, and Google is facing a formidable enemy. The huge wave that ChatGPT has set off in the technology circle is still in progress.

However, ChatGPT is not a “master key” – the accuracy of large models in certain professional fields still cannot surpass other vertical products. Recently, Tencent AI Lab has proved through experiments that in the field of machine translation, ChatGPT is weaker than other commercial translation products in some cases.

Paper address: https://arxiv.org/pdf/2301.08745v1.pdf

1

Is ChatGPT a good translator?

The survey article of Tencent AI Lab pointed out:

First, in high-resource settings—such as European languages—ChatGPT’s performance is competitive with commercial translation products (such as Google Translate, DeepL Translate), but, in low-resource settings—such as ancient languages—significantly lags behind ;

Second, in terms of translation robustness, ChatGPT is not as good as commercial translation products in terms of biomedical abstracts or Reddit comments, but it may be a good translation tool in spoken language.

In order to better understand the translation ability of ChatGPT, Tencent AI Lab conducted experiments from the following three aspects:

Prompt translation: ChatGPT is a large-scale language model, and prompt words (Prompt) are needed as a guide to guide the system to translate when translating. Therefore, the style of the prompt words will affect the quality of the translation output. For example, in a multilingual machine translation model, how to link two language information is very important, which is usually solved by appending language tags.

Multilingual translation: ChatGPT is a single model that handles various NLP tasks and covers different languages, which can be regarded as a unified multilingual machine translation model. Therefore, the performance of ChatGPT on resource differences (such as high and low) and language differences (such as Europe and Asia) is one of the key points explored in this experiment.

Translation robustness: ChatGPT is a model developed based on GPT-3. GPT-3 is trained on large-scale data sets covering various fields. Therefore, the performance in specific fields is one of the focuses of the researchers this time. .

prompt word translation

In order to design prompts that trigger ChatGPT’s machine translation capabilities, the Tencent AI Lab team proposed the following prompts to ChatGPT:

Provides ten brief prompts or templates that allow you to translate

and get the result in Figure 1:

Figure 1: 10 prompts recommended by ChatGPT that can trigger machine translation

The generated prompts look reasonable, but they all have a similar format. The researchers summarized them into three candidate prompts (as shown in Figure 2), where[SRC] and [TGT] represent the source and target languages ​​of the translation, respectively. Also, the researchers added an extra command to Tp2 asking ChatGPT not to put double quotes around translated sentences (which often happens in the original format). Nevertheless, ChatGPT is still unstable, such as translating multi-line sentences in the same batch into a single line.

Figure 2: Candidate translation hints

The researchers compared the performance of three different candidate prompts with Flores-101’s test set in the Chinese-English translation task. Figure 3 shows the results of ChatGPT and three other translation software. Although ChatGPT provides reasonably good translations, it still lags behind the baseline by at least 5.0 BLEU points. Regarding the three candidate prompts, Tp3 performed best in all indicators, so in this paper, the researchers use Tp3 by default.

Figure 3: Comparison of the translation performance of ChatGPT using different prompts in the Chinese-to-English translation task

multilingual translation

Four languages ​​were selected by Tencent AI Lab to evaluate ChatGPT’s ability in multilingual translation, including German (De), English (En), Romanian (Ro) and Chinese (Zh), which were selected in both research and competitions. generally adopted. The first three languages ​​all come from the Latin language family, while the latter one comes from the Chinese language family. The researchers tested translation performance between any two languages, involving a total of 12 translations.

resource difference

Through experiments, it is found that different languages ​​in the same language family also have resource differences. In machine translation, German-English translation is generally considered a high-resource task with over 10 million corpora. However, there are far fewer translation corpora between Romanian and English.

As shown in Figure 4, ChatGPT is comparable to Google Translate and DeepL in terms of German-to-English and English-to-German translation; while it is significantly behind in Romanian-English translation and English-to-Romanian translation. Specifically, ChatGPT achieved a 46.4% lower BLEU score than Google Translate on English to Romanian translation.

Figure 4: Performance of ChatGPT in multilingual translation

The researchers believe that the large resource discrepancy in monolingual data between English and Romanian limits the language modeling capabilities of Romanian, which partly explains the poor performance of English-to-Romanian translation.

On the contrary, the translation from Romanian to English can benefit from the strong English modeling ability, so that the resource gap of parallel data can be compensated to a certain extent.

language family

At the same time, the researchers also considered the influence of language family.

It is generally believed that for machine translation, translation between different language families is usually more difficult than translation between the same language family. The researchers found cultural and written differences between German-English translations, Chinese-English translations, or German-Chinese translations.

In addition, it can be found that among these types of translations, there is a large gap between ChatGPT and several commercial translation software. The researchers believe that this is because knowledge transfer in the same language family is better than between different language families. For languages ​​of other language families (such as the translation between Romanian and Chinese), this gap will be further widened.

Since ChatGPT handles different tasks in one model, the low-resource translation task not only competes with the high-resource translation task, but also competes with other NLP tasks for model capacity, indicating its underperformance.

translation robustness

Tencent AI Lab further evaluates the translation robustness of ChatGPT on WMT19 Bio and WMT20Rob2 and Rob3 test sets, which introduce domain bias and potentially noisy data.

For example, the WMT19 Bio test set is composed of Medline abstracts, which requires domain-specific knowledge processing, and WMT20Rob2 is comments from Reddit, which may contain various errors such as typos, word omissions, inserted repetitions, grammatical errors, disruptive language, and internet slang etc. Figure 5 lists the BLEU scores, and it is clear that ChatGPT does not perform as well as Google Translate and DeepL Translate on the WMT19 Bio and WMT20Rob2 test sets.

Figure 5: ChatGPT’s performance on translation robustness

The reason may be that commercial translation products like Google Translate often need to continuously improve their ability to translate domain-specific (such as biomedical) or noisy sentences, because they are real-world applications that require better understanding of out-of-distribution data. In general, ChatGPT is not very capable of accomplishing this.

However, an interesting finding is that ChatGPT greatly outperformed Google Translate and DeepL Translate on the WMT20Rob3 test set containing crowdsourced speech recognition corpus. This shows that ChatGPT is essentially an AI dialogue tool capable of generating more natural spoken language than commercial translation software (see Figure 6).

Figure 6: Example from WMT20 robust set set3

2

How should ChatGPT maximize its strengths and avoid its weaknesses?

It can be seen from this research that ChatGPT, which is held high, consumes a lot of computing power resources every time it is trained, but it cannot be perfect in all fields. Therefore, some people began to think whether they should “abandon” the idea of ​​large models and turn to “intensive cultivation” of small models.

Tencent AI Lab mentioned in the Chat GPT “evaluation” that there is a big gap between the translation between Romanian and English compared with the translation between German and English. The reason is that the huge resource difference limits the language modeling ability of Romanian It proves that AI learning ability is often constrained by low resources.

However, some senior scholars believe that although ChatGPT still has many shortcomings, it still has a lot of inspiration for researchers and entrepreneurs. AI 3.0 represented by ChatGPT is taking a different path from the previous AI wave, that is, it is more grounded and closer to the real world. In industrial applications, it is more direct and grounded. The path from academic research to industrial landing has also become Shorter and faster.

In the future, “helpful, truthful, and harmless” AI systems will become a reality.

