During the school season, an AI paper marking system in the United States that claims to serve 20,000 schools was questioned. As long as students use system vulnerabilities and enter corresponding keywords, they can easily get high scores even if there is no correlation between t

school season, an AI paper marking system in the United States that claims to serve 20,000 schools has been questioned. As long as students use system vulnerabilities and enter corresponding keywords, they can easily get high scores even if there is no correlation between the keywords.

With the development of artificial intelligence, many educational apps have applied intelligent scoring systems. The scoring system marks papers quickly and gives scores in a timely manner, which is welcomed by many teachers and students. But at the same time, many parents complain about the intelligent scoring system, such as the scoring system of the English reading app. Sometimes, even those with a level 8 English major, the test score is only 80 points.

is not only used in the intelligent scoring system for spoken English, but also in the judgment of the paper. However, this intelligent marking system also has a "turnover" phenomenon. It is reported that during the school season, an AI paper marking system that claims to serve 20,000 schools in the United States was questioned. Students can easily pass the "naked exam" with the help of its loopholes. The reason why students took advantage of loopholes is that the system only scored through keywords. As long as students input the corresponding keywords, they can pass the level smoothly and even get high scores even if there is no relationship between the keywords.

Before marking the paper, you need to set the evaluation criteria

. "Automatic evaluation and scoring systems generally need to set the evaluation criteria first, and then design the appropriate evaluation algorithm and model according to the set standards." Xiong Deyi, professor and doctoral supervisor of the Department of Intelligence and Computing at Tianjin University, introduced that for example, oral evaluation and scoring, a machine needs to judge whether the person's pronunciation is standard, whether the stress of the sentences read is correct, whether the read sentences are coherent and smooth, and whether the continuous reading is accurate. The

AI marking system involves the judgment of language and text, covering many aspects, such as grammar, semantics, etc., and will be widely used in natural language processing technology.

"Natural language processing technology is an important branch of artificial intelligence. It studies the use of computers to intelligently process natural language. The basic natural language processing technology mainly revolves around different levels of language, including 7 levels: phoneme (the pronunciation pattern of language), morphology (how words and letters form words, and morphological changes of words), vocabulary (the relationship between words), syntax (how words form sentences), semantics (the meaning corresponding to language expression), pragmatics (semantic interpretation in different contexts), and chapters (how sentences are combined into paragraphs). "Xiong Deyi emphasized that these basic natural language processing technologies are often used in various downstream natural language processing tasks (such as machine translation, dialogue, question and answer, document summary, etc.). Language and text evaluation in automatic marking usually involves several layers of these 7 levels. There are many ways to design automatic evaluation indicators in

, and the appropriate method is usually selected according to different types of judgment. "For example, if the marking system wants to automatically judge the translation questions, the teacher can write multiple reference translation answers in advance, and then compare the students' answers with the reference answers, and calculate their similarity as a rating indicator for the good or bad of the student's answers." Xiong Deyi gave an example and said that the BLEU, a commonly used evaluation indicator for machine translation, is to calculate the similarity based on the N-grams (N-element) matching degree between the reference translation and the machine translation.

A word is one yuan, two connected words are binary, and there are three yuan and four yuan. If one word in the answer is consistent with the word in the reference answer, a one yuan score will be given. Similarly, the scores of binary, three yuan and four yuan can be calculated. The researchers set different weights for different elements and then coordinated the scores into an objective value. The higher the score, the higher the similarity between the two.

The results of different AI scoring systems are very different

The fuse of the "surge" of the AI ​​paper marking system is that the son of an American history professor only got 50% of the scores when he was taking the history exam. After she evaluated her son's answer, she felt that the child's answer was basically no problem.

The same answer, why are there such a big difference between manual evaluation and machine evaluation?

"This is the biggest challenge facing automatic evaluation based on AI algorithms: how to be consistent with manual evaluation. There are many problems that need to be solved to deal with this challenge.For example, how to formulate appropriate evaluation standards, and the automatic evaluation of subjective questions must have appropriate evaluation standards and specifications; for example, how to deal with the ever-changing language, the diversity of language is one of the main challenges of natural language processing technology, and automatic evaluation and automatic processing of language must face the challenges of diversity; for example, how to design a comprehensive evaluation indicator. Although there are various indicators at present, few indicators comprehensively consider all aspects of language and text, such as automatic essay marking, it may be necessary to consider whether the words are used reasonably (vocabulary), whether the sentences are fluent (syntax), whether the paragraph organization is organized (section), and whether the content is subject to the topic (semantics, pragmatics), etc. "Xiong Deyi said that the BLEU mentioned above only considers the strict matching of word forms, and does not consider factors such as morphological changes, semantic similarity, syntactic rationality of translations, etc.

"The evaluation rules followed and the starting points of judgment are different, and the corresponding algorithm models are different, so the final results will be very different. "Xiong Deyi said.

is obviously incomplete only using one evaluation method, which explains that when the child's mother tries to add keywords in the questions such as "wealth, caravan, China, India" to the answer, even if there is no connection between these keywords, she gets full marks. "Maybe this AI marking system only uses simple keyword matching, so there will be a situation where the 'keyword salad' can also get away with it. "Xiong Deyi explained.

In addition, there are also big differences between manual evaluation and machine evaluation of spoken language. "In recent years, although speech recognition performance has been significantly improved under the promotion of deep learning technology, this recognition rate will drop a lot in open environments and noise environments. "Xiong Deyi explained that if the machine "listens" a word and then the machine evaluates it, an error will be spread, that is, the error in the upstream system will lead to the next system error, and the error will be added to the error, the more it will be outrageous, and the evaluation results will be very different.

"There are currently many methods for designing evaluation indicators, and there are many improved methods, such as calculating the recall while calculating the accuracy rate. In addition, there are also evaluations of evaluation indicators, that is, evaluations, to see which evaluation indicator is more perfect and more consistent with people's evaluations. "Xiong Deyi sighed that in many cases, the difficulty of automatic evaluation and the difficulty of corresponding natural language processing tasks are the same from a technical perspective. For example, using a machine to evaluate the quality of a translation is similar to using a machine to generate a translation. Using a machine to judge the quality of a document summary is similar to using a machine to generate a summary.

can combine manual evaluation to make the system smarter

" Traditional automatic evaluation indicators are usually calculated based on symbols, and now AI technologies such as deep learning are increasingly used in evaluation tools. "Xiong Deyi introduced that using deep learning, language symbols can be mapped to the semantic space of real dense vectors, and semantic vectors can be used to calculate similarity. Even if the words spoken are different from those learned by computers, as long as the semantics are consistent, the machine can conduct accurate evaluations. Therefore, automatic evaluation based on deep learning can some extent cope with the challenge of language diversity. However, deep learning also has a problem, which is that it requires a large amount of data for the machine to learn.

is based on a pre-trained language model based on self-supervised learning. In recent years, in language Breakthrough progress has been made in learning words. "OpenAI's pre-trained language model GPT-3 trained a neural network with 175 billion parameters on a massive corpus of 500 billion words. By learning texts from various languages ​​on the network, GPT-3 has formed a powerful language representation ability, which can perform a variety of tasks, such as automatic translation, story generation, common sense reasoning, question and answer, and even add and subtract operations. For example, its two-digit addition and subtraction accuracy reaches 100%, and the five-digit addition and subtraction accuracy is close to 10%. "Xiong Deyi introduced that, however, if such a huge neural network is stored with single-precision floating point numbers, it requires 700G of storage space. In addition, it costs 4.6 million US dollars to train the model once.Therefore, even if GPT-3 has better zero-sample and small-sample learning ability, its high cost makes it still far from universally available.

However, AI, as a "teacher" in the marking and evaluation, has an incomparable advantage of artificial intelligence. For example, the AI ​​automatic review system is faster than manual review. It is impossible for teachers to remember all the multiple-choice answers at once. They need to constantly check the standard answers. This is time-consuming. The automatic review system helps teachers greatly improve efficiency. In addition, the automatic review system is more rational, not interfered by external conditions, and will not cause misjudgment due to fatigue and other reasons. Even in a complex and disturbing environment, the correct results can still be obtained; the AI ​​marking system can also directly analyze the learning situation after scoring, count the examination data, wrong question data and other teaching materials to help teachers reduce burdens and increase efficiency, and help students improve learning efficiency.

"Reasonably objectifying subjective questions can reduce the difficulty of automatic marking of papers." Xiong Deyi said that although it is difficult to set comprehensive evaluation standards for subjective questions that cannot be objectified, it is still feasible to set evaluation standards in a certain aspect, such as the evaluation of word lexicon and sentence grammar. The accuracy rate is still quite high. This type of technology can move from laboratory to product application.

can also introduce manual evaluation to review and correct the scores of the AI ​​marking system. Through this repeated correction, a large amount of evaluation training data is accumulated, making the machine scoring more intelligent.

"Using artificial intelligence technologies such as natural language processing and further improving subjective intelligent scoring systems will be a very important topic in the field of education in the future." Xiong Deyi said that future AI automatic review systems will definitely become more and more "smart", and the combination of artificial intelligence and education will become closer and closer. (Reporter Chen Xi)