A Review on Human-Computer Interaction and Intelligent Robots

In the ¯eld of arti¯cial intelligence, human – computer interaction (HCI) technology and its related intelligent robot technologies are essential and interesting contents of research. From the perspective of software algorithm and hardware system, these above-mentioned technologies study and try to build a natural HCI environment. The purpose of this research is to provide an overview of HCI and intelligent robots. This research highlights the existing technologies of listening, speaking, reading, writing, and other senses, which are widely used in human interaction. Based on these same technologies, this research introduces some intelligent robot systems and platforms. This paper also forecasts some vital challenges of researching HCI and intelligent robots. The authors hope that this work will help researchers in the ¯eld to acquire the necessary information and technologies to further conduct more advanced research.


Introduction
Arti¯cial intelligence (AI) technology is a technical science that studies and develops theories, methods, technologies, and application systems for the simulation, extension, and expansion of human intelligence. It has been one of the most popular and widely growing technologies in recent years and has already achieved signi¯cant success in many areas such as robots, speech recognition, computer vision, and natural language processing. [1][2][3][4] AI is regarded as the most valuable technology, which holds the highest potential to achieve many breakthroughs. It attempts to understand the essence of intelligence and produces intelligent machines that can respond in the This is an Open Access article published by World Scienti¯c Publishing Company. It is distributed under the terms of the Creative Commons Attribution 4.0 (CC BY) License which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
should possess for man-machine interactions, and our review will also include literatures in most of the¯elds in this¯gure.

De¯nition and categorization of HCI
HCI or human-machine interaction (HMI) is the syncretic science of computer science, design, behavioral science, AI, and several other subjects, which involves a thorough research of the scienti¯c implications and practices of the interfaces between people and computers or intelligent agents. There are two levels of meaning associated with the related research works (see Fig. 2). On the primary level, it includes the research of ways and design of new technologies to (better) promote the computers as useful tools, whereas on the higher level, it includes the research of intelligent technologies that will adopt the natural ways of interaction between humans and computers, thereby boosting the cause for the computers to become more harmonious as partners to get along with. HCI was¯rst used in 1976, 10 and it was popularized by the book, The Psychology of Human-Computer Interaction published in 1983. 11 In 1992, a HCI curriculum was developed by Hewett and other leading HCI educators to serve the needs of the HCI community. 12 In CES 2008, Bill Gates emphasized the role of natural user interface and predicted that the way in which HCI will bring a radical change in the next few years. Thereafter, HCI researchers expounded the de¯nition of a natural HCI by employing di®erent approaches. [13][14][15] As far as we know, the development process of HCI has gone through¯ve major stages: manual stage, interactive command language stage, graphical user interface (GUI) stage, network user interface stage, and natural HCI. As their names imply,  we can understand the characteristics of each stages. A situation of tripartite emotion" and \Arti¯cial psychology". 27,28 Based on information science, arti¯cial psychology is the theory and methodology for intelligent machines stimulating people's psychological activity, such as emotions, volition, and personality. 29 Emotion is an adaptive physiological expression that humans produce spontaneously when they are buoyed by the external environment in daily life activities. Research results of anatomical and behavioral sciences suggest that emotional activities and expressions are under the control of human brain. Studies have found that emotion is associated with multiple brain regions, including the prefrontal cortex, hypothalamus, and cingulate cortex, and amygdala serves as the center of all emotions. 30,31 Researchers generally use discrete emotional states model and dimensional model to construct and understand the emotional space. The discrete emotional states model divides emotions into a variety of discrete states, which can be further divided into several di®erent emotional states (e.g., happiness or disgust). 32 The most common classi¯cation scheme is dividing it into six emotional states: happiness, sadness, anger, fear, surprise, and disgust. Human emotional states are continuous and dynamical in a natural interaction scene, so the discrete emotional states model is unable to accurately represent the change of human emotions. 33 The dimensional model considers the emotional space as a continuous space composed of di®erent dimensions, which can better characterize and stimulate human emotions. 34 There are two-dimensional valence-arousal model 35 and activation-evaluation model 36 besides the three-dimensional pleasure-arousal-dominance model 37 and arousal-valence-stance model. 38 Reference 39 proposed a new academic system called \Enriching Mental Engineering", which aims to deal with the mental system of human beings. It measures and enriches the mental richness by employing engineering methods. Reference 40 carried out research on a®ective computing from the view of psychology and proposed a mental state transition network model to dynamically detect human emotions. After that, these researchers conducted a series of experiments involving basic theories, emotional data resources construction, and their applications. [41][42][43][44] Table 1 summarizes the above reviewed models. In addition, there are a large amount of literatures available on the applications of a®ective computing in the¯eld of HCI and intelligent robots, which will be reviewed in the following sections.

Human-Computer Interaction
In this section, we will review the relevant literatures in the¯eld of HCI by considering the aspect of interactional abilities, such as listening, speaking, reading, writing, visual sense and other senses, possessed by humans. These same activities are desired in an intelligent robot.

Listening and speaking
Auditory sense is one of the most important senses of the human body. It is used for mutual interaction among humans and its main forms include listening and speaking.
Listening is used to receive the voices of outside world, and speaking is used to express own ideas and opinions to the outside world. The robot's abilities of listening and speaking aim to imitate the auditory ability of humans in the interaction process, and these two kinds of abilities are carried out via the spoken dialogue system in intelligent robots. Figure 3 shows the framework of a spoken dialogue system. Generally speaking, the spoken dialogue system comprises¯ve modules: automatic speech recognition (ASR), NLU, dialogue management (DM), natural language generation (NLG) and automatic speech synthesis (ASS). The primary responsibility of the ASR is to transform the continuous time signal of a user's speech into a series of discrete syllable units or words. The primary responsibility of NLG is to analyze the result of speech recognition process and transform the user's dialogue information into a representative form that can be utilized by the dialogue system via syntactic and semantic analysis. DM is used to make a comprehensive analysis based on the result of language understanding, the context of the dialogue, the historical information of the dialogue, etc., to determine the current intention of the user. Thereafter, the response or response strategy is adopted by the system. Then, NLG organizes the appropriate response statement and convert the system's response into the natural language that users can understand. The primary responsibility of ASS is to synthesize the text generated by NLG into the¯nal answering voice and feed it back to the user. A large number of extensive e®orts have been actualized in the¯eld of dialogue system, which is divided into two categories, acoustic-based and text-based.
One of the key terminals of the auditory module is ASR, which has changed the way we interact with intelligent agents/systems. The development of ASR bene¯ts from both¯elds of academic research and industry, including Google, Microsoft, IBM, Baidu, Amazon, iFLYTEK, etc., all of which have developed speech recognition engines. In a traditional solution, hidden Markov models (HMMs) are widely used in speech recognition systems, and most of modern general-purpose speech recognition systems are based on HMMs. 45 HMMs are used in speech recognition process because a speech signal can be viewed as a piecewise stationary signal or a short-time stationary signal. This signal can be considered suitable for the Markov model based on the hypothesis of HMMs that hidden state variables, speech as an observed value, and the transfer among states conform to the hypothesis of HMMs.
Reference 46 introduced the application of the theory of probabilistic functions of a (hidden) Markov chain to actualize ASR for an isolated word. Following that, Ref. 47 described a maximum likelihood approach to the continuous speech recognition process. Also, there are many other works that focus on probability models and HMMs for ASR. [48][49][50] According to the descriptive ways of observation probability, there are two HMM-based models: CD-GMM-HMM architecture 51 and CD-DNN-HMM architecture. 52 References 53 and 54 were committed in solving the problem of phoneme recognition and classi¯cation by using the neural network models. Reference 55 had reviewed the time delay neural network architectures for speech recognition process. In the last¯ve years, the research works of speech recognition had focused on deep neural networks-based methods, such as CNN, 56,57 LSTM, 6,58 and RNN. 59,60 ASR is also researched as an end-to-end problem, and many works have showed that end-toend deep learning-based methods had obtained encouraging results. [61][62][63][64] Apart from the speech content, the speech also carries the rich emotions of the speaker. In most natural interactions, we not only need to know the contents of speech, but more importantly, we need to know the emotions present in the speech, which are also an important part of the natural HCI research. Therefore, many researchers have been focusing on the emotional recognition of speech to mine the emotion labels of speech, so that the emotional information can be used by other interactive tasks. [65][66][67] ASS is the other terminal of an auditory module. Released in 1975, Multichannel Speaking Automaton was considered as one of the¯rst ASS systems, whereas the Bell Labs system was one of the¯rst multilingual language-independent systems, which made an extensive use of natural language processing methods. 68 The major recognized classi¯cation ways of speech synthesis methods are rule-driven methods and data-driven methods according to the design idea 69 (see Fig. 4 for details). The main principle of rule-driven methods is to simulate the physical process of human pronunciation by establishing a series of rules. The resonance peak synthesis method 70 and pronunciation simulation-based synthesis method are rule-driven methods.
The data-driven synthesis methods mainly include concatenative synthesis method, HMMs-based method, and deep neural networks-based method. The concatenative synthesis method synthesizes sounds by identifying and concatenating the units that best match the speci¯ed criterion, further accompanied by prosodic modi¯cation. 71 The di±culty and de¯ciency of this kind of approach is that speech corpus consumes lot of resources and requires a sophisticated design. HMMs-and STRAIGHT-based method overcame this barrier and is suitable for the mobileembedded platform. [72][73][74] The deep learning network was¯rst applied in the¯eld of speech recognition, and the recognition rate increased by more than 10%, which greatly attracted the attention of researchers. There are also abundant research achievements in the¯eld of speech synthesis with the use of deep neural networks. [75][76][77][78][79][80][81][82][83] In a conventional neural networks-based approach, text analysis and acoustic modeling are processed separately. However, Ref. 75 attempted to integrate them together and proposed a novel end-to-end framework to deal with speech synthesis. By combining memory-less modules and stateful recurrent neural networks, the unconditional audio generation in the raw acoustic domain was researched in Refs. 75 and 76. Reference 77 introduced WaveNet to generate the raw audio waveforms and yielded a state-of-the-art performance after applying the same to speech synthesis. References 78-80 have carried out a series of meaningful works in this¯eld based on deep neural networks. References 81 and 82 had focused their works on vocoder-based speech synthesis system to improve the sound quality and real-time performance of speech synthesis. Some research works also aim to synthesize the speech of a speci¯c type or person. 74,83 Reference 83 introduced an emotional speech synthesizer based on the end-to-end neural model, which could be used to generate speech for the given emotion labels. Reference 84 used Variational AutoEncoder (VAE) to synthesize speech to control it in an unsupervised manner. Certain types of speech synthesis tasks, especially emotional speech synthesis task, are of great signi¯cance and value, which can a®ect the content and e®ect to be expressed, because the e®ect will be greatly di®erent when the same content is expressed by di®erent emotional semantics.
Although the quality of speech synthesis has steadily improved over the past decades, especially with the rapid development of deep neural network technology, speech synthesis systems remain clearly distinguishable from the natural human speech. The challenges of emotional speech synthesis and natural language processing accompanied by speech synthesis are still in an urgent need to be addressed and solved. Another research direction of acoustic-based work related to HCI is Voiceprint Recognition (VPR). In the natural HCI scenario, intelligent robots need to know what the interactive person says, and be more natural. Intelligent robots must understand that the identity of the interactive person is also essential to be learned, so that it can adjust its way of speaking according to the speaker's personality. There are two application scenarios of VPR: speaker identi¯cation and speaker veri¯cation. The former is used to determine one of several peoples who speak a particular speech, whereas the latter is used to con¯rm whether a speech is spoken by a speci¯ed person. In 1995, Reynolds successfully applied the Gaussian mixture model (GMM) to the text-independent VPR task for the¯rst time 85 and established the foundation position of GMM in the acoustic pattern recognition. 86,87 The traditional acoustic features including MFCC, PLP, and PNCC 88 can be used as acoustic features in the VPR task. Also, there are works that have focused on deep learning and i-Vectorbased VPR. [89][90][91] Actually, the text-related research works between ASR and ASS are at the core of the dialogue system. Many AI companies have launched a series of new information services based on the dialogue system, such as Google's ALLO, Apple's Siri, Microsoft's Cortana, and Baidu's Duer. It can be divided into task-oriented and non-task-oriented dialogue systems based on whether the dialogue system can achieve a speci¯c goal (see Fig. 5 for details). Also, the above-mentioned dialogue systems are both task-oriented and non-task-oriented.
There are pipeline-based methods and end-to-end methods for task-oriented dialogue systems. The pipeline method includes NLU, dialogue state tracking, policy learning, and NLG. NLU is used for topic recognition, intention mining, and semantic annotation in the dialogue system. Topic recognition and intention mining are usually considered as classi¯cation tasks, and a number of studies have been published in these¯elds. [92][93][94][95][96][97] References 7, 98 and 99 focused on semantic annotation (slot¯lling), which is a challenge of sequential annotation for words. In recent years, the primary responsibilities of dialogue state tracking are mainly focused on deep neural network-based methods. [100][101][102] The DSTC: Dialog State Tracking Challenge, which has been held annually since 2013, has given a strong impetus to the study of dialogue state tracking. Deep reinforcement learning is often used in policy learning, 103,104 and other approaches have also been tried in policy learning. 105  is used to generate dialogue responses under the guidance of the dialogue strategy, whose generation ways contain generative models-based methods 106,107 and retrieval-based methods. 108,109   have researched about a®ective DM, which is one of the cores of dialogue system. Unlike pipeline methods, end-toend methods had treated the dialogue system learning as the problem of learning a mapping from dialogue histories to system responses, and applied an encoderdecoder model to train the whole system. 113,114 Non-task-oriented dialogue systems are often called chatbot, whose main purpose is to provide the ability to chat with people in an open domain, and there are rulebased methods, retrieval-based methods, and generation-based methods. In fact, we can think of it as the joint modeling of all modules in the pipeline-based methods. In recent years, the research works in this¯eld are mainly concentrated upon deep neural network methods-based generation models. Referring to the Seq2Seq model of machine translation, multiple end-to-end response systems based on the deep neural network model emerged in 2015. [115][116][117] Thereafter, the attention mechanism is introduced to generate context-sensitive dialogue responses. 118 In addition, the research works in this¯eld also include deep reinforcement learning-based dialogue generation, 119 dialogue generation model study based on VAE and CVAE 120,121 and dialogue generation study based on GAN. 122,123 A®ective computing and dialogue systems are two emerging and interesting research directions in the¯eld of AI. Many scholars have conducted a lot of research works on these two aspects, respectively; however, the researched content is basically independent and less related to each other. With the gradual perfection of dialogue system and the comprehensive deepening of a®ective computing, some scholars have begun to explore a new cross-research topic, that is, how to integrate the emotion into the dialogue system to build an emotional dialogue system. 124 Reference 125 combined the a®ective computing theory with the spoken language dialogue system and proposed to use the spoken language dialogue system as the carrier for the integration of the multi-modal emotion recognition, e®ective emotional interaction, and the emotion generation and expression of intelligent robots. The generation of emotional dialogue responses is mainly achieved by learning emotional labels. 22,126 Question answering system, which is focused more on factual questions, can be regarded as a special case of the dialogue system, and it can answer the questions posed by humans with more accurate and concise natural language. There are also plenty of good works in the¯eld. Reference 127 proposed a distantly supervised opendomain question answering (DS-QA) system, which retrieves the relevant text from Wikipedia and extracts the answer by reading comprehension. Reference 128 proposed a denoising DS-QA, which contains a paragraph selector and paragraph reader to make the full use of all informative paragraphs and alleviate the wrong labeling problem in DS-QA. Reference 129 proposed a method of answer extraction for long documents, which separated the answer generation in DS-QA into selecting a target paragraph in document and extracting the correct answer from the target paragraph by reading comprehension. Reference 130 proposed a Question Condensing Networks (QCN) to utilize the subject-body relationship of community questions.

Reading & writing
Another form of mutual human interaction involves characters, such as reading a book, writing a letter, etc. People express their thoughts, love, for example, by using characters. Then, readers can understand the meaning and thoughts deeply present in characters by reading them. Enduing the intelligent robots with the abilities of reading and writing is still in the category of natural language processing, whose purposes are to enable robots to read human characters and understand human thoughts, and to express their thoughts and ideas by generating a speci¯c character sequence. In the following content, several tasks will be introduced to re°ect the robots' abilities of reading and writing (see Table 2 for summaries), including part-ofspeech tagging, named entity recognition, text classi¯cation, text sentiment analysis, machine translation, machine reading comprehension (MRC), machine writing, etc.
To handle the problems of machine's reading and writing, the¯rst task of the representation of text in the computer needs to be solved. Although in some languages, such as Chinese, word segmentation is needed before the word representation. In traditional statistical natural language processing tasks, text representation is mostly based on the discrete feature vector method, which relies heavily on handcrafted feature engineering (e.g., vector space model (VSM)). 131 Feature engineering is always time consuming and incomplete, and the problem of dimensional explosion also exist in it. With the rise and development of deep learning methods and computing hardware, deep learning methods have been employed and produced state-of-the-art results in many domains, ranging from computer version to speech processing. References 132-134 put forward a neural networks-based method to embed words into the low-dimensional distributional vectors known as Word Embeddings. Word Embeddings is also a statistical method, which follows the distributional hypothesis that words occurring in a similar context tend to have similar meanings. Thus, we can think that Word Embeddings contain syntactical and semantic information, and its major advantage is that they can capture the similarity between words by measuring the similarity between vectors. Distributed representations have been the basis of deep learning-based NLP tasks and have helped achieve encouraging results in a wide range of NLP tasks. [135][136][137] POS tagging is the process of marking up a word in a text with a particular part of speech based on both its de¯nition and its context. The di±culty of this problem is that the same word will show di®erent parts of speech in di®erent contexts. Rulebased methods and statistics-based methods are the main approaches in traditional POS tagging and most machine learning methods have achieved accuracy above 95%, whereas recent research works focused on deep learning based-method have been achieving even better accuracy. Reference 137 proposed a deep neural network that learns the character-level representation of words and associates them with  NER, which is one of the most important bases of NLP tasks, refers to the task of recognizing the entity with a speci¯c meaning in the text, mainly includes name, place's name, institution's name, and proper noun. It also includes two subtasks: entity boundary recognition and entity type determination. CRF is a traditional discriminant probability model recognized as a good algorithm for solving NER problems. Many fusion methods of CRF and neural networks have emerged in this eld. Reference 142 is one of the representative works that neural networks were used for NER, in which CRF was fused into CNN. On the basis of similar ideas, Refs. 143-145 have combined RNN and CRF to deal with NER. These papers had proposed novel architectures for combining word embedding with character-level representation, in which attention mechanism was introduced to dynamically extract information from both word-and character-level components. 146,147 Generally speaking, deep learning relies on a large number of annotated samples as training data. In order to solve the limitation caused by the massive annotated data, many literatures have studied the NER methods based on a small amount of annotated data, such as transfer learning, 148 semi-supervised method, 149 and active learning. 150 Text classi¯cation is the technology that automatically marks the text with labels according to a certain or standard classi¯cation system. The research of text clas-si¯cation has gone through several stages including keywords matching-based method, rules-based knowledge engineering, statistical machine learning-based methods (e.g., SVM, KNN), and deep learning-based methods. Recently, Ref. 151 is an excellent work that provided an overview of the state-of-the-art elements of text classi¯cation. Reference 152 explored a simple but e±cient baseline for text classication, fastText, which provides the idea that some tasks can be solved by some extremely simple models. Reference 153 researched convolutional neural networks to deal with text sequence and carried out experiments for sentence-level classi¯cation, further achieving compelling results. Following this route, character-level convolutional neural networks are studied for text classi¯cation. 154 Reference 155 divided the text into three levels: word, sentence, and document. They constructed a hierarchical model for long text classi¯cation by using the hierarchical attention mechanism. Reference 156 proposed deep average networks (DAN) and attentional DAN to actualize the conversational topic classi¯cation for the evaluation of the conversational bots. Lai et al. introduced recurrent convolutional neural networks for this task and applied a recurrent structure to capture the contextual information, whereas a convolutional neural network was used to construct the representation of text. 157 Much of the success that transfer learning has achieved in computer vision cannot yet be fully transplanted into NLP. Text categorization still requires task-speci¯c modi¯cations and training from scratch. Howard et al. proposed an e®ective transfer learning method for text classi¯cation, known as universal language model ne tuning, and introduced some key techniques for model¯ne tuning. 158 Also known as opinion mining and inclination analysis, text sentiment analysis is the process of analyzing the emotions present in the text. Reference 159 gave a macroscopic introduction to the¯eld of sentiment analysis, such as research objects and venues. Reference 160 summarized several major models in the¯eld of deep learning and comprehensively introduced their applications in the task of sentiment analysis. Additionally, they reviewed three levels of granularity research works for sentiment analysis and their subtasks. Reference 161 researched the usage of autoencoders in modeling textual data and sentiment analysis, and tried to address the problems of scalability with the high dimensionality of vocabulary size and taskirrelevant words by introducing a loss function of autoencoders. Also, there are many other leading research works that focused on text sentiment analysis and its applications. [162][163][164][165][166][167][168][169][170] Although sentiment analysis is treated as a classi¯cation problem, sentiment analysis is actually a suitcase research problem that requires dealing with many NLP tasks. 171 Reference 172 proposed a novel tagging scheme to jointly extract entities and relations, which can be seen as the subtasks of sentiment analysis, by using several end-to-end models.
Machine translation is a cross-language literacy that automatically translates the source language into the target language. Machine translation consists of experienced rule-based methods and statistical machine translation. In recent years, research works had mainly focused on the neural machine translation (NMT). Reference 173 summarized a successful usage of neural networks in the machine translation system. Cho et al. proposed a novel neural network model called RNN encoder-decoder for statistical machine translation and show that the proposed model had the capacity of learning semantic and syntactic meaningful representation of linguistic phrases. 174 This research also involved an empirical evaluation of a novel hidden gated unit. Reference 175 presented a general end-to-end approach to sequence learning for machine translation and suggested that the NMT can achieve results similar to the traditional techniques. Reference 176 proposed the attention mechanism, which achieved state-of-art results for statistical machine translation. Following this research, Ref. 177 explored attention-based NMT architectures, including a global approach and a local one, to improve the NMT performance and achieved remarkable results. Reference 178 introduced a sequence-to-sequence architecture, which was always deployed via RNN and based entirely on CNN, and achieved better accuracy and time e±ciency. Di®erent from the previous encoderdecoder architecture, Ref. 179 proposed a neural network architecture that only used attention mechanism, and the experimental results on the machine translation task have showed that the architecture performed well both on quality and training speed. Google team presented Google's NMT system to address some relevant problems such as robustness, accuracy, and speed. 180 Thereafter, their team tried to solve the problem of multilingual translation by using a single NMT model. 181 GANs were also applied to NMT, and Ref. 182 introduced a conditional sequence, GAN, in which the generator aimed to translate the sentences while the discriminator tried to discriminate the outputs generated by the generator from the sentences translated by a human being. Reference 183 proposed a novel model to produce translation outputs in parallel instead of one after another, so as to reduce the latency occurring during inference.
MRC, also researched as the open domain QA, is the ability of intelligent robots to comprehend a given context and give answers of questions related to the given context. Information retrieval can also be considered as a MRC issue. 184 Many MRC datasets were exposed to train and evaluate MRC, such as Machine Comprehension Test, Children's Book Test, CNN/Daily Mail, The Stanford Question Answering Dataset (SQuAD), and DuReader. [185][186][187][188][189][190][191] Traditional research works on MRC always focus on pipeline-based methods consisting of several NLP subtasks. With the popularity of neural network model in NLP tasks, there are a series of works that focus on end-to-end neural networks for the MRC task, in which the answers are chosen from the candidates. 186,192 While a novel end-to-end neural architecture, using match-LSMT 193 and answer pointer, was proposed based on SQuAD, it had no answers of candidates and was thought to be di±cult to be dealt with. 194 Thereafter, R-NET introduced by MSRA solved the question via four steps following match-LSTM. 195 Reference 196 presented bi-directional attention°ow (BiDAF) network, in which the context is represented at di®erent granularities and BiDAF was used to locate the key context. Reference 197 improved the performance of MRC from the point of objective function and network model, in which a mixed objective combined cross-entropy loss with self-critical policy learning. This research also proposed a DCN that was improved by a deep residual coattention encoder. Reference 198 summarized the advantages and disadvantages of match-LSTM, R-NET, and other previous models, and made signi¯cant improvements by the reattention mechanism and dynamic critical reinforcement learning. Another excellent research was Ref. 199, which proposed a novel model known as attention-over-attention reader to address the cloze-style MRC and achieved state-of-art performance in many public datasets. Transfer learning was also introduced into MRC, and a two-stage synthesis network was presented by Ref. 200 to answer the questions in one domain that were provided in a model from another domain. Reference 201 proposed a novel dynamic fusion network model for MRC, in which the attention strategy was chosen°exibly according to the question types. A novel architecture called QANet consisted of local convolution, and global self-attention was proposed to improve the speed of training and reasoning. 8 The experiment results have showed that the proposed model achieved a greater increase in speed along with equivalent accuracy with recurrent models. Reference 202 extended the paragraph-level MRC to the documents level where the documents are given as context and a novel objective function was introduced to produce a global answer. Reference 203 proposed a meaningful assumption that if the MRC models could combine textual evidence from multiple contexts, then the scope of this model would be extended. Based on this novel task, the literature produced datasets and validated some methods.
By machine writing, we mean generating text, not writing calligraphy by robots. In essence, all of the above-mentioned methods such as the dialogue system, QA system, and machine translation belong to the category of machine writing, and the di®erence lies in the di®erent application premises and application scenarios. Machine writing will be one of the most important ways for intelligent robots to express themselves and also one of the most important means of natural HCI. There are many other forms of machine writing tasks such as text summarization, news writing, image description, etc. Reference 204¯rst proposed the technology of text summarization, which refers to analysis background documents, summarization of the main points of documents, along with extraction or generation of short summaries relative to the original documents. Traditional machine learning-based text summarization mainly adopted the extraction method, in which the summary was accomplished in two steps, which were sorting sentences by importance [205][206][207] and sentences' arrangement. [208][209][210] To make the generated summaries more compact, sentences compression and sentences fusion are commonly employed in the text summarization system. Sentences compression can be seen as a sentence-level text summarization, through which a long sentence is summarized to a short one, and several ways were employed in this direction. 211,212 Sentence fusion technology combines sentences and overlapping content to get a single one so as to reduce repeatability in the generated summary. 213 For the generative text summary methods, the sentences in the abstract are not extracted and rewritten based on the original text, but are generated based on semantic information. 9,214 Similar research works also include methods based on various deep neural network models, such as RNN, CNN, GAN, etc., the hidden layers among which can be regarded as abstract semantic information. [215][216][217][218] The researchers also conducted some interesting applications based on the technique, such as academic summaries 219,220 and student course feedback summarization. 221 Another area of research in machine writing is automatic text generation based on data, which have been widely used in many¯elds, such as weather report, news report generation, and biography domains. [222][223][224] In recent years, many Chinese scholars have used the text generation technology to create Chinese poems of speci¯c subjects or emotion and achieved prominent results. [225][226][227] Another data-based machine writing¯eld is image caption generation, whose task is to generate texts describing the content of the given image. Apparently, this task has led to a series works of joint modeling of image semantic annotation and NLG. [228][229][230] Automatic music generation is also an interesting research avenue, which is related to artistic creation. A great amount of novel deep learning methods was proposed to address this challenge. 231-233

Visual sense
Vision is the most important sense in human beings, and more than 80% of the information received from the outside world is obtained through vision. Machine vision, or computer vision, is a science that studies how to make a machine \see" like humans. This implies to use the camera to replace the human eye to obtain images and use the computer to replace the human brain to process images, so that the machines can be made for gaining a high-level understanding of images to simulation functions that the human visual system possesses. In the process of interpersonal communication, human beings recognize and judge the object's identity, expression, physical behavior, etc., through vision, and consider this as the basis of interaction. In the following sections, we will brie°y summarize these contents (see Table 3 for an outline).
Identi¯cation is a technique used in computer vision to determine one's identity and characteristics. The most common identi¯cation methods are biometrics-based methods, such as face recognition, iris recognition, and¯ngerprint recognition. The research of biometric recognition has a long history, and its development can be divided into four stages. 234 Although the biometric identi¯cation based on the traditional methods has achieved satisfactory results, with the rise of deep learning, the  238 Facial feature point detection 239

Face recognition
Shallow representations-based methods 240,241 Deep learning-based methods 242

Iris recognition
Machine learning-based methods 243 Long-range iris recognition 244 Fingerprint recognition Fingerprint recognition for young children 245 Fingerprint recognition at crime scenes 246 Review works 247,248 Others Age and gender recognition 249,250 Facial expression recognition À À À CNN based methods 252,253 Multi-modality feature fusion-based method 254 Expression recognition based on static images 255

Micro-Expression Recognition 256-258
Facial expressions generation À À À Interactive GAN-based method 260 3D facial expression generation 261 Humanoid robot expression generation 23 Three-dimensional speaking characters 262 Expression generation natural description 264,265 Posture or gestures recognition À À À Driving posture recognition 266 Weighted fusion method for gesture recognition 267 Posture recognition for hazard prevention 268 Emotional body gesture recognition 269 Gesture recognition in video 271 Hand gesture recognition 272 A Review on Human-Computer Interaction and Intelligent Robots 23 deep neural network methods are introduced into the¯eld to seek a better recognition performance. The¯rst step of face recognition is the detection of face with an aim to determine whether faces exist on a given image or not. If these faces exist, the location and size of faces are also determined. A number of studies have focused on this area. 235,236 Reference 237 provided a review of face detection for low-quality images. Face alignment is the process of marking out the important organs, such as eyes, nose, and mouth, in the image with feature points, and Refs. 238  245. An automated latent¯ngerprint recognition algorithm was proposed for the comparison of latents found at the crime scenes. 246 References 247 and 248 are the latest reviews conducted in this¯eld. Besides, there are other works that are related to identi¯cation, which focused on age and gender recognition. 249,250 Facial expression recognition refers to the recognition of the states of expression contained in the image from a given static image or dynamic video sequence, so as to determine the psychological emotions of the identi¯ed object. 251 Reference 252 proposed a neural network-based expression recognition method to improve the generalizability of model, which consisted of two convolutional layers with each followed by max pooling and, then, four inception layers. Reference 253 proposed another CNN-based expression recognition scheme, which was combined together with speci¯c image pre-processing steps to address the questions of limited training samples and the uncertainty of sampling during training. A multi-modality feature fusion-based framework was proposed for face recognition in videos to improve the system's robustness. 254 While expression recognition based on static images was also researched by the authors, Ref. 255 proposed a novel method to train an expression recognition network based on the static images. 255 Micro-expression recognition, which is regarded as a harder problem, was also researched by a large amount of research works. [256][257][258] Corresponding to facial recognition, this study provides the automatic generation of facial expressions. Its content generated various emotional expressions of a given facial image or a speci¯c text. This research is considered important as it can be seen as a feedback in the HMI. In Ref. 259, a chaotic feature which extracted associative memory was proposed to stimulate the human brain in generating the facial expressions. Reference 260 proposed an interactive GAN-based method for generating facial behaviors in a dyadic interaction scene. A novel point clouds-based method was introduced for 3D facial expression generation in Ref. 261. Additionally, there are research works in this¯eld combined with robotics and bionics to generate or imitate expression on robots or virtual faces. References 23 and 262 researched the automatic facial expression learning methods for a humanoid robot to generate vivid expressions and increase the interactivity of the humanoid robot. Reference 263 developed a free software and API that can generate dynamic facial expressions for the three-dimensional speaking characters. Reference 264 investigated a novel problem of generating images from some natural description and proposed a CAVE-based method to address this problem. A similar research work was conducted in Ref. 265.
The detection and recognition of posture, gestures, and eye movements are of great signi¯cance in the interactive process, and there are a lot of research works that need to be performed in this area. A CNN-based method for driving posture recognition was introduced in Ref. 266 to detect the driver's fatigue and inattention. Reference 267 researched the problem of gesture recognition with a weighted fusion method of D-S evidence theory by fusing Kinect and surface Electromyogram (EMG) signals. An ergonomic posture recognition technique was discussed in Ref. 268, which aimed to prevent construction hazard by a using an ordinary 2D camera. Reference 269 de¯ned a framework for automatic emotional body gesture recognition and reviewed the related research results in this¯eld. Besides, multi-modal approaches for improved emotion recognition were also discussed in both this work and Ref. 270. An end-to-end architecture incorporating temporal convolutions and bidirectional recurrence was proposed in Ref. 271 for gesture recognition in videos. A novel approach and a real-time system for static hand gesture recognition were introduced in Ref. 272, which could vastly improve the accuracy and speed of recognition. Research works on vision-based gesture recognition were reviewed by Ref. 273, which also included the discussion of the technical aspects of the whole pipeline and the challenges in this¯eld. It is very useful to recognize and track eye movements during HCI, and it can be used to detect the direction of human attention. Reference 274 introduced an approach integrating eye movement recognition, and tracking and application scenarios were designed to evaluate the proposed method. A robust online saccade recognition algorithm was proposed, which involved the integration of electrooculography (EOG) and video signals. The experiments results proved that the multimodal fusion technology was helpful in improving the accuracy of eye movement recognition. 275 Optical character recognition (OCR) is the process of converting typewritten or handwritten characters present in an image into the format that the computer can identify and edit, which is one of the most important ways of interaction. Reference 276 surveyed the OCR systems based on soft computing methods for di®erent languages, such as English, French, German, Latin, whereas the methods of feature extraction of OCR was summarized in Ref. 277. The method for improving the OCR performance of low-quality images was studied in Ref. 278. Reference 279 proposed a CNN-based method to learn the features of Chinese characters. Then, it addressed the problem of Chinese characters in completely automated public Turing test to tell computers and humans apart, which is increasingly used in many web applications for security reasons. Another aspect of OCR is to train the computer to automatically write characters or generate images with the character, which is also challenging and interesting. Reference 280 proposed a RNN-based framework to train a discriminative model and a generative model for recognition of Chinese characters and generation of Chinese characters, respectively. Reference 281 proposed a novel RNN-based model in order to overcome the challenge of handwritten character generation.

Other senses
Reference 282 discussed the in°uences of physiological signals on cognition, and there are a large number of signal sources that could be detected and processed by some special equipment and, later, used for interaction, for example representation and detection of states of human emotions. From the aspect of original source of emotions, a®ective computing can be divided into two categories: external nonphysical performance-based a®ective computing (e.g., facial expressions, text, body gestures, and speech) and inherent physiological information-based a®ective computing, such as electroencephalography (EEG), electrocardiogram (ECG), and EMG.
Reference 283 proposed the EEG-based method for the recognition of human intentions, which can be used for brain-computer interface, by employing both cascade and parallel convolutional recurrent neural network models. Reference 284 explored the feasibility of wireless EEG signals to assess the memory workload levels in special tasks, and the experimental results indicated that the proposed project can be used for mental workload identi¯cation when humans are engaged in cognitive activities. EEG signals are also applied to emotion detection tasks. 285,286 Both EEG signals and facial expressions were used for continuous emotion detection in Ref. 287, and the relationship between them was analyzed. More literatures based on physiological signal emotion recognition are presented in Ref. 288, which is a newly published review in this¯eld. Reference 289 summarized the application of deep learning and reinforcement learning to several di®erent biological datasets and discussed the future development perspectives. In Ref. 290, sleep apnea features were extracted from capacitively coupled ECG signals to monitor sleep apnea. Reference 291 researched ECG used for healthcare monitoring by employing residential wireless sensor networks.
The EMG signal is also widely used in the man-machine control system. An upper limb rehabilitation training system combined with portable accelerometers and EMG was designed and developed for children with cerebral palsy to capture their functional movements and address the problems of in-home training. 292 In Ref. 293, an EMG-and AdaBoost-based movements recognition method was introduced into a robotic hand-eye system for grasping and manipulation of control strategy.
In Ref. 294, high-density surface EMG signals were decomposed from the forearm muscles in the non-isometric wrist motor tasks of normally limbed and limb-de¯cient individuals, which could be used for prosthesis control with the help of the decoded neural information. Reference 295 proposed an optimal control framework based on EMG for the design of physical human-robot interaction in the application of rehabilitation. In Ref. 296, natural EMG signals were collected in a natural manner by introducing a physical haptic feedback mechanism, and an interface was designed for human adaptive impedance, extracted from the transfer of EMG signals. An algorithmic framework is proposed in Ref. 297 for EMG-based gesture recognition, and a prototype system along with an application program was developed to realize the gesture-based real-time interaction.
An EOG-based eye-movement tracking system was proposed for HCI in Ref. 298. Reference 299 developed a real-time eye-writing recognition system based on EOG, and users can write prede¯ned 29 symbolic patterns (26 lower case alphabet characters and 3 functional input patterns representing space, backspace, and enter keys) with their volitional eye movements. Blood volume pulse (BVP) signal is a weak physiological signal formed by the periodic contraction and expansion of the heart, which leads to the periodic changes in the blood volume of the face. Therefore, the BVP signal is often used to detect the heart rate and breathing rate. 300,301 Galvanic skin response (GSR) was used in Refs. 302 and 303 to design GSR-based sensors for the detection of stress states and prediction of performance under stressful conditions. GSR applied to sentiment classi¯cation was also studied. 304 Tactile ability is essential for intelligent robots to interact with humans in a HCI environment. Electronic devices having tactile ability were designed in Refs. 305 and 306 to address this challenge. Tactile sensors were also used for object recognition in Refs. 307 and 308. Additionally, methods and technologies for the implementation of largescale robot tactile sensors were researched in Ref. 309. In addition, WiFi can also be deployed in the HCI system for the implementation of the functions such as motion detection, activity recognition, and sleep monitoring. [310][311][312]

Intelligent Robots
Intelligent robots are an updated version of the traditional robots in both software and hardware systems. By upgrading the software, the intelligent robots have higher levels of brains, which bestow them with a comprehensive improvement in perception, reasoning, and decision-making. With the hardware upgrade, the intelligent robots have more perfect bodies so that they can better imitate human behaviors on the basis of completing delicate works and toilsome works. In combination with the both improvements, the intelligent robots can execute human commands or think independently to complete certain tasks, learn, and improve them autonomously. They can also interact with human beings in a friendly manner.
Motion elements are the centralized embodiment of robot positioning, obstacle recognition, navigation, and other functions in an unstructured environment, thereby re°ecting the autonomous ability of an intelligent robot to adapt to the complex environment. Reference 313 developed a simple and highly mobile hexapod robot RHex, who can traverse solid, broken, and obstructed ground without any topographic induction or active control. Boston Dynamics has developed two fourlegged robots, rough terrain robots, 314 and small four-legged robots 315 that mimic the mobility, autonomy, and speed of living creatures. The robots can move°exibly in various terrains such as steep, rutted, rocky, wet, muddy, and snowy outdoor terrains. ATLAS is a two-legged humanoid robot developed by Boston dynamics, 316 which can realize the dynamic planning, control, and state estimation of the twolegged robot. The robot can operate reliably in complex environments and can regain its balance even after slipping on snow, or it can get up if it is pushed down deliberately.
Another important element of an intelligent robot is the control element, which can perceive human's control intention in various ways and execute relevant actions according to commands. It is often used to assist the control of prostheses for patients with paralysis. Spinal cord injuries, stroke in the brain stem, and other diseases make it impossible for patients with paralysis to control their limbs autonomously. The prosthesis with controllable capability can detect and execute the patient's intention via signal sources, such as neural interface and physiological signals, so as to realize the patient's control of the prosthesis. Reference 317 exhibited the abilities of people with chronic tetraplegia to perform three-dimensional stretching and grasping motions by using a robotic arm controlled through the neural interface. This literature also showed that it is possible for tetraplegic patients to reconstruct the useful multi-dimensional neural controls from complex devices directly even after years of central nervous system injuries. References 318 and 319 researched on the controlling of robotic arm by modeling the multi-channel EEG signals and motion state together. Using pneumatic arti¯cial muscles and in°atable sleeves, Ref. 320 developed a robotic arm with seven degrees of freedom (DOFs), which were combined with elements and positive qualities of rigid and soft robotics. Brain-computer interfaces (BCIs) were employed in Ref. 321 to stimulate the muscle and control of robotic arm for reaching and grasping movements in people with tetraplegia.
The above-mentioned robots generally have solid bodies, complex structures, and limited DOFs, whereas the soft robots can achieve continuous deformation and, therefore, have in¯nite DOFs. References 322 and 323 conducted research on soft robots. The development of 3D printing technology and materials science have greatly bene¯tted researcher works on soft robots, owing to which they have shown a sig-ni¯cant progress and achieved the tasks of grabbing, human-robot collaboration, etc.
The interactive elements of intelligent robots are studied and practiced by a large number of researchers, and Sec. 4 introduces a great amount of research works and technologies focused on these interactions. In fact, scientists have developed several intelligent platforms and robots with the rudimentary ability of natural HCI. For example, MIT a®ective computing research team launched the Tega and Jibo platforms successively in 2016, which have certain emotional computing and perception abilities. In 2014, Microsoft launched the interactive platform Xiaobing, which can understand the emotional context to a certain extent. In 2015, the Turing robot team released an AI robot operating system with multimodal interaction mode, that is Turing OS. Turing OS simulates human-to-human interaction, giving the robot a wealth of input and output modes, including text, voice, action, environment, etc. IBM teamed up with Japan's SoftBank in 2016 to develop Pepper, an \emotional" robot that responds to the parts of the spoken language in limited settings. ABC Robot is a leading multi-modal human-computer interaction platform of Baidu. The platform can realize multimodal HCI such as speech recognition, semantic understanding, face recognition, gesture recognition, and multi-sensor fusion.
The Ren team from Hefei University of Technology in China studied the emotion computing system on the platform of a humanoid robot; constructed a heart state transfer network, which combined universality and individuality, for mental health problems; developed a multi-modal emotional response model based on the established heart state transfer network; and established an evaluation system for coping strategies. The emotional robot platform and its cloud system developed by the team mainly have the functions of character identity and emotion recognition, gesture and voice interaction, intelligent emotional conversation and chat, and emotional interaction. Emotional robots can be used at home and in medical settings for people of di®erent ages (especially for the elderly) and the assisted rehabilitation of speci¯c conditions (autism and depression).
The above content roughly belongs to the intelligent system of intelligent robots. In fact, more research works are focused on the hardware system of robots, such as actuator, driving device, sensing device, control system, etc. However, these studied contents are not within the scope of this review. For more literatures about intelligent robot systems, see Ref. 1. The authors of that research had reviewed the current research works on intelligent robot systems and prospected the future development trend in this¯eld.

Challenges for HCI and Intelligence Robots
HCI and intelligent robot technologies have broad development prospects in various industries. However, although there are many achievements in these two¯elds, but there is still a large space needed for the expansion of the intelligence level grow. Future intelligent robots and their interaction technologies need to be developed in the following aspects.

Technologies of multimodal fusion perception and human-like intelligent perception
Human beings express their emotions and intentions through multiple signals, such as language, pronunciation, and intonation, facial expressions and gestures, as well as A Review on Human-Computer Interaction and Intelligent Robots 29 some physiological signals, such as blood pressure and heartbeat. Most of the existing perception methods are focused on the single mode, whereas the correlation between the multiple modes is ignored. Therefore, multimodal databases, multimodal data hierarchical fusion perception, and human-like intelligent perception technologies based on this database will become an important direction for research. Existing mainstream approach for multimodal fusion perception is dependent on large-scale neural network and big data. In addition, we also can be provided good references by the group decision making and multiple criteria decision making in management to study the decision-making process in the process of multimodal fusion perception. 324-328

Mechanism of multimodal cooperative analysis and intelligent reasoning
At present, research works mainly start from the external appearance of HCI and adopt the traditional engineering methods. Then, they focus on the research and implementation of perception theory and technology. However, at the real thinking level, cooperative analysis and intelligent reasoning mechanism of multi-source data have not yet been formed yet. The cooperative representation of multimodal heterogeneous emotional data, the deep adaptive cooperative semantic understanding mechanism, and the e±cient reasoning mechanism integrating ontology knowledge and containing knowledge should be the major topics of research in this¯eld.

Technologies of emotion creation and natural HCI
Emotion is a very important factor in the process of natural HCI, further acting as the key for its establishment. Emotion is still a stumbling block on the path of natural HCI and will restrict the further development of intelligent robots. The existing interaction platform does not integrate multi-channel information and their corresponding feedback mechanism, and cannot achieve an emotional interaction. The methods mapping human emotions to the emotions of machines and the dynamic feedback mechanism of the emotional loop are the possible ways to realize the creation of emotions and natural HCI.

Mental health perception and calculation based on an emotional interaction
Psychology studies have demonstrated that emotional state is the important indicator of mental health, and behaviors such as language, voice, facial expressions, and gestures in the process of interaction always convey emotions. These interaction behaviors have become important ways for people to express their feelings and a visible indicator of the states of psychological health. Sometimes, these behaviors even resemble a variety of psychological crisis. A sudden low voice, for example, may simply be a sign of poor physical health (such as a cold), but not a sign of poor mental health. However, if the voice is low and the expression is painful at the same time, and the content of negative emotions is included in the ordinary voice text, the mental health of the patient can be judged to be in a bad state, which needs to be adjusted and dealt with. It is a great challenge to accurately perceive and calculate people's mental health state through multi-source signals. Additionally, ways to guide and improve people's psychological state in the process of interaction also serve as another great challenge.

Deep understanding of natural language and personalized interaction
Deep understanding of natural language and personalized interaction are also di±cult challenges faced by HCI. First, the combination of scene, historical interactive information, pragmatics, and, even, emotions and then, interaction with a deep understanding of semantics could be natural and e±cient. In the personalized interaction, the intelligent robot can adjust the interaction method and strategize neatly according to the scene, interaction object, interaction state, etc.

Human-machine integration and intelligent human-machine interface technology
The intelligence degree of an intelligent robot is growing higher and higher, and the human beings are becoming more and more dependent on the intelligent robot technology. The ways to promote the integration of human and robot will become an important research avenue in this¯eld. In order to better adapt to the application of di®erent users and di®erent tasks, improving the harmony of human-robot interaction and intelligent man-machine interface will become an e®ective way of human-machine integration.

Conclusions
Ever since computers were born, there have been various interactions between people and computers in order to make computers more responsive to the humans' needs. The continuous improvement in human demand and curiosity drives the development of HCI technology and intelligent robot technologies. A large amount of research works has been carried out to make HCI more natural and harmonious and, at the same time, make robots more intelligent and adaptable. With the rapid development of AI technology in recent years, it provides unprecedented development opportunities for the research of these two technologies. This paper summarizes the development status of HCI from the aspect of interaction abilities and introduced the related technologies of an intelligent robot. Thereafter, the challenges for these two¯elds in the future development and possible research approaches are expounded.