THE IMPORTANCE OF LINGUISTIC MODELS IN THE DEVELOPMENT OF LANGUAGE BASES

Relevance. In Uzbek linguistics, a number of studies have been carried out on automatic translation, the development of the linguistic foundations of the author's corpus, the processing of lexicographic texts and linguistic-statistical analysis. However, the processing of the Uzbek language as the language of the Internet: spelling, automatic processing and translation programs, search programs for various characters, text generation, the linguistic basis of the text corpus and national corpus, the technology of its software is not studied in any monograph. The article discusses such problems as: the transformation of language into the language of the Internet, computer technology, mathematical linguistics, its continuation and the formation and development of computer linguistics, in particular the question of modeling natural languages for artificial intelligence. The Uzbek National Corps plays an important role in enhancing the international status of the Uzbek language. Objective. To emphasize the importance of linguistic modules, such as phonology, morphology and spelling, in the formation of the linguistic base of the national corpus of the Uzbek language. Methods. The article uses rational-typological, comparative, meaningful, discursive methods of analysis. 1 Toirova: THE IMPORTANCE OF LINGUISTIC MODELS IN THE DEVELOPMENT OF LANGUAG Published by 2030 Uzbekistan Research Online, 2020

The following examples can be given to the module of attaching the given affixes to the core (A = base, N = derivative):1. N=A q_а; боланики= бола 2. N=A u_j; boladagi = bola dagi 3. N=A ch_а [1]; bolagacha= bola gacha 4. N=A Pl _a; bolalar= bola lar 5. N=A k_а [7]; bolaning= bola ning 6. N=A e_а [6]; bolam= bola m 7. N=A k_a e_а [6]; bolalarim= bola lar im 8. N=A k_а [6]; bolamga = bola m ga The modulation continues in this order" [2]. In the process of creating a national corpus in the Uzbek language, an optimum version of M. Abjalova is being used. The algorithm of phonological, morphological and orthographic rules shall be established in order to form a lexical-grammatical code in the linguistic norms module of the Uzbek language phrases.

Methodology of research.
What's the [6] algorithm? Algorithm, algorithm-a clear rule (program) for the execution of actions in a certain order that are used to solve problems of a particular type. One of the basic concepts for cybernetics and mathematics. The rule that performed four arithmetic operations on a decimal number system was called an algorithm in the Middle Ages. [15] The computer with its computing power is fast, clean, accurate and at the same time "completely incomprehensible" [7]. The idea that when we use it to solve a number of problems, the computer invents something on its own is a mistake, and a clear and complete instruction is needed for the computer to work. An algorithm is a rigidly set order that performs the action needed to produce the final result. This may sound strange, but we're always confronted with an algorithm in real life. An example of this is the use of a payphone, which includes a sequence of actions required for a successful phone call. The rules for the use of home appliances, etc., in a short, understandable way, tell us what to do in one way or another, and determine the algorithm of our actions. According to historians and mathematicians, [21] the word "algorithm" is derived from the name of our great ancestor Abu Abdullah Muhammad ibn Musa al-Khorazmi, and his famous book "Kitab al-jabr wa al-muqabala" has given rise to another popular term "algebra". It is fair to say that the basic algorithm for the production of instructions is controlled in the process of computer-assisted activities. We can not, however, transfer our records directly from the algorithm to the computer, because they are written in a language that the computer does not understand, only people understand. For a computer to understand an algorithm, it is translated into a machine language, just as algorithms written in a machine language are called programs or computer programs. Important features of the optional algorithm: the accuracy of the algorithm -the value of each step, discreteness -the process of solving the problem can be divided into several simple steps (execution steps) so as not to cause difficulties for the computer or person, the publicity -usefulness of the algorithm -the end of the actions of the algorithm, which allows to obtain the desired result with the initial data in the final steps [20].
In practice, there are the following types of algorithms: linear-algorithm in which actions are carried out sequentially, without any conditions being checked, branching-algorithm in which instructions are predetermined by conditions change, cyclic-al-algorithm in which individual processes or groups of processes are repeated. Methods of writing algorithms are considered to be verbal, formulaic, tabular, graphical.
The information available serves as a raw material for the processing of computers. In metallurgical production, that is, as metal ore is considered a raw material. However, in order to be effective in processing, the optional raw material must have an initial preparation. First, we collect information about the event we 're interested in, then we systematize and classify this information.
Next, we 're building a module that represents a given event. The module represents an event using a special mathematical device, graphics, diagrams. The module is structured to show the characteristics and key aspects of the situation. Mathematical and simulation modulation is also available. Mathematical modulation is the application of a mathematical instrument to the study and expression of an event. The exact mathematical module allows you to observe and analyze the status of an object. Simulation modulation-mainly used in industry, allows you to perform a series of tests on devices that do not exist in real time using computer equipment and special software. The application of this modulation accelerates the production of raw materials, as the construction and research process is reduced, the number of errors and their costs are reduced. For example, Boeing declined to implement a long-standing plan for the position of passenger seats, the development of natural cabin modules, and replacing them with computer modules. This saved millions of dollars and reduced the time for the production of new aircraft parts. Once the module is built, it moves to the step of creating an algorithm that matches it. Problems that have been solved by algorithms. In a computer language (machine code), the algorithm used to solve a problem in the form of a series of commands is called a machine program. The command of a machine program or machine is an elementary machine instruction that is executed automatically without additional instructions and concepts. Programming is a theoretical and practical program activity. The process of translating an algorithm into a machine language is called compiling. The first step in "humanizing" machine language was to create programs that convert symbolic names to machine code. Then programs for converting arithmetic expressions were created, and finally, in 1958, the Fortran translator, widely used in the programming language, came into being. Since then, many programming languages have been developed. Computer processes information by controlling machine program commands, using different data in the process. The data used are divided into: 1. Incoming-inputs to the computer and is used as a condition to solve the problem. 2. Current or internal-used to store and process information in the program. 3. Output-data generated by the program as a result of the processing of information : Text, graphics , video, etc. It could be visible. This means that it is always important to create an algorithm for the creation of the national corpus of the Uzbek language, as it is controlled in the process of computer work.
Analysis and results. The national corpus of the Uzbek language is the lexical unit that exists in the Uzbek language, such as synonyms , antonyms, homonyms, assimilation words, hierarchies of words; it is necessary to be able to automatically analyze the morphological structure of the word, the construction of the word, the meaning of the word, its morphological features. In other words, in the process of composing, lemming, marking the corpus, it is necessary, on the basis of individual searches, to find and interpret those words which form part of the corpus in the texts. In order to do this, the above-mentioned algorithm, linguistic modeling, must be carried out. M. Abzalova 's research "Linguistic modules of the program for editing and analyzing texts in the Uzbek language" [2], A. Eshmominov 's research" Synonymous database of the Uzbek national corpus"[17], automatic analysis of the morphological characteristics of words. It is necessary to use some parts of Sh. Khamroeva 's research on "Linguistic bases for the creation of the author's corpus of the Uzbek language" [18], N. Abdurahmanova 's research on" Linguistic support for the program for the translation of English texts into Uzbek" [1].
"Dictionary of synonyms of Uzbek language", "Explanatory dictionary of Uzbek words", "Dictionary of obsolete words of Uzbek language", "Dictionary of synonyms of Uzbek language", "Dictionary of Uzbek words", "Dictionary of synonyms of Uzbek language" "Dictionary of contradictory words of the Uzbek language", "Dictionary of word classification of the Uzbek language", "Educational etymological dictionary of the Uzbek language", "Educational toponymic dictionary of the Uzbek language" can serve as a linguistic support. Only such dictionaries are reworked, lemma words; depending on the nature of the words, it is necessary to delimit their series and connect the members of the lemma series with each other. Only then can the revised dictionary form the basis of the software for the programmer.
In the final stage, texts prepared with meta-metric and morphological markings undergo several more automatic transformations. The following programs written in "Perl" language are used: 1) The converter converts the working format of the socket to the final format. The converter converts the morphological analysis in parentheses to the correct format <w lex =… ..gr =….>. It also checks for some spelling errors in order to further improve the quality of the search engine, translates the name into Latin, adds insufficient characters, identifies different forms of the verb; 2) Semantic markup program (Semmarkup). The program adds basic semantic characters to words using a special semantic dictionary. This method makes semantic search in the corpus much easier. The semantic dictionary is formalized in the form of a table, the first column contains a lexeme and a phrase, and the remaining columns contain semantic symbols. After the program compares the morphological characters of the word with the dictionary and finds similarities, it copies the semantic characters in the sem attribute of the <w> tag. In multi-character words, however, certain errors may occur in the semantic search;; 3) Statistical programs (Gramstat, Metastat). These programs are designed to collect statistics on the distribution of grammatical and metamaterial characters in texts. This method allows you to quickly find errors in the characters. The gramstat program allows distribution in morphological analysis (lexeme, word group, lexeme, and grammatical features of word form) for individual parts.
The above technology helps automate complex processes for the preparation of corpus texts. Some operations (cleansing of text, removing homonymy, metametric) are not automated at all, but a number of service tools have been developed for these operations, which makes it much easier. From the start the data was deliberately easy to encode so that the additional marks did not interfere with the text edition. The complex formatted output format takes place in the last stage automatically.
The Russian National Corps, the Modern American English Corps, Oxford English Corps and Czech National Corps have been established worldwide. Uzbekistan has, however, not created a linguistic foundation. Ziyonet does not work at the system to process text automatically and perform searches based on different characteristics from the text although it currently has an electronic library. It is not meant for vocabulary or language learning. The text can not be heard aloud. A system of automatic processing of texts and searches based on several characteristics is established in the national corpus program, the database. Word, phrases and combinations that are rarely used are very easy to find, use and spell (spell) from. This allows the learner to hear the text aloud. This opens up the possibility for directional education. A key role for the body is to mark or to identify (linguistic analysis). Marking means separating special tags into texted and their components in linguistic and extra-linguistic terms. Currently, there are the following types of markups: morphological, semantic, syntactic, anaphoric, prosodic, discrete, and others [11]. An extralinguistic mark is distinguished by the following features: a mark that reflects the specificity of the text format (chapter, paragraph, section, etc.) and a mark that represents the information belonging to its author.
Most modern layout languages are based on SGML / XML, in which the defined text covers two parallel data layers: visible (text itself) and hidden (tagged or marked) [11]. In this case, the hidden part of the information is placed inside the text, but special markers <…> are included, which, in turn, separate it from the visible text. Unlike external methods of annotation writing (e.g. comments), the markup is always incorporated into the text and is an integral part of it. Subsequent levels of structural analysis are used by some corporations. In particular, some small corpuscles will be connected on the basis of a complete syntactic analysis. Such cases are usually characterized by a profoundly interpreted or syntactic structure. For example, a syntactic markup is like a large tree in itself. We know that manual analysis of texts is a valuable and time-consuming task. Currently, various software analysis tools are available on Russian and foreign sites, which are open (directly) accessible. They are individual, i.e. independent and subdivided into websites. In this case, it should be noted that in recent years, developers have focused on web applications. These systems have several advantages: the ability to analyze (mark) a single document by multiple users at once does not require the installation of additional software, but with the exception of the browser, access rights are limited, and the marking process can be monitored. In particular, let's pay attention to the process of analyzing the text from the story "Speech" by A.Qahhor. Text goes as following: "You don't love me, you 're not happy with our marriage, I've been waiting until this hour, this minute, you haven't said a word, it's been a year since we put our heads on a pillow ...
The speaker really forgot about it, but he was talking." The text mentioned above is distinguished by the following features: Table The main advantage of SGML / XML compared to other layout languages (TEX, RTF) is that it has strict syntax of markup commands, differentiating attributes and elements, clear indication of element boundaries, self-documentation, automatic verification of grammatically correct entry.
The most authoritative standards for corpus data encoding are: TEI (Text Encoding Initiative) [5], CES (XML Corpus Encoding Standard) [8], EAGLES (European Advisory Group on Language Engineering Standards) [10]. In particular, TEI is recognized as a well-developed standard, defining the rules for the expression of different types of texts and textual information elements, with particular emphasis on: structure, title, style of speech (prose, poetry , drama), pages, quotations, footnotes or links (footnotes, comments), corrections, tables, formulas, specific characters (characters), linguistic annotations, etc. The special title of the standard shall be subject to the rules for the coding of the case. Although TEI is not specifically tailored for corpus applications, it often works in conjunction with similar standards. For example, the British National Corpus (BNC), the Czech National Corps, the Hungarian National Corps, etc. The XCES standard is an advanced application of TEI, designed solely for the corpus and intended to identify specific labels specific to the corpus.
But when we studied the TEI and XCES universal standards in detail, we found that they were too complex, unnecessary, and inconvenient for text mass marking. The full provisions of the TEI are very broad and not always reasonable, and it is therefore difficult enough to comply with all the requirements of this standard. The format is not compact, and the size of the content is usually increased. The format loses its clarity function, for example, it is suggested that meta-attributes be written in the form of text in the tag, so that when the markup is removed, the original text returns to its original state, error occurs.
You can also restrict yourself to TEI applications by rejecting "redundant" tags. The minimum set of tags is selected from the TEI to represent the body: <text> -text, <p> -header, <s>word, <w> -word, and morphological analysis is written in the form of <w ana = ...> attribute. However, such an appearance does not fully comply with the standard of the housing layout. This view is reminiscent of a simplified HTML version.
The complexity of XML formats is not the main problem, but the complete lack of popular programs such as preparation, processing, indexing and searching, which is a major problem. Linguists have relatively simple programs available to them. Among them: XML-analysts, editors, converters, linear search programs are widely used. It turns out that such a set of programs is not enough for a corps with a volume of millions of words. Of course, tasks such as preparing the internal problems and markings of the case can be solved with the help of specially written converters, macros and other tools.
The data representation format in the case is developed based on existing coding standards (TEI, XCES). HTML belongs to the SGML / XML family, is the most common format, and can be used in many applications [19]. Today, search engines have the ability to understand the semantics and structure of HTML tags.
HTML is a very simple format that provides minimum requirements in terms of content and layout size, and is not able to use many commands in practice. It's a very convenient and compact format for manual editing and visual perception. Typically, when displaying language units, there are no tags in the standard itself, but HTML can allow non-standard tags to be used, and this problem is resolved through a special setup (correction) of the search server.
The corpus format has a number of HTML languages, with some special tags attached for linguistic units. This format specifies the coding requirements for important text information and includes: 1) meta text attributes; 2) text structure elements (title, paragraph, poems, footnote or link (footnotes, comments) and tables at the bottom of the page); 3) linguistic units (sentences, words); 4) lexical information (grammatical, semantic signs); 5) text formatting parameters, special characters, etc [20].
Meta text attributes are written in texts in different situations, so that steps 2 and 3 can be done in parallel or arbitrarily. But the text must have the name of the file identified and recorded. It does not perform any actions, such as renaming a single connection or file, as such actions could disrupt the operation of the entire system. For the purpose of storing metadata, simple Excel spreadsheets with a predefined structure are used, with the first column containing the name of the file (clearly specified path) and the other columns with metamata attributes and process information. This allows you to use Excel's built-in tools effectively and makes the search engine much easier. For example, search, filtering, analysis and data processing (to-do list, auto-filling, statistics). In this case, the tables must be stored in a text format, and this format must be understood by Excel. This allows the file stored in the spreadsheet view to accept not only Excel but also other spreadsheet programs and increase the runtime efficiency.
Theoretically, metadata can be stored separately from each text, but according to the HTML rules, the data must be stored in the file header so that the Yandex-server can index the data. When storing metadata in separate memory, there is always a problem of synchronization, meta-tables, and text interactions with each other.
Suggestions. The following methods are used to store metadata in separate memory: 1) The metas table creates meta-table headers by collecting meta-text attributes from the file headers. In Excel, it can be modified manually. At the initial processing stage, some metadata can be added to the text, such as the author's name, title and date of creation. At the final stage, the Metas.bat program collects all attributes and completes the verification phase.
2) Meta.txt takes the meta text attributes from the modified meta-tables and transfers them to the existing text. This program checks the availability of the file and updates the title. In the tables, most attribute actions are separated by a" "symbol. When the text is changed, each action will appear as a separate attribute. Metamata attributes can therefore move freely between text and metatables. Meta-metric, on the other hand, will need to be carried out interactively with several cycles of verification.
3) MetaTest checks the accuracy of the meta-table. In this case, the actions of the attribute in the normative table are compared with those shown in the templates. The program identifies incorrect actions with a "#" character and can be checked and corrected manually.
All the above programs are done in Perl. At the final stage of processing, texts prepared with meta-metric and morphological markings undergo several more automatic transformations. The converter checks for some markup errors in order to further improve the quality of the search engine by converting the morphological analysis in parentheses to the correct format <w lex =… ..gr =….>.
The semantic markup program adds basic semantic characters to words using a special semantic dictionary. This method has the property of greatly facilitating semantic search in the corpus. The semantic dictionary is formalized in the form of a table, the first column contains a lexeme and a phrase, and the remaining columns contain semantic symbols. After the program compares the morphological characters of the word with the dictionary and finds similarities, it copies the semantic characters in the sem attribute of the <w> tag. In multi-character words, however, various errors can occur in semantic search.
The above technology helps to automate complex operations in the preparation of texts for the corpus. Some operations are not automated at all (clearing texts, removing homonymy, metametric), but a set of service tools has been developed for such operations, which makes it much easier. From the very beginning, the data encoding format is developed in a special simple form. As a result, a complex layout development format occurs automatically at the final stage.
Conclusion (Recommendations). In conclusion, it should be noted that the role of linguistic modulation in the formation of the national body's linguistic base is incomparable. It is therefore necessary to create an algorithm as a basis for the production of controlled instructions in the computer process. It is important to develop specific linguistic module forms by marking each word group in the development of a morphological marking algorithm.