Windows.  Viruses.  Notebooks.  Internet.  office.  Utilities.  Drivers

NATIONAL UNIVERSITY OF UZBEKISTAN NAMED AFTER MIRZO ULUGBEK

FACULTY OF COMPUTER TECHNOLOGIES

On the topic: Semantic parsing of an EXE file.

Completed:

Tashkent 2003.

Preface.

Assembly language and instruction structure.

EXE file structure (semantic parsing).

Structure of a COM file.

How the virus works and spreads.

Disassembler.

Programs.

Foreword

The profession of a programmer is amazing and unique. In our time, science and life cannot be imagined without latest technology. Everything that is connected with human activity is not complete without computer technology. And this contributes to its high development and perfection. Even if the development of personal computers began not so long ago, but during this time colossal steps were taken in software products and for a long time these products will be widely used. The field of computer-related knowledge has exploded, as has the related technology. If we do not take into consideration the commercial side, then we can say that there are no strangers in this area of ​​​​professional activity. Many are engaged in the development of programs not for profit or earnings, but of their own free will, out of passion. Of course, this should not affect the quality of the program, and in this business, so to speak, there is competition and demand for quality performance, for stable work and meeting all the requirements of our time. Here it is also worth noting the appearance of microprocessors in the 60s, which came to replace a large number lamp set. There are some varieties of microprocessors that are very different from each other. These microprocessors differ from each other in terms of bit depth and built-in system commands. The most common ones are: Intel, IBM, Celeron, AMD, etc. All these processors are related to the advanced processor architecture by Intel. The spread of microcomputers has caused a rethinking of attitudes towards assembly language for two main reasons. First, programs written in assembly language require significantly less memory and runtime. Secondly, knowledge of the assembly language and the resulting machine code gives an understanding of the machine architecture, which is hardly provided when working in a high-level language. Although most software engineers develop in high-level languages ​​such as Pascal, C or Delphi, which is easier to write programs, the most powerful and efficient software wholly or partly written in assembly language. High-level languages ​​were designed to avoid the special technicalities of particular computers. And the assembly language, in turn, is designed for the specific specifics of the processor. Therefore, in order to write an assembly language program for a particular computer, one must know its architecture. Nowadays, the view of the main software product is an EXE file. Given the positive aspects of this, the author of the program can be sure of its inviolability. But oftentimes this is far from the case. There is also a disassembler. With the help of a disassembler, you can find out interrupts and program codes. It will not be difficult for a person who is well versed in assembler to remake the entire program to his taste. Perhaps this is where the most insoluble problem comes from - the virus. Why do people write a virus? Some ask this question with surprise, some with anger, but nevertheless there are still people who are interested in this task not from the point of view of causing some harm, but as an interest in system programming. Viruses are written for various reasons. Some like system calls, others improve their knowledge in assembler. I will try to explain all this in my term paper. It also says not only about the structure of the EXE file, but also about the assembly language.

^ Assembly Language.

It is interesting to follow, starting from the time of the appearance of the first computers and ending with the present day, the transformation of ideas about assembly language among programmers.

Once upon a time, assembler was a language without knowing which it was impossible to make a computer do anything useful. Gradually the situation changed. More convenient means of communication with a computer appeared. But, unlike other languages, assembler did not die, moreover, he could not do this in principle. Why? In search of an answer, we will try to understand what assembly language is in general.

In short, assembly language is a symbolic representation of machine language. All processes in the machine at the lowest, hardware level are driven only by commands (instructions) of the machine language. From this it is clear that, despite the common name, the assembly language for each type of computer is different. This also applies appearance programs written in assembler, and the ideas that this language is a reflection of.

It is impossible to really solve hardware-related problems (or even, moreover, hardware-related ones, such as improving the speed of a program), without knowledge of assembler.

A programmer or any other user can use any high-level tools, up to building programs virtual worlds and, perhaps, not even suspect that in fact the computer executes not the commands of the language in which its program is written, but their transformed representation in the form of a boring and dull sequence of commands of a completely different language - machine language. Now let's imagine that such a user has a non-standard problem or just something went wrong. For example, his program must work with some unusual device or perform other actions that require knowledge of the principles of computer hardware. No matter how smart a programmer is, no matter how good the language in which he wrote his wonderful program, he cannot do without knowledge of assembler. And it is no coincidence that almost all compilers of high-level languages ​​contain means of connecting their modules with modules in assembler or support access to the assembler programming level.

Of course, the time of computer wagons has already passed. As the saying goes, you can't embrace the immensity. But there is something in common, a kind of foundation on which any serious computer education is built. This is knowledge about the principles of computer operation, its architecture and assembly language as a reflection and embodiment of this knowledge.

A typical modern computer (i486 or Pentium based) consists of the following components (Figure 1).

Rice. 1. Computer and peripherals

Rice. 2. Block diagram personal computer

From the figure (Figure 1) it can be seen that the computer is made up of several physical devices, each of which is connected to one unit, called the system unit. Logically, it is clear that it plays the role of some coordinating device. Let's take a look inside system block(no need to try to get inside the monitor - there is nothing interesting there, besides it is dangerous): we open the case and see some boards, blocks, connecting wires. To understand their functional purpose, let's look at the block diagram of a typical computer (Fig. 2). It does not pretend to absolute accuracy and aims only to show the purpose, interconnection and typical composition of the elements of a modern personal computer.

Let's discuss the diagram in Fig. 2 in a somewhat unconventional style.
It is human nature, meeting with something new, to look for some associations that can help him to know the unknown. What associations does the computer evoke? For me, for example, the computer is often associated with the person himself. Why?

A person creating a computer somewhere in the depths of himself thought that he was creating something similar to himself. The computer has organs of perception of information from the outside world - this is a keyboard, mouse, magnetic disk drives. On fig. 2 these organs are located to the right of system buses. The computer has organs that “digest” the received information - these are CPU and working memory. And, finally, the computer has speech organs that give out the results of processing. These are also some of the devices on the right.

Modern computers, of course, far from human. They can be compared to beings interacting with the outside world at the level of a large but limited set of unconditioned reflexes.
This set of reflexes forms a system of machine instructions. No matter how high level you communicate with a computer, in the end it all comes down to a boring and monotonous sequence of machine instructions.
Each machine command is a kind of stimulus for excitation of this or that unconditioned reflex. The reaction to this stimulus is always unambiguous and is “hardwired” in the block of microcommands in the form of a microprogram. This microprogram also implements actions for the implementation of a machine command, but already at the level of signals supplied to certain computer logic circuits, thereby controlling various computer subsystems. This is the so-called principle of microprogram control.

Continuing the analogy with a person, we note: in order for a computer to eat properly, many operating systems, compilers for hundreds of programming languages, etc. have been invented. But all of them are, in fact, just a dish on which food (programs) is delivered according to certain rules stomach (computer). Only the stomach of a computer loves dietary, monotonous food - give it structured information, in the form of strictly organized sequences of zeros and ones, the combinations of which make up the machine language.

Thus, outwardly being a polyglot, the computer understands only one language - the language of machine instructions. Of course, to communicate and work with a computer, it is not necessary to know this language, but almost any professional programmer sooner or later faces the need to learn it. Fortunately, the programmer does not need to try to figure out the meaning of various combinations of binary numbers, since as early as the 50s, programmers began to use the symbolic analogue of machine language for programming, which was called assembly language. This language accurately reflects all the features of machine language. That is why, unlike high-level languages, the assembly language is different for each type of computer.

From the foregoing, we can conclude that, since the assembly language for the computer is “native”, the most efficient program can only be written in it (provided that it is written by a qualified programmer). There is one small “but” here: this is a very laborious process that requires a lot of attention and practical experience. Therefore, in reality, assembler writes mainly programs that should ensure efficient work with the hardware. Sometimes critical parts of the program in terms of execution time or memory consumption are written in assembler. Subsequently, they are made in the form of subroutines and combined with code in a high-level language.

It makes sense to start learning the assembly language of any computer only after finding out what part of the computer is left visible and available for programming in this language. This is the so-called computer program model, part of which is the microprocessor program model, which contains 32 registers that are more or less available for use by the programmer.

These registers can be divided into two large groups:

^16 custom registers;

16 system registers.

Assembly language programs use registers very heavily. Most registers have a specific functional purpose.

As the name implies, user registers are called because the programmer can use them when writing his programs. These registers include (Fig. 3):

Eight 32-bit registers that can be used by programmers to store data and addresses (also called registers) general purpose(RON)):

six segment registers: cs, ds, ss, es, fs, gs;

status and control registers:

Flags register eflags/flags;

eip/ip command pointer register.

Rice. 3. User registers of i486 and Pentium microprocessors

Why are many of these registers shown with a slash? No, these are not different registers - they are parts of one large 32-bit register. They can be used in the program as separate objects. This was done to ensure the operability of programs written for the younger 16-bit microprocessor models from Intel, starting with the i8086. The i486 and Pentium microprocessors have mostly 32-bit registers. Their number, with the exception of segment registers, is the same as that of the i8086, but the dimension is larger, which is reflected in their designations - they have
prefix e (Extended).

^ General purpose registers
All registers of this group allow accessing their “lower” parts (see Fig. 3). As you look at this figure, note that only the lower 16 and 8 bit parts of these registers can be used for self-addressing. The upper 16 bits of these registers are not available as independent objects. This is done, as we noted above, for compatibility with the younger 16-bit microprocessor models from Intel.

Let's list the registers belonging to the group of general purpose registers. Since these registers are physically located in the microprocessor inside the arithmetic logic unit (ALU), they are also called ALU registers:

eax/ax/ah/al (Accumulator register) - accumulator.
Used to store intermediate data. In some commands, the use of this register is required;

ebx/bx/bh/bl (Base register) - base register.
Used to store the base address of some object in memory;

ecx/cx/ch/cl (Count register) - counter register.
It is used in commands that perform some repetitive actions. Its use is often implicit and hidden in the algorithm of the corresponding command.
For example, the loop organization command, in addition to transferring control to a command located at a certain address, analyzes and decrements the value of the ecx/cx register by one;

edx/dx/dh/dl (Data register) - data register.
Just like the eax/ax/ah/al register, it stores intermediate data. Some commands require its use; for some commands this happens implicitly.

The following two registers are used to support the so-called chain operations, that is, operations that sequentially process chains of elements, each of which can be 32, 16, or 8 bits long:

esi/si (Source Index register) - source index.
This register in chain operations contains the current address of the element in the source chain;

edi/di (Destination Index register) - index of the receiver (recipient).
This register in chain operations contains the current address in the destination chain.

In the architecture of the microprocessor at the hardware and software level, such a data structure as a stack is supported. To work with the stack in the microprocessor instruction system there are special commands, and in the microprocessor software model there are special registers for this:

esp/sp (Stack Pointer register) - stack pointer register.
Contains a pointer to the top of the stack in the current stack segment.

ebp/bp (Base Pointer register) - stack frame base pointer register.
Designed to organize random access to data inside the stack.

A stack is a program area for temporary storage of arbitrary data. Of course, data can also be stored in the data segment, but in this case, for each temporarily stored data, a separate named memory cell must be created, which increases the size of the program and the number of names used. The convenience of the stack is that its area is reused, and storing data on the stack and fetching them from there is done using efficient push and pop commands without specifying any names.
The stack is traditionally used, for example, to store the contents of the registers used by the program before calling a subroutine, which, in turn, will use the processor registers "for its own purposes." The original contents of the registers are leaked from the stack upon return from the subroutine. Another common technique is to pass the parameters it requires to a subroutine via the stack. The subroutine, knowing in what order the parameters are placed on the stack, can take them from there and use them in its execution. A distinctive feature of the stack is the peculiar order of sampling of the data contained in it: at any time, only the top element is available on the stack, i.e. the last element loaded onto the stack. Popping the top element from the stack makes the next element available. The elements of the stack are located in the area of ​​memory allocated for the stack, starting from the bottom of the stack (ie, from its maximum address) to successively decreasing addresses. The address of the top accessible element is stored in the stack pointer register SP. Like any other area of ​​program memory, the stack must be included in some segment or form a separate segment. In either case, the segment address of that segment is placed in the segment stack register SS. Thus, a pair of registers SS:SP describes the address of an available stack cell: SS stores the segment address of the stack, and SP stores the offset of the last data stored on the stack (Fig. 4, a). Let's pay attention to the fact that in the initial state, the stack pointer SP points to a cell that lies under the bottom of the stack and is not included in it.

Fig 4. Stack organization: a - initial state, b - after loading one element (in this example, the contents of the AX register), c - after loading the second element (contents of the DS register), d - after unloading one element, e - after unloading two elements and return to the original state.

Loading onto the stack is carried out by a special push stack command. This instruction first decrements the contents of the stack pointer by 2, and then places the operand at the address in SP. If, for example, we want to temporarily save the contents of the AX register on the stack, we should execute the command

The stack transitions to the state shown in Fig. 1.10, b. It can be seen that the stack pointer is shifted up by two bytes (toward lower addresses) and the operand specified in the push command is written to this address. The following command to load onto the stack, for example,

will move the stack to the state shown in Fig. 1.10, c. The stack will now hold two elements, with only the top one being accessed, pointed to by the stack pointer SP. If, after some time, we need to restore the original contents of the registers saved on the stack, we must execute the pop commands (pop) from the stack:

pop DS
pop AX

How big should the stack be? It depends on how intensively it is used in the program. If, for example, you plan to store an array of 10,000 bytes on the stack, then the stack must be at least that size. It should be borne in mind that in some cases the stack is automatically used by the system, in particular, when executing the interrupt command int 21h. With this command, the processor first pushes the return address onto the stack, and then DOS pushes the contents of the registers and other information related to the interrupted program there. Therefore, even if the program does not use the stack at all, it must still be present in the program and have a size of at least several tens of words. In our first example, we put 128 words on the stack, which is definitely enough.

^ Assembly program structure

An assembly language program is a collection of blocks of memory called memory segments. A program may consist of one or more of these block-segments. Each segment contains a collection of language sentences, each of which occupies a separate line of program code.

Assembly statements are of four types:

commands or instructions that are symbolic counterparts of machine instructions. During the translation process, assembly instructions are converted into the corresponding commands of the microprocessor instruction set;

macro-commands - sentences of the program text that are designed in a certain way and are replaced by other sentences during translation;

directives that tell the assembler compiler to perform some action. Directives have no counterparts in machine representation;

comment lines containing any characters, including letters of the Russian alphabet. Comments are ignored by the translator.

^ Assembly language syntax

The sentences that make up a program can be a syntactic construct corresponding to a command, macro, directive, or comment. In order for the assembler translator to recognize them, they must be formed according to certain syntactic rules. To do this, it is best to use a formal description of the syntax of the language, like the rules of grammar. The most common ways to describe a programming language in this way are syntax diagrams and extended Backus-Naur forms. For practical use syntax diagrams are more convenient. For example, the syntax of assembly language statements can be described using the syntax diagrams shown in the following figures.

Rice. 5. Assembler sentence format

Rice. 6. Format directives

Rice. 7. Format of commands and macros

On these drawings:

label name - an identifier whose value is the address of the first byte of the program source code sentence it denotes;

name - an identifier that distinguishes this directive from other directives of the same name. As a result of processing by the assembler of a certain directive, certain characteristics can be assigned to this name;

operation code (COP) and directive is mnemonic notation corresponding machine instruction, macro instruction or translator directive;

operands - parts of the command, macro or assembler directives, denoting objects on which operations are performed. Assembler operands are described by expressions with numeric and text constants, variable labels and identifiers using operation signs and some reserved words.

^ How to use syntax diagrams? It's very simple: all you need to do is find and then follow the path from the diagram's input (left) to its output (right). If such a path exists, then the sentence or construction is syntactically correct. If there is no such path, then the compiler will not accept this construction. When working with syntax diagrams, pay attention to the direction of the bypass indicated by the arrows, since among the paths there may be those that can be followed from right to left. In fact, syntactic diagrams reflect the logic of the translator when parsing the input sentences of the program.

Allowed characters when writing the text of programs are:

All letters: A-Z, a-z. In this case, uppercase and lowercase letters are considered equivalent;

Numbers from 0 to 9;

Signs?, @, $, _, &;

Separators, . ()< > { } + / * % ! " " ? \ = # ^.

Assembler sentences are formed from lexemes, which are syntactically inseparable sequences of valid language symbols that make sense for the translator.

The tokens are:

identifiers are sequences of valid characters used to designate program objects such as opcodes, variable names, and label names. The rule for writing identifiers is as follows: an identifier can consist of one or more characters. As characters, you can use letters of the Latin alphabet, numbers and some special characters - _, ?, $, @. An identifier cannot start with a digit character. The length of the identifier can be up to 255 characters, although the translator accepts only the first 32 and ignores the rest. You can adjust the length of possible identifiers using the option command line mv. In addition, it is possible to tell the translator to distinguish between uppercase and lowercase letters or ignore their difference (which is done by default).

^ Assembly language commands.

Assembler commands open up the ability to transfer their requirements to the computer, the mechanism for transferring control in the program (loops and jumps) for logical comparisons and program organization. However, programming tasks are rarely that simple. Most programs contain a series of loops in which several instructions are repeated until a certain requirement is reached, and various checks, which determine which of several actions to perform. Some commands can transfer control by changing the normal sequence of steps by directly modifying the offset value in the command pointer. As mentioned earlier, there are various teams for various processors, we will consider a number of some commands for processors 80186, 80286 and 80386.

To describe the state of the flags after executing a certain command, we will use a selection from the table that reflects the structure of the eflags flag register:

The bottom row of this table lists the values ​​of the flags after the command is executed. In this case, the following notations are used:

1 - after the command is executed, the flag is set (equal to 1);

0 - after the command is executed, the flag is reset (equal to 0);

r - the value of the flag depends on the result of the command;

After executing the command, the flag is undefined;

space - after executing the command, the flag does not change;

The following notation is used to represent operands in syntax diagrams:

r8, r16, r32 - operand in one of the registers of size byte, word or double word;

m8, m16, m32, m48 - operand in memory size of bytes, word, double word or 48 bits;

i8, i16, i32 - immediate operand of size byte, word or double word;

a8, a16, a32 - relative address (offset) in the code segment.

Commands (in alphabetical order):

*These commands are described in detail.

ADD
(ADDition)

Addition

^ Command outline:

add destination, source

Purpose: addition of two source and destination operands of byte, word, or double word dimensions.

Work algorithm:

add the source and destination operands;

write the result of addition to the receiver;

set flags.

Status of flags after command execution:

Application:
The add command is used to add two integer operands. The result of the addition is placed at the address of the first operand. If the result of the addition goes beyond the bounds of the destination operand (an overflow occurs), then this situation should be taken into account by analyzing the cf flag and then possibly using the adc command. For example, let's add the values ​​in register ax and memory area ch. When adding, you should take into account the possibility of overflow.

Register plus register or memory:

|000000dw|modregr/rm|

Register AX (AL) plus immediate value:

|0000010w|--data--|data if w=1|

Register or memory plus immediate value:

|100000sw|mod000r/m|--data--|data if BW=01|

CALL
(CALL)

Calling a procedure or task

^ Command outline:

Purpose:

transfer of control to a close or far procedure with storing the address of the return point on the stack;

task switching.

Work algorithm:
determined by the operand type:

The label is near - the contents of the eip / ip command pointer are pushed onto the stack and a new address value corresponding to the label is loaded into the same register;

Far label - contents of the eip/ip and cs command pointer are pushed onto the stack. Then the new address values ​​corresponding to the far mark are loaded into the same registers;

R16, 32 or m16, 32 - define a register or memory cell containing offsets in the current instruction segment, where control is transferred. When control is transferred, the contents of the eip/ip command pointer are pushed onto the stack;

Memory pointer - defines a memory location containing a 4 or 6 byte pointer to the procedure being called. The structure of such a pointer is 2+2 or 2+4 bytes. The interpretation of such a pointer depends on the operating mode of the microprocessor:

^ State of flags after command execution (except task switch):

command execution does not affect flags

When a task is switched, the values ​​of the flags are changed according to the information about the eflags register in the TSS status segment of the task being switched to.
Application:
The call command allows you to organize a flexible and multivariate transfer of control to a subroutine while maintaining the address of the return point.

Object code (four formats):

Direct addressing in a segment:

|11101000|disp-low|diep-high|

Indirect addressing in a segment:

|11111111|mod010r/m|

Indirect addressing between segments:

|11111111|mod011r/m|

Direct addressing between segments:

|10011010|offset-low|offset-high|seg-low|seg-high|

CMP
(compare operands)

Operand comparison

^ Command outline:

cmp operand1, operand2

Purpose: comparison of two operands.

Work algorithm:

perform subtraction (operand1-operand2);

depending on the result, set flags, do not change operand1 and operand2 (that is, do not store the result).

Application:
This instruction is used to compare two operands by subtraction, while the operands are not changed. Flags are set as a result of command execution. The cmp instruction is used with the conditional jump instructions and the set byte by value instruction setcc.

Object code (three formats):

Register or Registered Memory:

|001110dw|modreg/m|

Immediate value with register AX (AL):

|0011110w|--data--|data if w=1|

Immediate value with register or memory:

|100000sw|mod111r/m|--data--|data if sw=0|

DEC
(DECrement operand by 1)

Operand decrement by one

^ Command outline:

dec operand

Purpose: decrease the value of the operand in memory or register by 1.

Work algorithm:
the instruction subtracts 1 from the operand. Status of flags after command execution:

Application:
The dec command is used to decrement the value of a byte, word, double word in memory or register by one. Note that the command does not affect the cf flag.

Register: |01001reg|

^ Register or memory: |1111111w|mod001r/m|

DIV
(DIVide unsigned)

Division unsigned

Command scheme:

div divider

Purpose: to perform a division operation on two binary unsigned values.

^ Work algorithm:
The command requires two operands - the dividend and the divisor. The dividend is specified implicitly and its size depends on the size of the divisor, which is specified in the command:

if the divisor is in bytes, then the dividend must be located in register ax. After the operation, the quotient is placed in al and the remainder in ah;

if the divisor is a word, then the dividend must be located in the register pair dx:ax, with the low part of the dividend in ax. After the operation, the quotient is placed in ax and the remainder in dx;

if the divisor is a double word, then the dividend must be located in the register pair edx:eax, with the low part of the dividend in eax. After the operation, the quotient is placed in eax and the remainder in edx.

^ State of flags after command execution:

Application:
The command performs an integer division of the operands, returning the result of the division as a quotient and the remainder of the division. When performing a division operation, an exception may occur: 0 - division error. This situation occurs in one of two cases: the divisor is 0, or the quotient is too large to fit into the eax/ax/al register.

Object code:

|1111011w|mod110r/m|

INT
(INTerrupt)

Calling an interrupt service routine

^ Command outline:

int interrupt_number

Purpose: call the interrupt service routine with the interrupt number specified by the instruction operand.

^ Work algorithm:

push the eflags/flags register and the return address onto the stack. When writing the return address, the contents of the cs segment register are first written, then the contents of the eip/ip command pointer;

reset the if and tf flags to zero;

transfer control to the interrupt handler with the specified number. The control transfer mechanism depends on the operating mode of the microprocessor.

^ State of flags after command execution:

Application:
As you can see from the syntax, there are two forms of this command:

int 3 - has its own individual opcode 0cch and occupies one byte. This circumstance makes it very convenient to use in various software debuggers to set breakpoints by replacing the first byte of any instruction. The microprocessor, encountering a command with the opcode 0cch in the sequence of commands, calls the interrupt handler with the vector number 3, which serves to communicate with the software debugger.

The second form of the instruction is two bytes long, has an opcode of 0cdh, and allows you to initiate a call to an interrupt service routine with a vector number in the range 0-255. Features of the transfer of control, as noted, depend on the operating mode of the microprocessor.

Object code (two formats):

Register: |01000reg|

^ Register or memory: |1111111w|mod000r/m|

JCC
JCXZ/JECXZ
(Jump if condition)

(Jump if CX=Zero/ Jump if ECX=Zero)

Jump if condition is met

Jump if CX/ECX is zero

^ Command outline:

jcc label
jcxz label
jecxz label

Purpose: transition within the current segment of commands, depending on some condition.

^ Command algorithm (except for jcxz/jecxz):
Checking the state of the flags depending on the opcode (it reflects the condition being checked):

if the condition under test is true, then go to the cell indicated by the operand;

if the condition being checked is false, then pass control to the next command.

jcxz/jecxz command algorithm:
Checking the condition that the contents of the ecx/cx register are equal to zero:

if the checked condition

By purpose, commands can be distinguished (examples of mnemonic opcodes of commands of a PC assembler such as IBM PC are given in brackets):

l execution arithmetic operations(ADD and ADC - additions and additions with carry, SUB and SBB - subtractions and subtractions with a loan, MUL and IMUL - unsigned and signed multiplications, DIV and IDIV - unsigned and signed divisions, CMP - comparisons, etc. .);

l execution logical operations(OR, AND, NOT, XOR, TEST, etc.);

l data transfer (MOV - send, XCHG - exchange, IN - enter into the microprocessor, OUT - withdraw from the microprocessor, etc.);

l transfer of control (program branches: JMP - unconditional branch, CALL - procedure call, RET - return from the procedure, J* - conditional branch, LOOP - loop control, etc.);

l processing character strings (MOVS - transfers, CMPS - comparisons, LODS - downloads, SCAS - scans. These commands are usually used with a prefix (repetition modifier) ​​REP;

l program interrupts (INT - software interrupts, INTO - conditional interrupts on overflow, IRET - return from interrupt);

l microprocessor control (ST* and CL* - set and clear flags, HLT - stop, WAIT - standby, NOP - idle, etc.).

A complete list of assembler commands can be found in the works.

Data transfer commands

l MOV dst, src - data transfer (move - move from src to dst).

Transfers: one byte (if src and dst are in byte format) or one word (if src and dst are in word format) between registers or between register and memory, and writes an immediate value to a register or memory.

The operands dst and src must have the same format - byte or word.

Src can be of type: r (register) - register, m (memory) - memory, i (impedance) - immediate value. Dst can be of type r, m. Operands cannot be used in one command: rsegm together with i; two operands of type m and two operands of type rsegm). Operand i can also be a simple expression:

mov AX, (152 + 101B) / 15

Expression evaluation is performed only during translation. Flags do not change.

l PUSH src - putting a word on the stack (push - push through; push to the stack from src). Pushes the contents of src onto the top of the stack - any 16-bit register (including segment) or two memory locations containing a 16-bit word. The flags do not change;

l POP dst - extracting a word from the stack (pop - pop; count from the stack in dst). Removes a word from the top of the stack and places it in dst - any 16-bit register (including segment) or two memory locations. Flags do not change.

Course work

By discipline " System Programming»

Topic number 4: "Solving problems for procedures"

Option 2

EAST SIBERIAN STATE UNIVERSITY

TECHNOLOGY AND MANAGEMENT

____________________________________________________________________

TECHNOLOGICAL COLLEGE

EXERCISE

for term paper

Discipline:
Topic: Problem solving for procedures
Artist(s): Glavinskaya Arina Alexandrovna
Head: Sesegma Viktorovna Dambaeva
Brief summary of the work: the study of subroutines on assembly language,
problem solving using subroutines
1. Theoretical part: Basic information about the assembly language (set
commands, etc.), Organization of subprograms, Ways of passing in parameters
in subroutines
2. Practical part: Develop two subroutines, one of which converts any given letter to uppercase (including for Russian letters), and the other converts the letter to lowercase.
converts any given letter to uppercase, and the other converts the letter to lowercase.
converts a letter to lowercase.
Project timelines according to the schedule:
1. Theoretical part - 30% by week 7.
2. Practical part - 70% by 11 weeks.
3. Protection - 100% by 14 weeks.
Design requirements:
1. The settlement and explanatory note of the course project must be submitted in
electronic and hard copies.
2. The volume of the report must be at least 20 typewritten pages, excluding annexes.
3. RPP is drawn up in accordance with GOST 7.32-91 and signed by the head.

Head of work __________________

Performer __________________

Date of issue " 26 " September 2017 G.


Introduction. 2

1.1 Basic information about the assembly language. 3

1.1.1 Command set. 4

1.2 Organization of subroutines in assembly language. 4

1.3 Methods for passing parameters in subroutines. 6

1.3.1 Passing parameters through registers.. 6

1.3.2 Passing parameters through the stack. 7

2 PRACTICAL SECTION.. 9

2.1 Statement of the problem. 9

2.2 Description of the problem solution. 9

2.3 Testing the program.. 7

Conclusion. 8

References.. 9


Introduction

It is well known that programming in Assembly language is difficult. As you know, there are many different languages ​​now high level, which allow you to spend much less effort when writing programs. Naturally, the question arises when a programmer may need to use Assembler when writing programs. Currently, there are two areas where the use of assembly language is justified, and often necessary.

Firstly, these are the so-called machine-dependent system programs, they usually control various devices computer (such programs are called drivers). In these system programs special machine instructions are used that do not need to be used in ordinary (or, as they say, applied) programs. These commands are impossible or very difficult to specify in a high-level language.

The second area of ​​application of Assembler is related to the optimization of program execution. Very often, translator programs (compilers) from high-level languages ​​produce a very inefficient machine language program. This usually applies to programs of a computational nature, in which a very small (about 3-5%) section of the program (the main loop) is executed most of the time. To solve this problem, so-called multilingual programming systems can be used, which allow you to write parts of the program in different languages. Usually, the main part of the program is written in a high-level programming language (Fortran, Pascal, C, etc.), and the time-critical sections of the program are written in Assembler. In this case, the speed of the entire program can increase significantly. Often this the only way to force the program to give a result in an acceptable time.

This term paper is to gain practical skills in programming in assembly language.

Work tasks:

1. To study the basic information about the Assembler language (the structure and components of the program in Assembler, the format of commands, the organization of subroutines, etc.);

2. To study the types of bit operations, the format and logic of the assembler logic commands;

3. Solve an individual problem for the use of subroutines in Assembler;

4.. Formulate a conclusion about the work done.

1 THEORETICAL SECTION

Assembly language basics

Assembler is a low-level programming language that is a format for writing machine instructions that is convenient for human perception.

Assembly language commands correspond one to one to processor commands and, in fact, represent a convenient symbolic form of notation (mnemonic code) of commands and their arguments. Assembly language also provides basic programming abstractions: linking parts of a program and data through labels with symbolic names and directives.

Assembly directives allow you to include blocks of data (described explicitly or read from a file) into the program; repeat a certain fragment a specified number of times; compile the fragment according to the condition; set the fragment execution address, change label values ​​during compilation; use macro definitions with parameters, etc.

Advantages and disadvantages

The minimum amount of redundant code (the use of fewer commands and memory accesses). As a consequence - greater speed and smaller program size;

large amounts of code, a large number of additional small tasks;

Poor readability of the code, difficulty of support (debugging, adding features);

· the difficulty of implementing programming paradigms and any other somewhat complex conventions, the complexity of joint development;

Fewer available libraries, their low compatibility;

· direct access to hardware: input-output ports, special processor registers;

maximum "fitting" for the desired platform (use of special instructions, technical features of the "iron");

· non-portability to other platforms (except for binary compatible ones).

In addition to instructions, the program may contain directives: commands that are not translated directly into machine instructions, but control the operation of the compiler. Their set and syntax vary significantly and depend not on the hardware platform, but on the compiler used (giving rise to dialects of languages ​​within the same family of architectures). As a set of directives, we can distinguish:

Definition of data (constants and variables);

management of the organization of the program in memory and the parameters of the output file;

setting the compiler mode;

All kinds of abstractions (i.e., elements of high-level languages) - from the design of procedures and functions (to simplify the implementation of the procedural programming paradigm) to conditional structures and loops (for the structural programming paradigm);

macros.

Command set

Typical assembly language instructions are:

Data transfer commands (mov, etc.)

Arithmetic commands (add, sub, imul, etc.)

Logical and bitwise operations (or, and, xor, shr, etc.)

Commands for managing the program execution (jmp, loop, ret, etc.)

Interrupt call commands (sometimes referred to as control commands): int

I / O commands to ports (in, out)

Microcontrollers and microcomputers are also characterized by commands that perform checks and transitions by condition, for example:

· jne - jump if not equal;

· jge - jump if greater than or equal to .

In order for the machine to execute human commands at the hardware level, it is necessary to set a certain sequence of actions in the language of “zeros and ones”. Assembler will become an assistant in this matter. This is a utility that works with the translation of commands into machine language. However, writing a program is a very time-consuming and complex process. This language is not intended to create light and simple actions. At the moment, any programming language used (Assembler works fine) allows you to write special efficient tasks that greatly affect the operation of the hardware. The main purpose is to create micro-instructions and small codes. This language provides more features than, for example, Pascal or C.

Brief description of assembly languages

All programming languages ​​are divided into levels: low and high. Any of the syntactic systems of the “family” of Assembler is different in that it combines at once some of the advantages of the most common and modern languages. They are also related to others by the fact that you can fully use the computer system.

A distinctive feature of the compiler is its ease of use. In this it differs from those that work only with high levels. If any such programming language is taken into account, Assembler functions twice as fast and better. To write in it light program won't take too long.

Briefly about the structure of the language

If we talk in general about the work and structure of the functioning of the language, we can say for sure that its commands are fully consistent with the commands of the processor. That is, the assembler uses mnemonic codes that are most convenient for a person to write.

Unlike other programming languages, Assembler uses specific labels instead of addresses to write memory cells. They are translated into the so-called directives with the code execution process. These are relative addresses that do not affect the operation of the processor (they are not translated into machine language), but are necessary for recognition by the programming environment itself.

Each processor line has its own. In this situation, any process will be correct, including the translated one.

Assembly language has several syntaxes, which will be discussed in the article.

Language pros

The most important and convenient adaptation of the assembly language will be that it can be used to write any program for the processor, which will be very compact. If the code is huge, then some processes are redirected to RAM. At the same time, they all perform quite quickly and without failures, unless, of course, they are controlled by a qualified programmer.

Drivers, operating systems, BIOS, compilers, interpreters, etc. are all assembly language programs.

When using a disassembler that translates from machine to machine, you can easily understand how this or that system task works, even if there are no explanations for it. However, this is only possible if the programs are light. Unfortunately, it is quite difficult to understand non-trivial codes.

Cons of the language

Unfortunately, it is difficult for novice programmers (and often professionals) to understand the language. Assembler requires detailed description the required command. Due to the fact that you need to use machine instructions, the probability of erroneous actions and the complexity of execution increases.

In order to write even the most a simple program, the programmer must be qualified, and his level of knowledge is high enough. The average specialist, unfortunately, often writes bad codes.

If the platform for which the program is being created is updated, then all commands must be rewritten manually - this is required by the language itself. The assembler does not support the function of automatic regulation of the health of processes and the replacement of any elements.

Language commands

As mentioned above, each processor has its own set of instructions. The simplest elements that are recognized by any type are the following codes:


Using directives

Programming microcontrollers in the language (Assembler allows this and does an excellent job of functioning) of the lowest level in most cases ends successfully. It is best to use processors with a limited resource. For 32-bit technology given language fits great. You can often see directives in codes. What is this? And what is it used for?

To begin with, it is necessary to emphasize that directives are not translated into machine language. They govern how the compiler does work. Unlike commands, these parameters, having different functions, differ not due to different processors, but due to a different translator. The main directives include the following:


origin of name

What is the name of the language - "Assembler"? We are talking about a translator and a compiler, which encrypt the data. From English Assembler means nothing more than an assembler. The program was not compiled by hand, an automatic structure was used. Moreover, at the moment, users and specialists have already erased the difference between the terms. Often assembler is called programming languages, although it is just a utility.

Because of the generally accepted collective name, some people have the erroneous assumption that there is a single low-level language (or standard norms for it). So that the programmer understands what structure in question, it is necessary to specify for which platform this or that assembly language is used.

macro tools

Assembler languages, which are relatively recent, have macro facilities. They make it easier to both write and run a program. Due to their presence, the translator executes the written code many times faster. When creating a conditional choice, you can write a huge block of commands, but it's easier to use macros. They will allow you to quickly switch between actions, in case of a condition being met or not being met.

When using macro language directives, the programmer receives Assembler macros. Sometimes it can be widely used, and sometimes its functional features down to one team. Their presence in the code makes it easier to work with it, makes it more understandable and visual. However, you should still be careful - in some cases, macros, on the contrary, worsen the situation.

Assembly language instruction structure Programming at the level of machine instructions is the minimum level at which computer programming is possible. The system of machine instructions must be sufficient to implement the required actions by issuing instructions to the machine hardware. Each machine instruction consists of two parts: an operating part that defines “what to do” and an operand that defines processing objects, that is, “what to do”. The machine instruction of the microprocessor, written in Assembly language, is a single line, having the following form: label instruction/directive operand(s) ; comments The label, command/directive, and operand are separated by at least one space or tab character. The instruction operands are separated by commas.

Structure of an assembly language instruction An assembly language instruction tells the compiler what action the microprocessor should perform. Assembly directives are parameters specified in the program text that affect the assembly process or the properties of the output file. The operand specifies the initial value of the data (in the data segment) or the elements to be acted upon by the instruction (in the code segment). An instruction may have one or two operands, or no operands. The number of operands is implicitly specified by the instruction code. If the command or directive needs to be continued on the next line, then the backslash character is used: "" . By default, the assembler does not distinguish between uppercase and lowercase letters in commands and directives. Directive and command examples Count db 1 ; Name, directive, one operand mov eax, 0 ; Command, two operands

Identifiers are sequences of valid characters used to designate variable names and label names. The identifier may consist of one or more of the following characters: all letters of the Latin alphabet; numbers from 0 to 9; special characters: _, @, $, ? . A dot can be used as the first character of the label. Reserved assembler names (directives, operators, command names) cannot be used as identifiers. The first character of the identifier must be a letter or a special character. The maximum identifier length is 255 characters, but the translator accepts the first 32 characters and ignores the rest. All labels that are written on a line that does not contain an assembler directive must end with a colon ":". The label, command (directive), and operand do not have to start at any particular position in the string. It is recommended to write them in a column for greater readability of the program.

Labels All labels that are written on a line that does not contain an assembler directive must end with a colon ":". The label, command (directive), and operand do not have to start at any particular position in the string. It is recommended to write them in a column for greater readability of the program.

Comments The use of comments in a program improves its clarity, especially where the purpose of a set of instructions is unclear. Comments begin on any line of a source module with a semicolon (;). All characters to the right of "; ' to the end of the line are comments. The comment can contain any printable characters, including "space". The comment can span the entire line or follow the command on the same line.

Structure of an assembly language program An assembly language program can be composed of several parts, called modules, each of which can define one or more data, stack, and code segments. Any complete assembly language program must include one main, or main, module from which its execution begins. A module may contain program, data, and stack segments declared with the appropriate directives.

Memory Models Before declaring segments, you must specify the memory model using a directive. MODEL modifier memory_model, calling_convention, OS_type, stack_parameter Basic assembly language memory models: Memory model Code addressing Data addressing operating system Code and data interleaving TINY NEAR MS-DOS Allowed SMALL NEAR MS-DOS, Windows No MEDIUM FAR NEAR MS-DOS, Windows No COMPACT NEAR FAR MS-DOS, Windows No LARGE FAR MS-DOS, Windows No HUGE FAR MS-DOS, Windows No NEAR Windows 2000, Windows XP, Windows Allowed FLAT NEAR NT,

Memory Models The tiny model only works in 16-bit MS-DOS applications. In this model, all data and code reside in one physical segment. The size of the program file in this case does not exceed 64 KB. The small model supports one code segment and one data segment. Data and code when using this model are addressed as near (near). The medium model supports multiple code segments and one data segment, with all links in the code segments considered far (far) by default, and links in the data segment are considered near (near). The compact model supports multiple data segments that use far data addressing (far) and one code segment that uses near data addressing (near). The large model supports multiple code segments and multiple data segments. By default, all code and data references are considered far. The huge model is almost equivalent to the large memory model.

Memory Models The flat model assumes a non-segmented program configuration and is only used on 32-bit operating systems. This model is similar to the tiny model in that the data and code reside in the same 32-bit segment. To develop a program for the flat model before the directive. model flat should place one of the directives: . 386, . 486, . 586 or. 686. The choice of the processor selection directive determines the set of commands available when writing programs. The letter p after the processor selection directive means protected mode of operation. Data and code addressing is near, with all addresses and pointers being 32-bit.

memory models. MODEL modifier memory_model, calling_convention, OS_type, stack_parameter The modifier parameter is used to define segment types and can take the following values: use 16 (segments of the selected model are used as 16-bit) use 32 (segments of the selected model are used as 32-bit). The calling_convention parameter is used to determine how parameters are passed when calling a procedure from other languages, including high-level languages ​​(C++, Pascal). The parameter can take the following values: C, BASIC, FORTRAN, PASCAL, SYSCALL, STDCALL.

memory models. MODEL modifier memory_model, calling_convention, OS_type, stack_parameter The OS_type parameter is OS_DOS by default, and is currently the only supported value for this parameter. The stack_param parameter is set to: NEARSTACK (SS register equals DS, data and stack regions are located in the same physical segment) FARSTACK (SS register is not equal to DS, data and stack regions are located in different physical segments). The default is NEARSTACK.

An example of a "doing nothing" program. 686 P. MODEL FLAT, STDCALL. DATA. CODE START: RET END START RET - microprocessor command. It ensures the correct termination of the program. The rest of the program is related to the operation of the translator. . 686 P - Pentium 6 (Pentium II) protected mode commands are allowed. This directive selects the supported assembler instruction set by specifying the processor model. . MODEL FLAT, stdcall - flat memory model. This memory model is used in the Windows operating system. stdcall is the procedure calling convention to use.

An example of a "doing nothing" program. 686 P. MODEL FLAT, STDCALL. DATA. CODE START: RET END START . DATA - program segment containing data. This program does not use the stack, so segment. STACK is missing. . CODE - a segment of the program containing the code. START - label. END START - the end of the program and a message to the compiler that the program must be started from the label START. Every program must contain an END directive marking the end source code programs. All lines that follow the END directive are ignored. The label after the END directive tells the compiler the name of the main module from which program execution begins. If the program contains one module, the label after the END directive can be omitted.

Assembly language translators A translator is a program or technical means A that converts a program in one of the programming languages ​​into a program in the target language, called object code. In addition to supporting machine instruction mnemonics, each translator has its own set of directives and macros, often incompatible with anything else. The main types of assembly language translators are: MASM (Microsoft Assembler), TASM (Borland Turbo Assembler), FASM (Flat Assembler) - a freely distributed multi-pass assembler written by Tomasz Gryshtar (Polish), NASM (Netwide Assembler) - a free assembler for the Intel x architecture 86 was created by Simon Tatham with Julian Hall and is currently being developed by a small development team at Source. Forge. net.

Src="https://present5.com/presentation/-29367016_63610977/image-15.jpg" alt="Program translation in Microsoft Visual Studio 2005 1) Create a project by selecting File->New->Project menu And"> Трансляция программы в Microsoft Visual Studio 2005 1) Создать проект, выбрав меню File->New->Project и указав имя проекта (hello. prj) и тип проекта: Win 32 Project. В !} additional options project wizard to specify “Empty Project”.

Src="https://present5.com/presentation/-29367016_63610977/image-16.jpg" alt="Program translation in Microsoft Visual Studio 2005 2) In the project tree (View->Solution Explorer) add"> Трансляция программы в Microsoft Visual Studio 2005 2) В дереве проекта (View->Solution Explorer) добавить файл, в котором будет содержаться текст программы: Source. Files->Add->New. Item.!}

Translation of the program in Microsoft Visual Studio 2005 3) Select the Code C++ file type, but specify the name with the extension. asm:

Translation of the program in Microsoft Visual Studio 2005 5) Set compiler options. Select by right button in the Custom Build Rules menu project file...

Translation of the program in Microsoft Visual Studio 2005 and in the window that appears, select Microsoft Macro Assembler.

Translation of the program in Microsoft Visual Studio 2005 Check by right button in the file hello. asm of the project tree from the Properties menu and set General->Tool: Microsoft Macro Assembler.

Src="https://present5.com/presentation/-29367016_63610977/image-22.jpg" alt="Program translation in Microsoft Visual Studio 2005 6) Compile the file by selecting Build->Build hello.prj."> Трансляция программы в Microsoft Visual Studio 2005 6) Откомпилировать файл, выбрав Build->Build hello. prj. 7) Запустить программу, нажав F 5 или выбрав меню Debug->Start Debugging.!}

Programming in Windows OS Programming in OS Windows is based on the use of API functions (Application Program Interface, i.e. interface software application). Their number reaches 2000. The program for Windows largely consists of such calls. All interactions with external devices and resources of the operating system occurs, as a rule, through such functions. operating room Windows system uses a flat memory model. The address of any memory location will be determined by the contents of one 32-bit register. There are 3 types of program structures for Windows: dialog (the main window is a dialog), console or windowless structure, classical structure (window, frame).

Call Windows features API In the help file, any API function is represented as type function_name (FA 1, FA 2, FA 3) Type – return value type; FAX is a list of formal arguments in the order they appear. For example, int Message. Box (HWND h. Wnd, LPCTSTR lp. Text, LPCTSTR lp. Caption, UINT u. Type); This function displays a window with a message and an exit button(s). Meaning of parameters: h. Wnd - handle to the window in which the message window will appear, lp. Text - the text that will appear in the window, lp. Caption - text in the window title, u. Type - window type, in particular, you can specify the number of exit buttons.

Calling Windows API functions int Message. Box (HWND h. Wnd, LPCTSTR lp. Text, LPCTSTR lp. Caption, UINT u. Type); Almost all API function parameters are actually 32-bit integers: HWND is a 32-bit integer, LPCTSTR is a 32-bit string pointer, UINT is a 32-bit integer. The suffix "A" is often added to the name of functions to jump to newer versions of functions.

Calling Windows API functions int Message. Box (HWND h. Wnd, LPCTSTR lp. Text, LPCTSTR lp. Caption, UINT u. Type); When using MASM, you must add @N N at the end of the name - the number of bytes that the passed arguments occupy on the stack. For Win 32 API functions, this number can be defined as the number of arguments n times 4 (bytes in each argument): N=4*n. To call a function, the CALL instruction of the assembler is used. In this case, all arguments to the function are passed to it via the stack (PUSH command). Argument passing direction: LEFT TO RIGHT - BOTTOM UP. Argument u will be pushed onto the stack first. type. Calling the specified function will look like this: CALL Message. box. [email protected]

Calling Windows API functions int Message. Box (HWND h. Wnd, LPCTSTR lp. Text, LPCTSTR lp. Caption, UINT u. Type); The result of executing any API function is usually an integer, which is returned in the EAX register. The OFFSET directive is a "segment offset" or, in high-level language terms, a "pointer" to the start of a string. The EQU directive, like #define in C, defines a constant. The EXTERN directive tells the compiler that a function or identifier is external to the module.

An example of the program "Hello everyone!" . 686 P. MODEL FLAT, STDCALL. STACK 4096. DATA MB_OK EQU 0 STR 1 DB "My first program", 0 STR 2 DB "Hello everyone!", 0 HW DD ? EXTERN message. box. [email protected]: NEAR. CODE START: PUSH MB_OK PUSH OFFSET STR 1 PUSH OFFSET STR 2 PUSH HW CALL Message. box. [email protected] RET END START

The INVOKE directive The MASM language translator also makes it possible to simplify the function call using a macro tool - the INVOKE directive: INVOKE function, parameter1, parameter2, ... There is no need to add @16 to the function call; the parameters are written exactly in the order in which they are given in the function description. translator macros push parameters onto the stack. to use the INVOKE directive, you must have a description of the function prototype using the PROTO directive in the form: Message. box. A PROTO: DWORD, : DWORD

If you notice an error, select a piece of text and press Ctrl + Enter
SHARE: