Sound compression: principle and settings. Speech synthesis and recognition. Modern solutions. Computer audio equipment. Converting audio into a stream of numbers Dynamic range compressed or standard

The second part of the series is devoted to functions for optimizing the dynamic range of images. In it we will tell you why such solutions are needed, consider various options for their implementation, as well as their advantages and disadvantages.

Embrace the immensity

Ideally, a camera should capture an image of the surrounding world as a person perceives it. However, due to the fact that the mechanisms of “vision” of a camera and the human eye are significantly different, there are a number of restrictions that do not allow this condition to be met.

One of the problems that previously faced by users of film cameras and is faced now by owners of digital cameras is the inability to adequately capture scenes with large differences in illumination without the use of special devices and/or special shooting techniques. The peculiarities of the human visual system make it possible to perceive details of high-contrast scenes equally well in both brightly lit and dark areas. Unfortunately, the camera sensor is not always able to capture an image the way we see it.

The greater the difference in brightness in the photographed scene, the higher the likelihood of loss of detail in highlights and/or shadows. As a result, instead of a blue sky with lush clouds, the picture turns out to be only a whitish spot, and objects located in the shadows turn into indistinct dark silhouettes or completely merge with the surrounding environment.

In classical photography, the concept of photographic latitude(See sidebar for details). Theoretically, the photographic latitude of digital cameras is determined by the bit depth of the analog-to-digital converter (ADC). For example, when using an 8-bit ADC, taking into account the quantization error, the theoretically achievable value of photographic latitude will be 7 EV, for a 12-bit ADC - 11 EV, etc. However, in real devices the dynamic range of images turns out to be at the same theoretical maximum due to the influence of various types of noise and other factors.

A large difference in brightness levels represents a serious
problem when taking photographs. IN in this case camera capabilities
turned out to be insufficient for adequate transmission of the most
light areas of the scene, and as a result, instead of an area of blue
sky (marked with a stroke) it turns out to be a white “patch”

The maximum brightness value that a light-sensitive sensor can record is determined by the saturation level of its cells. The minimum value depends on several factors, including the amount of thermal noise of the matrix, charge transfer noise and ADC error.

It is also worth noting that the photographic latitude of the same digital camera can vary depending on the sensitivity value set in the settings. The maximum dynamic range is achievable by setting the so-called basic sensitivity (corresponding to the minimum possible numerical value). As the value of this parameter increases, the dynamic range decreases due to the increasing noise level.

The photographic latitude of modern models of digital cameras equipped with large sensors and 14- or 16-bit ADCs ranges from 9 to 11 EV, which is significantly greater compared to similar characteristics of 35 mm format color negative films (average 4 to 5 EV ). Thus, even relatively inexpensive digital cameras have a photographic latitude sufficient to adequately convey most typical amateur shooting scenes.

However, there is a problem of a different kind. It is associated with the limitations imposed by existing standards for recording digital images. Using the JPEG format with 8 bits per color channel (which has now become the de facto standard for recording digital images in the computer industry and digital technology), it is even theoretically impossible to save an image with a photographic latitude greater than 8 EV.

Let's assume that the camera's ADC allows you to obtain an image with a bit depth of 12 or 14 bits, containing discernible details in both highlights and shadows. However, if the photographic latitude of this image exceeds 8 EV, then in the process of conversion to a standard 8-bit format without any additional actions (that is, simply by discarding “extra” bits), part of the information recorded by the photosensitive sensor will be lost.

Dynamic range and photographic latitude

To put it simply, dynamic range is defined as the ratio of the maximum brightness value of an image to its minimum value. In classical photography, the term photographic latitude is traditionally used, which essentially means the same thing.

Dynamic range width can be expressed as a ratio (for example, 1000:1, 2500:1, etc.), but most often this is done on a logarithmic scale. In this case, the value of the decimal logarithm of the ratio of the maximum brightness to its minimum value is calculated, and after the number the capital letter D (from the English density? - density), or less often? - the abbreviation OD (from the English optical density? - optical density) is placed. For example, if the ratio of the maximum brightness value to the minimum value of a device is 1000:1, then the dynamic range will be equal to 3.0 D:

To measure photographic latitude, so-called exposure units are traditionally used, abbreviated EV (exposure values; professionals often call them “stops” or “steps”). It is in these units that the exposure compensation value is usually set in the camera settings. Increasing the photographic latitude value by 1 EV is equivalent to doubling the difference between the maximum and minimum brightness levels. Thus, the EV scale is also logarithmic, but in this case the base 2 logarithm is used to calculate the numerical values. For example, if a device is capable of capturing images with a maximum to minimum brightness ratio of 256:1, then its photographic latitude will be 8 EV:

Compression is a reasonable compromise

Most effective way To preserve the full extent of image information recorded by the camera’s light-sensitive sensor, it is possible to record images in RAW format. However, not all cameras have such a function, and not every amateur photographer is ready to engage in the painstaking work of selecting individual settings for each photograph taken.

To reduce the likelihood of losing details of high-contrast images converted inside the camera into 8-bit JPEG, devices from many manufacturers (not only compact ones, but also SLR ones) have been introduced special functions, allowing you to compress the dynamic range of saved images without user intervention. Due to a decrease in overall contrast and loss of a small part of information original image Such solutions make it possible to preserve in 8-bit JPEG format the details in highlights and shadows captured by the device’s light-sensitive sensor, even if the dynamic range of the original image turned out to be wider than 8 EV.

One of the pioneers in the development of this area was the HP company. Released in 2003, the HP Photosmart 945 digital camera featured the world's first HP Adaptive Lightling technology, which automatically compensates for low light levels in dark areas of photos and thus preserves shadow detail without the risk of overexposure (which is very important when shooting high-contrast scenes). The HP Adaptive Lightling algorithm is based on the principles set out by the English scientist Edwin Land in the RETINEX theory of human visual perception.

HP Adaptive Lighting menu

How does Adaptive Lighting work? After obtaining a 12-bit image of the image, an auxiliary monochrome image is extracted from it, which is actually an irradiance map. When processing an image, this card is used as a mask, allowing you to adjust the degree of influence of a rather complex digital filter to the image. Thus, in areas corresponding to the darkest points of the map, the impact on the image of the future image is minimal, and vice versa. This approach allows shadow detail to be revealed by selectively brightening these areas and, accordingly, reducing the overall contrast of the resulting image.

It should be noted that when Adaptive Lighting is enabled, the captured image is processed in the manner described above before the finished image is written to a file. All the described operations are performed automatically, and the user can only select one of two Adaptive Lighting operating modes (low or high exposure) in the camera menu or disable this function.

Generally speaking, many specific functions of modern digital cameras (including the facial recognition systems discussed in the previous article) are a kind of by-product or conversion product of research work that was originally carried out for military customers. When it comes to image dynamic range optimization functions, one of the most well-known providers of such solutions is Apical. The algorithms created by its employees, in particular, underlie the operation of the SAT (Shadow Adjustment Technology) function, implemented in a number of Olympus digital camera models. Briefly, the operation of the SAT function can be described as follows: based on the original image of the image, a mask is created corresponding to the darkest areas, and then the exposure value is automatically corrected for these areas.

Sony also acquired a license to use Apical's developments. Many models of compact cameras in the Cyber-shot series and in DSLR cameras in the alpha series implement the so-called Dynamic Range Optimizer (DRO) function.

Photos taken with the HP Photosmart R927 turned off (top)
and activated Adaptive Lighting function

When DRO is activated, image correction is performed during the initial image processing (that is, before recording the finished JPEG file). In the basic version, DRO has a two-stage setting (you can select a standard or advanced mode of operation in the menu). When you select Standard mode, exposure levels are adjusted based on image analysis of the photo, and then a tone curve is applied to the image to even out the overall balance. The advanced mode uses a more complex algorithm that allows correction in both shadows and highlights.

Sony developers are constantly working to improve the DRO algorithm. For example, in the a700 SLR camera, when the advanced DRO mode is activated, it is possible to select one of five correction options. In addition, it is possible to save three versions of one image at once (a kind of bracketing) with different DRO settings.

Many Nikon digital camera models have a D-Lighting function, which is also based on Apical algorithms. True, in contrast to the solutions described above, D-Lighting is implemented as a filter for processing previously saved images using a tonal curve, the shape of which allows you to make shadows lighter, while keeping other areas of the image unchanged. But since in this case ready-made 8-bit images are processed (and not the original frame image, which has a higher bit depth and, accordingly, a wider dynamic range), the capabilities of D-Lighting are very limited. The user can get the same result by processing the image in a graphic editor.

When comparing enlarged fragments, it is clearly visible that the dark areas of the original image (left)
when the Adaptive Lighting function was turned on, they became lighter

There are also a number of solutions based on other principles. Thus, many cameras of the Lumix family from Panasonic (in particular, DMC-FX35, DMC-TZ4, DMC-TZ5, DMC-FS20, DMC-FZ18, etc.) implement the light recognition function (Intelligent Exposure), which is an integral part of the system intellectual automatic control shooting by iA. The Intelligent Exposure function is based on automatic analysis of the frame image and correction of dark areas of the image to avoid loss of detail in the shadows, as well as (if necessary) compression of the dynamic range of high-contrast scenes.

In some cases, the dynamic range optimization function involves not only certain operations for processing the original image image, but also correction of shooting settings. For example, new models of Fujifilm digital cameras (in particular, the FinePix S100FS) implement a function for expanding the dynamic range (Wide Dynamic Range, WDR), which, according to the developers, allows you to increase the photographic latitude by one or two steps (in settings terminology - 200 and 400%).

When WDR is activated, the camera takes photos with exposure compensation of -1 or -2 EV (depending on the selected setting). Thus, the image of the frame turns out to be underexposed - this is necessary in order to preserve maximum information about the details in the highlights. The resulting image is then processed using a tone curve, which allows you to equalize the overall balance and adjust the black level. The image is then converted to 8-bit format and recorded as a JPEG file.

Dynamic range compression preserves more detail
in lights and shadows, but an inevitable consequence of such exposure
is a decrease in overall contrast. In the bottom image
the texture of the clouds is much better developed, however
due to the lower contrast, this version of the photo
looks less natural

A similar function called Dynamic Range Enlargement is implemented in a number of compact and SLR cameras from Pentax (Optio S12, K200D, etc.). According to the manufacturer, the use of the Dynamic Range Enlargement function allows you to increase the photographic latitude of images by 1 EV without losing detail in highlights and shadows.

A similar function called Highlight tone priority (HTP) is implemented in a number of Canon DSLR models (EOS 40D, EOS 450D, etc.). According to the user manual, activating HTP improves highlight detail (specifically, in the 0 to 18% gray range).

Conclusion

Let's summarize. Built-in dynamic range compression allows you to convert a high dynamic range source image into an 8-bit JPEG file with minimal disruption. Without the option to save images in RAW format, Dynamic Range Compression mode allows photographers to more fully utilize their camera's potential when shooting high-contrast scenes.

Of course, it is important to remember that dynamic range compression is not a miracle cure, but rather a compromise. Preserving detail in highlights and/or shadows comes at the cost of increasing the noise level in the dark areas of the image, reducing its contrast, and somewhat coarsening smooth tonal transitions.

Like any automatic function, the dynamic range compression algorithm is not a fully universal solution that allows you to improve absolutely any photo. Therefore, it makes sense to activate it only in cases where it is really necessary. For example, in order to shoot a silhouette with a well-designed background, the dynamic range compression function must be turned off - otherwise the spectacular scene will be hopelessly ruined.

Concluding our consideration of this topic, it should be noted that the use of dynamic range compression functions does not allow us to “pull out” details in the resulting image that were not captured by the camera sensor. To achieve satisfactory results when shooting high-contrast scenes, you may need to use additional tools (such as gradient filters for landscape photography) or special techniques (such as shooting multiple frames with exposure bracketing and then merging them into one image using Tone Mapping technology).

The next article will focus on the burst function.

To be continued

Or photographic latitude photographic material is the ratio between the maximum and minimum exposure values that can be correctly captured in the photograph. When applied to digital photography, dynamic range is actually equivalent to the ratio of the maximum and minimum possible values of useful electrical signal, generated by the photosensor during exposure.

Dynamic range is measured in exposure stops (). Each step corresponds to doubling the amount of light. So, for example, if a certain camera has a dynamic range of 8 EV, this means that the maximum possible value of the useful signal of its matrix is related to the minimum as 2 8: 1, which means that the camera is able to capture objects that differ in brightness within one frame no more than 256 times. More precisely, it can capture objects with any brightness, but objects whose brightness exceeds the maximum permissible value will appear dazzling white in the image, and objects whose brightness is below the minimum value will appear pitch black. Details and texture will be visible only on those objects whose brightness falls within the dynamic range of the camera.

To describe the relationship between the brightness of the lightest and darkest objects being photographed, the not entirely correct term “scene dynamic range” is often used. It would be more correct to talk about the brightness range or the contrast level, since dynamic range is usually a characteristic of the measuring device (in this case, the matrix of a digital camera).

Unfortunately, the brightness range of many beautiful scenes we encounter in real life can significantly exceed the dynamic range of a digital camera. In such cases, the photographer is forced to decide which objects should be worked out in full detail, and which can be left outside the dynamic range without compromising the creative intent. In order to make the most of your camera's dynamic range, you may sometimes need not so much a thorough understanding of how the photosensor works, but rather a developed artistic sense.

Factors limiting dynamic range

The lower limit of the dynamic range is set by the self-noise level of the photosensor. Even an unlit matrix generates a background electrical signal called dark noise. Also, interference occurs when charge is transferred to the analog-to-digital converter, and the ADC itself introduces a certain error into the digitized signal - the so-called. sampling noise.

If you take a photo in complete darkness or with a lens cap on, the camera will only record this meaningless noise. If a minimal amount of light is allowed to reach the sensor, the photodiodes will begin to accumulate an electrical charge. The magnitude of the charge, and hence the intensity of the useful signal, will be proportional to the number of captured photons. In order for any meaningful details to appear in the image, it is necessary that the level of the useful signal exceeds the level of background noise.

Thus, the lower limit of the dynamic range or, in other words, the sensor sensitivity threshold can be formally defined as the level of the output signal at which the signal-to-noise ratio is greater than unity.

The upper limit of the dynamic range is determined by the capacitance of an individual photodiode. If during exposure any photodiode accumulates an electric charge of its maximum value, then the image pixel corresponding to the overloaded photodiode will turn out completely white, and further irradiation will not affect its brightness in any way. This phenomenon is called clipping. The higher the overload capacity of a photodiode, the greater the output signal it can produce before it reaches saturation.

For greater clarity, let us turn to the characteristic curve, which is a graph of the output signal versus exposure. The horizontal axis represents the binary logarithm of the radiation received by the sensor, and the vertical axis represents the binary logarithm of the magnitude of the electrical signal generated by the sensor in response to this radiation. My drawing is largely conventional and serves purely illustrative purposes. The characteristic curve of a real photosensor has a slightly more complex shape, and the noise level is rarely so high.

The graph clearly shows two critical turning points: in the first of them, the level of the useful signal crosses the noise threshold, and in the second, the photodiodes reach saturation. The exposure values that lie between these two points make up the dynamic range. In this abstract example, it is equal, as is easy to see, to 5 EV, i.e. The camera can handle five doublings of exposure, which is equivalent to a 32-fold (2 5 = 32) difference in brightness.

The exposure zones that make up the dynamic range are unequal. The upper zones have a higher signal-to-noise ratio, and therefore appear cleaner and more detailed than the lower ones. As a result, the upper limit of the dynamic range is very significant and noticeable - clipping cuts off light at the slightest overexposure, while the lower limit is inconspicuously drowned in noise, and the transition to black is not nearly as sharp as to white.

The linear dependence of the signal on exposure, as well as the sharp rise to a plateau, are unique features of the digital photographic process. For comparison, take a look at the characteristic characteristic curve of traditional photographic film.

The shape of the curve and especially the angle of inclination strongly depend on the type of film and on the procedure for its development, but the main, striking difference between the film graph and the digital one remains unchanged - the nonlinear nature of the dependence of the optical density of the film on the exposure value.

The lower limit of the photographic latitude of negative film is determined by the density of the veil, and the upper limit is determined by the maximum achievable optical density of the photographic layer; for reversible films it is the other way around. Both in the shadows and in the highlights, smooth bends in the characteristic curve are observed, indicating a drop in contrast when approaching the boundaries of the dynamic range, because the slope of the curve is proportional to the contrast of the image. Thus, the exposure zones lying in the middle part of the graph have maximum contrast, while in the highlights and shadows the contrast is reduced. In practice, the difference between film and a digital matrix is especially noticeable in the highlights: where in a digital image the highlights are burned out by clipping, on film the details are still visible, although low in contrast, and the transition to pure white looks smooth and natural.

In sensitometry, even two independent terms are used: actually photographic latitude, limited by a relatively linear portion of the characteristic curve, and useful photographic latitude, including in addition linear section also the base and shoulder of the chart.

It is noteworthy that when processing digital photographs, as a rule, a more or less pronounced S-shaped curve is applied to them, increasing the contrast in midtones at the cost of reducing it in shadows and highlights, which gives the digital image a more natural and pleasing appearance to the eye.

Bit depth

Unlike the matrix of a digital camera, human vision is characterized by, let's say, a logarithmic view of the world. Successive doublings of the amount of light are perceived by us as equal changes in brightness. Light numbers can even be compared to musical octaves, because double changes in sound frequency are perceived by ear as a single musical interval. Other senses work on this principle. Nonlinearity of perception greatly expands the range of human sensitivity to stimuli of varying intensity.

When converting a RAW file (it doesn’t matter - using the camera or in a RAW converter) containing linear data, the so-called. gamma curve, which is designed to non-linearly increase the brightness of a digital image, bringing it into line with the characteristics of human vision.

With linear conversion, the image is too dark.

After gamma correction, the brightness returns to normal.

The gamma curve stretches dark tones and compresses light ones, making the distribution of gradations more uniform. The result is a natural-looking image, but noise and sampling artifacts in the shadows inevitably become more noticeable, which is only exacerbated by the small number of brightness levels in the lower zones.

Linear distribution of brightness gradations.

Uniform distribution after applying the gamma curve.

ISO and dynamic range

Despite the fact that digital photography uses the same concept of photosensitivity of photographic material as in film photography, it should be understood that this happens solely due to tradition, since the approaches to changing photosensitivity in digital and film photography are fundamentally different.

Increasing ISO sensitivity in traditional photography means replacing one film with another with coarser grain, i.e. There is an objective change in the properties of the photographic material itself. In a digital camera, the light sensitivity of the sensor is strictly determined by its physical characteristics and cannot be changed in the literal sense. When increasing ISO, the camera does not change the actual sensitivity of the sensor, but only amplifies the electrical signal generated by the sensor in response to irradiation and adjusts the digitization algorithm for this signal accordingly.

An important consequence of this is that the effective dynamic range decreases in proportion to the increase in ISO, because along with the useful signal, noise also increases. If at ISO 100 the entire range of signal values is digitized - from zero to the saturation point, then at ISO 200 only half the capacity of the photodiodes is taken as the maximum. With each doubling of ISO sensitivity, the top step of the dynamic range is cut off, and the remaining steps are pulled into its place. This is why using ultra-high ISO values makes no practical sense. With the same success, you can lighten the photo in a RAW converter and get a comparable noise level. The difference between increasing the ISO and artificially brightening the image is that when increasing the ISO, the signal is amplified before it enters the ADC, which means that quantization noise is not amplified, unlike the sensor’s own noise, while in a RAW converter it is subject to amplification including ADC errors. In addition, reducing the digitization range means more accurate sampling of the remaining values input signal.

By the way, lowering the ISO below the base value (for example, to ISO 50), available on some devices, does not at all expand the dynamic range, but simply attenuates the signal by half, which is equivalent to darkening the image in the RAW converter. This function can even be considered harmful, since using a subminimum ISO value provokes the camera to increase the exposure, which, while the sensor saturation threshold remains unchanged, increases the risk of clipping in the highlights.

True Dynamic Range

There are a number of programs like (DxO Analyzer, Imatest, RawDigger, etc.) that allow you to measure the dynamic range of a digital camera at home. In principle, this is not very necessary, since data for most cameras can be freely found on the Internet, for example, on the website DxOMark.com.

Should we believe the results of such tests? Quite. With the only caveat that all these tests determine the effective or, so to speak, technical dynamic range, i.e. the relationship between the saturation level and the noise level of the matrix. For a photographer, the most important thing is the useful dynamic range, i.e. number of exposure zones that actually allow you to capture something useful information.

As you remember, the dynamic range threshold is set by the noise level of the photosensor. The problem is that in practice, the lower zones, which are technically already included in the dynamic range, still contain too much noise to be usefully used. Here a lot depends on individual disgust - everyone determines the acceptable noise level for themselves.

My subjective opinion is that details in the shadows begin to look more or less decent when the signal-to-noise ratio is at least eight. On this basis, I define useful dynamic range as technical dynamic range minus about three stops.

For example, if a DSLR camera, according to reliable tests, has a dynamic range of 13 EV, which is very good by today's standards, then its useful dynamic range will be about 10 EV, which, in general, is also quite good. Of course, we are talking about shooting in RAW, with minimum ISO and maximum bit depth. When shooting JPEG, dynamic range is highly dependent on contrast settings, but on average you should give up another two or three stops.

For comparison: color reversal films have a useful photographic latitude of 5-6 stops; black and white negative films give 9-10 stops with standard developing and printing procedures, and with certain manipulations - up to 16-18 stops.

To summarize the above, let's try to formulate a few simple rules, the observance of which will help you squeeze maximum performance out of your camera's sensor:

The dynamic range of a digital camera is only fully accessible when shooting in RAW.
Dynamic range decreases as light sensitivity increases, so avoid high ISO settings unless absolutely necessary.
Using a higher bit depth for RAW files does not increase true dynamic range, but it does improve tonal separation in the shadows due to more brightness levels.
Exposure to the right. The upper exposure zones always contain the maximum useful information with a minimum of noise and should be used most effectively. At the same time, do not forget about the danger of clipping - pixels that have reached saturation are absolutely useless.

And most importantly: don't worry too much about the dynamic range of your camera. Its dynamic range is fine. Your ability to see light and manage exposure correctly is much more important. A good photographer will not complain about the lack of photographic latitude, but will try to wait for more comfortable lighting, or change the angle, or use the flash, in a word, will act in accordance with the circumstances. I'll tell you more: some scenes only benefit from the fact that they do not fit into the dynamic range of the camera. Often an unnecessary abundance of details simply needs to be hidden in a semi-abstract black silhouette, which makes the photo both more laconic and richer.

High contrast is not always a bad thing – you just need to know how to work with it. Learn to exploit the shortcomings of the equipment as well as its advantages, and you will be surprised how much your creative possibilities will expand.

Thank you for your attention!

Vasily A.

Post scriptum

If you found the article useful and informative, you can kindly support the project by making a contribution to its development. If you didn’t like the article, but you have thoughts on how to make it better, your criticism will be accepted with no less gratitude.

Please remember that this article is subject to copyright. Reprinting and quoting are permissible provided there is a valid link to the source, and the text used must not be distorted or modified in any way.

This group of methods is based on the fact that transmitted signals undergo nonlinear amplitude transformations, and in the transmitting and receiving parts the nonlinearities are reciprocal. For example, if the nonlinear function Öu is used in the transmitter, u 2 is used in the receiver. Consistent application of reciprocal functions will ensure that the overall transformation remains linear.

The idea of nonlinear data compression methods is that the transmitter can, with the same amplitude of the output signals, transmit a larger range of changes in the transmitted parameter (that is, a larger dynamic range). Dynamic range- this is the ratio of the largest permissible signal amplitude to the smallest, expressed in relative units or decibels:

;	(2.17)
.	(2.18)

The natural desire to increase the dynamic range by decreasing U min is limited by the sensitivity of the equipment and the increasing influence of interference and self-noise.

Most often, dynamic range compression is carried out using a pair of reciprocal functions of logarithm and potentiation. The first operation of changing the amplitude is called compression(by compression), the second - expansion(stretching). The choice of these particular functions is related to their greatest opportunity compression.

At the same time, these methods also have disadvantages. The first of these is that the logarithm of a small number is negative and in the limit:

that is, the sensitivity is very nonlinear.

To reduce these shortcomings, both functions are modified by displacement and approximation. For example, for telephone channels the approximated function has the form (type A):

with A=87.6. The gain from compression is 24 dB.

Data compression using nonlinear procedures is implemented by analog means with large errors. The use of digital tools can significantly improve the accuracy or speed of conversion. At the same time, the direct use of funds computer technology(that is, directly calculating logarithms and exponents) will not give the best results due to low performance and accumulating calculation errors.

Due to accuracy limitations, data compression by compression is used in non-critical cases, for example, for transmitting speech over telephone and radio channels.

Efficient Coding

Efficient codes were proposed by K. Shannon, Fano and Huffman. The essence of codes is that they are uneven, that is, with an unequal number of bits, and the length of the code is inversely proportional to the probability of its occurrence. Another great feature of efficient codes is that they do not require delimiters, i.e. special characters, separating adjacent code combinations. This is achieved by observing simple rule: Shorter codes are not the beginning of longer ones. In this case, the continuous stream of bits is uniquely decoded because the decoder detects the shorter codewords first. Effective codes for a long time were purely academic, but have recently been successfully used in the formation of databases, as well as in compressing information in modern modems and software archivers.

Due to unevenness, the average code length is introduced. Average length - mathematical expectation of the code length:

moreover, l av tends to H(x) from above (that is, l av > H(x)).

The fulfillment of condition (2.23) becomes stronger as N increases.

There are two types of efficient codes: Shannon-Fano and Huffman. Let's look at how to obtain them using an example. Let's assume that the probabilities of the symbols in the sequence have the values given in Table 2.1.

Table 2.1.

Symbol probabilities

N
p i	0.1	0.2	0.1	0.3	0.05	0.15	0.03	0.02	0.05

Symbols are ranked, that is, presented in a row in descending order of probabilities. After this, using the Shannon-Fano method, the following procedure is periodically repeated: the entire group of events is divided into two subgroups with the same (or approximately the same) total probabilities. The procedure continues until one element remains in the next subgroup, after which this element is eliminated, and the specified actions continue with the remaining ones. This happens until there is only one element left in the last two subgroups. Let's continue with our example, which is summarized in Table 2.2.

Table 2.2.

Shannon-Fano coding

N	P i
4	0.3		I
	0.2	I	II
6	0.15		I	I
	0.1			II
1	0.1			I	I
9	0.05	II			II
5	0.05		II		I
7	0.03			II	II	I
8	0.02					II

As can be seen from Table 2.2, the first symbol with probability p 4 = 0.3 participated in two procedures for dividing into groups and both times ended up in group number I. In accordance with this, it is encoded with a two-digit code II. The second element at the first stage of partition belonged to group I, at the second - to group II. Therefore, its code is 10. The codes of the remaining symbols do not need additional comments.

Typically, non-uniform codes are depicted as code trees. A code tree is a graph indicating allowed code combinations. The directions of the edges of this graph are pre-set, as shown in Fig. 2.11 (the choice of directions is arbitrary).

They navigate the graph as follows: create a route for the selected symbol; the number of bits for it is equal to the number of edges in the route, and the value of each bit is equal to the direction of the corresponding edge. The route is drawn up from the starting point (in the drawing it is marked with the letter A). For example, the route to vertex 5 consists of five edges, all but the last of which have direction 0; we get code 00001.

Let's calculate the entropy and average word length for this example.

H(x) = -(0.3 log 0.3 + 0.2 log 0.2 + 2 0.1 log 0.1+ 2 0.05 log 0.05+

0.03 log 0.03 + 0.02 log 0.02) = 2.23 bits

l avg = 0.3 2 + 0.2 2 + 0.15 3 + 0.1 3 + 0.1 4 + 0.05 5 +0.05 4+

0.03 6 + 0.02 6 = 2.9 .

As you can see, the average word length is close to entropy.

Huffman codes are constructed using a different algorithm. The coding procedure consists of two stages. At the first stage, single compressions of the alphabet are carried out sequentially. One-time compression - replacing the last two symbols (with the lowest probabilities) with one, with a total probability. Compressions are carried out until two characters remain. At the same time, a coding table is filled in, in which the resulting probabilities are entered, and the routes along which new symbols move at the next stage are depicted.

At the second stage, the actual encoding occurs, which begins from the last stage: the first of the two symbols is assigned code 1, the second - 0. After this, they move on to the previous stage. The codes from the subsequent stage are assigned to the symbols that did not participate in compression at this stage, and the code of the symbol obtained after gluing is twice assigned to the last two symbols and added to the code of the upper character 1, the lower one - 0. If the character is not further in gluing participates, its code remains unchanged. The procedure continues until the end (that is, until the first stage).

Table 2.3 shows Huffman coding. As can be seen from the table, coding was carried out in 7 stages. On the left are the symbol probabilities, on the right are the intermediate codes. The arrows show the movements of the newly formed symbols. At each stage, the last two symbols differ only in the least significant bit, which corresponds to the encoding technique. Let's calculate the average word length:

l avg = 0.3 2 + 0.2 2 + 0.15 3 ++ 2 0.1 3 + +0.05 4 + 0.05 5 + 0.03 6 + 0.02 6 = 2.7

This is even closer to entropy: the code is even more efficient. In Fig. Figure 2.12 shows the Huffman code tree.

Table 2.3.

Huffman coding

N	p i	code	I	II	III	IV	V	VI	VII
	0.3		0.3 11	0.3 11	0.3 11	0.3 11	0.3 11	0.4 0	0.6 1
	0.2		0.2 01	0.2 01	0.2 01	0.2 01	0.3 10	0.3 11	0.4 0
	0.15		0.15 101	0.15 101	0.15 101	0.2 00	0.2 01	0.3 10
	0.1		0.1 001	0.1 001	0.15 100	0.15 101	0.2 00
	0.1		0.1 000	0.1 000	0.1 001	0.15 100
	0.05		0.05 1000	0.1 1001	0.1 000
	0.05		0.05 10011	0.05 1000
	0.03		0.05 10010
	0.02

Both codes satisfy the requirement of unambiguous decoding: as can be seen from the tables, shorter combinations are not the beginning of longer codes.

As the number of symbols increases, the efficiency of the codes increases, so in some cases larger blocks are encoded (for example, if we're talking about about texts, you can encode some of the most frequently occurring syllables, words and even phrases).

The effect of introducing such codes is determined by comparing them with a uniform code:

(2.24)

where n is the number of bits of the uniform code that is replaced by the effective one.

Modifications of Huffman codes

The classic Huffman algorithm is a two-pass algorithm, i.e. requires first collecting statistics on symbols and messages, and then the procedures described above. This is inconvenient in practice because it increases the time it takes to process messages and accumulate a dictionary. More often, one-pass methods are used, in which accumulation and encoding procedures are combined. Such methods are also called adaptive compression according to Huffman [46].

The essence of adaptive compression according to Huffman comes down to the construction of an initial code tree and its sequential modification after the arrival of each next symbol. As before, the trees here are binary, i.e. At most two arcs emanate from each vertex of the tree graph. It is customary to call the original vertex the parent, and the two subsequent vertices connected to it as children. Let's introduce the concept of vertex weight - this is the number of characters (words) corresponding to a given vertex, obtained when feeding the original sequence. Obviously, the sum of the children's weights is equal to the weight of the parent.

After introducing the next symbol of the input sequence, the code tree is revised: the weights of the vertices are recalculated and, if necessary, the vertices are rearranged. The rule for permuting vertices is as follows: the weights of the lower vertices are the smallest, and the vertices located on the left of the graph have the least weights.

At the same time, the vertices are numbered. The numbering starts from the lower (hanging, i.e., having no children) vertices from left to right, then moves to the upper level, etc. before numbering the last, original vertex. In this case, the following result is achieved: the smaller the weight of a vertex, the lower its number.

The permutation is carried out mainly for hanging vertices. When permuting, the rule formulated above must be taken into account: vertices with greater weight have a higher number.

After passing the sequence (it is also called control or test), all hanging vertices are assigned code combinations. The rule for assigning codes is similar to the above: the number of bits of the code is equal to the number of vertices through which the route passes from the source to the given hanging vertex, and the value of a particular bit corresponds to the direction from the parent to the “child” (say, going to the left of the parent corresponds to the value 1, to the right - 0 ).

The resulting code combinations are stored in the memory of the compression device along with their analogues and form a dictionary. The use of the algorithm is as follows. The compressed sequence of characters is divided into fragments in accordance with the existing dictionary, after which each of the fragments is replaced with its code from the dictionary. Fragments not found in the dictionary form new hanging vertices, acquire weight and are also entered into the dictionary. In this way, an adaptive algorithm for replenishing the dictionary is formed.

To increase the efficiency of the method, it is desirable to increase the size of the dictionary; in this case the compression ratio increases. In practice, the size of the dictionary is 4 - 16 KB of memory.

Let us illustrate the given algorithm with an example. In Fig. Figure 2.13 shows the original diagram (it is also called the Huffman tree). Each vertex of the tree is shown by a rectangle in which two numbers are inscribed through a fraction: the first means the number of the vertex, the second means its weight. As you can see, the correspondence between the weights of the vertices and their numbers is satisfied.

Let us now assume that the symbol corresponding to vertex 1 appears a second time in the test sequence. The weight of the vertex has changed as shown in Fig. 2.14, as a result of which the rule for numbering vertices is violated. At the next stage, we change the location of the hanging vertices, for which we swap vertices 1 and 4 and renumber all the vertices of the tree. The resulting graph is shown in Fig. 2.15. The procedure then continues in the same way.

It should be remembered that each hanging vertex in the Huffman tree corresponds to a specific symbol or group of symbols. The parent differs from the children in that the group of symbols corresponding to it is one symbol shorter than that of its children, and these children differ in the last symbol. For example, the symbols "car" correspond to the parent; then children may have the sequences "kara" and "karp".

The given algorithm is not academic and is actively used in archiver programs, including when compressing graphic data (they will be discussed below).

Lempel–Ziv algorithms

These are the most commonly used compression algorithms today. They are used in most archiving programs (for example, PKZIP. ARJ, LHA). The essence of the algorithms is that a certain set of symbols is replaced during archiving by its number in a specially generated dictionary. For example, the phrase “The outgoing number for your letter...”, which is often found in business correspondence, may occupy position 121 in the dictionary; then, instead of transmitting or storing the mentioned phrase (30 bytes), you can store the phrase number (1.5 bytes in binary decimal form or 1 byte in binary).

The algorithms are named after the authors who first proposed them in 1977. Of these, the first is LZ77. For archiving, a so-called message sliding window is created, consisting of two parts. The first part, a larger format, serves to form a dictionary and has a size of about several kilobytes. The second, smaller part (usually up to 100 bytes in size) accepts the current characters of the text being viewed. The algorithm tries to find a set of characters in the dictionary that matches those received in the viewing window. If this is successful, a code is generated consisting of three parts: the offset in the dictionary relative to its initial substring, the length of this substring, and the character following this substring. For example, the selected substring consists of the characters "app" (6 characters in total), the next character is "e". Then, if the substring has an address (place in the dictionary) 45, then the entry in the dictionary looks like “45, 6. e”. After this, the contents of the window are shifted by position, and the search continues. This is how a dictionary is formed.

The advantage of the algorithm is an easily formalized algorithm for compiling a dictionary. In addition, it is possible to unzip without the original dictionary (it is advisable to have a test sequence) - the dictionary is formed during unzipping.

The disadvantages of the algorithm appear as the size of the dictionary increases - the search time increases. In addition, if a string of characters appears in the current window that is not in the dictionary, each character is written with a three-element code, i.e. The result is not compression, but stretching.

Best Features has the LZSS algorithm proposed in 1978. It has differences in sliding window support and compressor output codes. In addition to the window, the algorithm generates a binary tree similar to a Huffman tree to speed up the search for matches: each substring leaving the current window is added to the tree as one of the children. This algorithm allows you to further increase the size of the current window (it is desirable that its size is equal to a power of two: 128, 256, etc. bytes). Sequence codes are also formed differently: an additional 1-bit prefix is introduced to distinguish uncoded characters from “offset, length” pairs.

An even greater degree of compression is obtained when using algorithms like LZW. The previously described algorithms have a fixed window size, which makes it impossible to enter phrases longer than the window size into the dictionary. In the LZW algorithms (and their predecessor LZ78), the viewing window has an unlimited size, and the dictionary accumulates phrases (and not a collection of characters, as before). The dictionary has an unlimited length, and the encoder (decoder) operates in phrase waiting mode. When a phrase that matches the dictionary is formed, a match code is issued (i.e., the code of this phrase in the dictionary) and the code of the character following it. If, as symbols accumulate, a new phrase is formed, it is also entered into the dictionary, like the shorter one. The result is a recursive procedure that provides fast encoding and decoding.

Additional opportunity compression provides compressed encoding of repeated characters. If in a sequence some characters follow in a row (for example, in the text these can be “space” characters, in a number sequence - consecutive zeros, etc.), then it makes sense to replace them with the pair “character; length” or “sign, length” ". In the first case, the code indicates the sign that the sequence will be encoded (usually 1 bit), then the code of the repeating character and the length of the sequence. In the second case (provided for the most frequently occurring repeating characters), the prefix simply indicates a repetition sign.

At a time when researchers were just beginning to solve the problem of creating a speech interface for computers, they often had to make their own equipment that would allow audio information to be input into the computer and also output it from the computer. Today, such devices may only be of historical interest, since modern computers can easily be equipped with audio input and output devices, such as sound adapters, microphones, headphones and speakers.

We will not delve into the details of the internal structure of these devices, but we will talk about how they work and provide some recommendations for choosing audio computer devices for working with speech recognition and synthesis systems.

As we already said in the previous chapter, sound is nothing more than air vibrations, the frequency of which lies in the range of frequencies perceived by humans. The exact boundaries of the audible frequency range may vary from person to person, but sound vibrations are believed to lie in the range of 16-20,000 Hz.

The job of a microphone is to convert sound vibrations into electrical vibrations, which can then be amplified, filtered to remove interference, and digitized for input. audio information to the computer.

Based on their operating principle, the most common microphones are divided into carbon, electrodynamic, condenser and electret. Some of these microphones require an external current source for their operation (for example, carbon and condenser), others, under the influence of sound vibrations, are capable of independently generating alternating electrical voltage (these are electrodynamic and electret microphones).

You can also separate the microphones according to their purpose. There are studio microphones that can be held in your hand or mounted on a stand, there are radio microphones that can be clipped to clothing, and so on.

There are also microphones designed specifically for computers. Such microphones are usually mounted on a stand placed on the surface of a table. Computer microphones can be combined with headphones, as shown in Fig. 2-1.

Rice. 2-1. Headphones with microphone

How can you choose from the variety of microphones that are best suited for speech recognition systems?

In principle, you can experiment with any microphone you have, as long as it can be connected to your computer's audio adapter. However, developers of speech recognition systems recommend purchasing a microphone that, during operation, will be at a constant distance from the speaker’s mouth.

If the distance between the microphone and the mouth does not change, then the average level of the electrical signal coming from the microphone will not change too much either. This will have a positive impact on the performance of modern speech recognition systems.

What's the problem?

A person is able to successfully recognize speech, the volume of which varies over a very wide range. The human brain is able to filter out quiet speech from interference, such as the noise of cars passing on the street, outside conversations and music.

As for modern speech recognition systems, their abilities in this area leave much to be desired. If the microphone is on a table, then when you turn your head or change your body position, the distance between your mouth and the microphone will change. This will change the microphone output level, which in turn will reduce the reliability of speech recognition.

Therefore, when working with speech recognition systems, the best results will be achieved if you use a microphone attached to headphones, as shown in Fig. 2-1. When using such a microphone, the distance between the mouth and the microphone will be constant.

We also draw your attention to the fact that all experiments with speech recognition systems are best carried out in privacy in a quiet room. In this case, the influence of interference will be minimal. Of course, if you need to select a speech recognition system that can operate in conditions of strong interference, then the tests need to be conducted differently. However, as far as the authors of the book know, the noise immunity of speech recognition systems is still very, very low.

The microphone converts sound vibrations into electrical current vibrations for us. These fluctuations can be seen on the oscilloscope screen, but do not rush to the store to purchase this expensive device. We can carry out all oscillographic studies using a regular computer equipped with a sound adapter, for example, a Sound Blaster adapter. Later we will tell you how to do this.

In Fig. 2-2 we showed an oscillogram of a sound signal obtained when pronouncing a long sound a. This waveform was obtained using the GoldWave program, which we will talk about later in this chapter of the book, as well as using a Sound Blaster audio adapter and a microphone similar to that shown in Fig. 2-1.

Rice. 2-2. Audio signal oscillogram

The GoldWave program allows you to stretch the oscillogram along the time axis, which allows you to see the smallest details. In Fig. 2-3 we showed a stretched fragment of the above-mentioned oscillogram of sound a.

Rice. 2-3. Fragment of an oscillogram of an audio signal

Please note that the magnitude of the input signal coming from the microphone changes periodically and takes on both positive and negative values.

If there was only one frequency present in the input signal (that is, if the sound was “clean”), the waveform received from the microphone would be a sine wave. However, as we have already said, the spectrum of human speech sounds consists of a set of frequencies, as a result of which the shape of the oscillogram of the speech signal is far from sinusoidal.

We will call a signal whose magnitude changes continuously over time analog signal. This is exactly the signal that comes from the microphone. Unlike analog, digital signal is a set of numerical values that change discretely over time.

In order for a computer to process an audio signal, it must be converted from analogue to digital form, that is, presented as a set of numerical values. This process is called analog signal digitization.

Digitization of an audio (and any analog) signal is performed using a special device called analog-to-digital converter ADC (Analog to Digital Converter, ADC). This device is located on the sound adapter board and is a regular-looking microcircuit.

How does an analog-to-digital converter work?

It periodically measures the level of the input signal and outputs a numerical value of the measurement result. This process is illustrated in Fig. 2-4. Here, gray rectangles indicate input signal values measured at some constant time interval. A set of such values is a digitized representation of the input analog signal.

Rice. 2-4. Measurements of signal amplitude versus time

In Fig. 2-5 we showed connecting an analog-to-digital converter to a microphone. In this case, input x 1 is supplied analog signal, and a digital signal is removed from the outputs u 1 -u n.

Rice. 2-5. Analog-to-digital converter

Analog-to-digital converters are characterized by two important parameters - the conversion frequency and the number of quantization levels of the input signal. The correct choice of these parameters is critical to achieve adequate representation in digital form analog signal.

How often do you need to measure the amplitude of the input analog signal so that information about changes in the input analog signal is not lost as a result of digitization?

It would seem that the answer is simple - the input signal needs to be measured as often as possible. Indeed, the more often an analog-to-digital converter makes such measurements, the better it will be able to track the slightest changes in the amplitude of the input analog signal.

However, excessively frequent measurements can lead to an unjustified increase in the flow of digital data and a waste of computer resources when processing the signal.

Fortunately, right choice conversion frequencies (sampling frequencies) are quite simple to do. To do this, it is enough to turn to Kotelnikov’s theorem, known to specialists in the field of digital signal processing. The theorem states that the conversion frequency must be twice the maximum frequency of the spectrum of the converted signal. Therefore, to digitize without losing the quality of an audio signal whose frequency lies in the range of 16-20,000 Hz, you need to select a conversion frequency no less than 40,000 Hz.

Note, however, that in professional audio equipment the conversion frequency is selected several times higher than the specified value. This is done to achieve very High Quality digitized sound. This quality is not relevant for speech recognition systems, so we will not focus your attention on this choice.

What conversion frequency is needed to digitize the sound of human speech?

Since the sounds of human speech lie in the frequency range of 300-4000 Hz, the minimum required conversion frequency is 8000 Hz. However, many computer programs Speech recognition uses a standard conversion frequency of 44,000 Hz for conventional audio adapters. On the one hand, such a conversion frequency does not lead to an excessive increase in the digital data flow, and on the other hand, it ensures speech digitization with sufficient quality.

Back in school, we were taught that with any measurements errors arise, which cannot be completely eliminated. Such errors arise due to the limited resolution of measuring instruments, as well as due to the fact that the measurement process itself can introduce some changes into the measured value.

An analog-to-digital converter represents the input analog signal as a stream of numbers of limited capacity. Conventional audio adapters contain 16-bit ADC blocks capable of representing the amplitude of the input signal as 216 = 65536 different values. ADC devices in high-end audio equipment can be 20-bit, providing greater accuracy in representing the amplitude of the audio signal.

Modern speech recognition systems and programs were created for regular computers, equipped with conventional sound adapters. Therefore, to conduct experiments with speech recognition, you do not need to purchase a professional audio adapter. An adapter such as Sound Blaster is quite suitable for digitizing speech for the purpose of its further recognition.

Along with the useful signal, various noises usually enter the microphone - noise from the street, wind noise, extraneous conversations, etc. Noise has a negative impact on the performance of speech recognition systems, so it has to be dealt with. We have already mentioned one of the ways - today's speech recognition systems are best used in a quiet room, alone with the computer.

However, it is not always possible to create ideal conditions, so it is necessary to use special methods to get rid of interference. To reduce the noise level, special tricks are used when designing microphones and special filters that remove frequencies from the spectrum of the analog signal that do not carry useful information. In addition, a technique such as compression of the dynamic range of input signal levels is used.

Let's talk about all this in order.

Frequency filter is a device that converts the frequency spectrum of an analog signal. In this case, during the conversion process, vibrations of certain frequencies are released (or absorbed).

You can imagine this device as a kind of black box with one input and one output. In relation to our situation, a microphone will be connected to the input of the frequency filter, and an analog-to-digital converter will be connected to the output.

There are different frequency filters:

· low pass filters;

high pass filters;

· transmitting bandpass filters;

· band-stop filters.

Low Pass Filters(low-pass filter) remove from the spectrum of the input signal all frequencies whose values are below a certain threshold frequency, depending on the filter setting.

Since audio signals lie in the range of 16-20,000 Hz, all frequencies less than 16 Hz can be cut off without degrading the sound quality. For speech recognition, the frequency range of 300-4000 Hz is important, so frequencies below 300 Hz can be cut out. In this case, all interference whose frequency spectrum lies below 300 Hz will be cut out from the input signal, and they will not interfere with the speech recognition process.

Likewise, high pass filters(high-pass filter) cut out from the spectrum of the input signal all frequencies above a certain threshold frequency.

Humans cannot hear sounds with a frequency of 20,000 Hz and higher, so they can be cut out of the spectrum without noticeable deterioration in sound quality. As for speech recognition, here you can cut out all frequencies above 4000 Hz, which will lead to a significant reduction in the level of high-frequency interference.

Band pass filter(band -pass filter) can be thought of as a combination of a low-pass and high-pass filter. Such a filter delays all frequencies below the so-called lower pass frequency, and also above upper pass frequency.

Thus, a passband filter is convenient for a speech recognition system, which delays all frequencies except frequencies in the range of 300-4000 Hz.

As for band-stop filters, they allow you to cut out all frequencies lying in a given range from the spectrum of the input signal. Such a filter is convenient, for example, for suppressing interference that occupies a certain continuous part of the signal spectrum.

In Fig. 2-6 we showed the connection of a pass bandpass filter.

Rice. 2-6. Filtering the audio signal before digitizing

It must be said that conventional sound adapters installed in a computer include a bandpass filter through which the analog signal passes before digitization. The passband of such a filter usually corresponds to the range of audio signals, namely 16-20,000 Hz (in different audio adapters, the values of the upper and lower frequencies may vary within small limits).

How to achieve a narrower bandwidth of 300-4000 Hz, corresponding to the most informative part of the spectrum of human speech?

Of course, if you have a penchant for designing electronic equipment, you can make your own filter from an operational amplifier chip, resistors and capacitors. This is roughly what the first creators of speech recognition systems did.

However, industrial speech recognition systems must be operational on standard computer hardware, so the route of making a special bandpass filter is not suitable here.

Instead, in modern systems speech processing uses the so-called digital frequency filters, implemented in software. This became possible after CPU The computer has become quite powerful.

Digital frequency filter, implemented in software, converts the input digital signal into an output digital signal. During the conversion process, the program processes in a special way the stream of numerical values of the signal amplitude coming from the analog-to-digital converter. The result of the transformation will also be a stream of numbers, but this stream will correspond to an already filtered signal.

While talking about the analog-to-digital converter, we noted such an important characteristic as the number of quantization levels. If a 16-bit analog-to-digital converter is installed in the sound adapter, then after digitization the audio signal levels can be represented as 216 = 65536 different values.

If there are few quantization levels, then the so-called quantization noise. To reduce this noise, high-quality audio digitization systems should use analog-to-digital converters with the maximum number of quantization levels available.

However, there is another technique to reduce the impact of quantization noise on the quality of the audio signal, which is used in digital audio recording systems. When using this technique, the signal is passed through a nonlinear amplifier before digitization, emphasizing signals with low signal amplitude. Such a device enhances weak signals stronger than strong.

This is illustrated by the graph of the output signal amplitude versus the input signal amplitude shown in Fig. 2-7.

Rice. 2-7. Nonlinear amplification before digitization

In the step of converting digitized audio back to analog (we'll look at this step later in this chapter), the analog signal is again passed through a nonlinear amplifier before being output to the speakers. This time, a different amplifier is used, which emphasizes high-amplitude signals and has a transfer characteristic (the dependence of the amplitude of the output signal on the amplitude of the input signal) inverse to that used during digitization.

How can all this help the creators of speech recognition systems?

A person, as is known, recognizes speech spoken in a quiet whisper or in a fairly loud voice quite well. We can say that the dynamic range of loudness levels of successfully recognized speech for a person is quite wide.

Today's computer systems speech recognition, unfortunately, cannot yet boast of this. However, in order to slightly expand the specified dynamic range, before digitizing, you can pass the signal from the microphone through a nonlinear amplifier, the transfer characteristic of which is shown in Fig. 2-7. This will reduce the quantization noise level when digitizing weak signals.

Developers of speech recognition systems, again, are forced to focus primarily on commercially produced sound adapters. They do not provide for the nonlinear signal conversion described above.

However, it is possible to create the software equivalent of a nonlinear amplifier that converts the digitized signal before passing it on to the speech recognition module. Although such a software amplifier will not be able to reduce quantization noise, it can be used to emphasize those signal levels that carry the most speech information. For example, you can reduce the amplitude of weak signals, thus ridding the signal of noise.

Let's think about the question - why do we need to turn up the volume? In order to hear quiet sounds that are not audible in our conditions (for example, if you cannot listen loudly, if there is extraneous noise in the room, etc.). Is it possible to amplify quiet sounds while leaving loud ones alone? It turns out it is possible. This technique is called dynamic range compression (DRC). To do this, you need to change the current volume constantly - amplify quiet sounds, loud ones - not. The simplest law of volume change is linear, i.e. The volume changes according to the law output_loudness = k * input_loudness, where k is the dynamic range compression ratio:

Figure 18. Dynamic range compression.

When k = 1, no changes are made (the output volume is equal to the input volume). At k< 1 громкость будет увеличиваться, а динамический диапазон - сужаться. Посмотрим на график (k=1/2) - тихий звук, имевший громкость -50дБ станет громче на 25дБ, что значительно громче, но при этом громкость диалогов (-27дБ) повысится всего лишь на 13.5дБ, а громкость самых громких звуков (0дБ) вообще не изменится. При k >1 - volume will decrease and dynamic range will increase.

Let's look at the volume graphs (k = 1/2: DD compression is doubled):

Figure 19. Loudness graphs.

As you can see in the original there were both very quiet sounds, 30 dB below the dialogue level, and very loud ones - 30 dB above the dialogue level. That. the dynamic range was 60dB. After compression, loud sounds are only 15dB higher, and quiet sounds are 15dB lower than dialogue (dynamic range is now 30dB). Thus, loud sounds became significantly quieter, and quiet sounds became significantly louder. In this case, there is no overflow!

Now let's look at the histograms:

Figure 20. Compression example.

As you can clearly see, with amplification up to +30dB, the shape of the histogram is well preserved, which means that loud sounds remain well expressed (they do not go to the maximum and are not cut off, as happens with simple amplification). This produces quiet sounds. The histogram shows this poorly, but the difference is very noticeable by ear. The disadvantage of this method is the same volume jumps. However, the mechanism of their occurrence differs from the volume jumps that occur during cutting, and their character is different - they appear mainly with a very strong amplification of quiet sounds (and not when loud sounds are cut, as with normal gain). An excessive level of compression leads to a flattening of the sound picture - all sounds tend to be the same loudness and inexpressiveness.

Excessive amplification of quiet sounds may cause recording noise to become audible. Therefore, the filter uses a slightly modified algorithm so that the noise level rises less:

Figure 21. Increasing volume without increasing noise.

Those. at a volume level of -50dB there is an inflection transfer function, and the noise will be amplified less (yellow line). In the absence of such an inflection, the noise will be much louder (gray line). Such simple modification significantly reduces the amount of noise even at very high compression levels (in the picture - 1:5 compression). The “DRC” level in the filter sets the gain level for quiet sounds (at -50dB), i.e. The 1/5 compression level shown in the figure corresponds to the +40dB level in the filter settings.

Thematic materials: