Indoor Localization Using Barely Perceptible Audio Signals

This paper presents a new approach to an audio-based indoor localization system. By using audio signals emitted by a public address sound system, mobile devices may globally localize themselves in an indoor environment where global navigation satellite systems are not viable or reliable. The use of data hiding techniques such as spread spectrum coding or echo hiding has allowed to convey information to a receiver avoiding people’s perception of the added audio content. Results demonstrate a relatively quite good localization with centimetre accuracy and precision and successful data transmission using barely perceptible audio signals.


Introduction
Location awareness in context-based applications is becoming one of the most compelling areas in information and communications technology. For instance, the Global Positioning System (GPS) is built into most mobile devices such as smartphones, laptops or tablets. However, there are many situations, typically at indoor locations, where GPS based-systems do not work properly because their weak radio frequency (RF) signals are largely attenuated by walls and ceilings. This limitation opens space to the appearance of alternative position determination technologies (Merry et al. 2010), namely those involving the use of audio signals. There are many examples where an audio-based indoor localization approach could be used: train or subway stations, airports, large departmental stores, shopping plazas, amusement parks, museums, office buildings, etc. The subway station example is actually the starting point of this research work, as previous work developed by some of the authors in the NAVMETRO project (Moutinho 2009, Moutinho et al. 2010) has shown the urge to have some automatic solution to the personal indoor localization problem. The developed system consists of an assisted living technology that supports visually impaired people providing information and acoustic guidance throughout an indoor space as intricate as a multi-line public subway station. In order to provide routing cues to a certain destination, selected by means of an interactive response system available through a telephone call, it is fundamental to know where the starting point is. No commercial solution at the time was able to provide a precise, accurate and reliable automatic localization system. Even the most promising one, based on Wi-Fi, had unsurmountable problems due to the electromagnetic interference caused mostly by the electric subway vehicles. A new approach is presented to develop an audio-based indoor localization system in order to take advantage of the availability and ease of use that audio-based indoor technologies have. The putative disadvantage that possibly explains the current lack of audio-based approaches in the research community, is the fact that the employed audio signals may be heard by the persons that use the space. This document presents some possible ways to avoid that and provides detail on how can such a system be implemented taking advantage of an existing public address sound system and without affecting the acoustic environment and the people in it. This paper is organized as follows. In section 2 background and related work concerning indoor localization is described. In Section 3 the localization using audio signals is approached in its many dimensions and an explanation is provided of the several parts of the proposed approach involving data-hiding and masking techniques. In section 4, some results are presented regarding the performance of the data-hiding methods. In section 5 we summarize the presented work and address future work.

Background and related work
Most GPS alternatives, particularly in indoor spaces, are based on wireless communication systems like IEEE 802.11 (Wi-Fi), Ultra Wide Band (UWB) or Radio Frequency Id (RFID). These approaches are characterized by their low accuracy or associated high costs for achieving an accurate enough ranging (Amundson and Koutsoukos 2009). More recently, with the availability of smartphones and tablets, new approaches that rely on dead reckoning with the use of fused data provided by Inertial Measurement Units, can provide a localization system almost independent of the infrastructure. However, even with the use of tools that evaluate human gait and behaviour, cumulative errors occur and, in order to provide accurate information, navigational aids are needed, not only to correct positional errors but also to calibrate dead reckoning algorithms. Absolute reference points become necessary and the use of inertial information does not solve the problem (Jimenez et al. 2009). Other localization possibilities use ultrasonic signals, inaudible to humans, to measure short distances as the ones involved in indoor spaces. However, ultrasonic signals require line-ofsight between emitting anchors and the mobile device as the wavelength of this type of signal is relatively short and significant diffraction or spreading problems occur. As such, it would require many anchors transmitting ultrasonic signals to ensure sufficient coverage of any area. It is also important to consider that these signals are highly attenuated through the air and both transmitter and receiver need to be custom made, as typically ultrasound signals are not used in everyday life and in off-the-shelf devices. The illustration in Figure 1 compares the most relevant indoor localization technologies by accuracy and by the deployment potential. The same illustration also describes the predictable evolution of deployment from the moment the technologies were developed to a possible future.

Localization with audio signals
An audio signal is one of the most difficult types of signal to use in indoor localization as it imposes several constraints and is severely affected by the acoustic environment. However, it also has several advantages especially concerning the wide relative bandwidth and the ubiquity of sound related equipment. Audio capable devices are present in people's everyday life and when one considers a possible usage scenario for this purpose, it is almost immediate to assume loudspeakers as fixed anchors and smartphones (along with their microphones) as mobile devices. This would allow a wide spread dissemination of the indoor localization possibilities to everyone in every public space (Moutinho et al. 2015).

System's architecture
A general view of the proposed system is presented in Figure 2. In the infrastructure side, a special audio emission is combined together with a possibly pre-existent public address sound programme in order to be the least possible noticeable by people and still well picked-up by the mobile receivers. In order to avoid the perception of added audio signals, spread spectrum and echo hiding techniques are used Hatfull 2011). These approaches are based on signals of opportunity as they take advantage of using pre-existent signals (Merry et al. 2010). This audio is transmitted by the loudspeakers to the channel (an indoor area). Mobile devices will be responsible for receiving the signals broadcasted in the acoustic environment and interpreting them to determine the localization, just as a Global Navigation Satellite Systems receiver does. Considering a room in a building with a pre-existent public address sound system, the only necessary addition is an appliance between the original sound source (possibly a mix of music and voice) and the sound transducers. This appliance, illustrated in Figure 3, would be configured considering the absolute global localization of the beacons (loudspeakers) and the environmental conditions. Previous work on this system (Moutinho et al. 2013), already demonstrated the possibility of using audible sound to precisely and robustly do indoor localization. These experimental activities in a real setup have shown that the use of existing loudspeakers and their typical frequency range provides good area coverage with good localization without the need to have a dense network of beacons as typically near ultrasound or even common ultrasound solutions do.

Spread spectrum
Spread spectrum technology uses different codes to separate between different emitters rather than different frequencies or time slots as in the case of other multiple access technologies, like Frequency Division or Time Division respectively. Spread Spectrum modulation techniques alter the signal so that the bandwidth becomes much greater than the bandwidth of the original information to be transmitted (Proakis 2003). The bandwidth of the modulated signal is determined by the information to be transmitted and by a spreading code that will be responsible for providing a distinct identification to each emitted signal. This code, a Pseudo-random-noise (PN) sequence, is turned into a low power signal spread across a widespread frequency interval as Figure 4 depicts. The scaling block of Figure 4 is responsible for levelling the spread spectrum signal addition as function of the energy of the Audio Signal ( ) or the environmental noise when ( ) = 0, to ensure the best possible masking conditions. Each loudspeaker's PN sequence should be statistically uncorrelated with the other so that each anchor signal is correctly identified. An illustrative picture is shown in Figure 5 where colours are used to illustrate different codes present in the signals that are being reproduced together with the pre-existent audio transmission. It is also possible to observe that the receiver's side picks-up the mixed contribution of all sound sources with different intensities (as a function of the room's frequency response and distances from the sound sources). Therefore the receiver will be required to know which signals to expect. Each beacon signal will also have unique characteristics so that separation is possible at the mobile device's side. Gold codes are a suitable example of a PN for this purpose as the cross-correlation between codes is low and the auto-correlation is high (Boney et al. 1996). Gold codes are also very appropriate since a large number of codes can be generated with good auto-correlation and cross-correlation properties, which is necessary to allow identification of each one out of a large set of beacons. Spread Spectrum allows the transmitted signal to have a low power density due to the fact that the transmitted energy is spread over a wide band, as Figure 6 illustrates in a situation where a Spread Spectrum -Binary Phase Shift Keying audio signal lies approximately 5dB below the environmental noise level, making it virtually inaudible (Garcia 1999 andJohnston 1988). Consequently this low power density of the transmitted signal of such a signal will not disturb or interfere with receivers or other mobile devices or persons in the same area.

Echo hiding
An interesting method of embedding watermarks into audio data is to take advantage of the human auditory system tolerance to early sound echoes or reverberations. Echo hiding exploits human perception by adding one of two different kinds of sub-perceptible echoes to segments of the cover audio that can be detected at the reception. The original cover signal is separated in audio segments that are sufficiently large to include a discernible echo and that are small enough to maximize data rate. Each of these segments is transformed by inserting an echo, which is determined by the binary representation of the watermark (Gruhl e al. 1996). One of two possible echoes, as illustrated in Figure 7 on the left, is applied to each of the previously separated segments according to the watermark. These echo kernels should not exceed the "fusion limit delay for echo perception" as they will affect significantly the cover signal and will provide two auditory images for the sound (Haas 1972). This echo limit should be around 1 ms so that no effect is perceived. The "zero" and "one" mixer signals, as depicted in Figure 7 on the right, are used to select which echo is used in each segment according to the bit in the binary watermark to insert.

Figure 7:
Echo hiding kernels on the left and the "zero" and "one" mixer signals according to a binary 1 0 1 1 0 0 watermark example on the right A general overview is illustrated in Figure 8 where it is possible to observe the processing from the original signal (the cover) until the encoded signal. Recovery of the watermark in the encoded received signal is accomplished by using signal analysis techniques that will detect the type of echo in order to discern if a "one" or a "Zero" was encoded in each segment of the signal. Using Cepstral analysis, as in equation (1), will allow unique echo detection possibilities and will make the echo effect more salient and easier to detect. Its autocorrelation will provide even more distinctive peaks to correctly receive the watermark information.
where represents a segment of the encoded signal and represents the Fourier Transformation and −1 the Inverse Fourier Transformation.
Indoor Localization Using Barely Perceptible Audio Signals João Moutinho, Diamantino Freitas, Rui Esteves Araújo Figure 9: Autocepstrum of an Echo Hiding frame where the peak is found at 600 samples delay, allowing to interpret it as a "zero" bit frame. An 800 sample delay would be interpreted as a "one" bit Interpreting the series of echo types and converting it into a binary string allows to recover the watermark. The plot of Figure 9 depicts a situation where the "zero" corresponding to a 600 samples delay was encoded in a 1400 samples frame. The autocepstrum of the received frame reveals its peak distinctly, while no peaks are found around the 800 samples delay where it would be expected to check for a "one" delay. This method provides a data rate of about 16 bits per second on average with minimal cover signal degradation in a wide range of host signals. Using music as a cover signal, watermark recovery rates are 100%, while being imperceptible even to very acute listeners.

The Steganographer block
As was previously mentioned, concealing data in an audio stream will greatly depend on the type of sound that is emitted in the pre-existent transmission. Considering the typical public address sound system in a mall or in public transportations systems, there are essentially three different possible states: music, voice or no emission. Each one of these possibilities implies different handling regarding steganography. For instance, one can easily imagine that it will not be feasible to echo hide data when there is no sound emission at all. Therefore the steganographer block is held responsible for detecting each of these possible states and for choosing the best method for the transmission of the data. Concerning state detection, the live audio signal is broken into short-term non-overlapping windows (frames). The frame determines the delay that is introduced in the public address sound emission. It allows to embed information on it, depending on its content (music/voice/silence). The frame classification is determined based on standard low level (SLL) features that need to be computed easily so that the process can run in real time.
For each frame, two features are currently calculated: Zero Crossing Rate and Signal energy. Zero crossing rate is characterized in equation (2). In discrete-time signals, a zero crossing occurs if successive samples have different algebraic signs. The rate at which zero crossings occur is a simple measure of the frequency content of frame under analysis. The signal energy provides a representation that reflects amplitude variations and is defined in equation (3). (3) These two features were found to be sufficient to correctly detect the three possible states as This classification does not detect when music and voice are emitted together. In those situations the state is classified as music as the musical content is determined as prominent. The choice of the method to conceal data in that frame is therefore based on this classification. Spread Spectrum, by its characteristics is permanently being sent. It is barely perceptible in its low level slightly above environmental noise level. It is important to remember that the signal itself is used to perform time delay estimation and therefore is always necessary. However, when there is a music/voice content frame it would be required to raise the Spreads Spectrum signal level to assure successful data transmission increasing the risk of becoming perceptible to people. Therefore, on those frames, Echo Hiding is used to transmit the information, while the Spread Spectrum signal, even though at smaller amplitude signal levels, continues being used to evaluate distance between the beacon and the mobile device, as the required level for the cross-correlation methods to perform Time Delay Estimation at the receiver is low and is barely perceptible to listeners. Voice only content classified frames are being handled just as the music ones as Figure 4 depicts. The altered sound signal, to be sent to the public address sound system, is delayed in a frame duration so that the original signal can be classified as a "music/voice" or "no emission" frame. That classification allows to use the steganographic method according to the previously defined criterion illustrated in Figure 10. This required delay, a few hundred milliseconds long, does not affect the transmission as it does not become perceptible to people and does not affect any possible application in public address sound system. Even when time critical public address information is emitted, like a train departure information, the exact instance when the sound actually radiates from the loudspeakers will not influence its function as this type of information does not require real-time emission. There will not be any problem if the audio announcement comes a second after.

Localization estimation
To estimate localization it is necessary to establish a relation with some referential that is present in a fixed position of the earth. There are many possibilities on how to acquire the necessary measurements (Zekavat 2011). The most appropriate one, if one considers using a pre-existent public address sound system, is to measure one's distance to each of the emitting loudspeakers. Then, by circle intersection it is possible to determine the point where the receiver device is. The process is illustrated in Figure 11 that describes a scenario with four available beacons. Figure 11: Illustrating sequence of the localization estimation process using distances to each beacons However, when the distance measurements are affected by noise, the circles will not intersect in a point and this uncertainty will define an area instead. This will be as large as the errors in the information. In such case, the localization estimation may be obtained by non-linear estimation methods or using a geometrical approach. This entitled "range measurement" is performed by using a time-of-flight technique that calculates distance as the product of the time the signal takes to travel from the loudspeaker to the microphone, with the speed of sound (which can be known as a function of temperature and humidity of the environment). To measure this time-of-flight it is necessary to determine the delay of each signal when it arrives at the mobile device. In this case, this is achieved by using cross-correlation techniques that identify the codes that were sent in the audio signals in each loudspeaker (Xinwang et al. 2013).

Results on Data Hiding experiments
A localization experiment, using range estimation by spread spectrum audio signals, with 23 receiver positions and performing 460 localization estimations was performed in a real setup environment (Moutinho et al. 2013). This 42 square meter ordinary room provided a test bed for several localization experiments. The results are summarized in Table 2, where Spread Spectrum low energy signals were emitted by 4 ordinary loudspeakers placed in each wall corner of the room. These signals were used, at the mobile device end, to estimate the range to each loudspeaker by measuring the time-of-flight. Then, using a non-linear estimation method the several ranges are used to estimate the localization of the mobile device (Zekavat 2011).
Average accuracy Average Precision 4.5 cm 5.4 cm Table 2: Localization average accuracy and precision The experiment transmitted data with non-existent ( ) and with the presence of a panel of 5 persons (with no known hearing impairment) that were asked to classify the audio content addition with three possible results: perceptible, barely perceptible and non-perceptible. Due to the inherent characteristics of this low energy noise like broadband signal and its ability to be masked by the environmental noise, the spread spectrum signal level used for these results was evaluated by the totality of the panel as barely perceptible. In another set of experiments regarding the possibility of allowing the mobile device to globally localize itself, a downlink transmission was established between the infrastructure and a mobile device. The previously discussed data hiding methods were used to transmit global position information at 4 meter distance between loudspeaker and microphone. The results are summarized in Table 3 where bit rate and bit error rate are evaluated in the limit situation where people's perception is avoided and data is successfully transmitted (using error correction methods). The methods have shown to have very different bit rates as the Spread Spectrum was able to achieve 600 bit/s and the echo hiding did not exceed the 16 bit/s (quite in line with the literature results) (Gruhl e al. 1996).

Spread Spectrum
Echo Hiding Bit rate (bit/s) 600 16 Bit error rate (%) 8 12 Table 3: Bit rate and bit error rate in the Spread Spectrum and Echo Hiding methods while using minimum signal-to-noise ratios to achieve successful data transmission using error correcting codes and avoiding people's auditory perception These very different results in maximum bit rate for these two Data Hiding methods are predictable as the methods rely on very different approaches. Spread Spectrum transmits data very efficiently as the information itself is used as a noise like signal and therefore, bit rate will only be limited by the frequency range imposed by the chip time, sampling frequency and PN sequence used. Echo Hiding is limited by the ability to separate environment acoustic reflections and the direct sound component from the introduced echoes in the method.
Although it may seem an obvious choice, the Spread Spectrum data transmission performance is significantly affected in moments when music/speech is being transmitted and therefore complementarily Echo Hiding, that requires such cover signal, assures data transmission. A 16 bit/s data transmission speed is sufficient to convey the information on the infrastructure's global position. A music intensive scenario in the public address sound system, may delay initial position acquisition (a similar condition with GNSS in situations of difficult satellite line of sight). Nevertheless, this will only limit the applications where localization refresh rate need to be higher than usual. The steganographer block has successfully classified each moment ("music/voice" or "no emission") in the conducted experiments. As it imposes a delay in public address sound transmission, the classification allows to choose either if it is a good "moment" to use or not Echo Hiding, based on the SLL features-based classification. As these features are very distinctive, classification results are highly successful.

Conclusions and future work
One of the main reasons for which most indoor localization approaches do not use audio signals is the fact that such signals may be heard by people and therefore may disturb the acoustic environment and annoy anyone eventually present is an indoor space. Using lower frequency sound, off-the-shelf loudspeakers can be used and everyday microphones like the ones on cell phones can be used to capture these signals. The suggested approach to use Data Hiding methods, such as Spread Spectrum or Echo Hiding, to avoid people's perception while enabling the transmission of required signals has proven to be successful. Using these methodologies, range estimation and absolute (global) localization is achieved allowing the use of audio signals for indoor localization purposes without significantly disturbing the acoustic environment. Future work will focus in providing the steganographer block the ability to automatically adjust the Spread Spectrum signal intensity to adapt to the audio signal or to the environmental noise while minimizing its perception.