Audio can play a significant role in ensuring events and incidents are automatically detected and responded to. While some audio detection options have specialist applications, such as gun shot detection, others enable a wide range of sounds to be used as full or partial triggers for both alarms and automation. The challenge is determining whether a site requires basic audio detection or sound classification, as the two are very different.
The use of audio detection in video surveillance solutions has always created a fair degree of controversy. While many people are willing to be subjected to video-based surveillance as a part of creating secure and safe environments, any mention of audio surveillance immediately instigates debates about privacy. People have more liberal attitudes to being ‘watched’ than ‘listened to’. This is based upon the thinking that if you are behaving within the law, you have nothing to fear, but ‘eavesdropping’ is perceived as being a negative approach and an invasion of personal rights.
In any society that values free speech, the ability to speak without fear of reprimand resonates highly with data subjects. As such, while audio offers a lot in terms of situational awareness and qualifying what might be inconclusive surveillance, its value is rarely realised in security systems used for surveillance and/or monitoring.
However, audio-based functionality does not solely consist of speech capture and review, and it is non-speech implementations of audio technology that are increasingly useful in a wide range of smart applications.
Often, debate about the rights of data subjects to enjoy free speech is something of a red herring, as in many applications it is not speech which is being detected. What the system monitors are certain sounds which are exceptions to ambient background noise, and these are then used as a trigger. Whether the exception is created based upon volume or sound type depends very much upon the application and the exceptions being detected.
Audio, as a term, has numerous connotations with regard to features and functions in video surveillance. It is predominantly associated with verbal communication. For example, the term ‘two-way audio’ is generally understood to indicate a device or system component which allows verbal communication between a person close to an edge device and an operator in a control room or administrator at the ‘centre’ of the system.
Acceptable uses of two-way audio include help-points, customer support, intercom services, communications with personnel, verbal warnings with regard to security or safety, etc.. However, as soon as the functionality is used to ‘listen in’, it becomes a taboo subject.
Via the implementation of intelligent video analytics in recent years, the boundaries of system performance have been significantly expanded, allowing many of the benefits of video to be fully exploited by those seeking an advanced level of protection. The power offered by a well designed and correctly implemented video system using IVA will often be unsurpassed by most other technologies. Audio also has a role play in such solutions when deployed as a trigger element.
Detection and Classification
Audio detection and the classification of sounds have become realistic and affordable options in surveillance applications. The technology has been driven by advances in specific audio-based analytics algorithms, coupled with the shift to GPU-based hardware. As a result, a more proactive approach to the use of audio data as a trigger event is possible.
Systems can initiate actions, events and alerts based upon a wide range of sounds. Triggers can be caused by ‘exceptional’ volume levels or by certain types of sounds.
In the first instance, audio detection generally senses changes in typical ambient sound levels, such as sudden increases in volume. These can work well within certain environments. For example, an office which handles customers or clients may generally have a low level of ambient noise. If a customer starts shouting or screaming, the volume spikes and audio detection will identify this.
If basic audio detection only uses volume levels as the basis for judging exceptions, then a wide range of other sounds – laughter, calling after a customer who may have left an item at the counter, staff issuing general instructions to a crowd, a passing vehicle backfiring – could also trigger an alert or action.
The issue with basic audio detection is that it is monitoring for defined parameters across a wide range of sounds. A shout, a loud laugh, a dog’s bark, a window being broken or an item being dropped can all be seen as exceptions, because there is a chance the resulting sound will go beyond the established threshold. Volume-based systems are simply looking for a spike in sound intensity rather than a specific type of noise.
In the case of sound classification, a more complex and accurate layer of analytics is added. It is still possible to trigger an event following unexpected noises and volume increases, but these can be filtered by the type of sound. By recognising specific classifications of sound to generate triggers, the system will be able to differentiate between different types of sound.
Analysing audio
Audio analytics deploying basic volume-based audio detection will be somewhat limited. This approach is best suited to closed sites or locations where exceptional sounds are unlikely to be confused with innocuous noises.
In such cases, unless filters can be applied, it must be accepted that a window being broken, a gun shot or a scream may all be treated in the same way as thunder, a vehicle with a faulty exhaust, an emergency services vehicle with siren sounding, etc..
Audio analysis algorithms have improved in recent years (due in no small part to the greater use of GPUs), and today sound classification makes audio detection a worthwhile consideration. Detected sounds need to match a range of criteria before triggering an action or alarm: these might include (but will not be limited to) sound type, frequency or multiple frequencies, volume, duration and characteristics.
In some applications, it might also be appropriate to include key word recognition and behavioural analysis to spot high risk exceptions. Whether such an approach is acceptable or not will ultimately depend on the operational requirements of the system and the risk being protected against. If the role of the analytics is to alert an operator or security personnel to a high risk event such as threat being made at an airport, it may be acceptable. However, if it used in the workplace to trigger an action when a member of staff complains about their employer, the system may be violating rights with regard to privacy.
Where sound classification can make a significant difference is with regard to filtering out false activations. For example, in a closed building at night or during weekends, sounds such as those created by the general building fabric will be ignored. With sound classification, breaking glass can specifically be identified as a trigger that should warrant further investigation. If the security team has an on-site presence, such alerts could be sent to an operator or to a patrol via a handheld device.
As with any detection technology, if the rate of nuisance alarms is high, the effectiveness of the system may suffer as events will be ignored. Therefore, sound classification becomes more important to ensure an effective solution.
If a more intelligent approach is required, the ability to filter and identify whether an alert is created by breaking glass, a gun shot, a crying baby, an impact, aggressive behaviour, fire or smoke alarms, keywords or machinery malfunctioning allows audio analytics to automate actions both for security and business management purposes.
The right delivery
There are a variety of ways to implement audio analytics. Many camera manufacturers are utilising spare processing power in devices to add audio detection or audio classification analytics.
Where a more specialised approach is required, many of the ‘open platform’ cameras and encoders allow third party Apps to run, offering dedicated audio analytics. These provide specific audio analytics options.
The app-based approach reduces costs, as often there is not a requirement for a full suite of audio sensors. App-based analytics permit a ‘mix and match’ approach to maximise the potential from significant and powerful detection options.
Where site-wide sound classification is required, the best approach is to deploy a dedicated server running multiple channels of analytics software. This allows audio analytics to operate on an optimised platform, but does increase the capital investment for the user.
Increasingly, VMS providers are also working with providers of audio analytics. This means audio detection options can be added and managed directly from the VMS GUI, and processing can be allocated to specific cameras or groups of devices.
Whichever route is taken, professional audio analytics can significantly enhance situational awareness.