Home System Design Audio Detection and Sound Classification

Audio Detection and Sound Classification

by Benchmark

Increasingly, audio plays a positive role is ensuring that events and incidents are automatically responded to. While some have specialist applications, such as gun shot detection, others enable a wide range of sounds to be used a full or partial triggers for both alarms and automation. The challenge is determining whether a site requires audio detection or sound classification, as the two are very different.

The use of audio in video surveillance solutions has always created a fair degree of controversy. While many are willing to be subjected to video-based surveillance as a part of creating secure and safe environments, any mention of audio surveillance immediately instigates debates about privacy. People have more liberal attitudes to being ‘watched’ than ‘listened to’. This is based upon the thinking that if you are behaving within the law, you have nothing to fear, but ‘eavesdropping’ is negative and an invasion of personal rights.

In a society that values free speech, the ability to speak without fear of reprimand resonates highly with data subjects. As such, while audio offers a lot in terms of situational awareness, its value is rarely realised in security systems used for surveillance and/or monitoring.

However, audio-based functionality does not solely consist of speech capture and review, and it is non-speech implementations of audio technology that are increasingly useful in a wide range of applications.

Often, the use of the word ‘audio’ is something of a red herring. What we are really considering is the use of certain sounds, as exceptions to ambient background noise, being used as part of a trigger mechanism. Whether the exception is created by volume or sound type depends very much upon the application and the exceptions being detected.

Audio, as a term, has numerous connotations with regard to features and functions in video surveillance. It is predominantly associated with verbal communication. For example, the term ‘two-way audio’ is generally understood to indicate a device or system component which allows verbal communication between a person close to an edge device and an operator in a control room or administrator at the ‘centre’ of the system.

Acceptable uses of two-way audio include help-points, customer support, intercom services, communications with personnel, verbal warnings with regard to security of safety, etc.. However, as soon as the functionality is used to ‘listen in’, it becomes a taboo subject.

Via the implementation of intelligent video analytics in recent years, the boundaries of system performance have been expanded, allowing many of the benefits of video to be more fully exploited by those seeking an advanced level of protection. The power offered by a well designed and correctly implemented video system using IVA will often be unsurpassed by most other technologies. Audio also has a role play in such solutions.

Detection and Classification

Audio detection and the classification of sounds has become a realistic and affordable option. This has been driven by advances in specific audio-based analytics algorithms, coupled with the shift to GPU-based hardware. As a result, a more proactive approach to the use of audio data is possible.

Systems can trigger actions, events and alerts based upon a wide range of sounds. Triggers can be caused by ‘exceptional’ volume levels mor by certain types of sounds. It is important to take note of the word ‘exceptional’, as well as considering the type of analytic being deployed: audio detection or sound classification.

In the first instance, audio detection generally looks for changes in the typical ambient sound levels, such as increases in volume. These can work well within certain environments. For example, an office which handles customers or clients may generally have a low level of ambient noise. If a customer starts shouting, audio detection will identify this.

If basic detection only uses volume levels as the basis for judging exceptions, then a wide range of other sounds – laughter, calling after a customer who may have left an item at the counter, staff issuing general instructions to a crowd – could also trigger an alert or action.

The issue with audio detection is that it looks for defined parameters for a wide range of sounds. A shout, a laugh, a dog’s bark, a window being broken or a party popper being let off can all be seen as exceptions because they go beyond an established threshold. Volume-based systems are simply looking for a spike in loudness rather than a specific type of noise.

In the case of sound classification another layer of analytics is added. It is still possible to trigger an event following unexpected noises and volume increases, but these can be filtered by the type of sound.

By recognising specific classifications of sound to trigger actions, events or alarms, the system will be able to differentiate gun shots from breaking glass or sirens, for example.

Analysing audio

Audio analytics deploying basic audio detection functionality can be somewhat limited. This approach is best suited to detect sounds that are unlikely to be replicated by innocuous noises.

Unless multiple filters can be applied, it must be accepted that a window being broken, a gun shot or a scream may all be treated in the same way as thunder, a vehicle with a faulty exhaust, an emergency services vehicle with siren sounding or even an ice cream van!

As audio analysis algorithms improve (due in no small part to the greater use of GPUs), so sound classification makes audio detection a worthwhile consideration. Sounds will need to match a range of criteria before triggering an action or alarm: these might include (but will not be limited to) sound type, frequency or multiple frequencies, volume, duration and characteristics. In some applications, it might also be appropriate to include key word recognition and behavioural analysis to spot exceptions.

Whether such an approach is acceptable or not will ultimately depend on the operational requirements of the system and the risk being protected against. If the role of the analytics is to alert an operator or security personnel to a high risk event such as threat being made at an airport, it may be acceptable.


However, if it used in the workplace to trigger an action when a member of staff complains about their employer, the system may be violating rights with regard to privacy.

Where sound classification can make a significant difference is with regard to filtering out false activations. For example, in a closed building at night or during weekends, sounds such as those created by the general building fabric should be ignored. With sound classification breaking glass can specifically be identified as a trigger that should warrant further investigation.

If the security team has an on-site presence, such alerts could be sent to an operator or to a patrol via a handheld device. As with any detection technology, if the rate of nuisance alarms is high, the effectiveness of the system may suffer as events will inevitably be ignored. Therefore, sound classification becomes more important to ensure an effective solution.

If a more intelligent approach is required, the ability to filter and identify whether an alert is created by breaking glass, a gun shot, a crying baby, an impact, aggressive behaviour, fire or smoke alarms, keywords or machinery malfunctioning allows audio analytics to automate actions both for security and business management purposes. This enables a higher degree of flexibility.

The right delivery

Installers and integrators have a variety of ways to implement audio detection and its associated analytics. A growing number of camera manufacturers are now utilising spare processing power in their devices to add either audio detection or audio analytics functionality. As such, it is increasingly a common standard feature.

Where a more specialised approach is required, many of the ‘open platform’ cameras and encoders allow installers and integrators to run third party Apps offering higher level audio analytics. These enable installers and integrators to select specific audio analytics software options for any given application.

The app-based can reduce costs, as in many sites there will often not be a requirement for a full suite of audio sensors. As with all App-based functionality, installers and integrators can take a ‘mix and match’ approach to ensure they maximise the potential on offer from what is a very significant and powerful detection option.

As with many things in surveillance today, there exists something of a debate over whether video analytics are better served by being executed centrally or at the edge. This is also true for audio-based analytics.

Some will argue that the process is optimised by deploying a dedicated server running multiple channels of analytics software. This does allow all analytics to operate on an optimised platform, but can also increase the capital investment for the user. Others believe that analytics at the edge sits better with modern system design. It also allows the use of specific analytics at certain locations, thus enabling a best-of-breed solution.

Increasingly, VMS providers are also working with providers of the leading audio analytics. This means that audio detection options can be managed directly from the VMS GUI and processing can be allocated to specific cameras or groups of devices.

Some audio analytic providers claim their audio sensors are compatible with low cost microphones.

Given that many camera are equipped for two-way audio, it makes sense to minimise installation time and use these rather than fitting discrete devices. Of course, it pays to be prudent and carry out field trials with the specified audio package prior to installation.

In summary

Audio detection for alarm triggering is widely available as an integral feature of cameras and encoders. Whilst the majority of options are limited by basic filtering, they can still deliver benefits in a wide range of applications.

For those seeking a higher degree of intelligence, systems using advanced processing and classification are becoming more common and easier to implement. If seeking such solutions, it is worth looking at advanced cameras with good audio analytics, an open platform device ora compatible VMS; a dedicated server could also be deployed if a variety of analytics are to be implemented.

Whichever route is taken, professional audio analytics can significantly enhance situational awareness.


The Hartford Police Department, based in the capital of Connecticut, makes use of ShotSpotter, an audio-based analytics tool that identifies gun fire and plots its location. ShotSpotter uses acoustic sensors to detect and notify authorities about gun-based incidents in real-time. The system generates alerts that include precise location information, including latitude and longitude, along with corresponding data including information about the type of gunfire. This information can be delivered to any browser-enabled mobile device as well as to a central control room.

The information gathered by ShotSpotter is sent back to the police department’s XProtect Corporate VMS system, which in turn links with The Hawkeye Effect geospatial mapping tool. The software uses GPS location mapping to drive absolute positioning PTZ cameras, tracking events as they happen.
The system utilises triggers – in the Hartford Police Department case the alerts from ShotSpotter – to track events and provide real-time visual verification following an incident.

Because an exact location is known, the movements of suspects can be tracked and visually verified, allowing the police to maximise resources, ensuring that differing levels of response are implemented.
The system ‘drives’ the absolute positioning PTZ cameras using the X-Y coordinates captured by ShotSpotter to cover entry and egress roads and other significant points within the area of the detected gunshot.

The VMS integration with ShotSpotter and The Hawkeye Effect is a key tool for the police department, and as a result of its success the system has recently been expanded to cover every residential zone in the Hartford city limits.

Related Articles

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

Privacy & Cookies Policy