The Path to Human-Level Perception of Physical Security
How the Ambient.ai AI system sees the world and protects it
.webp)
The idea of creating systems with human-level intelligence has always fascinated engineers, AI researchers, sci-fi writers, and geeks everywhere. Lately, the concept of Artificial General Intelligence (AGI) has captured the imagination of the general public.
At the end of the day, who wouldn’t want to have a friend like R2-D2? Artificial General Intelligence (AGI) is no longer just a sci-fi dream. Research is actually being conducted that could one day lead to machines to almost think and perceive the world as humans do. On a more pragmatic note, we already have AI systems that can achieve human-level performance on certain specialized tasks. Together with human agents, such AI systems have already delivered billions of dollars of value to society.
Our focus at Ambient.ai is physical security. At Ambient.ai, we have built AI systems with near human-level visual perception for physical security tasks. In this post, I am going to elaborate on our approach to building such a system.
Consider a building equipped with multiple security cameras. In the broadest sense, we want the computer vision monitoring these cameras to perceive and respond to this information the same way as a human operator would. To do this, we’ll have to break down this task into several components.
AI should not only see the world but also understand it
The first step is to detect objects and entities of interest in every single frame – their color, texture, shape and relative positioning in space, among other things. To achieve this, we leverage and build upon state of the art machine learning models that have already surpassed human level performance in tasks like image recognition.
But the task of physical security monitoring is more than just object detection on well defined datasets. The system needs to handle and respond to events in the wild that have never occurred in the past. In order to truly make sense of what is happening, we need to understand how everything interacts over a time horizon. To do this, we consider the prior knowledge we have from events in the recent past and use it to understand new information we obtain in real time. We have built a graph based approach that contextualizes this information across physical space in a semantically meaningful way. We use this contextual information to identify, understand and analyze more than 100 threat signatures in real time. All of these signatures leverage non-personally identifiable information. We have built our system to scale across cameras. More importantly, we have engineered the system to be able to easily add additional skills and new threat signatures in the future.
We train our AI system with a combination of publicly available open source data, in-house generated data and aggregated and anonymized field data that we manage according to Privacy by Design standards. On the highest level, we use this data to train our system for the task of visual monitoring. On a more granular level, we model this as a mathematical optimization problem.
The two most important metrics we optimize for are precision and recall. These are co-dependent metrics where a soft trade-off exists in most cases. But real improvements can be made if we can optimize both of these metrics simultaneously.
Recall: This is the proportion of real security incidents we can catch. We want to prevent every security incident possible. So we need our system to have a high recall.
Precision: This is the proportion of real security events among all the alerts we create. We need the precision to be high as well. A low precision means that we would raise a lot of false alerts. This is undesirable.
We’ve achieved a good baseline precision for our computer vision system and the main focus for this blogpost is to explain our recall capabilities and compare it to a scenario where an operator is monitoring the space. Below, I will take you through the simple math that explains how Ambient.ai achieves near human level performance at scale in the domain of physical security.
AI can surpass humans in task-specific problems
Consider a security operator monitoring a single camera. Let us assume that this operator has 100% recall. i.e. they catch every single security event that happens on that camera. Additionally, consider that our AI system has 85% recall. i.e. our system can catch 85% of all security events that happen. This is a conservative value, but this number works for our analysis.
For the sake of simplicity, assume that a human operator can monitor 20 cameras at the same time. As we add more cameras to the monitoring load of the human operator, their recall reduces due to a combination of reasons, with fatigue being the primary cause. Even if we assume a 3 minute downtime for every 1 hour, this reduces the recall of human operator monitoring to 95%. Again, this is a very conservative estimate.
Now, consider a site that has hundreds of cameras. A single security operator can only look at 20 cameras at once without compromising on recall. This is a physical and biological limitation of the operator. A common method operators use to monitor all 100 cameras is with Time Division Multiplexing. i.e. by periodically switching between video walls with 20 cameras in each video wall. This effectively cuts the recall of the operator to 19%. So about 81% of the video security footage is not monitored at any given point in time. But the AI system we have built still maintains a recall of 85% because it can constantly monitor all cameras at scale.
We know that every security incident can have multiple events that can be caught to avert the incident. Consider, for example, the theft of a data asset like a laptop. The perpetrator might first breach the perimeter, loiter around for a while, try to force-open a door, and then carry the asset and exit. Both the human operator and the AI have about 3 – 4 chances to catch this incident during any of these events. But at least one event needs to be caught in order to prevent the security incident. This is where our large and growing list of threat signatures become very valuable. Our comprehensive set of signatures already model most of the common events leading up to an incident, and additional signatures are being added to extend it even further. With a baseline recall of 19% on a single event, a human operator can catch at least one among three events 48% of the time. Whereas, with a baseline recall of 85% on a single event, our AI system can catch at least one among three events 99.6% of the time. Even if we give a human operator 10 chances to catch the incident, the recall would be 88%. This is still way lower compared to the 99.6% recall with our system.
On a system level, we have achieved near human-level visual perception for the task of physical security monitoring. Whereas human performance is largely limited by physical and biological factors, our AI system is constantly evolving as we continue adding new threat signatures to our already large repository. Any additional performance improvements we make are instantly available to everyone using the system and builds towards our exciting vision of preventing every security incident possible.

