In imitation learning, an expert demonstrates how to perform a task (e.g., driving a car, filling a cup, playing a game) for the benefit of an agent.
In each demo, the agent has access both to its n-dim. state observations at each time t, X t = [X1t , X2t , . . . Xnt ] (e.g., a video feed from a camera), and to the expert’s action At.
Behavioral cloning approaches learn a mapping π from Xt to At using all (Xt, At) tuples from the demonstrations.
Seems easy enough, right?
Why do causal models matter? In short, causal models matter because we want to learn accurate descriptions of the world, which can still help the agent when the world changes! People do this all the time. We try to understand that things fall because of gravity, not because of a particular coincidence of, say, the floor being a particular color. And this helps when we move to a place where the floor isn't that color. Robots need the same kind of thing. But for robots, much of the world looks identical, since they don't have common sense. Thus, if the world is a grid of black and white squares (images!), then learning a causal model would be like learning which of the squares matter.
In this paper, Pim de Haan, Dinesh Jayaraman and Sergey Levine have a similar idea. Instead of learning directly on the images, we might consider a world which consists of disentangled factors. Then, learning the causal model would be learning which of those factors matter, which they proceed to do by turning them off one by one1.
In driving, one of the causes of confusion is the past state see here for a reference to bansal. This one is due to Bansal:
Say the agent is learning to stop at stop signs. So you pass it a vector of images, and a vector of past speeds, and sure enough, it can predict an output speed which stops! Great! But this might only be because you fed it the past speeds of an expert driver - the model might have (spuriously) learned the correlation that if the expert slowed down, it should probably slow down even more. When you take this agent on the road, however, it sees its own actions. It might never have slowed down in the first place, so off it goes, past the stop sign...
So even if a model can predict the actions of the expert driver, it doesn't mean it'll be able to do the same for itself. Why? In the stop light example, it's because it misattributed the need to stop to the expert state, not the actual stop sign!
How do you tell the agent that something is important? Or, more generally, what can you do to help your agents learn better?
Let's take a detour - there's some fancy stuff in causal models that will help!
The idea of causal models goes back to Pearl. Causal models are like.... what you'd get if you wanted the easiest version of math ever. Just some shapes!
X -> Y intuitively means, X leads to Y, or a change in X will definitely produce a change in Y, or mathematically: Y is not independent of X! That is, you can say the joint density P(X,Y) is just P(X) (this one is just by itself, since nothing's pointing to it) times P(Y|X) (because this one can change depending on the value of X)
Anyways, you can draw these shapes with as many letters as you want. The fun part is when you try to figure out if the arrows are actually there. Given you have some observations, can you even determine which variables influence which other ones? This is a huge tossup, and in most cases, the situation is hopeless. But when did that ever stop us from trying?
A confounder Z t = [X t−1 , At−1 ] influences each state variable in X t , so that some nuisance variables may still be correlated with At among (X t , At ) pairs from demonstrations
How prevalent are confounders in real life problems?
Applying this formalism to our imitation learning setting, any distributional shift in the state Xt may be modeled by intervening on Xt, so that correctly modeling the “interventional query” p(At|do(Xt)) is sufficient for robustness to distributional shifts. Now, we may formalize the intuition that only a policy relying solely on true causes can robustly model the mapping from states to optimal/expert actions under distributional shift.
For each task, we study imitation learning in two scenarios. In scenario A (henceforth called "CONFOUNDED"), the policy sees the augmented ob- servation vector, including the previous action.
In scenario B ("ORIGINAL"), the previous action variable is replaced with random noise for low-dimensional ob- servations.
ORIGINAL produces rewards tending towards expert performance as the size of the imitation dataset increases. CONFOUNDED either requires many more demonstrations to reach equivalent performance, or fails completely to do so.
This is weird, no? By simply giving the model access to additional information (the previous action), we've totally messed up its ability to learn!
Let's detour to Wang et al [56]
We draw the reader’s attention to particularly telling results from Wang et al. [56] for learning to drive in near-photorealistic GTA-V [24] envi- ronments, using behavior cloning with DAgger- inspired expert perturbation. Imitation learn- ing policies are trained using overhead image observations with and without “history” infor- mation (HISTORY and NO-HISTORY) about the ego-position trajectory of the car in the past.
nd once again, like in our tests above, HISTORY has better performance on held-out demonstration data, but much worse performance when actually deployed
So what can we do to help the agent learn good, helpful causal models?
Okay, it's a little more complicated. Turning them off one by one is good, but they call this a causal graph, where you have a bit vector corresponding to the latent variables that you want to leave on.
↩