Atari early

Apr 01, 2020

By Katja Grace, 1 April 2020

Deepmind announced that their Agent57 beats the ‘human baseline’ at all 57 Atari games usually used as a benchmark. I think this is probably enough to resolve one of the predictions we had respondents make in our 2016 survey.

Our question was when it would be feasible to ‘outperform professional game testers on all Atari games using no game specific knowledge’.[note]

Full question wording:

How many years until you think the following AI tasks will be feasible with:

a small chance (10%)? an even chance (50%)? a high chance (90%)?

Let a task be ‘feasible’ if one of the best resourced labs could implement it in less than a year if they chose to. Ignore the question of whether they would choose to.

[…]

Outperform professional game testers on all Atari games using no game-specific knowledge. This includes games like Frostbite, which require planning to achieve sub-goals and have posed problems for deep Q-networks1,2.

1 Mnih et al. (2015). Human-level control through deep reinforcement learning 2 Lake et al. (2015). Building Machines That Learn and Think Like People

small chance (10%)
even chance (50%)
high chance (90%)[/note] 'Feasible' was defined as meaning that one of the best resourced labs could do it in a year if they wanted to.

As I see it, there are four non-obvious things to resolve in determining whether this task has become feasible:

Did or could they outperform ‘professional game testers’?
Did or could they do it ‘with no game specific knowledge’?
Did or could they do it for ‘all Atari games’?
Is anything wrong with the result?

I. Did or could they outperform ‘professional game testers’?

It looks like yes, for at least for 49 of the games: the 'human baseline' appears to have come from 'professional human games testers' described in this paper.[note]"In addition to the learned agents, we also report scores for
a professional human games tester playing under controlled conditions..."

"The professional human tester used the same emulator engine as the agents, and played under controlled conditions. The human tester was not allowed to pause, save or reload games. As in the original Atari 2600 environment, the emulator was run at 60 Hz and the audio output was disabled: as such, the sensory input was equated between human player and agents. The human performance is the average reward achieved from around 20 episodes of each game lasting a maximum of 5min each, following around 2 h of practice playing each game."[/note] (What exactly the comparison was for the other games is less clear, but it sounds like what they mean by 'human baseline' is 'professional game tester', so I guess the other games meet a similar standard.)

I'm not sure how good professional games testers are. It sounds like they were not top-level players, given that the paper doesn't say that they were, that they were given two hours to practice the games, and that randomly searching for high scores online for a few of these games (e.g. here) yields higher ones (though this could be complicated by e.g. their only being allowed a short time to play).

II. Did or could they do it with ‘no game specific knowledge’?

My impression is that their system does not involve 'game specific knowledge' under likely meanings of this somewhat ambiguous term. However I don't know a lot about the technical details here or how such things are usually understood, and would be interested to hear what others think.

III. Did or could they do it for ‘all Atari games’?

Agent57 only plays 57 Atari 2600 games, whereas there are hundreds of Atari 2600 games (and other Atari consoles with presumably even more games).

Supposing that Atari57 is a longstanding benchmark including only these 57 Atari games, it seems likely that the survey participants interpreted the question as about only those games. Or at least about all Atari 2600 games, rather than every game associated with the company Atari.

Interpreting it as written though, does Agent57's success suggest that playing all Atari games is now feasible? My guess is yes, at least for Atari 2600 games.

Fifty-five of the fifty-seven games were proposed in this paper[note]Section 3.1.2, https://arxiv.org/pdf/1207.4708.pdf[/note], which describes how they chose fifty of them:

Our testing set was constructed by choosing semi-randomly from the 381 games listed on Wikipedia [http://en.wikipedia.org/wiki/List_of_Atari_2600_games (July 12, 2012)] at the time of writing. Of these games, 123 games have their own Wikipedia page, have a single player mode, are not adult-themed or prototypes, and can be emulated in ALE. From this list, 50 games were chosen at random to form the test set.

The other five games in that paper were a 'training set', and I'm not sure where the other two came from, but as long as fifty of them were chosen fairly randomly, the provenance of the last seven doesn't seem important.

My understanding is that none of the listed constraints should make the subset of games chosen particularly easy rather than random. So being able to play these games well suggests being able to play any Atari 2600 game well, without too much additional effort.

This might not be true if having chosen those games (about eight years ago), systems developed in the meantime are good for this particular set of games, but a different set of methods would have been needed had a different subset of games been chosen, to the extent that more than an additional year would be needed to close the gap now. My impression is that this isn't very likely.

In sum, my guess is that respondents usually interpreted the ambiguous 'all Atari games' at least as narrowly as Atari 2600 games, and that a well resourced lab could now develop AI that played all Atari 2600 games within a year (e.g. plausibly DeepMind could already do that).

IV. Is there anything else wrong with it?

Not that I know of, but let's wait a few weeks and see if anything comes up.

Given all this, I think it is more likely than not that this Atari task is feasible now. Which would be interesting, because the median 2016 survey response put a 10% chance on it being feasible in five years, i.e. by 2021.[note]Though note that only 19 participants answered the question about when there was a 10% chance.

We surveyed 352 machine learning researchers publishing at top conferences, asking each a random subset of many questions. Some of these questions were about when they expected thirty-two concrete AI tasks would become ‘feasible’. We asked each of those questions in two slightly different ways. The relevant Atari questions had 19 and 20 responses for the two wordings, only one of which gave an answer for 2021.[/note] They more robustly put a median 50% chance on ten years out (2026).[note]Half the time we asked about chances in N years, and half the time we asked about years until P probability, and people fairly consistently had earlier distributions when asked the second way. Both methods yielded a 50% chance in ten years here, though later the distributions diverge, with a 90% chance in 15 years yet a 60% chance in 20 years. Note that small numbers of different respondents answered each question, so inconsistency is not a huge red flag, though the consistent inconsistency across many questions is highly suspicious.[/note]

It's exciting to resolve expert predictions about early tasks so we know more about how to treat their later predictions about human-level science research and the obsolescence of all human labor for instance. But we should probably wait for a few more before reading much into it.

At a glance, some other tasks which we are already learning something about, or might soon:

The 'reading Aloud' task[note]'Take a written passage and output a recording that can’t be distinguished from a voice actor, by an expert listener.'[/note] seems to be coming along to my very non-expert ear, but I know almost nothing about it.
It seems like we are close on Starcraft though as far as I know the prediction hasn’t been exactly resolved as stated.

1 April 2020

Thanks to Rick Korzekwa, Jacob Hilton and Daniel Filan for answering many questions.