• Stars
    star
    144
  • Rank 255,590 (Top 6 %)
  • Language
    Python
  • Created almost 8 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Benchmark and build RL architectures that can do multitask and transfer learning.

Multitask and Transfer Learning

  • Benchmark and build RL architectures that can do multitask and transfer learning.
  • Date: December 2016
  • Category: Fundamental Research
  • Contact: [email protected]
  • Join the chat at https://gitter.im/ai-open-network/multitask_and_transfer_learning

Problem description

  1. Create a benchmark for transfer learning and multitask learning.
    • Should measure improvement in learning that is directly attributable to knowledge transfer between games.
    • Should also be able to measure performance by a single agent on multiple games.
    • Should use cross-validation to mitigate the effects of a small number of games to test on.
  2. Design and implement deep reinforcement learning architectures that do well on the benchmark.
    • For methodological reasons, we think it's important to design the ideal benchmark before getting too attached to a particular architecture.
    • It's important that we're sure the benchmark is measuring the crux of the transfer and multi-task problem rather than measuring something our architecture is good at.

Contributing

We have a few different "threads" going on right now, so there are several different ways you can get involved if you're interested:

A few notes on contributing

  • Be Kind and Be Respectful
  • Value other people's work: please reference them. This also helps other people in the project find valuable prior work. Don't just copy & paste what you find elsewhere when it comes to sharing information.
  • Give constructive criticism. If you see something not working or wrong, open an issue, or bring it up in the chat. Avoid criticizing people or making things personal, but feel free to criticize code, ideas, project direction constructively. If you come with a proposed solution in hand, all the better!
  • Please Ask Questions! An important part of this project is to open up the opportunity for everyone to contribute. We want anyone who wants to to be able to add value towards these research topics.
  • Keep in mind that most of the researchers that are opening these projects have full-time work/research. If there is a specific question, use the gitter channel or open an issue rather than directly emailing them.

Project Status:

See detailed status on the project tracker

  • Currently writing and testing the benchmarks for measuring performance.
  • Looking for people to review papers trawling for ideas, and to implement some existing architectures to benchmark their performance.
  • Check the README in AMTLB directory to learn more about the tool/ library used.
    • "This is a library to test how a reinforcement learning architecture performs on all Atari games in OpenAI's gym. It performs two kinds of tests, one for transfer learning and one for multitask learning."

Why this problem matters:

Generalizing across tasks is a crucial component of human intelligence. Current deep RL architectures get less effective the more tasks they are put to, whereas for humans, diversity of experience is a strength that improves performance on new tasks. Overcoming catastrophic forgetting and achieving one-shot learning are abilities that should fall out naturally if this task is solved convincingly.

At a more meta-level, this problem is both out of reach of current reinforcement learning architectures, but it seems reasonably within reach within a year or two. Much like ImageNet spurred innovation by creating a common target for researchers to aim for, this project could similarly provide a common idea of success for multitask and transfer learning. Many papers researching multi-task and transfer learning using Atari are doing it in ad-hoc ways that cherry-pick games that get good results.

How to measure success:

Success is in degrees, since an architecture (in principle) could surpass human ability in multi-task Atari, getting both higher scores on all games, and picking up new games faster than a human does. Ideally, a good waterline would be human level performance on the benchmark, but creating a robust dataset on human performance is beyond the scope of this project.

The fundamental benchmark then will be two measures:

  1. Transfer Learning: How much a given architecture improves on an unseen game when it is untrained versus when it has been trained on other games firest. Measured as a ratio of total score pre-trained vs. untrained. Ratio is averaged using cross-validation given that there is a small number of available games and the fact that high scores are not comparable across games.
  2. Multitask Learning: How well a given architecture does across all games with a single architecture and set of weights. Rather than an aggregate, this result will be a vector of top scores achieved for each game.

In addition to the scores, the benchmark will also make some strict demands on the architecture itself due to the testing/training regime:

  • Training will happen on random games sequentially. After each loss a new random game from the training set will be selected to play next.
  • No out of band signal will be given to indicate which game is being played, so architectures that need to allocate a set of extra weights for each game will have to be more clever.
  • All games in ALE will be used, even ones which standard DQNs perform poorly on like Montezuma’s Revenge.

Datasets:

Currently no datasets, but it’s possible the dataset being created at atarigrandchallenge.com will potentially be a useful comparison once it’s available. Measuring human performance needs to be done with a large sample size, both to control for pre-training (some people have played Atari games before, or other video games before) and to control for individual human skill levels (this could be seen as pre-training on non-Atari games, generalization from real life, or natural ability etc).

Akin to a dataset will be the benchmark framework itself. Since this is a reinforcement learning problem, the testing environment provides the data, rather than a static dataset.

Relevant/Related Work

Since the original Mnih paper, the Atari 2600 environment has been a popular target for testing out RL architectures

Note: More Work to be added to, always check the chat for latest related work for now