Ravi Munde uses reinforcement learning to control Dino Run

Google Chrome has a little egg in it. When you don't have a network, opening a random URL will bring up a Dino Run. Press the space bar to jump. Of course, just open chrome://dino to play this mini-game. Recently, Ravi Munde, a graduate student from Northeastern University (USA), used reinforcement learning to control Dino Run.

The following content from the Ravi Munde blog, artificial intelligence headline compilation:

This article will start with the foundation of reinforcement learning and detail the following steps:

Build bidirectional interfaces between browsers (JavaScript) and models (Python)

Capture and preprocess images

Training model

Evaluation

Reinforcement learning

For many people, reinforcement learning may be a new word, but in fact children learn to use the concept of reinforcement learning (RL), which is how our brains still work. The reward system is the basis of any RL algorithm. Just like a child's toddler step, positive rewards will come from parents' applause or candy, while negative rewards will be no candy. Children learn to stand up before they start walking. In terms of artificial intelligence, the main objective of the agent (Dino in our case) is to maximize a certain digital reward by performing a specific sequence of operations in the environment. The biggest challenge in the RL is the lack of oversight (marking data) to guide the agent, which must explore and learn on its own. Agents start with random actions, observe the rewards of each action, and learn how to predict the best action when facing similar environmental conditions.

Caption: Vanilla Reinforcement Learning Framework

Q-learning

We use Q-Learning (a kind of RL) to try to approximate a special function that can drive the action selection strategy of any sequence of environmental states. Q-Learning is a modelless implementation of RL that updates the Q-table for each state, action taken, and reward received. It allows us to understand the structure of the data. In our example, the status is a screenshot, action, motionless, jump [0,1] of the game.

We solve this problem by regression and choose the action with the highest predicted Q value.

Legend: Q-table sample

▌ Settings

First set the environment:

1, select the virtual machine

We need a complete desktop environment where we can capture and use screenshots to train the model. I chose the Paperspace ML-in-a-box (MLIAB) Ubuntu image. The advantage of MLIAB is that it comes preloaded with Anaconda and many other ML libraries.

2. Set up and install Keras to use the GPU

Paperspace's virtual machine is pre-installed. If not, you can do the following:

Pip install keraspip install tensorflow

In addition, in order to ensure that the GPU can be set to identify, execute the following python code, and you should see the available GPU devices:

From keras import backend as KK.tensorflow_backend._get_available_gpus()

3, install Dependencies

Selenium:

Pip install selenium

OpenCV:

Pip install opencv-python

Download Chromedrive:

Http://chromedriver.chromium.org

▌ game framework

Open chrome://dino and press the spacebar to play this game. If you need to modify the game code, you need to extract the game from chromium's open source library.

Since this game is written in JavaScript and our model is written in Python, we need to apply some interface tools.

Selenium is a popular browser automation tool for sending operations to the browser and obtaining different game parameters such as the current score.

After we have an interface for sending operations, we also need a mechanism to capture the game screen:

Selenium and OpenCV provide the best performance for screen capture and image preprocessing, achieving frame rates of 6-7 fps.

Game Module

We use this module to implement the interface between Python and JavaScript. The following code will let you know how the module works:

Class Game: def __init__(self): self._driver = webdriver.Chrome(executable_path = chrome_driver_path) self._driver.set_window_position(x=-10,y=0) self._driver.get(game_url) def restart(self): Self._driver.execute_script("Runner.instance_.restart()") def press_up(self): self._driver.find_element_by_tag_name("body").send_keys(Keys.ARROW_UP) def get_score(self): score_array = self._driver .execute_script("return Runner.instance_.distanceMeter.digits") score = ''.join(score_array). return int(score)

Agent module

We use an agent module to encapsulate all interfaces. We use this module to control Dino and get the status of the agent in the environment.

Class DinoAgent: def __init__(self,game): #takes game as input for taking actions self._game = game; self.jump(); #to start the game, we need to jump once def is_crashed(self): return self ._game.get_crashed() def jump(self): self._game.press_up()

Game State Module

In order to send the action to the module and get the corresponding result state, we use the Game-State module. It simplifies the process by receiving and executing actions, determines rewards, and returns to an experience tuple.

Class Game_sate: def __init__(self,agent,game): self._agent = agent self._game = game def get_state(self,actions): score = self._game.get_score() reward = 0.1 #survival reward is_over = False # Game over if actions[1] == 1: #else do nothing self._agent.jump() image = grab_screen(self._game._driver) if self._agent.is_crashed(): self._game.restart() reward = -1 is_over = True return image, reward, is_over #return the Experience tuple

▌ Image channel

Image capture

We can capture the game screen in a variety of ways, such as using the PIL and MSS python libraries to capture the entire screen and crop the Region of Interest (ROI). However, the biggest drawback of this method is the sensitivity to screen resolution and window position. Fortunately, the game uses HTML Canvas and we can use JavaScript to easily get images in base64 format. Now we use selenium to run this script.

#javascript code to get the image data from canvasvar canvas = document.getElementsByClassName('runner-canvas')[0];var img_data = canvas.toDataURL()return img_data

Def grab_screen(_driver = None): image_b64 = _driver.execute_script(getbase64Script) screen = np.array(Image.open(BytesIO(base64.b64decode(image_b64)))) image = process_img(screen)#processing image as required return image

Image Processing

The captured original image has a resolution of 600x150 and has 3 channels (RGB). We intend to use 4 consecutive screen shots as a single input to the model, which allows us to have a single input size of 600x150x3x4. The input is too large, requires a lot of computing power, and not all features are useful, so we use the OpenCV library to adjust, crop, and process the image. The final processed input is only 80x80 pixels and is a single gray scale.

Def process_img(image): image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) image = image[:300, :500] return image

Legend: Image Processing

Model Architecture

Let us now look at the model architecture. We use a series of three convolution layers and then flatten them into a dense layer and an output layer. The model for the CPU does not include the pooling layer because I have removed many features and adding a pooling layer can result in a large loss of already sparse features. But with the GPU, our model can accommodate more features without reducing the frame rate.

The largest pooled layer significantly improves the processing of dense feature sets.

Legend: Model Architecture

The output layer consists of two neurons, each representing the maximum predicted return for each action. Then we choose the action with the greatest return (Q value).

Def buildmodel():

Print("Now we build the model")

Model = Sequential()

Model.add(Conv2D(32, (8, 8), padding='same', strides=(4, 4), input_shape=(img_cols, img_rows, img_channels))) #80*80*4

Model.add(MaxPooling2D(pool_size=(2,2)))

Model.add(Activation('relu'))

Model.add(Conv2D(64, (4, 4), strides=(2, 2), padding='same'))

Model.add(MaxPooling2D(pool_size=(2,2)))

Model.add(Activation('relu'))

Model.add(Conv2D(64, (3, 3), strides=(1, 1), padding='same'))

Model.add(MaxPooling2D(pool_size=(2,2)))

Model.add(Activation('relu'))

Model.add(Flatten())

Model.add(Dense(512))

Model.add(Activation('relu'))

Model.add(Dense(ACTIONS))

Adam = Adam(lr=LEARNING_RATE)

Model.compile(loss='mse',optimizer=adam)

Print("We finish building the model")

Return model

Training

Start with rest and get the initial state (s_t)

The number of observation steps

Predict and perform operations

Storage experience in Replay Memory

Randomly select a batch from Replay Memory and train the model on this basis

Start again after the game is over

Def trainNetwork(model,game_state):

# store the previous observations in replay memory

D = deque() #experience replay memory

# get the first state by doing nothing

Do_nothing = np.zeros(ACTIONS)

Do_nothing[0] =1 #0 => do nothing,

#1=> jump x_t, r_0, terminal = game_state.get_state(do_nothing) # get next step after performing the action

S_t = np.stack((x_t, x_t, x_t, x_t), axis=2).reshape(1,20,40,4) # stack 4 images to create placeholder input reshaped 1*20*40*4

OBSERVE = OBSERVATION epsilon = INITIAL_EPSILON t = 0 while (True): #endless running

Loss = 0

Q_sa = 0

Action_index = 0

R_t = 0 #reward at t

A_t = np.zeros([ACTIONS]) # action at t

q = model.predict(s_t)

#input a stack of 4 images, get the prediction

max_Q = np.argmax(q)

# chosing index with maximum q value

Action_index = max_Q

A_t[action_index] = 1

# o=> do nothing, 1=> jump

#run the selected action and observed next state and reward

X_t1, r_t, terminal = game_state.get_state(a_t)

X_t1 = x_t1.reshape(1, x_t1.shape[0], x_t1.shape[1], 1) #1x20x40x1

S_t1 = np.append(x_t1, s_t[:, :, :, :3], axis=3) # append the new image to input stack and remove the first one

D.append((s_t, action_index, r_t, s_t1, terminal))# store the transition

#only train if done observing; sample a minibatch to train on

trainBatch(random.sample(D, BATCH)) if t > OBSERVE else 0

S_t = s_t1

t += 1

Please note that we are sampling 32 random experience replays from replay memory and use batch training. The reason for this is the unbalanced distribution of actions in the game structure and the avoidance of overfitting.

Def trainBatch(minibatch): for i in range(0, len(minibatch)):

Loss = 0

Inputs = np.zeros((BATCH, s_t.shape[1], s_t.shape[2], s_t.shape[3])) #32, 20, 40, 4

Targets = np.zeros((inputs.shape[0], ACTIONS))

#32, 2

State_t = minibatch[i][0] # 4D stack of images

Action_t = minibatch[i][1] #This is action index

Reward_t = minibatch[i][2] #reward at state_t due to action_t

State_t1 = minibatch[i][3] #next state

Terminal = minibatch[i][4] #wheather the agent died or survided due the action

Inputs[i:i + 1] = state_t

Targets[i] = model.predict(state_t) #predicted q values

Q_sa = model.predict(state_t1)

#predict q values ​​for next step

If terminal:

Targets[i, action_t] = reward_t # if terminated, only equals reward

Else:

Targets[i, action_t] = reward_t + GAMMA * np.max(Q_sa)

Loss += model.train_on_batch(inputs, targets)

result

We have achieved good results by using this architecture. The graph below shows the average score at the beginning of training. At the end of training, the average score for every 10 games is much higher than 1000.

The highest score record is 4000+, which is far more than the 250 points of the previous model (and far exceeds what most people can do!). The figure below shows the progress of the game's highest score during training.

Dino's speed is proportional to the score, which makes it more difficult to detect and determine an action at higher speeds. Therefore, the entire game is trained at a constant speed.

USB 3.1 Interfaces

Emphasis: Because of the market chaos and the cheating of bad dealers, most people simply don't understand USB 3.0 and USB 3.1. USB 3.1 Gen1 is USB 3.0. And USB 3.1 Gen2 is the real USB 3.1. The maximum transmission bandwidth of USB 2.0 is 480 Mbps (i.e. 60MB/s), USB 3.0 (i.e. USB 3.1 Gen1) is 5.0 Gbps (500MB/s), and USB 3.1 Gen 2 is 10.0 Gbps (although the nominal interface theoretical rate of USB 3.1 is 10Gbps), but it also retains some bandwidth to support other functions, so it has a good performance. The actual effective bandwidth is about 7.2 Gbps. USB 2.0 is a four-pin interface, and USB 3.0 and USB 3.1 are nine-pin interfaces.
USB 3.1 is the latest USB specification, which was initiated by big companies such as Intel. Compared with the existing USB technology, the new USB technology uses a more efficient data encoding system and provides more than twice the effective data throughput (USB IF Association). It is fully downward compatible with existing USB connectors and cables.
USB 3.1 is compatible with existing USB 3.0 software stacks and devices
USB3.1 LOGO
USB3.1 LOGO
Protocol, 5Gbps hubs and devices, USB 2.0 products.
Intel, which owns Thunderbolt technology, also welcomes the formation of the USB 3.1 standard. USB 3.1 contains most of the features of USB 3.0 [2]. USB 3.1, as the next generation of USB transmission specifications, is commonly referred to as "SuperSpeed+", which will replace USB 3.0 in the future. [3]

USB 3.1 C SMT FINISHED,DIP USB3.1 Plug,Vertical USB Connector,USB Type-c Receptacles Shell

ShenZhen Antenk Electronics Co,Ltd , https://www.atkconnectors.com