Q-learning Reward Propagation Method for Reducing the Transmission Power of Sensor Nodes in Wireless Sensor Networks

  • Authors:
  • Yunsick Sung;Eunyoung Ahn;Kyungeun Cho

  • Affiliations:
  • Dongguk University-Seoul, Seoul, Republic of Korea;Hanbat National University, Daejeon, Republic of Korea;Dongguk University-Seoul, Seoul, Republic of Korea

  • Venue:
  • Wireless Personal Communications: An International Journal
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

In wireless sensor networks (WSNs), sensor nodes and sink nodes communicate among themselves to collect and send data. Because of the volume of data transmission, it is important to minimize the power used for this communication. Q-learning can be applied to find the optimal path between two nodes. However, Q-learning suffers from having a significant learning time, meaning that the learning process must be conducted in advance in order to be applicable to WSNs. Many studies have proposed methods to decrease the learning time by reducing the state spaces and updating more Q-values at each update. Reducing the size of the Q-learning state space leads to an inability to execute the optimum action, because the correct state may not be available. Other methods utilize additional information by involving a teacher, control the time flow of Q-learning by considering the number of updates for each Q-value, or use a prioritized queue in the update procedure. Such methods are not well suited to the real-world environment and complexity of WSNs. A more suitable method involves updating the Q-values iteratively. A combination of these updating methods may enhance the reduction in learning time. This paper proposes a reward propagation method (RPM), i.e., a method that integrates various updating algorithms to propagate the reward of the goal state to more Q-values, thus reducing the learning time required for Q-learning. By not only updating the Q-value of the last visited state and executed action, but also updating the Q-values of unvisited states and unexecuted actions, the learning time can be reduced considerably. In this method, we integrate the following three Q-value updating algorithms. First, we incorporate the concept of $$\text{ Q }(\lambda )$$Q(驴)-learning. This method iteratively propagates the terminal reward to the Q-values of the visited states until the terminal reward is received. If the terminal reward were to be propagated to Q-values of unvisited states, the learning time could be further reduced. Second, the concept of reward propagation is expanded. The previous, one-step reward update method updates any Q-values of states in which the terminal reward can be received by executing only one action after receiving the terminal reward. If the terminal reward is propagated to states for which the reward will only be received by executing more than one action, more Q-values can be updated. Third, we apply a type of fuzzy Q-learning with eligibility, which updates the Q-values of unexecuted actions. Even though this method is not utilized directly, the concept of updating the Q-values of unexecuted actions is applied. To investigate how much learning time is reduced with using the proposed method, we compare it with conventional Q-learning. Given that the optimal path problem of WSNs can be remodeled as a reinforcement learning problem, a hunter---prey capture game is utilized, which involves two agents in a grid environment. RPM, which plays the role of prey, learns to receive the terminal reward and escape from the hunter. The hunter selects one of two movement policies randomly, and executes one action based on the selected policy. To measure the difference between the success rates of conventional Q-learning and RPM, an equivalent environment and parameters are used for both these methods. We conduct three experiments: the first to compare RPM and conventional Q-learning, the second to test the scalability of RPM, and the third to evaluate the performance of RPM in a differently configured environment. The RPM results are compared with those of conventional Q-learning in terms of the success rate of receiving the terminal reward. This provides a measure of the difference in required learning times. The greatest reduction in learning time is obtained with a grid size of $$10 \times 10$$10脳10, no obstacles, and 3,000 episodes to be learned. In these episodes, the success rate of RPM is 232聽% higher than that of conventional Q-learning. We perform two experiments to verify the scalability of RPM by changing the size of the environment. In a $$12 \times 12$$12脳12 grid environment, RPM initially exhibits a maximum success rate 176聽% higher than that of conventional Q-learning. However, as the size of the environment is increased, the effect of propagating the terminal rewards decreases, and the improvement in the success rate compared to conventional Q-learning decreases relative to the $$10 \times 10$$10脳10 grid environment. With a $$14 \times 14$$14脳14 grid environment, the relative effect of RPM declines further, giving a maximum success rate that is around 138聽% higher than that of conventional Q-learning. The results of the scalability experiments show that increasing the size of the environment without changing the scope of terminal reward propagation causes a decrease in the success rate of RPM. However, RPM still exhibits a higher success rate than conventional Q-learning. The improvement in the peak success rate of between 138聽% and 232聽% can greatly reduce the learning time in difficult environments. If the amount of the calculation of terminal reward propagation is not a critical issue, given that the calculation amount is increased exponentially in proportion to the scope of terminal reward propagation, the learning time can be reduced further easily by increasing the scope. Finally, we compare the difference in success rates when obstacles are deployed within the $$10 \times 10$$10脳10 environment. The obstacles naturally degraded the performance of both RPM and conventional Q-learning, but the proposed method still outperforms conventional Q-learning by about 20聽% to 59聽%. Our experimental results show that the peak success rate of the proposed method is consistently superior to that of conventional Q-learning. Given that the size of environment and the number of obstacles effects the improvement of RPM comparing to conventional Q-learning, the improvement is in proportion to the number of more updated Q-values. If the size of environment is increased, the rate how much Q-values are more updated comparing to whole Q-values is decreased. Therefore, the effect of RPM is also decreased. More the number of obstacles is increased, less the number of Q-values are update, which also reduces the effect of RPM. As the learning time can be reduced by RPM, Q-learning can be applied to diverse fields in which the learning time problem affects its application to WSNs. Although a lot of researches of Q-learning are proposed to solve the learning time problem of Q-learning, the demand of reducing the learning time is still urgent. Given that one line of researches that reduce learning time by updating Q-value more is very active topic, we expect further expansion by combining more kinds of concepts to reduce learning time on RPM.