FREE SHIPING FOR OVER $100 - MOSTLY SHIP VIA USPS GROUND ADVANTAGE %D days %H:%M:%S
PhilWinderPh.D.
출간작으로『강화학습개념및산업현장의적용사례』등이있다.
CHAPTER1왜강화학습인가?1.1왜지금강화학습이필요한가?1.2기계학습1.3강화학습1.3.1언제강화학습을사용해야할까?1.3.2강화학습을적용한어플리케이션1.4RL접근방식의분류1.4.1Model-FreeorModel-based1.4.2에이전트가전략을사용하고업데이트하는방법1.4.3이산(Discrete)혹은연속(Continuous)행동1.4.4최적화방법1.4.5정책평가와개선1.5강화학습의기본개념1.5.1첫번째강화학습알고리즘1.5.2RL은ML과동일합니까?1.5.3보상과피드백1.6학문으로서의강화학습1.7요약1.8추가자료ReferenceCHAPTER2마르코프결정프로세스,동적프로그래밍과몬테카를로방법2.1Multi-ArmBandit테스트하기2.1.1보상엔지니어링2.1.2정책평가:가치함수2.1.3정책개선:최고행동선택2.1.4시뮬레이션환경2.1.5실험실행2.1.6ε-greedy알고리즘개선하기2.2마르코프의사결정프로세스(MarkovDecisionProcess)2.2.1제고관리2.2.2제고관리시뮬레이션2.3정책과가치함수2.3.1감가된보상2.3.2상태-가치함수로보상예측2.3.3행동-가치함수로보상예측하기2.3.4최적의정책2.4몬테카를로정책생성2.5동적프로그래밍을사용한가치반복2.5.1가치반복구현2.5.2가치반복결과2.6요약2.7추가자료ReferenceCHAPTER3시간차학습,Q-learning및-스텝알고리즘3.1시간차학습의정의3.2Q-러닝(Q-learning)3.3SARSA3.4Q-러닝과SARSA비교3.5연구사례:어플리케이션컨테이너자동확장을통한비용절감3.6산업적용사례:광고실시간입찰3.6.1MDP정의3.6.2실시간입찰환경의결과3.6.3추가개선사항3.7Q-러닝의확장3.7.1더블Q-러닝(DoubleQ-learning)3.7.2지연Q-러닝(DelayedQ-learning)3.7.3표준,더블,지연Q-러닝비교3.7.4대립학습(OppositionLearning)3.8n-스텝(n-Step)알고리즘3.9그리드환경에서-스텝알고리즘3.10타당성추적(eligibilitytraces)3.11타당성추적의확장3.11.1Watkins’sQ(λ)3.11.2FuzzywipesinWatkins’sQ(λ)3.11.3빠른Q-러닝(SpeedyQ-Learning)3.11.4타당성추적의저장과대체3.12요약3.13추가자료ReferenceCHAPTER4심층Q-네트워크(DeepQ-Networks,DQN)4.1딥러닝구조4.1.1딥러닝의기본적인구조4.1.2많이사용하는신경망구조4.1.3딥러닝프레임워크4.1.4심층강화학습4.2심층Q-러닝(DeepQ-Learning)4.2.1경험재생(ExperienceReplay)4.2.2Q-네트워크복제4.2.3뉴럴네트워크구조4.2.4DQN구현4.2.5예제:CartPole환경에서DQN4.2.6연구사례:빌딩의에너지사용감소4.3RainbowDQN4.3.1분산강화학습(DistributionalRL)4.3.2우선순위기반경험재생(PrioritizedExperienceReplay,PER)4.3.3노이지네트(NoisyNets)4.3.4듀얼링네트워크(DuelingNetworks)4.4예제:RainbowDQN의AtraiGames적용4.4.1결과4.4.2추가로논의할부분4.5다른DQN구현4.5.1탐험개선4.5.2보상개선4.5.3오프라인데이터로학습하기4.6요약4.7추가자료ReferenceCHAPTER5정책기울기메소드5.1정책직접학습의장점5.2정책의기울기를계산하는방법5.3정책기울기(PolicyGradient)이론5.4정책함수(PolicyFunctions)5.4.1선형적인정책(LinearPolicies)5.4.2임의의정책(ArbitraryPolicies)5.5기본구현(BasicImplementations)5.5.1몬테카를로(REINFORCE)5.5.2베이스라인을가진REINFORCE5.5.3기울기분산감소5.5.4-스텝액터-크리틱과이득액터-크리틱(AdvantageActor-Critic,A2C)5.5.5액터-크리틱의타당성추적(EligibilityTraces)5.5.6기본정책기울기알고리즘비교5.6산업적용사례:소비자를위한자동물품구매5.6.1환경:Gym-Shopping-Cart5.6.2기대치5.6.3ShoppingCart환경의결과5.7요약5.8참고자료ReferenceCHAPTER6정책기울기를넘어6.1Off-Policy알고리즘6.1.1중요도샘플링(ImportanceSampling)6.1.2행동과타켓정책6.1.3Off-PolicyQ-러닝6.1.4기울기시간차(GradientTemporal-Difference,GTD)학습6.1.5탐욕적-GQ6.1.6Off-Policy액터-크리틱6.2결정론적정책기울기6.2.1결정론적정책기울기6.2.2심층결정론적정책기울기(DeepDeterministicPolicyGradients,DDPG)6.2.3이중지연된심층결정론적정책기울기(TwinDelayedDDPG,TD3)6.2.4연구사례:리뷰를활용한추천6.2.5DPG의개선6.3신뢰영역방법(TrustRegionMethods)6.3.1쿨백-라이블러(KullbackLeibler,KL)발산6.3.2자연정책기울기(NaturalPolicyGradients)와신뢰영역정책최적화(TrustRegionPolicyOptimization)6.3.3근접정책최적화(ProximalPolicyOptimization,PPO)6.4예제:실제환경에서원하는곳에도달하기위한서보모터활용하기6.4.1환경설정6.4.2강화학습알고리즘구현6.4.3알고리즘의복잡성증가시키기6.4.4시뮬레이션에서하이퍼파라미터조정6.4.5정책들의결과6.5그밖의다른정책기울기알고리즘들6.5.1리트레이스(λ)6.5.2경험재생액터-크리틱(Actor-CriticwithExperienceReplay,ACER)6.5.3Kronecker-Factored신뢰영역을활용한액터크리틱(Actor-CriticUsingKronecker-FactoredTrustRegions,ACKTR)6.5.4강조적방법6.6정책기울기알고리즘들의확장6.6.1정책기울기알고리즘들의분위수회귀(QuantileRegressioninPolicyGradientAlgorithms)6.7요약6.7.1어떤알고리즘을사용해야할까?6.7.2비동기적방법6.8참고문헌ReferenceCHAPTER7엔트로피방법과연관된정책모두배우기7.1엔트로피(Entropy)란무엇일까?7.2최대엔트로피강화학습7.3소프트액터-크리틱(SoftActor-Critic,SAC)7.3.1SAC구현세부사항과이산행동공간7.3.2자동온도매개변수조정7.3.3연구사례:자동화된교통관리를통한대기줄감소7.4최대엔트로피방법들의확장7.4.1다른엔트로피측정방법들(그리고앙상블)7.4.2더블Q-러닝의상한값을사용한낙관적탐험(OptimisticExploration)7.4.3경험재생(ExperienceReplay)의조정7.4.4부드러운정책기울기7.4.5부드러운Q-러닝(SoftQ-Learning)과그유도7.4.6경로일관성학습(PathConsistencyLearning)7.5성능비교:SACvsPPO7.6어떻게엔트로피가탐험을장려시킬까?7.6.1온도매개변수는탐험을어떻게변화시킬까?7.7산업적용사례:원격차운전배우기7.7.1문제정의7.7.2훈련시간최소화7.7.3극적인행동들7.7.4하이퍼파라미터탐색7.7.5최종정책7.7.6추가적개선사항7.8요약7.8.1정책기울기와부드러운Q-러닝간의등가성7.8.2이것이미래에의미하는바는?7.8.3이것이현재에의미하는바는?ReferenceCHAPTER8에이전트학습방법개선8.1MDP에대한재고8.1.1부분적으로관찰가능한마르코프결정프로세스(PartiallyObservableMarkovDecisionProcess,POMDP)8.1.2연구사례:자율주행차에서POMDP사용8.1.3상황별마르코프의사결정프로세스8.1.4변경행동이있는MDP8.1.5정규화된MDP8.2계층적강화학습(HierarchicalReinforcementLearning)8.2.1Naive계층적강화학습8.2.2내재적보상이있는고-저수준계층구조(HIRO)8.2.3학습기술및비지도RL8.2.4HRL에서기술사용하기8.2.5HRL결론8.3다중에이전트강화학습(Multi-AgentReinforcementLearning)8.3.1MARL프레임워크8.3.2중앙집중식혹은비중앙집중식8.3.3단일에이전트알고리즘8.3.4연구사례:UAV에서싱글에이전트분산학습사용8.3.5중앙집중식훈련,비중앙집중식(분산)실행8.3.6비중앙집중식(분산/탈중앙식)학습8.3.7다른조합방법8.3.8MARL의과제8.3.9MARL의결론8.4전문가의가이드8.4.1행동복제8.4.2모방RL8.4.3InverseRL8.4.4커리큘럼RL