How to optimize and tune hadoop cluster performance -
i not familiar hadoop cluster configs , have integrated apache nutch apache hadoop , have crawled data indexed in solr successfully. have master-slave sources below:
master: cpu : 4 cores memory :12g hard disk : 37g
slave1 : cpu : 2 cores memory :4g hard disk : 18g
slave2: cpu : 2 cores memory :4g hard disk : 16g
slave3 : cpu : 2 cores memory :4g hard disk : 16g
slave4 : cpu : 4 cores memory :4g hard disk : 50g
i have configed core-site.xml, mapred-site.xml, hdfs-site.xml, masters , slaves.
here core-site.xml :
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/my project name/hadoop-datastore</value> <description>store data</description> </property> <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> <description>the name of default file system</description> </property> </configuration>
here mapred-site.xml :
<configuration> <property> <name>mapred.job.tracker</name> <value>master:54311</value> <description>host , port</description> </property> <property> <name>mapred.reduce.tasks</name> <value>10</value> <description></description> </property> <property> <name>mapred.map.tasks</name> <value>20</value> <description></description> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>8</value> <description></description> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>8</value> <description></description> </property> </configuration>
and here hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>2</value> <description>default block</description> </property> </configuration>
and here conf/masters :
master
and conf/slaves:
master slave1 slave2 slave3 slave4
this story goes well: when run master , run jps command, have folowings on master:
19031 tasktracker 18644 datanode 18764 secondarynamenode 18884 jobtracker 13226 jps 18506 namenode
and when run jps command on slaves, have followings:
4969 datanode 5057 tasktracker 5592 jps
when @ master hadoop map/reduce administration have following cluster summary:
<h2>cluster summary (heap size 114.5 mb/889 mb)</h2> <table border="1" cellpadding="5" cellspacing="0"> <tr><th>running map tasks</th><th>running reduce tasks</th><th>total submissions</th><th>nodes</th><th>occupied map slots</th><th>occupied reduce slots</th><th>reserved map slots</th><th>reserved reduce slots</th><th>map task capacity</th><th>reduce task capacity</th><th>avg. tasks/node</th><th>blacklisted nodes</th><th>graylisted nodes</th><th>excluded nodes</th></tr> <tr><td>8</td><td>8</td><td>1607</td><td><a href="machines.jsp?type=active">1</a></td><td>8</td><td>8</td><td>0</td><td>0</td><td>8</td><td>8</td><td>16.00</td><td><a href="machines.jsp?type=blacklisted">0</a></td><td><a href="machines.jsp?type=graylisted">0</a></td><td><a href="machines.jsp?type=excluded">0</a></td></tr></table> <br>
the problem procedure works fine topn :1000 there load on master high cpu , memory usage when find top on slaves, neither cpu nor memory has loads. mean both cpu , memory usage low , cpu idle high.
i wonder whether natural , ok or not. looking solutions , configs able share load on slaves , make procedure faster. links, documentations , solutions appreciated.
your master node running lot of services :
tasktracker datanode secondarynamenode jobtracker namenode
typically in decent sized cluster master not have datanode service.
name node & secondary name node should on different nodes. can set secondary name node on 1 of data nodes.
similarly task tracker - master typically not have task tracker. i.e. not run mr tasks on master.
on other hand pure experimentation setup have done ok & cpu usage noticing obvious.
Comments
Post a Comment