How to optimize and tune hadoop cluster performance -

April 15, 2011

i not familiar hadoop cluster configs , have integrated apache nutch apache hadoop , have crawled data indexed in solr successfully. have master-slave sources below:

master: cpu : 4 cores memory :12g hard disk : 37g

slave1 : cpu : 2 cores memory :4g hard disk : 18g

slave2: cpu : 2 cores memory :4g hard disk : 16g

slave3 : cpu : 2 cores memory :4g hard disk : 16g

slave4 : cpu : 4 cores memory :4g hard disk : 50g

i have configed core-site.xml, mapred-site.xml, hdfs-site.xml, masters , slaves.

here core-site.xml :

<configuration>          <property>                  <name>hadoop.tmp.dir</name>                  <value>/usr/local/my project name/hadoop-datastore</value>                  <description>store data</description>          </property>           <property>                  <name>fs.default.name</name>                  <value>hdfs://master:54310</value>                  <description>the name of default file system</description>          </property>     </configuration>

here mapred-site.xml :

<configuration>    <property>      <name>mapred.job.tracker</name>      <value>master:54311</value>      <description>host , port</description>    </property>     <property>      <name>mapred.reduce.tasks</name>      <value>10</value>      <description></description>    </property>     <property>      <name>mapred.map.tasks</name>      <value>20</value>      <description></description>    </property>     <property>      <name>mapred.tasktracker.map.tasks.maximum</name>      <value>8</value>      <description></description>    </property>     <property>      <name>mapred.tasktracker.reduce.tasks.maximum</name>      <value>8</value>      <description></description>    </property>  </configuration>

and here hdfs-site.xml:

<configuration>      <property>              <name>dfs.replication</name>              <value>2</value>              <description>default block</description>          </property>   </configuration>

and here conf/masters :

master

and conf/slaves:

master slave1 slave2 slave3 slave4

this story goes well: when run master , run jps command, have folowings on master:

19031 tasktracker 18644 datanode 18764 secondarynamenode 18884 jobtracker 13226 jps 18506 namenode

and when run jps command on slaves, have followings:

4969 datanode 5057 tasktracker 5592 jps

when @ master hadoop map/reduce administration have following cluster summary:

<h2>cluster summary (heap size 114.5 mb/889 mb)</h2>  <table border="1" cellpadding="5" cellspacing="0">  <tr><th>running map tasks</th><th>running reduce tasks</th><th>total submissions</th><th>nodes</th><th>occupied map slots</th><th>occupied reduce slots</th><th>reserved map slots</th><th>reserved reduce slots</th><th>map task capacity</th><th>reduce task capacity</th><th>avg. tasks/node</th><th>blacklisted nodes</th><th>graylisted nodes</th><th>excluded nodes</th></tr>  <tr><td>8</td><td>8</td><td>1607</td><td><a href="machines.jsp?type=active">1</a></td><td>8</td><td>8</td><td>0</td><td>0</td><td>8</td><td>8</td><td>16.00</td><td><a href="machines.jsp?type=blacklisted">0</a></td><td><a href="machines.jsp?type=graylisted">0</a></td><td><a href="machines.jsp?type=excluded">0</a></td></tr></table>  <br>

the problem procedure works fine topn :1000 there load on master high cpu , memory usage when find top on slaves, neither cpu nor memory has loads. mean both cpu , memory usage low , cpu idle high.

i wonder whether natural , ok or not. looking solutions , configs able share load on slaves , make procedure faster. links, documentations , solutions appreciated.

your master node running lot of services :

tasktracker datanode secondarynamenode jobtracker namenode

typically in decent sized cluster master not have datanode service.

name node & secondary name node should on different nodes. can set secondary name node on 1 of data nodes.

similarly task tracker - master typically not have task tracker. i.e. not run mr tasks on master.

on other hand pure experimentation setup have done ok & cpu usage noticing obvious.

Search This Blog

Ruby Code

How to optimize and tune hadoop cluster performance -

Comments

Post a Comment

Popular posts from this blog

php - failed to open stream: HTTP request failed! HTTP/1.0 400 Bad Request -

command line - Use qwinsta in PowerShell ISE -

java - Show Soft Keyboard when EditText Appears -