Hadoop nodes die (crash) after a while

I have a hadoop cluster of 16 (ubuntu 12.04 server) nodes (1 master and 15 slaves). They are connected through a private network and the master also has a public IP (it belongs to two networks). When I run small tasks, i.e., with small input and small processing time, everything works. However, when I run bigger tasks, i.e. with 7-8 GB of input data, my slave nodes start dying one after another.

From the web ui (http://master:50070/dfsnodelist.jsp?whatNodes=LIVE) I see that the last contact starts increasing and from my cluster provider's web ui, I see that the nodes have crashed. Here is the screenshot of a node (I cannot scroll up):

Another machine showed this error, with hadoop dfs running, while no job was running:

BUG: soft lockup - CPU#7 stuck for 27s! [java:4072]

and

BUG: soft lockup - CPU#5 stuck for 41s! [java:3309]
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata2.00: cmd a0/00:00:00:08:00/00:00:00:00:00/a0 tag 0 pio 16392 in
         res 40/00:02:00:08:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
ata2.00: status: { DRDY }

Here is another screenshot (out of which I cannot make any sense):

Here is the log of a crashed datanode (with IP 192.168.0.9):

2014-02-01 15:17:34,874 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving blk_-2375077065158517857_1818 src: /192.168.0.7:53632 dest: /192.168.0.9:50010
2014-02-01 15:20:14,187 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for blk_-2375077065158517857_1818 java.io.EOFException: while trying to read 65557 bytes
2014-02-01 15:20:17,556 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-2375077065158517857_1818 0 : Thread is interrupted.
2014-02-01 15:20:17,556 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for blk_-2375077065158517857_1818 terminating
2014-02-01 15:20:17,557 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-2375077065158517857_1818 received exception java.io.EOFException: while trying to read 65557 bytes
2014-02-01 15:20:17,560 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.0.9:50010, storageID=DS-271028747-192.168.0.9-50010-1391093674214, infoPort=50075, ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 65557 bytes
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:296)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:340)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:404)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:582)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:404)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:112)
    at java.lang.Thread.run(Thread.java:744)
2014-02-01 15:21:48,350 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.9:50010, dest: /192.168.0.19:60853, bytes: 132096, op: HDFS_READ, cliID: DFSClient_attempt_201402011511_0001_m_000018_0_1657459557_1, offset: 0, srvID: DS-271028747-192.168.0.9-50010-1391093674214, blockid: blk_-6962923875569811947_1279, duration: 276262265702
2014-02-01 15:21:56,707 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.9:50010, dest: /192.168.0.19:60849, bytes: 792576, op: HDFS_READ, cliID: DFSClient_attempt_201402011511_0001_m_000013_0_1311506552_1, offset: 0, srvID: DS-271028747-192.168.0.9-50010-1391093674214, blockid: blk_4630218397829850426_1316, duration: 289841363522
2014-02-01 15:23:46,614 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call getProtocolVersion(org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol, 3) from 192.168.0.19:48460: output error
2014-02-01 15:23:46,617 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50020 caught: java.nio.channels.ClosedChannelException
    at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
    at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1756)
    at org.apache.hadoop.ipc.Server.access$2000(Server.java:97)
    at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:780)
    at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:844)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1472)
2014-02-01 15:24:26,800 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.9:50010, dest: /192.168.0.9:36391, bytes: 10821100, op: HDFS_READ, cliID: DFSClient_attempt_201402011511_0001_m_000084_0_-2100756773_1, offset: 0, srvID: DS-271028747-192.168.0.9-50010-1391093674214, blockid: blk_496206494030330170_1187, duration: 439385255122
2014-02-01 15:27:11,871 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.9:50010, dest: /192.168.0.20:32913, bytes: 462336, op: HDFS_READ, cliID: DFSClient_attempt_201402011511_0001_m_000004_0_-1095467656_1, offset: 19968, srvID: DS-271028747-192.168.0.9-50010-1391093674214, blockid: blk_-7029660283973842017_1326, duration: 205748392367
2014-02-01 15:27:57,144 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.9:50010, dest: /192.168.0.9:36393, bytes: 10865080, op: HDFS_READ, cliID: DFSClient_attempt_201402011511_0001_m_000033_0_-1409402881_1, offset: 0, srvID: DS-271028747-192.168.0.9-50010-1391093674214, blockid: blk_-8749840347184507986_1447, duration: 649481124760
2014-02-01 15:28:47,945 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded blk_887028200097641216_1396
2014-02-01 15:30:17,505 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.9:50010, dest: /192.168.0.8:58304, bytes: 10743459, op: HDFS_READ, cliID: DFSClient_attempt_201402011511_0001_m_000202_0_1200991434_1, offset: 0, srvID: DS-271028747-192.168.0.9-50010-1391093674214, blockid: blk_887028200097641216_1396, duration: 69130787562
2014-02-01 15:32:05,208 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.0.9:50010, storageID=DS-271028747-192.168.0.9-50010-1391093674214, infoPort=50075, ipcPort=50020) Starting thread to transfer blk_-7029660283973842017_1326 to 192.168.0.8:50010
2014-02-01 15:32:55,805 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.0.9:50010, storageID=DS-271028747-192.168.0.9-50010-1391093674214, infoPort=50075, ipcPort=50020) Starting thread to transfer blk_-34479901

Here is how my mapred-site.xml files are setup:

<property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx2048m</value>
</property>

<property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>4</value>
</property>

<property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>4</value>
</property>

Each node has 8 CPUs and 8GB of RAM. I know that I have set mapred.child.java.opts too high,but with these settings and data the same jobs used to run. I have set reduce slowstart to 1.0, so reducers start only after all the mappers have finished.

Pinging some nodes results in a small percentage of packets lost and ssh connection freezes for a while, but I don't know if it is relevant. I have added to the /etc/security/limits.conf file on each node the line:

hadoop hard nofile 16384

but that didn't work either.

SOLUTION: It seems that after all, it was a memory error, indeed. I had too many running tasks and the computers crashed. After they crashed and I rebooted them, the hadoop jobs did not run correclty, even if I set the correct number of mappers. The solution was to remove bad datanodes (through decommissioning) and then include them again. Here is what I did and everything worked perfectly again, without losing any data:

How do I correctly remove nodes in Hadoop?

And of course, set the right number of max map and reduce tasks per node.

Answers


Here you're running out of memory as per mapped u have 2GB RAM and 4 maps are allowed.

Please try to run the same job with 1 GB xmx it will surely work out.

If u wanna use your cluster efficiently set the xmx according to block size of files.

If your block is of 128 Mb then 512 mb is sufficient.


does the job have a combiner step between M and R? I have had a problem with high mem Map and Combine steps happening at the same time on same nodes in tasks that take a lot of memory. With your config if 2 maps and 2 combines are happening you could be using 8G ram if you have large objects in memory. just a thought.


Need Your Help

Android add a subview to a row layout of a listview

android android-layout android-listview baseadapter

I have a ListView in my application with a custom adapter that extends the BaseAdapter. In the getView() method of my custom adapter a relative layout is inflated in order to display data parsed from

Delay with slow fade in after rest of page is loaded - Javascript

javascript jquery delay fadein

Basically, what I want to do is make it so that a Div with an image fade in after a delay, after the rest of the page has loaded. So all of the page would load, then there would be a delay, then the