Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Even if the data isn't big, there can be a benefit from the Hadoop infrastructure. Say you have just 86,400 rows of data but each row takes 1 second. That adds up to 24 hours of elapsed time, and waiting for that run can be painful, especially if you are trying to experiment, iterate. With HDFS/MapReduce you can distribute that work across N machines and divide the elapsed time by N, speeding up the pace of iteration. I've worked on a project that had exactly this challenge, before Hadoop was available, and so we had to invent our own crappy ways of distributing the data to the N machines, monitoring them, collecting the results. Hadoop HDFS and Map/Reduce, with Job Tracker, etc, would have been much better than what we came up with.


Unless your problem is I/O bound (you can't get it off the disks fast enough, or network bound -- transforming data to a worker nodes takes too long) using Hadoop is the wrong choose. CPU bound problems are better solved with Grid solutions that do a better job of scaling up (with in a single node) and scale out to multiple machines. Taking a step back, you should always ask your self if this can be done on a single machine, taking advantage of Moore's Law.


What kind of processing takes 1s per row? That's several billion instructions. And you can easily fit 86400 rows in memory, so disk seeks aren't an issue.


Decent RDBMS servers will parallelise where possible, and use the servers 8 cores (or whatever) to optimise such a problem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: