Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There's an important distinction to be made between the storage layer and the analysis layer. Something like HDFS can make sense as a storage layer once you hit the > 10TB range even if your average dataset for analysis is reasonably small (and it should be; 99% of the time you can get by with sampling down to single-machine size). That doesn't mean you need to be setting up all your analysis jobs to run via map-reduce; you can usually dump the dataset to a dedicated machine and do it all in one go with sequential algorithms. As a side benefit, you have access to algorithms that are really difficult to express efficiently as map-reduce (eg, computations over ordered time series).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: