Blood from a turnip
It has been argued that data is growing at a pace faster than our ability to process it. Turning data into intelligence is the challenge. Clean, fast and scalable.
We've all heard that it was impossible to squeeze blood from a turnip, yet that exactly what we must do with data. We must be prepared to continue to deliver lightning fast analytics at scale. Answer questions that haven't been asked yet.
There is much room to debate the pros and cons of the various methodologies in use today. Generally speaking, I would support, that when it comes to big data it will always be to your advantage to 'move' the data the fewest amount of times possible. Often that means bringing the compute to the data as opposed to the other way around.
Early on I'd use HDFS file formats to achieve this, which in itself can mean moving that data. I still use and recommend Spark or Hive in places that have already or intend to invest in Hadoop infrastructure.
More recently I have built data warehouses using the columnar database technology provided by Vertica. While relatively expensive, they perform consistently well on large datasets. Still, in a world driven by ROI, it is impossible to overlook the substantial cost to store petabyte scale data using this method.
Seizing on the practicality of bringing compute to your data, AWS has made a decidedly purposeful step in this arena by introducing Athena. Athena's basic premise is that you can now use an infrastructure free engine to access and query data right from s3. Considering the huge footprint that Amazon has when it comes to data stored in it's s3 service, there is tremendous potential for this service.
Similarly, Vertica has introduced EON mode in Version 9 of their product. Eon mode allows the use of Vertica's powerful vsql engine to access data directly in s3.