Overview of all pages with the tag #large csv

Converting large csv's to nested data structure using apache spark

Posted on July 18, 2015 | 5 minutes | Sudev Ambadi

What is Apache Spark ? Apache Spark brings fast, in-memory data processing to Hadoop. Elegant and expressive development APIs in Scala, Java, and Python allow data workers to efficiently execute streaming, machine learning or SQL workloads for fast iterative access to datasets. Quick start guide Problem Statement / Task To read lot of really big csv’s (~GBs) from Hadoop HDFS, clean, convert them to nested data structure and update it to MongoDB using Apache Spark. [Read More]