Note : This article applies only to those who code. I have seen many strugling with MS Excel trying to figure out data in a large csv file, I don’t blame them beacause most people I have met ignore standard unix command line tools just because they cared about commandline tools.
When the data is BIG(anything above .5GBs) and if we are trying to figure out say even the coloumn names of a csv file MS Excel will get stuck and we will see a MS Windows Not Responding
. And if we are going to script using python/R/perl/* we will spend some time to script and more time for the script to complete.
Assume that we have a csv “car.csv” as shown below:
"","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
"Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
"Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
"Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
"Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
"Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
"Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
"Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
"Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
Let’s assume that we have a BIG csv with many coloumns and we are interested in the sum of 12th coloumn (carb) of the file.
cat car.csv | awk -F "," '{ sum += $12 } END { printf "%.2f\n", sum}'
# 29.00
cat
and awk
and we don’t even load the entire data into memory.The above line says:
- Using
cat
stream the contents into stdout - Pipe the stdout from
cat
to the next commandawk
. - With Awk
-F
parameter takes the deliminator that is used in big.csv.- once you specify the
-F
to deliminator all the fields will be available to you as { $1, $2 … } sum += $12
will increment the variable sum with 12th coloumn of each line.- use
printf
to format result.
Other useful commands:
Head and tail
If we want to get a sample data from BIG csv. Again using head or tail command we are not loading the entire file into memory, we are just reading them line by line.
# To get the first 3 lines of your csv
head car.csv -n 3
# "","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
# "Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# "Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
# Similarly for the last 5 lines
tail car.csv -n 3
# "Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# "Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
# "Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
>
in bash you can redirect the output to a file.
# first 100 lines into a new file first100.txt
head -n 100 big.csv > first100.txt
Word count
wc
command is used to get the number of line, words and bytes in your file
# number of lines, words and bytes
wc car.csv
# 11 25 602 car.csv
# to only output number of lines
wc -l car.csv
# 11
Grep
We wish to search for the line containing some text Datsun
we can use grep to search and return those rows for you. Grep supports regular expression and hell lot of options.
# Search for text Datsun
grep Datsun car.csv
# "Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
# "Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
# subset data with only lines containing Datsun
grep Datsun car.csv > datsun.csv
Cut is used to cut a line into fields according to a given deliminator or number of characters or any pattern.
# To cut your csv and show only first and third coloumn
cut -d "," -f 1,3 car.csv
# "","cyl"
# "Mazda RX4",6
# "Mazda RX4 Wag",6
# "Datsun 710",4
# "Hornet 4 Drive",6
# "Hornet Sportabout",8
# "Mazda RX4",6
# "Mazda RX4 Wag",6
# "Mazda RX4",6
# "Mazda RX4 Wag",6
# "Datsun 710",4
# You can cut by specifying start/end/number of characters
# To print only first 3 characters
cut -c-3 car.csv
# "","m
# "Mazd
# "Mazd
# "Dats
# "Horn
# "Horn
# "Mazd
# "Mazd
# "Mazd
# "Mazd
# "Dats
Sed and Awk
These two commands are more of a programing language than being just commands. We use sed and Awk to find and replace, count, add, basically to manipulate data.
Sed/Awk again combined with some regular expression will let we to do most manipulations on a text file.
# Replace all occurence of "Mazda" to "Maa" in your data.
sed s/Mazda/Maa/g car.csv
# "","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
# "Maa RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# "Maa RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
# "Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
# "Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
# "Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
# "Maa RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# "Maa RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
# "Maa RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# "Maa RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
# "Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
# Delete all the lines containing Mazda
sed /Mazda/d car.csv > noMazda.csv
Sort and uniq
Using sort command will sort the csv treating each line as a single string, we can use sort to even sort based on a coloumn, numerical order or in reverse order. Uniq can be used to return only uniq rows or return the duplicated ones.
Say you want to sort a csv according to coloumn 12 (carb), -k (key) specifies the coloumn to sorted, -nr specifies sort to sort in reverse numeric order and -t specifies the coloumn deliminator.
# if you want to sort the csv according to carb(Coloumn 12)
sort -k 12 -nr -t "," car.csv
# "Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
# "Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
# "Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
# "Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# "Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# "Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# "Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
# "Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
# "Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
# "Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
# "","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
sort car.csv | uniq -c
# 2 "Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
# 1 "Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
# 1 "Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
# 3 "Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# 3 "Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
# 1 "","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
sort car.csv | uniq -d
# "Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
# "Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# "Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
uniq
output to wc
.
sort car.csv | uniq | wc -l
6
Sometimes you just want to split the file into n small parts and run your script over these n part files.
#Split the csv into part files of 100 lines
split big.csv --lines 100
GNU Plot
GNU plot is awesome tool for plotting. Read more.
I just want to say that working with unix command line tools is easier than using any other graphical tools, all you have to do is to try them for once. There are many custom tools to ease your workflow like jsontoCsv, GNU plot etc check them out. High level scripting languages can do a lot, I’m big fan of python but then using unix commandline is so much easier and reliable IMHO. R is pretty good tool too, but when data gets bigger R sucks, use commandline to make data small and tidy for R.
And you dont have to remember any of these parameters that we passed to unix commands, always use man page!
Links:
- Using command line for datascience, huge collection!
- A beginners guide that I wrote with abijith for fosscell juniors
- Noufal Ibrahim’s unix blog post
- I copied many examples from here