DataFrames
What is a DataFrame?
In short, a DataFrame is a table of data, where each column has a title and a type. Operations can be performed on a per column or per row basis, and columns and rows can be added or removed easily. The DataFrames package in Julia is the de facto standard for working with tabular data and is integrated in many other packages. In the previous section we have seen that you can read a data file directly into a DataFrame via the CSV package, but this is only one example of other packages integrating with DataFrames.
The DataFrames package provides a lot of functionality for manipulating the data contain in a DataFrame. There are also additional packages that provide either more options, or alternative (simpler) syntax for convenience, such as DataFramesMeta and Tidier.
We shall look at the basic use of DataFrames and leave these auxiliary packages and some more advanced functionality of DataFrames for self-study.
Creating a DataFrame
Usually, you would create a DataFrame by reading in data from a file. You can however also create a DataFrame from variables directly:
df = DataFrame(X = [1, 2, 3], Y = [2, 3, 4], Z = 2.0)
# 3×3 DataFrame
# Row │ X Y Z
# │ Int64 Int64 Float64
# ─────┼───────────────────────
# 1 │ 1 2 2.0
# 2 │ 2 3 2.0
# 3 │ 3 4 2.0Note that columns X and Y are specified as arrays of integers, and so the columns are typed as Int64. Z is specified as a floating point value and hence the column is types as Float64. The length of Z is also expanded to be the same as the other columns, with a fixed value. The column names, X, Y and Z, are all legal Julia symbols (variable names) and so we can use the assignment operator (=) here.
You can also give columns longer names, with spaces, but then instead you need to either use the pair operator, =>, or define a Dict and then convert that to a DataFrame:
df = DataFrame("X values" => [1, 2, 3], "Y values" => [2, 3, 4], "Z" => 2.0)
# 3×3 DataFrame
# Row │ X values Y values Z
# │ Int64 Int64 Float64
# ─────┼─────────────────────────────
# 1 │ 1 2 2.0
# 2 │ 2 3 2.0
# 3 │ 3 4 2.0
d = Dict("X values" => [1, 2, 3], "Y values" => [2, 3, 4], "Z" => 2.0)
# Dict{String, Any} with 3 entries:
# "Z" => 2.0
# "X values" => [1, 2, 3]
# "Y values" => [2, 3, 4]
df = DataFrame(d)
# 3×3 DataFrame
# Row │ X values Y values Z
# │ Int64 Int64 Float64
# ─────┼─────────────────────────────
# 1 │ 1 2 2.0
# 2 │ 2 3 2.0
# 3 │ 3 4 2.0Don’t use strings for column names just because you can. Using a symbol, like :X_values, instead of the string “X values” will give faster results and is just as readable. This does mean you need to use underscores instead of spaces, but all-in-all that is not too high a price to pay.
A DataFrame can also be constructed one column at a time:
df = DataFrame()
# 0×0 DataFrame
df.A = 1:3
# 1:3
df.B = 2:2:6
# 2:2:6
df
# 3×2 DataFrame
# Row │ A B
# │ Int64 Int64
# ─────┼──────────────
# 1 │ 1 2
# 2 │ 2 4
# 3 │ 3 6Or row by row:
df = DataFrame(A=String[], B=Int64[])
# 0×2 DataFrame
# Row │ A B
# │ String Int64
# ─────┴───────────────
push!(df, ("Bob", 1))
# 1×2 DataFrame
# Row │ A B
# │ String Int64
# ─────┼───────────────
# 1 │ Bob 1
push!(df, ("Tom", 5))
# 2×2 DataFrame
# Row │ A B
# │ String Int64
# ─────┼───────────────
# 1 │ Bob 1
# 2 │ Tom 5Note how we forced types of the columns in the creation of the empty DataFrame. This is needed to give the columns names, and tell the empty DataFrame how many columns it will hold. You could specify the types as Any, but that would not lead to good performance.
Lastly, you can also convert a matrix to a DataFrame. This requires you to either allow DataFrames to automatically generate column names, or to pass a vector of names:
m = rand(5, 5)
# 5×5 Matrix{Float64}:
# 0.460589 0.734461 0.771352 0.292767 0.0078547
# 0.254958 0.153814 0.571355 0.196476 0.739701
# 0.0799407 0.457583 0.553493 0.25409 0.245967
# 0.106429 0.435509 0.0184347 0.932008 0.805972
# 0.0877892 0.0231817 0.435262 0.0187286 0.812151
df = DataFrame(m, :auto) # Automatically generate names
# 5×5 DataFrame
# Row │ x1 x2 x3 x4 x5
# │ Float64 Float64 Float64 Float64 Float64
# ─────┼───────────────────────────────────────────────────────
# 1 │ 0.460589 0.734461 0.771352 0.292767 0.0078547
# 2 │ 0.254958 0.153814 0.571355 0.196476 0.739701
# 3 │ 0.0799407 0.457583 0.553493 0.25409 0.245967
# 4 │ 0.106429 0.435509 0.0184347 0.932008 0.805972
# 5 │ 0.0877892 0.0231817 0.435262 0.0187286 0.812151
df = DataFrame(m, [:X, :Y, :Z, :W, :Q]) # Specify names
# 5×5 DataFrame
# Row │ X Y Z W Q
# │ Float64 Float64 Float64 Float64 Float64
# ─────┼───────────────────────────────────────────────────────
# 1 │ 0.460589 0.734461 0.771352 0.292767 0.0078547
# 2 │ 0.254958 0.153814 0.571355 0.196476 0.739701
# 3 │ 0.0799407 0.457583 0.553493 0.25409 0.245967
# 4 │ 0.106429 0.435509 0.0184347 0.932008 0.805972
# 5 │ 0.0877892 0.0231817 0.435262 0.0187286 0.812151Accessing the Data
Once the data is in a DataFrame, you can access column like they are individual vectors1:
df.X
# 5-element Vector{Float64}:
# 0.460589206535297
# 0.25495814887990786
# 0.07994066869519556
# 0.10642869978604408
# 0.0877892134158087You can also add columns, even via calculations:
df.New = (df.X).^2 .- df.Y
# 5-element Vector{Float64}:
# -0.5223183397440386
# -0.08881079559662731
# -0.45119243718262425
# -0.42418203290738643
# -0.015474773743873368
df
# 5×6 DataFrame
# Row │ X Y Z W Q New
# │ Float64 Float64 Float64 Float64 Float64 Float64
# ─────┼───────────────────────────────────────────────────────────────────
# 1 │ 0.460589 0.734461 0.771352 0.292767 0.0078547 -0.522318
# 2 │ 0.254958 0.153814 0.571355 0.196476 0.739701 -0.0888108
# 3 │ 0.0799407 0.457583 0.553493 0.25409 0.245967 -0.451192
# 4 │ 0.106429 0.435509 0.0184347 0.932008 0.805972 -0.424182
# 5 │ 0.0877892 0.0231817 0.435262 0.0187286 0.812151 -0.0154748Note the use of broadcasting in the first command - using .^ instead of ^ and .- instead of -. This tells Julia to execute these operations in element-wise fashion. The new column, df.New, is added to the DataFrame automatically.
You can also access individual entries by specifying the row and column, the latter using either the names or indexes:
df[2, 3] # Row 2, Column 3 (Z)
# 0.5713552793835407
df[2:3, 3:5]
# 2×3 DataFrame
# Row │ Z W Q
# │ Float64 Float64 Float64
# ─────┼──────────────────────────────
# 1 │ 0.571355 0.196476 0.739701
# 2 │ 0.553493 0.25409 0.245967
df[1:3, :W]
# 3-element Vector{Float64}:
# 0.29276715395306163
# 0.19647571959450094
# 0.25409034700483624To select an entire row or column, the : operator is used, just like in Arrays. This returns a copy of the slice of the DataFrame. When using a !, however, a view to the data is returned. Modifying the view will modify the contents of the DataFrame as well:
df = DataFrame(X = [1, 2, 3], Y = [2, 4, 6], Z = [3, 6, 9])
# 3×3 DataFrame
# Row │ X Y Z
# │ Int64 Int64 Int64
# ─────┼─────────────────────
# 1 │ 1 2 3
# 2 │ 2 4 6
# 3 │ 3 6 9
xvals = df[:, :X]
# 3-element Vector{Int64}:
# 1
# 2
# 3
xvals[1] = 100
# 100
df
# 3×3 DataFrame
# Row │ X Y Z
# │ Int64 Int64 Int64
# ─────┼─────────────────────
# 1 │ 1 2 3
# 2 │ 2 4 6
# 3 │ 3 6 9
xview = df[!, :X]
# 3-element Vector{Int64}:
# 1
# 2
# 3
xview[1] = 100
# 100
df
# 3×3 DataFrame
# Row │ X Y Z
# │ Int64 Int64 Int64
# ─────┼─────────────────────
# 1 │ 100 2 3
# 2 │ 2 4 6
# 3 │ 3 6 9You can also select just specific rows and columns:
df[[1, 3], [1, 3]]
# 2×2 DataFrame
# Row │ X Z
# │ Int64 Int64
# ─────┼──────────────
# 1 │ 1 3
# 2 │ 3 9Data Wrangling with DataFrames
Data wrangling is the pre-processing of your data to get it into the form you need for the modelling or analysis you intend to do with it. This can include many steps. DataFrames has a rich feature set to support you in this. We’ll look at several examples here, but if you are going to be doing this type of work often, save yourself a lot of page-turning and download, print and laminate a copy of the Data Wrangling with DataFrames.jl Cheat Sheet.
We start off by creating a dataset to work with:
df = DataFrame(X = 1:3:1500, Y = repeat(1:100, outer=5), Z = repeat(1:100, inner=5)
# 500×3 DataFrame
# Row │ X Y Z
# │ Int64 Int64 Int64
# ─────┼─────────────────────
# 1 │ 1 1 1
# 2 │ 4 2 1
# 3 │ 7 3 1
# 4 │ 10 4 1
# 5 │ 13 5 1
# 6 │ 16 6 2
# 7 │ 19 7 2
# 8 │ 22 8 2
# 9 │ 25 9 2
# ⋮ │ ⋮ ⋮ ⋮
# 492 │ 1474 92 99
# 493 │ 1477 93 99
# 494 │ 1480 94 99
# 495 │ 1483 95 99
# 496 │ 1486 96 100
# 497 │ 1489 97 100
# 498 │ 1492 98 100
# 499 │ 1495 99 100
# 500 │ 1498 100 100
# 482 rows omitted
first(df,3)
# 3×3 DataFrame
# Row │ X Y Z
# │ Int64 Int64 Int64
# ─────┼─────────────────────
# 1 │ 1 1 1
# 2 │ 4 2 1
# 3 │ 7 3 1
last(df,4)
# 4×3 DataFrame
# Row │ X Y Z
# │ Int64 Int64 Int64
# ─────┼─────────────────────
# 1 │ 1489 97 100
# 2 │ 1492 98 100
# 3 │ 1495 99 100
# 4 │ 1498 100 100
names(df)
# 3-element Vector{String}:
# "X"
# "Y"
# "Z"
describe(df)
# 3×7 DataFrame
# Row │ variable mean min median max nmissing eltype
# │ Symbol Float64 Int64 Float64 Int64 Int64 DataType
# ─────┼──────────────────────────────────────────────────────────────
# 1 │ X 749.5 1 749.5 1498 0 Int64
# 2 │ Y 50.5 1 50.5 100 0 Int64
# 3 │ Z 50.5 1 50.5 100 0 Int64In generating this data, we used Julia’s repeat function. For the Y column, we specify outer=5, which means the data (1:100) is repeated 5 times as a set, i.e. 1 to 100, then 1 to 100 again, for a total of 5 times. For the Z column, we specify inner=5, which means each of the values from 1 to 100 is repeated 5 times: 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,… etc.
We can quickly see the first or last few rows using the Julia functions first and last, with the second parameter showing how many entries to display. The number of rows displayed by Julia for the full DataFrame depends on the size of the window it is being displayed in. The middle omitted rows are indicated with the ⋮ characters. If the DataFrame is too wide, columns will also be omitted, with a message indicating how many columns are hidden.
If you really want to see all the rows or columns, you can call show and specify allrows=true and/or allcols=true
show(df, allrows=true, allcols=true)For obvious reasons, we won’t include the 500 rows of data here.
The names() function returns a vector of the column names as strings. You can also get summary statistics of the data with the describe() function.
If you only want specific statistics, you can specify them in the call:
describe(df, :mean, :std)
# 3×3 DataFrame
# Row │ variable mean std
# │ Symbol Float64 Float64
# ─────┼────────────────────────────
# 1 │ X 749.5 433.445
# 2 │ Y 50.5 28.895
# 3 │ Z 50.5 28.895You can also apply a custom function to describe the data. Here we use the Julia function extrema(), which returns the minimum and maximum value in an array. We’ll specify that we want the results in a column call minmax:
describe(df, extrema => :minmax)
# 3×2 DataFrame
# Row │ variable minmax
# │ Symbol Tuple…
# ─────┼─────────────────────
# 1 │ X (1, 1498)
# 2 │ Y (1, 100)
# 3 │ Z (1, 100)Lastly, the number of rows and column can be easily retrieved:
nrow(df)
# 500
ncol(df)
# 6Selecting and Transforming Data
Selecting Columns
In the introductory section, we selected specific rows and columns based on their indices. You can also select columns based on their names:
select(df, [:X, :Z])
# 500×2 DataFrame
# Row │ X Z
# │ Int64 Int64
# ─────┼──────────────
# 1 │ 1 1
# 2 │ 4 1
# 3 │ 7 1
# ⋮ │ ⋮ ⋮
# 498 │ 1492 100
# 499 │ 1495 100
# 500 │ 1498 100
# 494 rows omittedThere are also selector function to allow for selection of ranges of columns:
Not(): Select all the columns expect the specified one(s)Between(): Select all column starting with the first specified column and ending with the second. You can use names or indices to specify the columnsCols(): Use this to combine other selector functionsAll(): Select all columns - same as specifying:
Examples:
In the previous example we selected columns X and Z, which is the same as selecting everything expect Y:
select(df, Not(:Y))
# 500×2 DataFrame
# Row │ X Z
# │ Int64 Int64
# ─────┼──────────────
# 1 │ 1 1
# 2 │ 4 1
# 3 │ 7 1
# ⋮ │ ⋮ ⋮
# 498 │ 1492 100
# 499 │ 1495 100
# 500 │ 1498 100
# 494 rows omittedWe only passed a single symbol to Not, but could also pass an array of symbols to exclude multiple columns.
To make things more interesting, we can first add a few more columns:
df.A = 2 .* df.X
df.B = df.Y .+ df.Z
df.C = df.Z .^ 2
df
# 500×6 DataFrame
# Row │ X Y Z A B C
# │ Int64 Int64 Int64 Int64 Int64 Int64
# ─────┼──────────────────────────────────────────
# 1 │ 1 1 1 2 2 1
# 2 │ 4 2 1 8 3 1
# 3 │ 7 3 1 14 4 1
# ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
# 498 │ 1492 98 100 2984 198 10000
# 499 │ 1495 99 100 2990 199 10000
# 500 │ 1498 100 100 2996 200 10000
# 494 rows omittedselect(df, Cols(:A, Between(:X, :Z)))
# 500×4 DataFrame
# Row │ A X Y Z
# │ Int64 Int64 Int64 Int64
# ─────┼────────────────────────────
# 1 │ 2 1 1 1
# 2 │ 8 4 2 1
# 3 │ 14 7 3 1
# ⋮ │ ⋮ ⋮ ⋮ ⋮
# 498 │ 2984 1492 98 100
# 499 │ 2990 1495 99 100
# 500 │ 2996 1498 100 100
# 494 rows omittedHere we used Cols to combine selecting column A and all columns from X to Z. This also changed the order of the columns.
We used the select function to do the selections, but could as easily have used the normal array-type notation:
df[:, Cols(:A, Between(:X, :Z))]
# 500×4 DataFrame
# Row │ A X Y Z
# │ Int64 Int64 Int64 Int64
# ─────┼────────────────────────────
# 1 │ 2 1 1 1
# 2 │ 8 4 2 1
# 3 │ 14 7 3 1
# ⋮ │ ⋮ ⋮ ⋮ ⋮
# 498 │ 2984 1492 98 100
# 499 │ 2990 1495 99 100
# 500 │ 2996 1498 100 100
# 494 rows omittedTo permanently change the DataFrame, either assign the returned result to the same variable, or use the mutating version of select(): select!().
df = select(df, Cols(:A, Between(:X, :Z)))
df
# 500×4 DataFrame
# Row │ A X Y Z
# │ Int64 Int64 Int64 Int64
# ─────┼────────────────────────────
# 1 │ 2 1 1 1
# 2 │ 8 4 2 1
# 3 │ 14 7 3 1
# ⋮ │ ⋮ ⋮ ⋮ ⋮
# 498 │ 2984 1492 98 100
# 499 │ 2990 1495 99 100
# 500 │ 2996 1498 100 100
# 494 rows omittedor
select!(df, Cols(:A, Between(:X, :Z)))
# 500×4 DataFrame
# Row │ A X Y Z
# │ Int64 Int64 Int64 Int64
# ─────┼────────────────────────────
# 1 │ 2 1 1 1
# 2 │ 8 4 2 1 Da
# 3 │ 14 7 3 1
# ⋮ │ ⋮ ⋮ ⋮ ⋮
# 498 │ 2984 1492 98 100
# 499 │ 2990 1495 99 100
# 500 │ 2996 1498 100 100
# 494 rows omittedIt is also possible to select columns from a DataFrame that match to a regular expression. This is a template that describes a string of text. For more information, see one the many tutorials on the web.
df = DataFrame(
Names = ["Tom", "Dick", "Harry"],
AddressLine1 = ["2 Maple Drive", "4 Oak Street", "6 Pine Road"],
AddressLine2 = ["Hopetown", "Smallville", "Metropolis"],
AddressZip = [1234, 2345, 3456]
)
3×4 DataFrame
# Row │ Names AddressLine1 AddressLine2 AddressZip
# │ String String String Int64
# ─────┼─────────────────────────────────────────────────
# 1 │ Tom 2 Maple Drive Hopetown 1234
# 2 │ Dick 4 Oak Street Smallville 2345
# 3 │ Harry 6 Pine Road Metropolis 3456
df[:, r"Addr"]
# 3×3 DataFrame
# Row │ AddressLine1 AddressLine2 AddressZip
# │ String String Int64
# ─────┼─────────────────────────────────────────
# 1 │ 2 Maple Drive Hopetown 1234
# 2 │ 4 Oak Street Smallville 2345
# 3 │ 6 Pine Road Metropolis 34Here we select all rows, but only the column that have the sub-string “Addr” in their names.
Selecting Rows (Filtering) and Sorting
You can select rows by filtering the data. There are two equivalent ways:
filter(:X => <=(10), df)
# 4×6 DataFrame
# Row │ X Y Z A B C
# │ Int64 Int64 Int64 Int64 Int64 Int64
# ─────┼──────────────────────────────────────────
# 1 │ 1 1 1 2 2 1
# 2 │ 4 2 1 8 3 1
# 3 │ 7 3 1 14 4 1
# 4 │ 10 4 1 20 5 1
df[df.X .<= 10, :]
# 4×6 DataFrame
# Row │ X Y Z A B C
# │ Int64 Int64 Int64 Int64 Int64 Int64
# ─────┼──────────────────────────────────────────
# 1 │ 1 1 1 2 2 1
# 2 │ 4 2 1 8 3 1
# 3 │ 7 3 1 14 4 1
# 4 │ 10 4 1 20 5 1In both cases we pass a predicate function, i.e. a function that returns true for rows that should be included. In the first case, a pair operator links a column name to the predicate function and in the second case we simply pass an expression that returns true for all the correct instances. Note the use of broadcasting in the second example.
The contents of the column being checked in the predicate function is passed as the first parameter, so :X => <=(10) is in fact df[<current row>, X] <= 10, which can also be written as <=(df[<current row>, X], 10), since operators in Julia are just functions with special syntax - they can still be called like normal functions.
You can use any predicate function. A new useful ones include:
df[findfirst(==(4), df.C), :] # Find the first row where column C has a value of 4
# DataFrameRow
# Row │ X Y Z A B C
# │ Int64 Int64 Int64 Int64 Int64 Int64
# ─────┼──────────────────────────────────────────
# 6 │ 16 6 2 32 8 4
df[findall(==(4), df.C), :] # Find all rows where column C has a value of 4
# 5×6 DataFrame
# Row │ X Y Z A B C
# │ Int64 Int64 Int64 Int64 Int64 Int64
# ─────┼──────────────────────────────────────────
# 1 │ 16 6 2 32 8 4
# 2 │ 19 7 2 38 9 4
# 3 │ 22 8 2 44 10 4
# 4 │ 25 9 2 50 11 4
# 5 │ 28 10 2 56 12 4
df[df.C .== 4, :] # Find all rows where column C has a value of 4, alternative form
# 5×6 DataFrame
# Row │ X Y Z A B C
# │ Int64 Int64 Int64 Int64 Int64 Int64
# ─────┼──────────────────────────────────────────
# 1 │ 16 6 2 32 8 4
# 2 │ 19 7 2 38 9 4
# 3 │ 22 8 2 44 10 4
# 4 │ 25 9 2 50 11 4
# 5 │ 28 10 2 56 12 4It may also be useful to sort the data according to specified columns:
sort(df, :Y)
# 500×6 DataFrame
# Row │ X Y Z A B C
# │ Int64 Int64 Int64 Int64 Int64 Int64
# ─────┼──────────────────────────────────────────
# 1 │ 1 1 1 2 2 1
# 2 │ 301 1 21 602 22 441
# 3 │ 601 1 41 1202 42 1681
# 4 │ 901 1 61 1802 62 3721
# 5 │ 1201 1 81 2402 82 6561
# ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
# 497 │ 598 100 40 1196 140 1600
# 498 │ 898 100 60 1796 160 3600
# 499 │ 1198 100 80 2396 180 6400
# 500 │ 1498 100 100 2996 200 10000
# 491 rows omitted
sort(df, [:Y, :B])
# 500×6 DataFrame
# Row │ X Y Z A B C
# │ Int64 Int64 Int64 Int64 Int64 Int64
# ─────┼──────────────────────────────────────────
# 1 │ 1 1 1 2 2 1
# 2 │ 301 1 21 602 22 441
# 3 │ 601 1 41 1202 42 1681
# 4 │ 901 1 61 1802 62 3721
# 5 │ 1201 1 81 2402 82 6561
# ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
# 497 │ 598 100 40 1196 140 1600
# 498 │ 898 100 60 1796 160 3600
# 499 │ 1198 100 80 2396 180 6400
# 500 │ 1498 100 100 2996 200 10000
# 491 rows omittedApplying Function to Columns and Rows
Applying a function to a column or multiple columns has been shown before when we added columns to a DataFrame. Since each column acts like a vector, we simply use normal Julia broadcasting to apply functions to whole columns:
df.BdivC = df.B ./ df.C
# 500-element Vector{Float64}:
# 2.0
# 3.0
# 4.0
# 5.0
# 6.0
# 2.0
# 2.25
# ⋮
# 0.01979389858177737
# 0.0196
# 0.0197
# 0.0198
# 0.0199
# 0.02
# df
# 500×7 DataFrame
# Row │ X Y Z A B C BdivC
# │ Int64 Int64 Int64 Int64 Int64 Int64 Float64
# ─────┼─────────────────────────────────────────────────────
# 1 │ 1 1 1 2 2 1 2.0
# 2 │ 4 2 1 8 3 1 3.0
# 3 │ 7 3 1 14 4 1 4.0
# 4 │ 10 4 1 20 5 1 5.0
# 5 │ 13 5 1 26 6 1 6.0
# ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
# 497 │ 1489 97 100 2978 197 10000 0.0197
# 498 │ 1492 98 100 2984 198 10000 0.0198
# 499 │ 1495 99 100 2990 199 10000 0.0199
# 500 │ 1498 100 100 2996 200 10000 0.02
# 491 rows omittedIn order to apply functions across rows, we need to use some of the built-in functions of DataFrames, select and transform, along with ByRow.
The basic syntax is:
select(df, <array of columns to select>)select(df, <array of columns to select>, <array of columns to pass to function> => <function> => <name(s) of column(s) to hold result(s)>)select(df, AsTable(:) => <function> => <name(s) of column(s) to hold result>)# pass the entire table to the function
Some examples:
select(df, :X, :X => cumsum => :cumX) # Put the cumulative sum of X in column cumX
# 500×2 DataFrame
# Row │ X cumX
# │ Int64 Int64
# ─────┼───────────────
# 1 │ 1 1
# 2 │ 4 5
# 3 │ 7 12
# 4 │ 10 22
# 5 │ 13 35
# ⋮ │ ⋮ ⋮
# 497 │ 1489 370265
# 498 │ 1492 371757
# 499 │ 1495 373252
# 500 │ 1498 374750
# 491 rows omitted
select(df, [:X, :Y], [:X, :Y] => ByRow((x, y) -> sin(x)*cos(y)) => :sinXcosY) # Put the value of sin(X)*cos*(Y) in column sinXcosY
# 500×3 DataFrame
# Row │ X Y sinXcosY
# │ Int64 Int64 Float64
# ─────┼───────────────────────────
# 1 │ 1 1 0.454649
# 2 │ 4 2 0.314941
# 3 │ 7 3 -0.650412
# 4 │ 10 4 0.355596
# 5 │ 13 5 0.119185
# ⋮ │ ⋮ ⋮ ⋮
# 497 │ 1489 97 0.106082
# 498 │ 1492 98 -0.207859
# 499 │ 1495 99 -0.0154374
# 500 │ 1498 100 0.443124
# 491 rows omitted
select(df, AsTable(:) => ByRow(extrema) => [:min, :max]) # Find the smallest and largest value in each row
# 500×2 DataFrame
# Row │ min max
# │ Int64 Int64
# ─────┼──────────────
# 1 │ 1 2
# 2 │ 1 8
# 3 │ 1 14
# 4 │ 1 20
# 5 │ 1 26
# ⋮ │ ⋮ ⋮
# 497 │ 97 10000
# 498 │ 98 10000
# 499 │ 99 10000
# 500 │ 100 10000
# 491 rows omittedIn each case, we used select and it only returned the specified columns. In the last example, we didn’t select any column, but just specified the function to apply. As a result, only the columns holding the results of the function are selected. If we instead call transform, the entire DataFrame is returned, as well as any new columns added in the calculations, e.g.
transform(df, AsTable(:) => ByRow(extrema) => [:min, :max])
# 500×8 DataFrame
# Row │ X Y Z A B C min max
# │ Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
# ─────┼────────────────────────────────────────────────────────
# 1 │ 1 1 1 2 2 1 1 2
# 2 │ 4 2 1 8 3 1 1 8
# 3 │ 7 3 1 14 4 1 1 14
# 4 │ 10 4 1 20 5 1 1 20
# 5 │ 13 5 1 26 6 1 1 26
# ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
# 497 │ 1489 97 100 2978 197 10000 97 10000
# 498 │ 1492 98 100 2984 198 10000 98 10000
# 499 │ 1495 99 100 2990 199 10000 99 10000
# 500 │ 1498 100 100 2996 200 10000 100 10000
# 491 rows omittedJust like before, select and transform return new DataFrames, while select! and transform! will modify the existing DataFrame.
Grouping Data
DataFrames will group your data by column values through the groupby() function. One or more columns can be specified and you will get a GroupedDataFrame - a container of SubDataFrames for each unique combination of the values in the specified columns.
df = DataFrame(X = 1:50, Y = repeat(1:2, outer=25), Z = repeat(2:2:10, inner=10))
# 50×3 DataFrame
# Row │ X Y Z
# │ Int64 Int64 Int64
# ─────┼─────────────────────
# 1 │ 1 1 2
# 2 │ 2 2 2
# 3 │ 3 1 2
# 4 │ 4 2 2
# 5 │ 5 1 2
# 6 │ 6 2 2
# ⋮ │ ⋮ ⋮ ⋮
# 45 │ 45 1 10
# 46 │ 46 2 10
# 47 │ 47 1 10
# 48 │ 48 2 10
# 49 │ 49 1 10
# 50 │ 50 2 10
# 38 rows omitted
gdf = groupby(df, [:Y, :Z]
# GroupedDataFrame with 10 groups based on keys: Y, Z
# First Group (5 rows): Y = 1, Z = 2
# Row │ X Y Z
# │ Int64 Int64 Int64
# ─────┼─────────────────────
# 1 │ 1 1 2
# ⋮ │ ⋮ ⋮ ⋮
# 5 │ 9 1 2
# 3 rows omitted
# ⋮
# Last Group (5 rows): Y = 2, Z = 10
# Row │ X Y Z
# │ Int64 Int64 Int64
# ─────┼─────────────────────
# 1 │ 42 2 10
# ⋮ │ ⋮ ⋮ ⋮
# 5 │ 50 2 10
# 3 rows omitted
counts = combine(gdf, nrow, :X => mean)
# 10×4 DataFrame
# Row │ Y Z nrow X_mean
# │ Int64 Int64 Int64 Float64
# ─────┼──────────────────────────────
# 1 │ 1 2 5 5.0
# 2 │ 1 4 5 15.0
# 3 │ 1 6 5 25.0
# 4 │ 1 8 5 35.0
# 5 │ 1 10 5 45.0
# 6 │ 2 2 5 6.0
# 7 │ 2 4 5 16.0
# 8 │ 2 6 5 26.0
# 9 │ 2 8 5 36.0
# 10 │ 2 10 5 46.0
counts = select(gdf, nrow, :X => mean)
# 50×4 DataFrame
# Row │ Y Z nrow X_mean
# │ Int64 Int64 Int64 Float64
# ─────┼──────────────────────────────
# 1 │ 1 2 5 5.0
# 2 │ 2 2 5 6.0
# 3 │ 1 2 5 5.0
# 4 │ 2 2 5 6.0
# 5 │ 1 2 5 5.0
# 6 │ 2 2 5 6.0
# ⋮ │ ⋮ ⋮ ⋮ ⋮
# 45 │ 1 10 5 45.0
# 46 │ 2 10 5 46.0
# 47 │ 1 10 5 45.0
# 48 │ 2 10 5 46.0
# 49 │ 1 10 5 45.0
# 50 │ 2 10 5 46.0
# 38 rows omittedHere we create a DataFrame with three columns: X ranges from 1 to 50, Y alternates between 1 and 2, and Z is a sequence of the even numbers from 2 to 10. We then group by Y and Z, which gives us a GroupedDataFrame with 10 groups, for the 10 unique combinations of values in Y and Z. The second command created a new DataFrame, called counts which contains the number of rows in each of the 10 SubDataFrames in our GroupedDataFrame (gdf) as well as the mean2 value of the X column in each of the 10 groups. Lastly, we repeat the second example, but with select instead of combine. This returns a DataFrame with the same number (and order) of rows as in the original source DataFrame (not the grouped one!), but with only the newly created column. In this case, we see 50 rows, not the expected 10.
A More Complete Example
For this example, we use a dataset from Kaggle3. The data file includes information on the ingredients of 399 types of food.
df = CSV.read(raw"D:\JuliaCode\KaggleData\food_ingredients_and_allergens.csv", DataFrame; normalizenames=true)
select!(df, [:Food_Product, :Allergens])
# 399×2 DataFrame
# Row │ Food_Product Allergens
# │ String String31
# ─────┼─────────────────────────────────────────────────
# 1 │ Almond Cookies Almonds, Wheat, Dairy
# 2 │ Almond Cookies Almonds, Wheat, Dairy
# 3 │ Chicken Noodle Soup Chicken, Wheat, Celery
# 4 │ Chicken Noodle Soup Chicken, Wheat, Celery
# 5 │ Cheddar Cheese Dairy
# 6 │ Ranch Dressing Dairy
# 7 │ Caramel Popcorn Dairy
# 8 │ Caesar Salad Dairy
# ⋮ │ ⋮ ⋮
# 393 │ Chocolate Chip Pancakes Wheat, Dairy, Cocoa
# 394 │ Lemon Bars Wheat, Dairy, Eggs
# 395 │ Lemon Bars Wheat, Dairy, Eggs
# 396 │ Pecan Pie Wheat, Dairy, Nuts
# 397 │ Zucchini Bread Wheat, Dairy, Nuts
# 398 │ Banana Bread Wheat, Dairy, Nuts
# 399 │ Hawaiian Pizza Wheat, Pork, Dairy
# 384 rows omittedWe are going to focus on the allergens in these foods. We select to keep only the two column containing the names of the foods and the allergens contained in them. Unfortunately, the allergens are given as a single string of comma-separated values. This means we have to do a little processing of the data.
We ideally want a separate column for each allergen that contains a true (1) or false (0) to indicate if the allergen is present or not. To get this, we must process all the entries on allergens and split the strings into the individual allergens. We then need to identify the unique values, and add a true or false in a new column for each food and each allergen. This sounds like a lot of work, but Julia makes this easy.
allAllergens = String[] #empty array of strings
for row in eachrow(df)
allergens = row.Allergens
for s in eachsplit(allergens, ", ")
push!(allAllergens, s)
end
end
unique!(allAllergens)
# 24-element Vector{String}:
# "Almonds"
# "Wheat"
# "Dairy"
# "Chicken"
# "Celery"
# "Anchovies"
# "Cocoa"
# ⋮
# "Oats"
# "Peanuts"
# "Pine nuts"
# "Rice"
# "Shellfish"
# "Pork"
# "Alcohol"We create an empty vector of Strings to hold the allergen names - allAllergens. Then we iterate trough the DataFrame, one row at a time, using the eachrow iterator. We extract the string holding the allergens into a temporary string allergens with allergens = row.Allergens (where row is the current row in the iterator eachrow).
To split the string (allergens) into the various allergens, we use the Julia function eachsplit, which splits the string with the specified delimiter (“,”), and returns an iterator of substrings. We loop through this iterator and push each substring into the empty vector we created at the start.
Lastly, we use the Julia function unique! to remove all duplicates from this vector. We now have a list of 24 possible allergens.
for allergen in allAllergens
transform!(df, :Allergens => ByRow(s -> contains(string(s), allergen)) => Symbol(allergen))
end
df[:, [:Food_Product, :Almonds, :Wheat, :Dairy]]
# 399×4 DataFrame
# Row │ Food_Product Almonds Wheat Dairy
# │ String Bool Bool Bool
# ─────┼────────────────────────────────────────────────
# 1 │ Almond Cookies true true true
# 2 │ Almond Cookies true true true
# 3 │ Chicken Noodle Soup false true false
# 4 │ Chicken Noodle Soup false true false
# 5 │ Cheddar Cheese false false true
# 6 │ Ranch Dressing false false true
# ⋮ │ ⋮ ⋮ ⋮ ⋮
# 394 │ Lemon Bars false true true
# 395 │ Lemon Bars false true true
# 396 │ Pecan Pie false true true
# 397 │ Zucchini Bread false true true
# 398 │ Banana Bread false true true
# 399 │ Hawaiian Pizza false true true
# 387 rows omittedWe now run through our newly created list of allergens and call the DataFrames function transform! for each allergen. transform! is the mutating version of transform and will modify the specified DataFrame rather than return a new object. transform includes all of the columns in the original DataFrame, unlike select, which will include only the specified columns.
We specify here than we want to take the contents of the Allergens column, and pass it, one row at a time (ByRow()) to an anonymous function (s -> contains(string(s), allergen))) and store the result in a new column Symbol(allergen). Note that we are converting the name of the allergen, which is a String to a Symbol, which is what DataFrames expects here.
The anonymous function we define essentially only calls the Julia function contains, which will return a true or false depending on whether the string contains a specified substring. We convert the value from the DataFrame to a string (with the string() function), as DataFrames uses a more specialised data type for string values to save memory and improve performance. We have a quick look at three of the allergens and see that we indeed have exactly what we wanted.
We can now analyse the data any way we want to. Let’s group the data to see combinations of almonds and wheat in the various foods:
gdf = groupby(df, [:Almonds, :Wheat])
for (keys, sdf) in pairs(gdf)
println("Number of foods with $keys: $(nrow(sdf))")
end
# Number of foods with GroupKey: (Almonds = true, Wheat = true): 2
# Number of foods with GroupKey: (Almonds = false, Wheat = true): 94
# Number of foods with GroupKey: (Almonds = false, Wheat = false): 303Joining DataFrames
DataFrames allows for joining of tables, just like in a SQL database. We shall give a few key definitions here to illustrate the concept.
Firstly, what is a join? It is a combination of two tables, based on the values in a specified column, typically an ID or similar index value. The resulting table may include columns from one or both of the tables, for the rows that are included in the joined table, depending on the type of join.
We define two DataFrames to use in our examples:
df1 = DataFrame(ID = [1, 2, 3], Floor = ["A", "B", "C"], Office = [11, 12, 12])
# 3×3 DataFrame
# Row │ ID Floor Office
# │ Int64 String Int64
# ─────┼───────────────────────
# 1 │ 1 A 11
# 2 │ 2 B 12
# 3 │ 3 C 12
df2 = DataFrame(ID = [1, 2, 7, 11], Name = ["Tom", "Dan", "Harry", "Bob"])
# 4×2 DataFrame
# Row │ ID Name
# │ Int64 String
# ─────┼───────────────
# 1 │ 1 Tom
# 2 │ 2 Dan
# 3 │ 7 Harry
# 4 │ 11 BobInner join: An inner join between two tables will include only the rows where the field (e.g. ID) on which you join, are the same in each table. Rows that include a value of ID that is only present in one of the two tables will not be included. All columns from both tables are included.
innerjoin(df1, df2, on=:ID)
# 2×4 DataFrame
# Row │ ID Floor Office Name
# │ Int64 String Int64 String
# ─────┼───────────────────────────────
# 1 │ 1 A 11 Tom
# 2 │ 2 B 12 DanLeft join: A left join between two tables will include all the rows from the left table, as well as those rows from right table where the joining field (e.g. ID) match. All columns are included, which may result in missing values in columns from the right table.
leftjoin(df1, df2, on=:ID)
# 3×4 DataFrame
# Row │ ID Floor Office Name
# │ Int64 String Int64 String?
# ─────┼────────────────────────────────
# 1 │ 1 A 11 Tom
# 2 │ 2 B 12 Dan
# 3 │ 3 C 12 missing Right join: Similar to a left join, but all rows from the right table are included and only matching rows from the left table. This may result in missing values in columns from the left table.
rightjoin(df1, df2, on=:ID)
# 4×4 DataFrame
# Row │ ID Floor Office Name
# │ Int64 String? Int64? String
# ─────┼─────────────────────────────────
# 1 │ 1 A 11 Tom
# 2 │ 2 B 12 Dan
# 3 │ 7 missing missing Harry
# 4 │ 11 missing missing BobOuter join: This will include all rows from both left and right tables, possibly resulting in missing values in all columns.
# outerjoin(df1, df2, on=:ID)
# 5×4 DataFrame
# Row │ ID Floor Office Name
# │ Int64 String? Int64? String?
# ─────┼──────────────────────────────────
# 1 │ 1 A 11 Tom
# 2 │ 2 B 12 Dan
# 3 │ 3 C 12 missing
# 4 │ 7 missing missing Harry
# 5 │ 11 missing missing BobSemi join (a.k.a. Left semi join): Like an inner join, this will only return rows that match in both columns, but will only return columns from the left table. This is useful if you want to select, for example, names from a list that are also present in another list, without including additional data from that list. To do a right semi join, just reverse the order to the two DataFrames.
semijoin(df1, df2, on=:ID)
# 2×3 DataFrame
# Row │ ID Floor Office
# │ Int64 String Int64
# ─────┼───────────────────────
# 1 │ 1 A 11
# 2 │ 2 B 12Anti join (a.k.a. Left anti join): Like the semi join, but will only return rows from the left table that are NOT matched in the right table.
antijoin(df1, df2, on=:ID)
# 1×3 DataFrame
# Row │ ID Floor Office
# │ Int64 String Int64
# ─────┼───────────────────────
# 1 │ 3 C 12Footnotes
Unless, of course, you used strings with spaces for column names.↩︎
Note that to use the
meanfunction, you need to first include theStatisticsstandard library↩︎Kaggle is a website for data science enthusiasts. It has many data sets and run competitions to generate predictions based on these sets.↩︎