DataFrames

What is a DataFrame?

In short, a DataFrame is a table of data, where each column has a title and a type. Operations can be performed on a per column or per row basis, and columns and rows can be added or removed easily. The DataFrames package in Julia is the de facto standard for working with tabular data and is integrated in many other packages. In the previous section we have seen that you can read a data file directly into a DataFrame via the CSV package, but this is only one example of other packages integrating with DataFrames.

The DataFrames package provides a lot of functionality for manipulating the data contain in a DataFrame. There are also additional packages that provide either more options, or alternative (simpler) syntax for convenience, such as DataFramesMeta and Tidier.

We shall look at the basic use of DataFrames and leave these auxiliary packages and some more advanced functionality of DataFrames for self-study.

Creating a DataFrame

Usually, you would create a DataFrame by reading in data from a file. You can however also create a DataFrame from variables directly:

df = DataFrame(X = [1, 2, 3], Y = [2, 3, 4], Z = 2.0)
# 3×3 DataFrame
#  Row │ X      Y      Z
#      │ Int64  Int64  Float64
# ─────┼───────────────────────
#    1 │     1      2      2.0
#    2 │     2      3      2.0
#    3 │     3      4      2.0

Note that columns X and Y are specified as arrays of integers, and so the columns are typed as Int64. Z is specified as a floating point value and hence the column is types as Float64. The length of Z is also expanded to be the same as the other columns, with a fixed value. The column names, X, Y and Z, are all legal Julia symbols (variable names) and so we can use the assignment operator (=) here.

You can also give columns longer names, with spaces, but then instead you need to either use the pair operator, =>, or define a Dict and then convert that to a DataFrame:

df = DataFrame("X values" => [1, 2, 3], "Y values" => [2, 3, 4], "Z" => 2.0)
# 3×3 DataFrame
#  Row │ X values  Y values  Z
#      │ Int64     Int64     Float64
# ─────┼─────────────────────────────
#    1 │        1         2      2.0
#    2 │        2         3      2.0
#    3 │        3         4      2.0

d = Dict("X values" => [1, 2, 3], "Y values" => [2, 3, 4], "Z" => 2.0)
# Dict{String, Any} with 3 entries:
#   "Z"        => 2.0
#   "X values" => [1, 2, 3]
#   "Y values" => [2, 3, 4]

df = DataFrame(d)
# 3×3 DataFrame
#  Row │ X values  Y values  Z
#      │ Int64     Int64     Float64
# ─────┼─────────────────────────────
#    1 │        1         2      2.0
#    2 │        2         3      2.0
#    3 │        3         4      2.0
Caution

Don’t use strings for column names just because you can. Using a symbol, like :X_values, instead of the string “X values” will give faster results and is just as readable. This does mean you need to use underscores instead of spaces, but all-in-all that is not too high a price to pay.

A DataFrame can also be constructed one column at a time:

df = DataFrame()
# 0×0 DataFrame

df.A = 1:3
# 1:3

df.B = 2:2:6
# 2:2:6

df
# 3×2 DataFrame
#  Row │ A      B
#      │ Int64  Int64
# ─────┼──────────────
#    1 │     1      2
#    2 │     2      4
#    3 │     3      6

Or row by row:

df = DataFrame(A=String[], B=Int64[])
# 0×2 DataFrame
#  Row │ A       B
#      │ String  Int64
# ─────┴───────────────

push!(df, ("Bob", 1))
# 1×2 DataFrame
#  Row │ A       B
#      │ String  Int64
# ─────┼───────────────
#    1 │ Bob         1

push!(df, ("Tom", 5))
# 2×2 DataFrame
#  Row │ A       B
#      │ String  Int64
# ─────┼───────────────
#    1 │ Bob         1
#    2 │ Tom         5

Note how we forced types of the columns in the creation of the empty DataFrame. This is needed to give the columns names, and tell the empty DataFrame how many columns it will hold. You could specify the types as Any, but that would not lead to good performance.

Lastly, you can also convert a matrix to a DataFrame. This requires you to either allow DataFrames to automatically generate column names, or to pass a vector of names:

m = rand(5, 5)
# 5×5 Matrix{Float64}:
#  0.460589   0.734461   0.771352   0.292767   0.0078547
#  0.254958   0.153814   0.571355   0.196476   0.739701
#  0.0799407  0.457583   0.553493   0.25409    0.245967
#  0.106429   0.435509   0.0184347  0.932008   0.805972
#  0.0877892  0.0231817  0.435262   0.0187286  0.812151

df = DataFrame(m, :auto) # Automatically generate names
# 5×5 DataFrame
#  Row │ x1         x2         x3         x4         x5
#      │ Float64    Float64    Float64    Float64    Float64
# ─────┼───────────────────────────────────────────────────────
#    1 │ 0.460589   0.734461   0.771352   0.292767   0.0078547
#    2 │ 0.254958   0.153814   0.571355   0.196476   0.739701
#    3 │ 0.0799407  0.457583   0.553493   0.25409    0.245967
#    4 │ 0.106429   0.435509   0.0184347  0.932008   0.805972
#    5 │ 0.0877892  0.0231817  0.435262   0.0187286  0.812151

df = DataFrame(m, [:X, :Y, :Z, :W, :Q]) # Specify names
# 5×5 DataFrame
#  Row │ X          Y          Z          W          Q
#      │ Float64    Float64    Float64    Float64    Float64
# ─────┼───────────────────────────────────────────────────────
#    1 │ 0.460589   0.734461   0.771352   0.292767   0.0078547
#    2 │ 0.254958   0.153814   0.571355   0.196476   0.739701
#    3 │ 0.0799407  0.457583   0.553493   0.25409    0.245967
#    4 │ 0.106429   0.435509   0.0184347  0.932008   0.805972
#    5 │ 0.0877892  0.0231817  0.435262   0.0187286  0.812151

Accessing the Data

Once the data is in a DataFrame, you can access column like they are individual vectors1:

df.X
# 5-element Vector{Float64}:
#  0.460589206535297
#  0.25495814887990786
#  0.07994066869519556
#  0.10642869978604408
#  0.0877892134158087

You can also add columns, even via calculations:

df.New = (df.X).^2 .- df.Y
# 5-element Vector{Float64}:
#  -0.5223183397440386
#  -0.08881079559662731
#  -0.45119243718262425
#  -0.42418203290738643
#  -0.015474773743873368

df
# 5×6 DataFrame
#  Row │ X          Y          Z          W          Q          New
#      │ Float64    Float64    Float64    Float64    Float64    Float64
# ─────┼───────────────────────────────────────────────────────────────────
#    1 │ 0.460589   0.734461   0.771352   0.292767   0.0078547  -0.522318
#    2 │ 0.254958   0.153814   0.571355   0.196476   0.739701   -0.0888108
#    3 │ 0.0799407  0.457583   0.553493   0.25409    0.245967   -0.451192
#    4 │ 0.106429   0.435509   0.0184347  0.932008   0.805972   -0.424182
#    5 │ 0.0877892  0.0231817  0.435262   0.0187286  0.812151   -0.0154748

Note the use of broadcasting in the first command - using .^ instead of ^ and .- instead of -. This tells Julia to execute these operations in element-wise fashion. The new column, df.New, is added to the DataFrame automatically.

You can also access individual entries by specifying the row and column, the latter using either the names or indexes:

df[2, 3] # Row 2, Column 3 (Z)
# 0.5713552793835407

df[2:3, 3:5]
# 2×3 DataFrame
#  Row │ Z         W         Q
#      │ Float64   Float64   Float64
# ─────┼──────────────────────────────
#    1 │ 0.571355  0.196476  0.739701
#    2 │ 0.553493  0.25409   0.245967

df[1:3, :W]
# 3-element Vector{Float64}:
#  0.29276715395306163
#  0.19647571959450094
#  0.25409034700483624

To select an entire row or column, the : operator is used, just like in Arrays. This returns a copy of the slice of the DataFrame. When using a !, however, a view to the data is returned. Modifying the view will modify the contents of the DataFrame as well:

df = DataFrame(X = [1, 2, 3], Y = [2, 4, 6], Z = [3, 6, 9])
# 3×3 DataFrame
#  Row │ X      Y      Z
#      │ Int64  Int64  Int64
# ─────┼─────────────────────
#    1 │     1      2      3
#    2 │     2      4      6
#    3 │     3      6      9

xvals = df[:, :X]
# 3-element Vector{Int64}:
#  1
#  2
#  3

xvals[1] = 100
# 100

df
# 3×3 DataFrame
#  Row │ X      Y      Z
#      │ Int64  Int64  Int64
# ─────┼─────────────────────
#    1 │     1      2      3
#    2 │     2      4      6
#    3 │     3      6      9

xview = df[!, :X]
# 3-element Vector{Int64}:
#  1
#  2
#  3

xview[1] = 100
# 100

df
# 3×3 DataFrame
#  Row │ X      Y      Z
#      │ Int64  Int64  Int64
# ─────┼─────────────────────
#    1 │   100      2      3
#    2 │     2      4      6
#    3 │     3      6      9

You can also select just specific rows and columns:

df[[1, 3], [1, 3]]
# 2×2 DataFrame
#  Row │ X      Z     
#      │ Int64  Int64 
# ─────┼──────────────
#    1 │     1      3
#    2 │     3      9

Data Wrangling with DataFrames

Data wrangling is the pre-processing of your data to get it into the form you need for the modelling or analysis you intend to do with it. This can include many steps. DataFrames has a rich feature set to support you in this. We’ll look at several examples here, but if you are going to be doing this type of work often, save yourself a lot of page-turning and download, print and laminate a copy of the Data Wrangling with DataFrames.jl Cheat Sheet.

We start off by creating a dataset to work with:

df = DataFrame(X = 1:3:1500, Y = repeat(1:100, outer=5), Z = repeat(1:100, inner=5)
# 500×3 DataFrame
#  Row │ X      Y      Z     
#      │ Int64  Int64  Int64 
# ─────┼─────────────────────
#    1 │     1      1      1 
#    2 │     4      2      1 
#    3 │     7      3      1 
#    4 │    10      4      1 
#    5 │    13      5      1 
#    6 │    16      6      2 
#    7 │    19      7      2 
#    8 │    22      8      2 
#    9 │    25      9      2 
#   ⋮  │   ⋮      ⋮      ⋮   
#  492 │  1474     92     99 
#  493 │  1477     93     99
#  494 │  1480     94     99
#  495 │  1483     95     99
#  496 │  1486     96    100
#  497 │  1489     97    100
#  498 │  1492     98    100
#  499 │  1495     99    100
#  500 │  1498    100    100
#            482 rows omitted

first(df,3)
# 3×3 DataFrame
#  Row │ X      Y      Z     
#      │ Int64  Int64  Int64 
# ─────┼─────────────────────
#    1 │     1      1      1
#    2 │     4      2      1
#    3 │     7      3      1

last(df,4)
# 4×3 DataFrame
#  Row │ X      Y      Z     
#      │ Int64  Int64  Int64 
# ─────┼─────────────────────
#    1 │  1489     97    100
#    2 │  1492     98    100
#    3 │  1495     99    100
#    4 │  1498    100    100

names(df)
# 3-element Vector{String}:
#  "X"
#  "Y"
#  "Z"

describe(df)
# 3×7 DataFrame
#  Row │ variable  mean     min    median   max    nmissing  eltype   
#      │ Symbol    Float64  Int64  Float64  Int64  Int64     DataType 
# ─────┼──────────────────────────────────────────────────────────────
#    1 │ X           749.5      1    749.5   1498         0  Int64
#    2 │ Y            50.5      1     50.5    100         0  Int64
#    3 │ Z            50.5      1     50.5    100         0  Int64

In generating this data, we used Julia’s repeat function. For the Y column, we specify outer=5, which means the data (1:100) is repeated 5 times as a set, i.e. 1 to 100, then 1 to 100 again, for a total of 5 times. For the Z column, we specify inner=5, which means each of the values from 1 to 100 is repeated 5 times: 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,… etc.

We can quickly see the first or last few rows using the Julia functions first and last, with the second parameter showing how many entries to display. The number of rows displayed by Julia for the full DataFrame depends on the size of the window it is being displayed in. The middle omitted rows are indicated with the characters. If the DataFrame is too wide, columns will also be omitted, with a message indicating how many columns are hidden.

If you really want to see all the rows or columns, you can call show and specify allrows=true and/or allcols=true

show(df, allrows=true, allcols=true)

For obvious reasons, we won’t include the 500 rows of data here.

The names() function returns a vector of the column names as strings. You can also get summary statistics of the data with the describe() function.

If you only want specific statistics, you can specify them in the call:

describe(df, :mean, :std)
# 3×3 DataFrame
#  Row │ variable  mean     std     
#      │ Symbol    Float64  Float64 
# ─────┼────────────────────────────
#    1 │ X           749.5  433.445
#    2 │ Y            50.5   28.895
#    3 │ Z            50.5   28.895

You can also apply a custom function to describe the data. Here we use the Julia function extrema(), which returns the minimum and maximum value in an array. We’ll specify that we want the results in a column call minmax:

describe(df, extrema => :minmax)
# 3×2 DataFrame
#  Row │ variable  minmax    
#      │ Symbol    Tuple…    
# ─────┼─────────────────────
#    1 │ X         (1, 1498)
#    2 │ Y         (1, 100)
#    3 │ Z         (1, 100)

Lastly, the number of rows and column can be easily retrieved:

nrow(df)
# 500

ncol(df)
# 6

Selecting and Transforming Data

Selecting Columns

In the introductory section, we selected specific rows and columns based on their indices. You can also select columns based on their names:

select(df, [:X, :Z])
# 500×2 DataFrame     
#  Row │ X      Z     
#      │ Int64  Int64 
# ─────┼──────────────
#    1 │     1      1
#    2 │     4      1
#    3 │     7      1
#   ⋮  │   ⋮      ⋮
#  498 │  1492    100
#  499 │  1495    100
#  500 │  1498    100
#     494 rows omitted

There are also selector function to allow for selection of ranges of columns:

  • Not(): Select all the columns expect the specified one(s)
  • Between(): Select all column starting with the first specified column and ending with the second. You can use names or indices to specify the columns
  • Cols(): Use this to combine other selector functions
  • All(): Select all columns - same as specifying :

Examples:

In the previous example we selected columns X and Z, which is the same as selecting everything expect Y:

select(df, Not(:Y))
# 500×2 DataFrame
#  Row │ X      Z     
#      │ Int64  Int64 
# ─────┼──────────────
#    1 │     1      1
#    2 │     4      1
#    3 │     7      1
#   ⋮  │   ⋮      ⋮
#  498 │  1492    100
#  499 │  1495    100
#  500 │  1498    100
#     494 rows omitted

We only passed a single symbol to Not, but could also pass an array of symbols to exclude multiple columns.

To make things more interesting, we can first add a few more columns:

df.A = 2 .* df.X
df.B = df.Y .+ df.Z
df.C = df.Z .^ 2
df
# 500×6 DataFrame
#  Row │ X      Y      Z      A      B      C     
#      │ Int64  Int64  Int64  Int64  Int64  Int64 
# ─────┼──────────────────────────────────────────
#    1 │     1      1      1      2      2      1
#    2 │     4      2      1      8      3      1
#    3 │     7      3      1     14      4      1
#   ⋮  │   ⋮      ⋮      ⋮      ⋮      ⋮      ⋮
#  498 │  1492     98    100   2984    198  10000
#  499 │  1495     99    100   2990    199  10000
#  500 │  1498    100    100   2996    200  10000
#                                 494 rows omitted
select(df, Cols(:A, Between(:X, :Z)))
# 500×4 DataFrame
#  Row │ A      X      Y      Z     
#      │ Int64  Int64  Int64  Int64 
# ─────┼────────────────────────────
#    1 │     2      1      1      1
#    2 │     8      4      2      1
#    3 │    14      7      3      1
#   ⋮  │   ⋮      ⋮      ⋮      ⋮
#  498 │  2984   1492     98    100
#  499 │  2990   1495     99    100
#  500 │  2996   1498    100    100
#                   494 rows omitted

Here we used Cols to combine selecting column A and all columns from X to Z. This also changed the order of the columns.

We used the select function to do the selections, but could as easily have used the normal array-type notation:

df[:, Cols(:A, Between(:X, :Z))]
# 500×4 DataFrame
#  Row │ A      X      Y      Z     
#      │ Int64  Int64  Int64  Int64 
# ─────┼────────────────────────────
#    1 │     2      1      1      1
#    2 │     8      4      2      1
#    3 │    14      7      3      1
#   ⋮  │   ⋮      ⋮      ⋮      ⋮
#  498 │  2984   1492     98    100
#  499 │  2990   1495     99    100
#  500 │  2996   1498    100    100
#                   494 rows omitted

To permanently change the DataFrame, either assign the returned result to the same variable, or use the mutating version of select(): select!().

df = select(df, Cols(:A, Between(:X, :Z)))
df
# 500×4 DataFrame
#  Row │ A      X      Y      Z     
#      │ Int64  Int64  Int64  Int64 
# ─────┼────────────────────────────
#    1 │     2      1      1      1
#    2 │     8      4      2      1
#    3 │    14      7      3      1
#   ⋮  │   ⋮      ⋮      ⋮      ⋮
#  498 │  2984   1492     98    100
#  499 │  2990   1495     99    100
#  500 │  2996   1498    100    100
#                   494 rows omitted

or

select!(df, Cols(:A, Between(:X, :Z)))
# 500×4 DataFrame
#  Row │ A      X      Y      Z     
#      │ Int64  Int64  Int64  Int64 
# ─────┼────────────────────────────
#    1 │     2      1      1      1
#    2 │     8      4      2      1 Da
#    3 │    14      7      3      1
#   ⋮  │   ⋮      ⋮      ⋮      ⋮
#  498 │  2984   1492     98    100
#  499 │  2990   1495     99    100
#  500 │  2996   1498    100    100
#                   494 rows omitted

It is also possible to select columns from a DataFrame that match to a regular expression. This is a template that describes a string of text. For more information, see one the many tutorials on the web.

df = DataFrame(
    Names = ["Tom", "Dick", "Harry"], 
    AddressLine1 = ["2 Maple Drive", "4 Oak Street", "6 Pine Road"],
    AddressLine2 = ["Hopetown", "Smallville", "Metropolis"],
    AddressZip = [1234, 2345, 3456]
)
3×4 DataFrame
#  Row │ Names   AddressLine1   AddressLine2  AddressZip 
#      │ String  String         String        Int64      
# ─────┼─────────────────────────────────────────────────
#    1 │ Tom     2 Maple Drive  Hopetown            1234
#    2 │ Dick    4 Oak Street   Smallville          2345
#    3 │ Harry   6 Pine Road    Metropolis          3456


df[:, r"Addr"]
# 3×3 DataFrame
#  Row │ AddressLine1   AddressLine2  AddressZip 
#      │ String         String        Int64      
# ─────┼─────────────────────────────────────────
#    1 │ 2 Maple Drive  Hopetown            1234
#    2 │ 4 Oak Street   Smallville          2345
#    3 │ 6 Pine Road    Metropolis          34

Here we select all rows, but only the column that have the sub-string “Addr” in their names.

Selecting Rows (Filtering) and Sorting

You can select rows by filtering the data. There are two equivalent ways:

filter(:X => <=(10), df)
# 4×6 DataFrame
#  Row │ X      Y      Z      A      B      C     
#      │ Int64  Int64  Int64  Int64  Int64  Int64 
# ─────┼──────────────────────────────────────────
#    1 │     1      1      1      2      2      1
#    2 │     4      2      1      8      3      1
#    3 │     7      3      1     14      4      1
#    4 │    10      4      1     20      5      1

df[df.X .<= 10, :]
# 4×6 DataFrame
#  Row │ X      Y      Z      A      B      C     
#      │ Int64  Int64  Int64  Int64  Int64  Int64 
# ─────┼──────────────────────────────────────────
#    1 │     1      1      1      2      2      1
#    2 │     4      2      1      8      3      1
#    3 │     7      3      1     14      4      1
#    4 │    10      4      1     20      5      1

In both cases we pass a predicate function, i.e. a function that returns true for rows that should be included. In the first case, a pair operator links a column name to the predicate function and in the second case we simply pass an expression that returns true for all the correct instances. Note the use of broadcasting in the second example.

Note

The contents of the column being checked in the predicate function is passed as the first parameter, so :X => <=(10) is in fact df[<current row>, X] <= 10, which can also be written as <=(df[<current row>, X], 10), since operators in Julia are just functions with special syntax - they can still be called like normal functions.

You can use any predicate function. A new useful ones include:

df[findfirst(==(4), df.C), :] # Find the first row where column C has a value of 4
# DataFrameRow
#  Row │ X      Y      Z      A      B      C     
#      │ Int64  Int64  Int64  Int64  Int64  Int64 
# ─────┼──────────────────────────────────────────
#    6 │    16      6      2     32      8      4

df[findall(==(4), df.C), :] # Find all rows where column C has a value of 4
# 5×6 DataFrame
#  Row │ X      Y      Z      A      B      C     
#      │ Int64  Int64  Int64  Int64  Int64  Int64 
# ─────┼──────────────────────────────────────────
#    1 │    16      6      2     32      8      4
#    2 │    19      7      2     38      9      4
#    3 │    22      8      2     44     10      4
#    4 │    25      9      2     50     11      4
#    5 │    28     10      2     56     12      4

df[df.C .== 4, :] # Find all rows where column C has a value of 4, alternative form
# 5×6 DataFrame
#  Row │ X      Y      Z      A      B      C     
#      │ Int64  Int64  Int64  Int64  Int64  Int64 
# ─────┼──────────────────────────────────────────
#    1 │    16      6      2     32      8      4
#    2 │    19      7      2     38      9      4
#    3 │    22      8      2     44     10      4
#    4 │    25      9      2     50     11      4
#    5 │    28     10      2     56     12      4

It may also be useful to sort the data according to specified columns:

sort(df, :Y)
# 500×6 DataFrame
#  Row │ X      Y      Z      A      B      C     
#      │ Int64  Int64  Int64  Int64  Int64  Int64 
# ─────┼──────────────────────────────────────────
#    1 │     1      1      1      2      2      1
#    2 │   301      1     21    602     22    441
#    3 │   601      1     41   1202     42   1681
#    4 │   901      1     61   1802     62   3721
#    5 │  1201      1     81   2402     82   6561
#   ⋮  │   ⋮      ⋮      ⋮      ⋮      ⋮      ⋮
#  497 │   598    100     40   1196    140   1600
#  498 │   898    100     60   1796    160   3600
#  499 │  1198    100     80   2396    180   6400
#  500 │  1498    100    100   2996    200  10000
#                                 491 rows omitted

sort(df, [:Y, :B])
# 500×6 DataFrame
#  Row │ X      Y      Z      A      B      C
#      │ Int64  Int64  Int64  Int64  Int64  Int64
# ─────┼──────────────────────────────────────────
#    1 │     1      1      1      2      2      1
#    2 │   301      1     21    602     22    441
#    3 │   601      1     41   1202     42   1681
#    4 │   901      1     61   1802     62   3721
#    5 │  1201      1     81   2402     82   6561
#   ⋮  │   ⋮      ⋮      ⋮      ⋮      ⋮      ⋮
#  497 │   598    100     40   1196    140   1600
#  498 │   898    100     60   1796    160   3600
#  499 │  1198    100     80   2396    180   6400
#  500 │  1498    100    100   2996    200  10000
#                                 491 rows omitted
Applying Function to Columns and Rows

Applying a function to a column or multiple columns has been shown before when we added columns to a DataFrame. Since each column acts like a vector, we simply use normal Julia broadcasting to apply functions to whole columns:

df.BdivC = df.B ./ df.C
# 500-element Vector{Float64}:
#  2.0
#  3.0
#  4.0
#  5.0
#  6.0
#  2.0
#  2.25
#  ⋮
#  0.01979389858177737
#  0.0196
#  0.0197
#  0.0198
#  0.0199
#  0.02

# df
# 500×7 DataFrame
#  Row │ X      Y      Z      A      B      C      BdivC     
#      │ Int64  Int64  Int64  Int64  Int64  Int64  Float64   
# ─────┼─────────────────────────────────────────────────────
#    1 │     1      1      1      2      2      1  2.0
#    2 │     4      2      1      8      3      1  3.0
#    3 │     7      3      1     14      4      1  4.0
#    4 │    10      4      1     20      5      1  5.0
#    5 │    13      5      1     26      6      1  6.0
#   ⋮  │   ⋮      ⋮      ⋮      ⋮      ⋮      ⋮        ⋮
#  497 │  1489     97    100   2978    197  10000  0.0197
#  498 │  1492     98    100   2984    198  10000  0.0198
#  499 │  1495     99    100   2990    199  10000  0.0199
#  500 │  1498    100    100   2996    200  10000  0.02
#                                            491 rows omitted

In order to apply functions across rows, we need to use some of the built-in functions of DataFrames, select and transform, along with ByRow.

The basic syntax is:

  • select(df, <array of columns to select>)
  • select(df, <array of columns to select>, <array of columns to pass to function> => <function> => <name(s) of column(s) to hold result(s)>)
  • select(df, AsTable(:) => <function> => <name(s) of column(s) to hold result>) # pass the entire table to the function

Some examples:


select(df, :X, :X => cumsum => :cumX) # Put the cumulative sum of X in column cumX
# 500×2 DataFrame
#  Row │ X      cumX   
#      │ Int64  Int64  
# ─────┼───────────────
#    1 │     1       1
#    2 │     4       5
#    3 │     7      12
#    4 │    10      22
#    5 │    13      35
#   ⋮  │   ⋮      ⋮
#  497 │  1489  370265
#  498 │  1492  371757
#  499 │  1495  373252
#  500 │  1498  374750
#      491 rows omitted

select(df, [:X, :Y], [:X, :Y] => ByRow((x, y) -> sin(x)*cos(y)) => :sinXcosY) # Put the value of sin(X)*cos*(Y) in column sinXcosY
# 500×3 DataFrame
#  Row │ X      Y      sinXcosY    
#      │ Int64  Int64  Float64     
# ─────┼───────────────────────────
#    1 │     1      1   0.454649
#    2 │     4      2   0.314941
#    3 │     7      3  -0.650412
#    4 │    10      4   0.355596
#    5 │    13      5   0.119185
#   ⋮  │   ⋮      ⋮         ⋮
#  497 │  1489     97   0.106082
#  498 │  1492     98  -0.207859
#  499 │  1495     99  -0.0154374
#  500 │  1498    100   0.443124
#                  491 rows omitted

select(df, AsTable(:) => ByRow(extrema) => [:min, :max]) # Find the smallest and largest value in each row
# 500×2 DataFrame
#  Row │ min    max   
#      │ Int64  Int64 
# ─────┼──────────────
#    1 │     1      2
#    2 │     1      8
#    3 │     1     14
#    4 │     1     20
#    5 │     1     26
#   ⋮  │   ⋮      ⋮
#  497 │    97  10000
#  498 │    98  10000
#  499 │    99  10000
#  500 │   100  10000
#     491 rows omitted

In each case, we used select and it only returned the specified columns. In the last example, we didn’t select any column, but just specified the function to apply. As a result, only the columns holding the results of the function are selected. If we instead call transform, the entire DataFrame is returned, as well as any new columns added in the calculations, e.g.

transform(df, AsTable(:) => ByRow(extrema) => [:min, :max])
# 500×8 DataFrame
#  Row │ X      Y      Z      A      B      C      min    max   
#      │ Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64 
# ─────┼────────────────────────────────────────────────────────
#    1 │     1      1      1      2      2      1      1      2
#    2 │     4      2      1      8      3      1      1      8
#    3 │     7      3      1     14      4      1      1     14
#    4 │    10      4      1     20      5      1      1     20
#    5 │    13      5      1     26      6      1      1     26
#   ⋮  │   ⋮      ⋮      ⋮      ⋮      ⋮      ⋮      ⋮      ⋮
#  497 │  1489     97    100   2978    197  10000     97  10000
#  498 │  1492     98    100   2984    198  10000     98  10000
#  499 │  1495     99    100   2990    199  10000     99  10000
#  500 │  1498    100    100   2996    200  10000    100  10000
#                                               491 rows omitted

Just like before, select and transform return new DataFrames, while select! and transform! will modify the existing DataFrame.

Grouping Data

DataFrames will group your data by column values through the groupby() function. One or more columns can be specified and you will get a GroupedDataFrame - a container of SubDataFrames for each unique combination of the values in the specified columns.

df = DataFrame(X = 1:50, Y = repeat(1:2, outer=25), Z = repeat(2:2:10, inner=10))
# 50×3 DataFrame
#  Row │ X      Y      Z     
#      │ Int64  Int64  Int64 
# ─────┼─────────────────────
#    1 │     1      1      2 
#    2 │     2      2      2
#    3 │     3      1      2
#    4 │     4      2      2
#    5 │     5      1      2
#    6 │     6      2      2
#   ⋮  │   ⋮      ⋮      ⋮
#   45 │    45      1     10
#   46 │    46      2     10
#   47 │    47      1     10
#   48 │    48      2     10
#   49 │    49      1     10
#   50 │    50      2     10
#             38 rows omitted

gdf = groupby(df, [:Y, :Z]
# GroupedDataFrame with 10 groups based on keys: Y, Z
# First Group (5 rows): Y = 1, Z = 2
#  Row │ X      Y      Z     
#      │ Int64  Int64  Int64 
# ─────┼─────────────────────
#    1 │     1      1      2
#   ⋮  │   ⋮      ⋮      ⋮
#    5 │     9      1      2
#              3 rows omitted
# ⋮
# Last Group (5 rows): Y = 2, Z = 10
#  Row │ X      Y      Z     
#      │ Int64  Int64  Int64 
# ─────┼─────────────────────
#    1 │    42      2     10
#   ⋮  │   ⋮      ⋮      ⋮
#    5 │    50      2     10
#              3 rows omitted

counts = combine(gdf, nrow, :X => mean)
#     10×4 DataFrame
#  Row │ Y      Z      nrow   X_mean
#      │ Int64  Int64  Int64  Float64
# ─────┼──────────────────────────────
#    1 │     1      2      5      5.0
#    2 │     1      4      5     15.0
#    3 │     1      6      5     25.0
#    4 │     1      8      5     35.0
#    5 │     1     10      5     45.0
#    6 │     2      2      5      6.0
#    7 │     2      4      5     16.0
#    8 │     2      6      5     26.0
#    9 │     2      8      5     36.0
#   10 │     2     10      5     46.0

counts = select(gdf, nrow, :X => mean) 
# 50×4 DataFrame
#  Row │ Y      Z      nrow   X_mean  
#      │ Int64  Int64  Int64  Float64 
# ─────┼──────────────────────────────
#    1 │     1      2      5      5.0
#    2 │     2      2      5      6.0
#    3 │     1      2      5      5.0
#    4 │     2      2      5      6.0
#    5 │     1      2      5      5.0
#    6 │     2      2      5      6.0
#   ⋮  │   ⋮      ⋮      ⋮       ⋮
#   45 │     1     10      5     45.0
#   46 │     2     10      5     46.0
#   47 │     1     10      5     45.0
#   48 │     2     10      5     46.0
#   49 │     1     10      5     45.0
#   50 │     2     10      5     46.0
#                      38 rows omitted

Here we create a DataFrame with three columns: X ranges from 1 to 50, Y alternates between 1 and 2, and Z is a sequence of the even numbers from 2 to 10. We then group by Y and Z, which gives us a GroupedDataFrame with 10 groups, for the 10 unique combinations of values in Y and Z. The second command created a new DataFrame, called counts which contains the number of rows in each of the 10 SubDataFrames in our GroupedDataFrame (gdf) as well as the mean2 value of the X column in each of the 10 groups. Lastly, we repeat the second example, but with select instead of combine. This returns a DataFrame with the same number (and order) of rows as in the original source DataFrame (not the grouped one!), but with only the newly created column. In this case, we see 50 rows, not the expected 10.

A More Complete Example

For this example, we use a dataset from Kaggle3. The data file includes information on the ingredients of 399 types of food.

df = CSV.read(raw"D:\JuliaCode\KaggleData\food_ingredients_and_allergens.csv", DataFrame; normalizenames=true)
select!(df, [:Food_Product, :Allergens])
# 399×2 DataFrame
#  Row │ Food_Product             Allergens
#      │ String                   String31
# ─────┼─────────────────────────────────────────────────
#    1 │ Almond Cookies           Almonds, Wheat, Dairy
#    2 │ Almond Cookies           Almonds, Wheat, Dairy
#    3 │ Chicken Noodle Soup      Chicken, Wheat, Celery
#    4 │ Chicken Noodle Soup      Chicken, Wheat, Celery
#    5 │ Cheddar Cheese           Dairy
#    6 │ Ranch Dressing           Dairy
#    7 │ Caramel Popcorn          Dairy
#    8 │ Caesar Salad             Dairy
#   ⋮  │            ⋮                       ⋮
#  393 │ Chocolate Chip Pancakes  Wheat, Dairy, Cocoa
#  394 │ Lemon Bars               Wheat, Dairy, Eggs
#  395 │ Lemon Bars               Wheat, Dairy, Eggs
#  396 │ Pecan Pie                Wheat, Dairy, Nuts
#  397 │ Zucchini Bread           Wheat, Dairy, Nuts
#  398 │ Banana Bread             Wheat, Dairy, Nuts
#  399 │ Hawaiian Pizza           Wheat, Pork, Dairy
#                                        384 rows omitted

We are going to focus on the allergens in these foods. We select to keep only the two column containing the names of the foods and the allergens contained in them. Unfortunately, the allergens are given as a single string of comma-separated values. This means we have to do a little processing of the data.

We ideally want a separate column for each allergen that contains a true (1) or false (0) to indicate if the allergen is present or not. To get this, we must process all the entries on allergens and split the strings into the individual allergens. We then need to identify the unique values, and add a true or false in a new column for each food and each allergen. This sounds like a lot of work, but Julia makes this easy.

allAllergens = String[] #empty array of strings
for row in eachrow(df)
    allergens = row.Allergens
    for s in eachsplit(allergens, ", ")
        push!(allAllergens, s)
    end
end
unique!(allAllergens)

# 24-element Vector{String}:
#  "Almonds"
#  "Wheat"
#  "Dairy"
#  "Chicken"
#  "Celery"
#  "Anchovies"
#  "Cocoa"
#  ⋮
#  "Oats"
#  "Peanuts"
#  "Pine nuts"
#  "Rice"
#  "Shellfish"
#  "Pork"
#  "Alcohol"

We create an empty vector of Strings to hold the allergen names - allAllergens. Then we iterate trough the DataFrame, one row at a time, using the eachrow iterator. We extract the string holding the allergens into a temporary string allergens with allergens = row.Allergens (where row is the current row in the iterator eachrow).

To split the string (allergens) into the various allergens, we use the Julia function eachsplit, which splits the string with the specified delimiter (“,”), and returns an iterator of substrings. We loop through this iterator and push each substring into the empty vector we created at the start.

Lastly, we use the Julia function unique! to remove all duplicates from this vector. We now have a list of 24 possible allergens.

for allergen in allAllergens
    transform!(df, :Allergens => ByRow(s -> contains(string(s), allergen)) => Symbol(allergen))
end

df[:, [:Food_Product, :Almonds, :Wheat, :Dairy]]
# 399×4 DataFrame
#  Row │ Food_Product             Almonds  Wheat  Dairy 
#      │ String                   Bool     Bool   Bool  
# ─────┼────────────────────────────────────────────────
#    1 │ Almond Cookies              true   true   true
#    2 │ Almond Cookies              true   true   true
#    3 │ Chicken Noodle Soup        false   true  false
#    4 │ Chicken Noodle Soup        false   true  false
#    5 │ Cheddar Cheese             false  false   true
#    6 │ Ranch Dressing             false  false   true
#   ⋮  │            ⋮                ⋮       ⋮      ⋮
#  394 │ Lemon Bars                 false   true   true
#  395 │ Lemon Bars                 false   true   true
#  396 │ Pecan Pie                  false   true   true
#  397 │ Zucchini Bread             false   true   true
#  398 │ Banana Bread               false   true   true
#  399 │ Hawaiian Pizza             false   true   true
#                                       387 rows omitted

We now run through our newly created list of allergens and call the DataFrames function transform! for each allergen. transform! is the mutating version of transform and will modify the specified DataFrame rather than return a new object. transform includes all of the columns in the original DataFrame, unlike select, which will include only the specified columns.

We specify here than we want to take the contents of the Allergens column, and pass it, one row at a time (ByRow()) to an anonymous function (s -> contains(string(s), allergen))) and store the result in a new column Symbol(allergen). Note that we are converting the name of the allergen, which is a String to a Symbol, which is what DataFrames expects here.

The anonymous function we define essentially only calls the Julia function contains, which will return a true or false depending on whether the string contains a specified substring. We convert the value from the DataFrame to a string (with the string() function), as DataFrames uses a more specialised data type for string values to save memory and improve performance. We have a quick look at three of the allergens and see that we indeed have exactly what we wanted.

We can now analyse the data any way we want to. Let’s group the data to see combinations of almonds and wheat in the various foods:

gdf = groupby(df, [:Almonds, :Wheat])

for (keys, sdf) in pairs(gdf)
    println("Number of foods with $keys: $(nrow(sdf))")
end

# Number of foods with GroupKey: (Almonds = true, Wheat = true): 2
# Number of foods with GroupKey: (Almonds = false, Wheat = true): 94
# Number of foods with GroupKey: (Almonds = false, Wheat = false): 303

Joining DataFrames

DataFrames allows for joining of tables, just like in a SQL database. We shall give a few key definitions here to illustrate the concept.

Firstly, what is a join? It is a combination of two tables, based on the values in a specified column, typically an ID or similar index value. The resulting table may include columns from one or both of the tables, for the rows that are included in the joined table, depending on the type of join.

We define two DataFrames to use in our examples:

df1 = DataFrame(ID = [1, 2, 3], Floor = ["A", "B", "C"], Office = [11, 12, 12])
# 3×3 DataFrame
#  Row │ ID     Floor   Office 
#      │ Int64  String  Int64  
# ─────┼───────────────────────
#    1 │     1  A           11
#    2 │     2  B           12
#    3 │     3  C           12

df2 = DataFrame(ID = [1, 2, 7, 11], Name = ["Tom", "Dan", "Harry", "Bob"])
# 4×2 DataFrame
#  Row │ ID     Name   
#      │ Int64  String 
# ─────┼───────────────
#    1 │     1  Tom
#    2 │     2  Dan
#    3 │     7  Harry
#    4 │    11  Bob

Inner join: An inner join between two tables will include only the rows where the field (e.g. ID) on which you join, are the same in each table. Rows that include a value of ID that is only present in one of the two tables will not be included. All columns from both tables are included.

innerjoin(df1, df2, on=:ID)
# 2×4 DataFrame
#  Row │ ID     Floor   Office  Name   
#      │ Int64  String  Int64   String 
# ─────┼───────────────────────────────
#    1 │     1  A           11  Tom
#    2 │     2  B           12  Dan

Left join: A left join between two tables will include all the rows from the left table, as well as those rows from right table where the joining field (e.g. ID) match. All columns are included, which may result in missing values in columns from the right table.

leftjoin(df1, df2, on=:ID)
# 3×4 DataFrame
#  Row │ ID     Floor   Office  Name    
#      │ Int64  String  Int64   String? 
# ─────┼────────────────────────────────
#    1 │     1  A           11  Tom
#    2 │     2  B           12  Dan
#    3 │     3  C           12  missing 

Right join: Similar to a left join, but all rows from the right table are included and only matching rows from the left table. This may result in missing values in columns from the left table.

rightjoin(df1, df2, on=:ID)
# 4×4 DataFrame
#  Row │ ID     Floor    Office   Name   
#      │ Int64  String?  Int64?   String 
# ─────┼─────────────────────────────────
#    1 │     1  A             11  Tom
#    2 │     2  B             12  Dan
#    3 │     7  missing  missing  Harry
#    4 │    11  missing  missing  Bob

Outer join: This will include all rows from both left and right tables, possibly resulting in missing values in all columns.

# outerjoin(df1, df2, on=:ID)
# 5×4 DataFrame
#  Row │ ID     Floor    Office   Name    
#      │ Int64  String?  Int64?   String? 
# ─────┼──────────────────────────────────
#    1 │     1  A             11  Tom
#    2 │     2  B             12  Dan
#    3 │     3  C             12  missing 
#    4 │     7  missing  missing  Harry
#    5 │    11  missing  missing  Bob

Semi join (a.k.a. Left semi join): Like an inner join, this will only return rows that match in both columns, but will only return columns from the left table. This is useful if you want to select, for example, names from a list that are also present in another list, without including additional data from that list. To do a right semi join, just reverse the order to the two DataFrames.

semijoin(df1, df2, on=:ID)
# 2×3 DataFrame
#  Row │ ID     Floor   Office 
#      │ Int64  String  Int64  
# ─────┼───────────────────────
#    1 │     1  A           11
#    2 │     2  B           12

Anti join (a.k.a. Left anti join): Like the semi join, but will only return rows from the left table that are NOT matched in the right table.

antijoin(df1, df2, on=:ID)
# 1×3 DataFrame
#  Row │ ID     Floor   Office 
#      │ Int64  String  Int64  
# ─────┼───────────────────────
#    1 │     3  C           12
Back to top

Footnotes

  1. Unless, of course, you used strings with spaces for column names.↩︎

  2. Note that to use the mean function, you need to first include the Statistics standard library↩︎

  3. Kaggle is a website for data science enthusiasts. It has many data sets and run competitions to generate predictions based on these sets.↩︎