1. Welcome to the world of data science

Throughout the world of data science, there are many languages and tools that can be used to complete a given task. While you are often able to use whichever tool you prefer, it is often important for analysts to work with similar platforms so that they can share their code with one another. Learning what professionals in the data science industry use while at work can help you gain a better understanding of things that you may be asked to do in the future.

In this project, we are going to find out what tools and languages professionals use in their day-to-day work. Our data comes from the Kaggle Data Science Survey which includes responses from over 10,000 people that write code to analyze data in their daily work.

In [15]:
# Load necessary packages
library(tidyverse)

# Load the data
responses <- read_csv('datasets/kagglesurvey.csv')

# Print the first 10 rows
head(responses, n = 10)
Parsed with column specification:
cols(
  Respondent = col_double(),
  WorkToolsSelect = col_character(),
  LanguageRecommendationSelect = col_character(),
  EmployerIndustry = col_character(),
  WorkAlgorithmsSelect = col_character()
)
A tibble: 10 x 5
RespondentWorkToolsSelectLanguageRecommendationSelectEmployerIndustryWorkAlgorithmsSelect
<dbl><chr><chr><chr><chr>
1Amazon Web services,Oracle Data Mining/ Oracle R Enterprise,Perl F# Internet-basedNeural Networks,Random Forests,RNNs
2Amazon Machine Learning,Amazon Web services,Cloudera,Hadoop/Hive/Pig,Impala,Java,Mathematica,MATLAB/Octave,Microsoft Excel Data Mining,Microsoft SQL Server Data Mining,NoSQL,Python,R,SAS Base,SAS JMP,SQL,TableauPythonMix of fields Bayesian Techniques,Decision Trees,Random Forests,Regression/Logistic Regression
3C/C++,Jupyter notebooks,MATLAB/Octave,Python,R,TensorFlow PythonTechnology Bayesian Techniques,CNNs,Ensemble Methods,Neural Networks,Regression/Logistic Regression,SVMs
4Jupyter notebooks,Python,SQL,TensorFlow PythonAcademic Bayesian Techniques,CNNs,Decision Trees,Gradient Boosted Machines,Neural Networks,Random Forests,Regression/Logistic Regression
5C/C++,Cloudera,Hadoop/Hive/Pig,Java,NoSQL,R,Unix shell / awk R Government NA
6SQL PythonNon-profit NA
7Jupyter notebooks,NoSQL,Python,R,SQL,Unix shell / awk PythonInternet-basedCNNs,Decision Trees,Gradient Boosted Machines,Random Forests,Regression/Logistic Regression,SVMs
8Python,Spark / MLlib,Tableau,TensorFlow,Other PythonMix of fields Bayesian Techniques,CNNs,HMMs,Neural Networks,Random Forests,Regression/Logistic Regression,SVMs
9Jupyter notebooks,MATLAB/Octave,Python,SAS Base,SQL PythonFinancial Ensemble Methods,Gradient Boosted Machines
10C/C++,IBM Cognos,MATLAB/Octave,Microsoft Excel Data Mining,Microsoft R Server (Formerly Revolution Analytics),Microsoft SQL Server Data Mining,Perl,Python,R,SQL,Unix shell / awk R Technology Bayesian Techniques,Regression/Logistic Regression
In [16]:
library("testthat")
library('IRkernel.testthat')

run_tests({
    test_that("Read in data correctly.", {
        expect_is(responses, "tbl_df", 
            info = 'You should use read_csv() (with an underscore) to read "datasets/kagglesurvey.csv" into responses.')
    })
    
    test_that("Read in data correctly.", {
        responses_test <- read_csv('datasets/kagglesurvey.csv')
        expect_equivalent(responses, responses_test, 
            info = 'responses should contain the data in "datasets/kagglesurvey.csv".')
    })
    
})
2/2 tests passed

2. Using multiple tools

Now that we have loaded in the survey results, we want to focus on the tools and languages that the survey respondents use at work.

To get a better idea of how the data are formatted, we will look at the first respondent's tool-use and see that this survey-taker listed multiple tools that are each separated by a comma. To learn how many people use each tool, we need to separate out all of the tools used by each individual. There are several ways to complete this task, but we will use str_split() from stringr to separate the tools at each comma. Since that will create a list inside of the data frame, we can use the tidyr function unnest() to separate each list item into a new row.

In [17]:
# Printing the first respondent's tools and languages
responses[1, 2]

# Add a new column, and unnest the new column
tools <- responses  %>% 
    mutate(work_tools = str_split(WorkToolsSelect, ","))  %>% 
    unnest(work_tools)

# View the first 6 rows of tools
head(tools)
A tibble: 1 x 1
WorkToolsSelect
<chr>
Amazon Web services,Oracle Data Mining/ Oracle R Enterprise,Perl
A tibble: 6 x 6
RespondentWorkToolsSelectLanguageRecommendationSelectEmployerIndustryWorkAlgorithmsSelectwork_tools
<dbl><chr><chr><chr><chr><chr>
1Amazon Web services,Oracle Data Mining/ Oracle R Enterprise,Perl F# Internet-basedNeural Networks,Random Forests,RNNs Amazon Web services
1Amazon Web services,Oracle Data Mining/ Oracle R Enterprise,Perl F# Internet-basedNeural Networks,Random Forests,RNNs Oracle Data Mining/ Oracle R Enterprise
1Amazon Web services,Oracle Data Mining/ Oracle R Enterprise,Perl F# Internet-basedNeural Networks,Random Forests,RNNs Perl
2Amazon Machine Learning,Amazon Web services,Cloudera,Hadoop/Hive/Pig,Impala,Java,Mathematica,MATLAB/Octave,Microsoft Excel Data Mining,Microsoft SQL Server Data Mining,NoSQL,Python,R,SAS Base,SAS JMP,SQL,TableauPythonMix of fields Bayesian Techniques,Decision Trees,Random Forests,Regression/Logistic RegressionAmazon Machine Learning
2Amazon Machine Learning,Amazon Web services,Cloudera,Hadoop/Hive/Pig,Impala,Java,Mathematica,MATLAB/Octave,Microsoft Excel Data Mining,Microsoft SQL Server Data Mining,NoSQL,Python,R,SAS Base,SAS JMP,SQL,TableauPythonMix of fields Bayesian Techniques,Decision Trees,Random Forests,Regression/Logistic RegressionAmazon Web services
2Amazon Machine Learning,Amazon Web services,Cloudera,Hadoop/Hive/Pig,Impala,Java,Mathematica,MATLAB/Octave,Microsoft Excel Data Mining,Microsoft SQL Server Data Mining,NoSQL,Python,R,SAS Base,SAS JMP,SQL,TableauPythonMix of fields Bayesian Techniques,Decision Trees,Random Forests,Regression/Logistic RegressionCloudera
In [18]:
run_tests({
    test_that("Tools and Languages were Split and Unnested", {
        expect_true(nrow(tools) == 47409, 
            info = 'Make sure that you split the tools at the commas and unnested them.')
    })
    
    test_that("Tools and Languages were Unnested", {
        expect_is(tools$work_tools, "character", 
            info = 'The work_tools column should be of class "character". Make sure that you unnested the results of str_split().')
    })
    
})
2/2 tests passed

3. Counting users of each tool

Now that we've split apart all of the tools used by each respondent, we can figure out which tools are the most popular.

In [19]:
# Group the data by work_tools, summarise the counts, and arrange in descending order
tool_count <- tools  %>% 
    group_by(work_tools)  %>% 
    summarise(count = n())  %>% 
    arrange(desc(count))

# Print the first 6 results
head(tool_count)
`summarise()` ungrouping output (override with `.groups` argument)
A tibble: 6 x 2
work_toolscount
<chr><int>
Python 6073
R 4708
SQL 4261
Jupyter notebooks3206
TensorFlow 2256
NA 2198
In [20]:
run_tests({
    test_that("Tools were Grouped and Summarised", {
        expect_true(nrow(tool_count) == 50, 
            info = 'Make sure that you grouped by tools and then summarised the counts.')
    })
    
    test_that("Values were sorted correctly", {
        expect_true(tool_count[1, 2] == 6073, 
            info = 'Do not forget to sort your tool counts from largest to smallest.')
    })
    
})
2/2 tests passed

Let's see how the most popular tools stack up against the rest.

In [21]:
# Create a bar chart of the work_tools column, most counts on the far right
ggplot(tool_count, aes(x = fct_reorder(work_tools, count), y = count)) + 
    geom_bar(stat = "identity") +
    theme(axis.text.x  = element_text(angle=90, vjust=0.5, hjust= 1))
In [22]:
run_tests({
   test_that("Plot is a bar chart",{
      p <- last_plot()
      q <- p$layers[[1]]
      expect_is(q$geom, "GeomBar", 
                info = "You should plot a bar chart with ggplot().")
    })
})
1/1 tests passed

5. The R vs Python debate

Within the field of data science, there is a lot of debate among professionals about whether R or Python should reign supreme. You can see from our last figure that R and Python are the two most commonly used languages, but it's possible that many respondents use both R and Python. Let's take a look at how many people use R, Python, and both tools.

In [23]:
# Create a new column called language preference
debate_tools <- responses  %>% 
    mutate(language_preference = case_when(
        str_detect(WorkToolsSelect, "R") & ! str_detect(WorkToolsSelect, "Python") ~ "R",
        str_detect(WorkToolsSelect, "Python") & ! str_detect(WorkToolsSelect, "R") ~ "Python",
        str_detect(WorkToolsSelect, "R") & str_detect(WorkToolsSelect, "Python")   ~ "both",
        TRUE ~ "neither"
    ))


# Print the first 6 rows
head(debate_tools)
A tibble: 6 x 6
RespondentWorkToolsSelectLanguageRecommendationSelectEmployerIndustryWorkAlgorithmsSelectlanguage_preference
<dbl><chr><chr><chr><chr><chr>
1Amazon Web services,Oracle Data Mining/ Oracle R Enterprise,Perl F# Internet-basedNeural Networks,Random Forests,RNNs R
2Amazon Machine Learning,Amazon Web services,Cloudera,Hadoop/Hive/Pig,Impala,Java,Mathematica,MATLAB/Octave,Microsoft Excel Data Mining,Microsoft SQL Server Data Mining,NoSQL,Python,R,SAS Base,SAS JMP,SQL,TableauPythonMix of fields Bayesian Techniques,Decision Trees,Random Forests,Regression/Logistic Regression both
3C/C++,Jupyter notebooks,MATLAB/Octave,Python,R,TensorFlow PythonTechnology Bayesian Techniques,CNNs,Ensemble Methods,Neural Networks,Regression/Logistic Regression,SVMs both
4Jupyter notebooks,Python,SQL,TensorFlow PythonAcademic Bayesian Techniques,CNNs,Decision Trees,Gradient Boosted Machines,Neural Networks,Random Forests,Regression/Logistic RegressionPython
5C/C++,Cloudera,Hadoop/Hive/Pig,Java,NoSQL,R,Unix shell / awk R Government NA R
6SQL PythonNon-profit NA neither
In [24]:
debate_tools_counts <- debate_tools %>% 
    count(language_preference)

run_tests({
    test_that("New column was created", {
        expect_is(debate_tools$language_preference, "character", 
            info = 'The language_preference column should be of class "character". Make sure that you filled this new column correctly.')
    })
    test_that("Language preferences are correct", {
        expect_equal(filter(debate_tools_counts, language_preference == "both")  %>% pull(n), 3660, 
            info = 'There is an incorrect amount of "both". Please check the case_when() statements.')
        expect_equal(filter(debate_tools_counts, language_preference == "neither")  %>% pull(n), 2860, 
            info = 'There is an incorrect amount of "neither". Please check the case_when() statements.')
        expect_equal(filter(debate_tools_counts, language_preference == "Python")  %>% pull(n), 2413, 
            info = 'There is an incorrect amount of "Python". Please check the case_when() statements.')
        expect_equal(filter(debate_tools_counts, language_preference == "R")  %>% pull(n), 1220, 
            info = 'There is an incorrect amount of "R". Please check the case_when() statements.')
        
    })
    
})
2/2 tests passed

6. Plotting R vs Python users

Now we just need to take a closer look at how many respondents use R, Python, and both!

In [25]:
# Group by language preference, calculate number of responses, and remove "neither"
debate_plot <- debate_tools %>% 
    group_by(language_preference)  %>%
    summarise(count = n())  %>% 
    filter(!language_preference == "neither")

# Creating a bar chart
ggplot(debate_plot, aes(x = language_preference, y = count)) + 
  geom_bar(stat = "identity")
`summarise()` ungrouping output (override with `.groups` argument)
In [26]:
run_tests({
   test_that("Plot is a bar chart",{
      p <- last_plot()
      q <- p$layers[[1]]
      expect_is(q$geom, "GeomBar",
               info = "You should plot a bar chart with ggplot().")
    })
})
1/1 tests passed

7. Language recommendations

It looks like the largest group of professionals program in both Python and R. But what happens when they are asked which language they recommend to new learners? Do R lovers always recommend R?

In [27]:
# Group by, summarise, arrange, mutate, and filter
recommendations <- debate_tools  %>% 
    group_by(language_preference, LanguageRecommendationSelect)  %>%
    summarise(count = n())  %>% 
    arrange(language_preference, desc(count))  %>% 
    mutate(row = row_number()) %>% 
    filter(row <= 4)
`summarise()` regrouping output by 'language_preference' (override with `.groups` argument)
In [28]:
run_tests({
    test_that("Tools have been summarised", {
        expect_true(nrow(recommendations) == 16, 
            info = 'Make sure that you are only keeping the top 4 responses for each language used.')
    })
    
})
1/1 tests passed

Just one thing left. Let's graphically determine which languages are most recommended based on the language that a person uses.

In [29]:
# Create a faceted bar plot
ggplot(recommendations, aes(x = LanguageRecommendationSelect, y = count)) +
    geom_bar(stat = "identity") + 
    facet_wrap(~language_preference)
In [30]:
run_tests({
   test_that("Plot is a bar chart",{
      p <- last_plot()
      q <- p$layers[[1]]
      expect_is(q$geom, "GeomBar",
               info = "You should plot a bar chart with ggplot().")
    })
})
1/1 tests passed

9. The moral of the story

So we've made it to the end. We've found that Python is the most popular language used among Kaggle data scientists, but R users aren't far behind. And while Python users may highly recommend that new learners learn Python, would R users find the following statement TRUE or FALSE?

In [31]:
# Would R users find this statement TRUE or FALSE?
R_is_number_one = TRUE
In [32]:
run_tests({
    test_that("The question has been answered", {
        expect_true(R_is_number_one, 
            info = 'Try again! Should R_is_number_one be set to TRUE or FALSE?')
    })
    
})
1/1 tests passed