Analyzing business ownership demographics from Annual Business Survey

Last month, U.S. Census Bureau released its first ever Annual Business Survey ABS, covering statistics of nonfarm employer businesses in 2017. ABS provides business and business owner characteristics such as firms, receipts, payroll and employment by demographic characteristics such as gender, race, ethnicity and veteran status, at regional level.

Today I’m going to demonstrate how to fetch the ABS data using Census API, and explore the demographics of business ownership across cities, counties, and metropolitan areas in the U.S.

Fetch ABS data using Census API

There are many great packages and tutorials that walk you through the process of using Census API, so I won’t go into the details. My favorite one is tidycensus, but unfortunately it’s available for American Community Survey and Decennial Survey only, so I’m going to use censusapi today, which offers a more flexible query structure.

Let’s first take a look at the ABS API endpoints. Navigate to its variables page, it lists all the variables you can query from this API.

Here are the variables I decide to include:

library(tidyverse)
abs_var <- c("GEO_ID", "NAICS2017", "NAICS2017_LABEL", "SEX", "SEX_LABEL", "ETH_GROUP", "ETH_GROUP_LABEL", "RACE_GROUP", "RACE_GROUP_LABEL", "VET_GROUP", "VET_GROUP_LABEL", "FIRMPDEMP", "RCPPDEMP", "EMP", "PAYANN")

Now I’m going to test my query with national data first.

censusapi::getCensus(
  name = "abscs",
  vintage = 2017,
  vars = abs_var,
  region = "us:*",
  key = Sys.getenv("CENSUS_API_KEY")
) %>% head()
## Warning in responseFormat(raw): NAs introduced by coercion

## Warning in responseFormat(raw): NAs introduced by coercion
##   us    GEO_ID NAICS2017 NAICS2017_LABEL SEX           SEX_LABEL ETH_GROUP
## 1  1 0100000US         0              NA 003                Male       001
## 2  1 0100000US         0              NA 004 Equally male/female       001
## 3  1 0100000US         0              NA 001               Total       020
## 4  1 0100000US         0              NA 002              Female       020
## 5  1 0100000US         0              NA 003                Male       020
## 6  1 0100000US         0              NA 001               Total       001
##   ETH_GROUP_LABEL RACE_GROUP RACE_GROUP_LABEL VET_GROUP VET_GROUP_LABEL
## 1           Total         00            Total       001           Total
## 2           Total         00            Total       001           Total
## 3        Hispanic         00            Total       001           Total
## 4        Hispanic         00            Total       001           Total
## 5        Hispanic         00            Total       001           Total
## 6           Total         00            Total       001           Total
##   FIRMPDEMP    RCPPDEMP       EMP     PAYANN
## 1   3480438 10018649629  44854631 1986477555
## 2    859735  1180988058   8030678  272302530
## 3    322076   422573589   2872550   90985526
## 4     78091    74506263    664947   18735668
## 5    206437   309414268   1895189   64179111
## 6   5744643 36579575617 127738274 6534271084

Okay, the query returned a data frame, but I noticed that for some reasons NAICS2017 and NAICS2017_LABEL are not displayed properly, and the warning message seems to suggest that this is the result of these two columns being coerced to numerical variables. I spent some time investigating the censusapi::getCensus()’s source code and fixed the bug, which you can see my pull request here.

Let’s run this again with the modified getCensus() function:

getCensus(
  name = "abscs",
  vintage = 2017,
  vars = abs_var,
  region = "us:*",
  key = Sys.getenv("CENSUS_API_KEY")
) %>% head()
##   us    GEO_ID NAICS2017       NAICS2017_LABEL SEX           SEX_LABEL
## 1  1 0100000US        00 Total for all sectors 003                Male
## 2  1 0100000US        00 Total for all sectors 004 Equally male/female
## 3  1 0100000US        00 Total for all sectors 001               Total
## 4  1 0100000US        00 Total for all sectors 002              Female
## 5  1 0100000US        00 Total for all sectors 003                Male
## 6  1 0100000US        00 Total for all sectors 001               Total
##   ETH_GROUP ETH_GROUP_LABEL RACE_GROUP RACE_GROUP_LABEL VET_GROUP
## 1       001           Total         00            Total       001
## 2       001           Total         00            Total       001
## 3       020        Hispanic         00            Total       001
## 4       020        Hispanic         00            Total       001
## 5       020        Hispanic         00            Total       001
## 6       001           Total         00            Total       001
##   VET_GROUP_LABEL FIRMPDEMP    RCPPDEMP       EMP     PAYANN
## 1           Total   3480438 10018649629  44854631 1986477555
## 2           Total    859735  1180988058   8030678  272302530
## 3           Total    322076   422573589   2872550   90985526
## 4           Total     78091    74506263    664947   18735668
## 5           Total    206437   309414268   1895189   64179111
## 6           Total   5744643 36579575617 127738274 6534271084

Looking great! Now the NAICS2017 and NAICS2017_LABEL are displaying correctly.

Now let’s get the data for every metro areas and counties.

cbsa_abs_naics <- getCensus(
  name = "abscs",
  vintage = 2017,
  vars = abs_var,
  region = "metropolitan statistical area/micropolitan statistical area:*",
  key = Sys.getenv("CENSUS_API_KEY")
) %>% 
  mutate(GEOID = str_sub(GEO_ID, -5,-1)) 

stco_abs_naics <- getCensus(
  name = "abscs",
  vintage = 2017,
  vars = abs_var,
  region = "county:*",
  key = Sys.getenv("CENSUS_API_KEY")
) %>% 
  mutate(GEOID = str_sub(GEO_ID, -5,-1))

Exploratory Data Analysis

How does business ownership demographics look like in Detroit and Birmingham?

First, let’s specify the GEOIDs of the place of interests. If you are not familiar with census geographies, MCDC has a very friendly GEOID lookup tool.

target_cbsa <- c("19820","13820")   # cbsa codes for Detroit-Warren-Dearborn, MI and Birmingham-Hoover, AL

Now we can find everything about these two metro areas by filtering the raw data. Use count() to get the codebook of NAICS2017, SEX, RACE_GROUP, ETH_GROUP, VET_GROUP. For example:

cbsa_abs_naics %>% 
  count(SEX, SEX_LABEL)
##   SEX           SEX_LABEL      n
## 1 001               Total 119619
## 2 002              Female  15558
## 3 003                Male  18796
## 4 004 Equally male/female  16123
## 5 096        Classifiable   8325
## 6 098      Unclassifiable   7937

Let’s take a look at female ownership. So we will get data for female (002), and all businesses that could be classified by gender (096).

cbsa_result <- cbsa_abs_naics %>% 
  filter(GEOID %in% target_cbsa) %>% 
  filter(SEX %in% c("002","096")) %>% 
  filter(NAICS2017 == "72") %>% 
  select(GEOID, contains("_LABEL"), FIRMPDEMP, EMP, RCPPDEMP, PAYANN) 

cbsa_result
##   GEOID                 NAICS2017_LABEL    SEX_LABEL ETH_GROUP_LABEL
## 1 13820 Accommodation and food services       Female           Total
## 2 13820 Accommodation and food services Classifiable    Classifiable
## 3 19820 Accommodation and food services       Female           Total
## 4 19820 Accommodation and food services Classifiable    Classifiable
##   RACE_GROUP_LABEL VET_GROUP_LABEL FIRMPDEMP    EMP RCPPDEMP  PAYANN
## 1            Total           Total       256   3404   184036   57926
## 2     Classifiable    Classifiable      1479  29483  1736740  482008
## 3            Total           Total      1195  20522  1548070  382517
## 4     Classifiable    Classifiable      6387 129103  7308642 2047629

I’m going to reshape the data to calculate number of firms (FIRMPDEMP), employment (EMP), receipts (RCPPDEMP), and payroll (PAYANN) of female-owed businesses in these two metro areas.

cbsa_result %>% 
  pivot_longer(FIRMPDEMP:PAYANN) %>%
  mutate(value = as.numeric(value)) %>% 
  select(GEOID, NAICS2017_LABEL, SEX_LABEL, name, value) %>% 
  pivot_wider(names_from = "SEX_LABEL", values_from = "value") %>% 
  mutate(pct_female = Female/Classifiable) %>% 
  
  ggplot(aes(x = name, y = pct_female, fill = name, label = scales::percent(pct_female, accuracy = 0.1)))+
  geom_col()+
  geom_text(vjust = -0.5)+
  facet_wrap(~GEOID)+
  scale_y_continuous(labels = scales::percent, "share of female owned businesses")+
  theme_classic()

We can see that in the accommodation sector, female business ownership is similar in Birmingham and Detroit (17.3% and 18.7%), but female-owned firms are much smaller in size in Birmingham. They only account for 11.5% of total employment, 12% of payroll, and 10.6% of total sales and receipts.

Scale it up

Let’s write a function to easily adapt this script to explore explore other demographic groups in other regions.

summarise_abs <- function(df, target, col, col_var, NAICS = "00"){
  col <- enquo(col)
  col_LABEL <- paste0((rlang::as_label(col)),"_LABEL")
    
df %>% 
  filter(GEOID %in% target) %>% 
  filter(!!col %in% col_var) %>% 
  filter(NAICS2017 == NAICS) %>%
    
  select(GEOID, contains("_LABEL"), FIRMPDEMP, EMP, RCPPDEMP, PAYANN) %>% 
  pivot_longer(FIRMPDEMP:PAYANN) %>%
  mutate(value = as.numeric(value)) %>% 
  select(GEOID, NAICS2017_LABEL, col_LABEL, name, value) %>% 
  pivot_wider(names_from = col_LABEL, values_from = value) 
}

Now, let’s look at Asian business ownership in Arlington County, VA (51013)

stco_abs_naics %>% 
  summarise_abs("51013",RACE_GROUP, c("60","96"))%>% 
  mutate(pct_asian = Asian/Classifiable) %>% 
  knitr::kable()
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(col_LABEL)` instead of `col_LABEL` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
GEOID NAICS2017_LABEL name Asian Classifiable pct_asian
51013 Total for all sectors FIRMPDEMP 622 3655 0.1701778
51013 Total for all sectors EMP 6587 50475 0.1305002
51013 Total for all sectors RCPPDEMP 1307860 9785745 0.1336495
51013 Total for all sectors PAYANN 384583 3182245 0.1208527

Or, Hispanic business ownership in professional services sector in Austin-Round Rock, TX metro area (12420)

cbsa_abs_naics %>% 
  summarise_abs("12420",ETH_GROUP, c("020","096"), "54") %>% 
  mutate(pct_hispanic = Hispanic/Classifiable) %>% 
  knitr::kable()
GEOID NAICS2017_LABEL name Hispanic Classifiable pct_hispanic
12420 Professional, scientific, and technical services FIRMPDEMP 430 8278 0.0519449
12420 Professional, scientific, and technical services EMP 2210 51394 0.0430011
12420 Professional, scientific, and technical services RCPPDEMP 289735 10400407 0.0278580
12420 Professional, scientific, and technical services PAYANN 123854 3865307 0.0320425
Avatar
Sifan Liu 刘思凡
Senior Research Analyst, Metropolitan Policy Program

I uses spatial analysis and data science tools to conduct data-driven policy research. I am a certified RStudio Instructor. My research interests include regional policy on innovation, entrepreneurship ecosystem, and economic development.

Related