This is an example to test the new API for the UN’s World Population Propects, described at https://population.un.org/dataportal/about/dataapi. This is not an exhaustive tutorial, just a single example.
My personal preference is to skip the JSON formatting and to download in CSV format. The API uses a vertical bar | as the csv separator character, and csv files all seem to start with a line
sep =|
that will need to be skipped when converting to a tibble or data frame.
library(tidyverse)
base_url = 'https://population.un.org/dataportalapi/api/v1'
We can get a list of “topics” like this
target = paste0(base_url, '/topics/?format=csv')
df_topics = read.csv(target, sep='|', skip=1)
df_topics
## Id Name ShortName SortOrder
## 1 0 Not applicable <NA> 0
## 2 1 Population Pop 1
## 3 2 Fertility Fert 2
## 4 3 Mortality Mort 3
## 5 4 International Migration iMigration 4
## 6 5 Family Planning FP 5
## 7 6 Marital Status MarStat 6
## 8 7 All Components All 7
## 9 8 Child Mortality IGME 8
## 10 9 Maternal Mortality MMEIG 9
Each topic has an id#, a full name, and a short name. We’ll try 2/Fertility/Fert and look at the available variables for that topic.
target = paste0(base_url,'/indicators/?format=csv')
df_indicators = read.csv(target, sep='|', skip=1) %>%
filter(TopicName == 'Fertility')
# what are the column names for the indicators info?
names(df_indicators)
## [1] "Id" "Name"
## [3] "ShortName" "Description"
## [5] "DisplayName" "DimAge"
## [7] "DimSex" "DimVariant"
## [9] "DimCategory" "DefaultAgeId"
## [11] "DefaultSexId" "DefaultVariantId"
## [13] "DefaultCategoryId" "VariableType"
## [15] "ValueType" "UnitScaling"
## [17] "Precision" "IsThousandSeparatorSpace"
## [19] "FormatString" "UnitShortLabel"
## [21] "UnitLongLabel" "NClassesDefault"
## [23] "DownloadFileName" "SourceId"
## [25] "SourceName" "SourceYear"
## [27] "SourceStartYear" "SourceEndYear"
## [29] "SourceCitation" "SourceUrl"
## [31] "TopicId" "TopicName"
## [33] "TopicShortName"
The indicators have lots of detailed information in all of those columns, but the most interesting are the first few columns. Let’s peek at those, plus the starting and ending years for the available data series.
df_indicators %>%
select(Id:ShortName, SourceStartYear, SourceEndYear)
## Id Name ShortName SourceStartYear
## 1 17 Fertility rates by age of mother (5-year) ASFR5 1950
## 2 18 Mean age of childbearing (5-year) MAC5 1950
## 3 19 Total fertility (5-year) TFR5 1950
## 4 55 Crude birth rate CBR 1950
## 5 56 Net Reproduction Rate NRR 1950
## 6 57 Total births by sex TBirths 1950
## 7 58 Sex ratio at birth SRB 1950
## 8 68 Fertility rates by age of mother (1-year) ASFR1 1950
## SourceEndYear
## 1 2100
## 2 2100
## 3 2100
## 4 2100
## 5 2100
## 6 2100
## 7 2100
## 8 2100
Let’s try the single-year age-specific rates: the ID# for that variable is 68 and the short name is ASFR1. Estimates are available from 1950 to 2100.
Before grabbing data for this variable, we’ll need to select a population (or populations) for which we want the ASFRs. To see the available list we can use the API again:
target = paste0(base_url, '/locations?sort=id&format=csv')
df_locations = read.csv(target, sep='|', skip=1)
# how many locations?
nrow(df_locations)
## [1] 277
# print the first 20 observations
df_locations %>%
head(20)
## Id ParentId Name Iso3 Iso2 LocationTypeId
## 1 4 5501 Afghanistan AFG AF 4
## 2 8 925 Albania ALB AL 4
## 3 12 912 Algeria DZA DZ 4
## 4 16 957 American Samoa ASM AS 4
## 5 20 925 Andorra AND AD 4
## 6 24 911 Angola AGO AO 4
## 7 28 915 Antigua and Barbuda ATG AG 4
## 8 31 922 Azerbaijan AZE AZ 4
## 9 32 931 Argentina ARG AR 4
## 10 36 927 Australia AUS AU 4
## 11 40 926 Austria AUT AT 4
## 12 44 915 Bahamas BHS BS 4
## 13 48 922 Bahrain BHR BH 4
## 14 50 5501 Bangladesh BGD BD 4
## 15 51 922 Armenia ARM AM 4
## 16 52 915 Barbados BRB BB 4
## 17 56 926 Belgium BEL BE 4
## 18 60 905 Bermuda BMU BM 4
## 19 64 5501 Bhutan BTN BT 4
## 20 68 931 Bolivia (Plurinational State of) BOL BO 4
## LocationType Longitude Latitude
## 1 Country 67.709953 33.93911
## 2 Country 20.168331 41.15333
## 3 Country 1.659626 28.03389
## 4 Country -170.696182 -14.30602
## 5 Country 1.521801 42.50629
## 6 Country 17.873886 -11.20269
## 7 Country -61.789180 17.07867
## 8 Country 47.576927 40.14310
## 9 Country -63.616673 -38.41610
## 10 Country 133.775131 -25.27440
## 11 Country 14.550072 47.51623
## 12 Country -77.396278 25.03428
## 13 Country 50.557701 26.06670
## 14 Country 90.356331 23.68499
## 15 Country 45.038189 40.06910
## 16 Country -59.556430 13.16705
## 17 Country 4.469936 50.50389
## 18 Country -64.750504 32.30780
## 19 Country 90.433601 27.51416
## 20 Country -63.588654 -16.29015
# find the codes for India and Bangladesh
df_locations %>%
filter( Name %in% c('India','Bangladesh'))
## Id ParentId Name Iso3 Iso2 LocationTypeId LocationType Longitude
## 1 50 5501 Bangladesh BGD BD 4 Country 90.35633
## 2 356 5501 India IND IN 4 Country 78.96288
## Latitude
## 1 23.68499
## 2 20.59368
Suppose we want to compare single-year fertility rates in India and Bangladesh during the 1990s. That is indicator 68 (ASFR1), for location IDs 50 (Bangladesh) and 356 (India).
The URL for that specific query is the base_url +
/data/indicators/68/locations/50,356/start/1990/end/1999/?format=csv
Let’s try it:
my_indicator = 68
my_location = '50,356'
my_startyr = 1990
my_endyr = 1999
target = paste0(base_url,
'/data/indicators/',my_indicator,
'/locations/',my_location,
'/start/',my_startyr,
'/end/',my_endyr,
'/?format=csv')
target
## [1] "https://population.un.org/dataportalapi/api/v1/data/indicators/68/locations/50,356/start/1990/end/1999/?format=csv"
df_data = read.csv(target, sep='|', skip=1)
# look at the data's structure
dim(df_data)
## [1] 1000 32
names(df_data)
## [1] "LocationId" "Location" "Iso3"
## [4] "Iso2" "LocationTypeId" "IndicatorId"
## [7] "Indicator" "IndicatorDisplayName" "SourceId"
## [10] "Source" "Revision" "VariantId"
## [13] "Variant" "VariantShortName" "VariantLabel"
## [16] "TimeId" "TimeLabel" "TimeMid"
## [19] "CategoryId" "Category" "EstimateTypeId"
## [22] "EstimateType" "EstimateMethodId" "EstimateMethod"
## [25] "SexId" "Sex" "AgeId"
## [28] "AgeLabel" "AgeStart" "AgeEnd"
## [31] "AgeMid" "Value"
# peek at a random bit of the dataframe
df_data %>%
sample_n(size=20) %>%
select(Location, starts_with('Age'), starts_with('Time'), Value) %>%
as_tibble()
## # A tibble: 20 × 10
## Location AgeId AgeLabel AgeStart AgeEnd AgeMid TimeId TimeLabel TimeMid
## <chr> <int> <int> <int> <int> <int> <int> <int> <dbl>
## 1 Bangladesh 145 52 52 52 52 48 1997 1998.
## 2 Bangladesh 116 19 19 19 19 45 1994 1994.
## 3 India 97 45 45 45 45 49 1998 1998.
## 4 India 131 36 36 36 36 50 1999 2000.
## 5 Bangladesh 146 53 53 53 53 49 1998 1998.
## 6 India 138 43 43 43 43 47 1996 1996.
## 7 Bangladesh 98 50 50 50 50 47 1996 1996.
## 8 Bangladesh 109 11 11 11 11 44 1993 1994.
## 9 India 140 46 46 46 46 47 1996 1996.
## 10 Bangladesh 95 35 35 35 35 47 1996 1996.
## 11 Bangladesh 140 46 46 46 46 50 1999 2000.
## 12 India 141 47 47 47 47 47 1996 1996.
## 13 India 127 31 31 31 31 42 1991 1992.
## 14 Bangladesh 137 42 42 42 42 49 1998 1998.
## 15 Bangladesh 151 59 59 59 59 48 1997 1998.
## 16 Bangladesh 129 33 33 33 33 49 1998 1998.
## 17 India 119 22 22 22 22 41 1990 1990.
## 18 India 142 48 48 48 48 45 1994 1994.
## 19 Bangladesh 142 48 48 48 48 41 1990 1990.
## 20 India 151 59 59 59 59 47 1996 1996.
## # … with 1 more variable: Value <dbl>
Now that we know what the data structure looks like, let’s plot the 1990, 1995, and 1999 ASFR schedules for the two countries.
# make a new data frame with the data of interest
df = df_data %>%
select(Country=Location, Year=TimeLabel,Age=AgeMid,ASFR=Value) %>%
filter(Year %in% c(1990,1995,1999))
# construct a plot
ggplot(data=df) +
aes(x=Age, y=ASFR, group=Country, color=Country) +
geom_line(lwd=2, alpha=.80) +
scale_y_continuous(breaks=seq(0,300,100),minor_breaks = seq(0,300,50)) +
theme_bw() +
facet_wrap(~Year, ncol=1)