Úvod do Pandas¶
Data¶
Data naleznete v databázi Eurostat.
Vyberte Detailed datasets -> Population and social conditions -> Demography, population stock and balance (demo) -> Population (regional level) (demo_pop) -> Population on 1 January by age group, sex and NUTS 3 region (demo_r_pjangrp3).
V prohlížeči dat stáhněte plný dataset ve formátu CSV (Full dataset [MIGR_POP1CTZ] -> SDMX-CSV 1.0).
Načtěte data do pandas¶
[1]:
filename = 'demo_r_pjangrp3_linear.csv.gz'
# pozn. k 'filename':
# a) data musi byt ve stejnem adresari, ve kterem se nachazi vas notebook
# b) nebo specifikujte absolutni cestu
import pandas as pd
df = pd.read_csv(filename, compression='gzip')
Nahraná data zobrazíme.
[2]:
df
[2]:
DATAFLOW | LAST UPDATE | freq | sex | unit | age | geo | TIME_PERIOD | OBS_VALUE | OBS_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|
0 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | F | NR | TOTAL | AL | 2014 | 1430827 | NaN |
1 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | F | NR | TOTAL | AL | 2015 | 1424597 | NaN |
2 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | F | NR | TOTAL | AL | 2016 | 1417141 | NaN |
3 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | F | NR | TOTAL | AL | 2017 | 1423050 | NaN |
4 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | F | NR | TOTAL | AL | 2018 | 1431715 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1125890 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | Y_LT5 | UKN16 | 2015 | 7764 | NaN |
1125891 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | Y_LT5 | UKN16 | 2016 | 7720 | NaN |
1125892 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | Y_LT5 | UKN16 | 2017 | 7712 | NaN |
1125893 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | Y_LT5 | UKN16 | 2018 | 7659 | NaN |
1125894 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | Y_LT5 | UKN16 | 2019 | 7654 | NaN |
1125895 rows × 10 columns
Jednotlivé hodnoty¶
[3]:
# pohlavi
df["sex"].unique()
[3]:
array(['F', 'M', 'T'], dtype=object)
[4]:
# datace informace
df["TIME_PERIOD"].unique()
[4]:
array([2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022])
Seřaďme hodnoty.
[5]:
import numpy as np
np.sort(df["TIME_PERIOD"].unique())
[5]:
array([2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022])
Filtrování dat¶
Filtrování podle jednoho sloupce.
[6]:
df[df["geo"] == "CZ"]
[6]:
DATAFLOW | LAST UPDATE | freq | sex | unit | age | geo | TIME_PERIOD | OBS_VALUE | OBS_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|
1865 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | F | NR | TOTAL | CZ | 2014 | 5350039 | NaN |
1866 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | F | NR | TOTAL | CZ | 2015 | 5361348 | NaN |
1867 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | F | NR | TOTAL | CZ | 2016 | 5367513 | NaN |
1868 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | F | NR | TOTAL | CZ | 2017 | 5378133 | NaN |
1869 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | F | NR | TOTAL | CZ | 2018 | 5390264 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1110293 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | Y_LT5 | CZ | 2018 | 560763 | NaN |
1110294 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | Y_LT5 | CZ | 2019 | 567172 | NaN |
1110295 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | Y_LT5 | CZ | 2020 | 568823 | NaN |
1110296 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | Y_LT5 | CZ | 2021 | 558225 | b |
1110297 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | Y_LT5 | CZ | 2022 | 558098 | NaN |
579 rows × 10 columns
Filtrování podle více sloupců.
[7]:
df[(df["geo"] == "CZ") & (df["TIME_PERIOD"] == 2022) & (df["sex"] == 'T') & (df["age"] == 'TOTAL')]
[7]:
DATAFLOW | LAST UPDATE | freq | sex | unit | age | geo | TIME_PERIOD | OBS_VALUE | OBS_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|
753989 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | CZ | 2022 | 10516707 | NaN |
Definujme funkci, která bude vracet počet obyvatel pro zvolený NUTS kód a rok.
[8]:
def population_by_country(code, year):
return int(df[(df["geo"] == code) & (df["TIME_PERIOD"] == year) & (df["sex"] == 'T') & (df["age"] == 'TOTAL')]["OBS_VALUE"].values[0])
population_by_country('CZ', 2022)
[8]:
10516707
[9]:
population_by_country('PL', 2022)
[9]:
37654247
Definujme funkci, která bude vracet počet obyvatel podle pohlaví pro zvolený NUTS kód a rok.
[10]:
def population_by_country_sex(code, year):
return df[(df["geo"] == code) & (df["TIME_PERIOD"] == year) & (df["age"] == 'TOTAL')][["sex", "OBS_VALUE"]].to_dict('list')
population_by_country_sex('CZ', 2022)
[10]:
{'sex': ['F', 'M', 'T'], 'OBS_VALUE': [5332932, 5183775, 10516707]}
Definujme funkci, která bude vracet počet obyvatel podle pohlaví a procentuální podíl žen pro zvolený NUTS kód a rok. Hodnota bude vrácena jako slovník.
[11]:
def population_by_country_sex_p(code, year):
res = df[(df["geo"] == code) & (df["TIME_PERIOD"] == year) & (df["age"] == 'TOTAL')][["sex", "OBS_VALUE"]].to_dict('list')
f_idx = res["sex"].index("F")
f = res["OBS_VALUE"][f_idx]
t_idx = res["sex"].index("T")
t = res["OBS_VALUE"][t_idx]
res_dict = {}
i = 0
for k in res["sex"]:
res_dict[k] = int(res["OBS_VALUE"][i])
i += 1
res_dict["F_p"] = round(f/t, 2)
return res_dict
population_by_country_sex_p('CZ', 2022)
[11]:
{'F': 5332932, 'M': 5183775, 'T': 10516707, 'F_p': 0.51}
[12]:
population_by_country_sex_p('PL', 2022)
[12]:
{'F': 19443954, 'M': 18210293, 'T': 37654247, 'F_p': 0.52}
Filtrování dat na úrovních NUTS¶
[13]:
df[(df["TIME_PERIOD"] == 2022) & (df["sex"] == 'T') & (df["age"] == 'TOTAL')]
[13]:
DATAFLOW | LAST UPDATE | freq | sex | unit | age | geo | TIME_PERIOD | OBS_VALUE | OBS_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|
752124 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | AL | 2022 | 2793592 | NaN |
752133 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | AL0 | 2022 | 2793592 | NaN |
752142 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | AL01 | 2022 | 783970 | NaN |
752151 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | AL011 | 2022 | 109585 | NaN |
752160 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | AL012 | 2022 | 291325 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
768174 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | TRC3 | 2022 | 2367645 | NaN |
768183 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | TRC31 | 2022 | 862757 | NaN |
768192 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | TRC32 | 2022 | 626319 | NaN |
768201 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | TRC33 | 2022 | 546589 | NaN |
768210 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | TRC34 | 2022 | 331980 | NaN |
1778 rows × 10 columns
Filtrujme NUTS úrovně 0 (NUTS identifikátor má přesně dva znaky).
[14]:
df[(df["TIME_PERIOD"] == 2022) & (df["sex"] == 'T') & (df["age"] == 'TOTAL') & (df["geo"].apply(lambda x: len(x) == 2))].head(10)
[14]:
DATAFLOW | LAST UPDATE | freq | sex | unit | age | geo | TIME_PERIOD | OBS_VALUE | OBS_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|
752124 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | AL | 2022 | 2793592 | NaN |
752286 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | AT | 2022 | 8978929 | NaN |
752718 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | BE | 2022 | 11617623 | NaN |
753305 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | BG | 2022 | 6838937 | NaN |
753638 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | CH | 2022 | 8738791 | NaN |
753953 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | CY | 2022 | 904705 | NaN |
753989 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | CZ | 2022 | 10516707 | NaN |
754205 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | DE | 2022 | 83237124 | NaN |
758309 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | DK | 2022 | 5873420 | NaN |
758471 | ESTAT:DEMO_R_PJANGRP3(1.0) | 28/09/23 23:00:00 | A | T | NR | TOTAL | EE | 2022 | 1331796 | NaN |
Definujme funkci, která bude vracet zemi a počet obyvatel pro zvolený rok. Volitelné argumenty budou NUTS úroveň, pohlaví a NUTS vzor.
[15]:
def population_by_nuts(year, level=0, sex='T', nuts=None):
data = df[(df["TIME_PERIOD"] == year) & (df["sex"] == sex) & (df["age"] == 'TOTAL') & (df["geo"].apply(lambda x: len(x) == level+2))]
if nuts is not None:
data = data[data["geo"].str.match(nuts)]
return data[["geo", "OBS_VALUE"]]
population_by_nuts(2022)
[15]:
geo | OBS_VALUE | |
---|---|---|
752124 | AL | 2793592 |
752286 | AT | 8978929 |
752718 | BE | 11617623 |
753305 | BG | 6838937 |
753638 | CH | 8738791 |
753953 | CY | 904705 |
753989 | CZ | 10516707 |
754205 | DE | 83237124 |
758309 | DK | 5873420 |
758471 | EE | 1331796 |
758551 | EL | 10459782 |
759181 | ES | 47432893 |
759976 | FI | 5548241 |
760219 | FR | 67871925 |
761521 | HR | 3862305 |
761750 | HU | 9689010 |
762059 | IE | 5060004 |
762176 | IS | 376248 |
762221 | IT | 59030133 |
763448 | LI | 39308 |
763484 | LT | 2805998 |
763610 | LU | 645397 |
763646 | LV | 1875757 |
763727 | ME | 617683 |
763763 | MK | 1837114 |
763883 | MT | 520971 |
763928 | NL | 17590672 |
764441 | NO | 5425270 |
764676 | PL | 37654247 |
765558 | PT | 10352042 |
765882 | RO | 19042455 |
766377 | RS | 6797105 |
766572 | SE | 10452326 |
766869 | SI | 2107180 |
767013 | SK | 5434712 |
767139 | TR | 84680273 |
Podívejme se, jak vypadá rozdělení žen v ČR podle krajů.
[16]:
population_by_nuts(2022, level=2, sex="F", nuts="CZ")
[16]:
geo | OBS_VALUE | |
---|---|---|
1891 | CZ01 | 652285 |
1909 | CZ02 | 701332 |
1927 | CZ03 | 613767 |
1954 | CZ04 | 547501 |
1981 | CZ05 | 755586 |
2017 | CZ06 | 854930 |
2044 | CZ07 | 608263 |
2071 | CZ08 | 599268 |
Grafy¶
Podívejme se na graf znázorňující počet obyvatel pro jednotlivé země.
[17]:
data = population_by_nuts(2022)
data.plot(kind="barh", x="geo", y="OBS_VALUE", xlabel="NUTS0", ylabel="Počet obyvatel",
title=f"Počet obyvatel (2021, total)", figsize=(20, 15))
[17]:
<AxesSubplot: title={'center': 'Počet obyvatel (2021, total)'}, xlabel='NUTS0', ylabel='Počet obyvatel'>
Definujme funkci, která bude generovat graf s počtem obyvatel pro zvolený rok. Volitelné argumenty budou NUTS úroveň, pohlaví a NUTS vzor.
[18]:
def print_graph(year, level=0, sex="T", nuts=None):
data = population_by_nuts(year, level, sex, nuts)
data.plot(kind="barh", x="geo", y="OBS_VALUE", xlabel=f"NUTS{level}", ylabel="Počet obyvatel",
title=f"Počet obyvatel ({year} {sex})", figsize=(20, 15))
print_graph(2022, level=2, nuts="CZ")
Podívejme se, jak vypadá rozdělení žen v ČR podle krajů.
[19]:
print_graph(2022, level=2, nuts="CZ", sex="F")