Data Pitfalls

Multiple sources for one statistic

Consider the followin query without source information.

from datenguidepy import Query

q = Query.region('01')
q.add_field('BEVSTD')
result_df = q.results()
result_df.head().iloc[:,:4]

id

name

year

BEVSTD

0

01

Schleswig-Holstein

1995

2725461

1

01

Schleswig-Holstein

1996

2742293

2

01

Schleswig-Holstein

1997

2756473

3

01

Schleswig-Holstein

1998

2766057

4

01

Schleswig-Holstein

1998

2766057

As can be seen in the results the value for 1998 appears twice. The reason is that the values come from different sources. This is the reason why sources are part of the results by default. When one encounters unexpected values it is a good idea to check sources for uniqueness.

It is also important to not that different sources may actually report different values for the same year, unlike the example above.

Changing region ids

Sometmes region ids change at a point in time without an apparent reason. For example when looking for an id for the small town Binz one finds two distinct ids.

from datenguidepy import get_regions

reg = get_regions()
reg[reg.name.str.contains('Binz',case=False)]

region_id

name

level

parten

13061005

Binz

lau

13061

13073011

Binz

lau

13073

08336008

Binzen

lau

08336

If one uses these ids to query data one id will deliver data until 2010 and the other starting from 2011. The reason is an administrative change on the county level (nuts3), which is one level higher in the region hierarchy. Looking at the parents for the above results one finds the following.

reg[(reg.index == '13061') | (reg.index == '13073')]

region_id

name

level

partent

13061

Landkreis Rügen

nuts3

130

13073

Landkreis Vorpommern-Rügen

nuts3

130

This reflects the administrative change that happened in 2011. On the county level. Because region ids reflect the hierarchy reflect the region hierarchy such a change causes all subregions to get new ids. Therefore Binz appears twice in the list of all regions.

Most of the time such a situation can be resolved with the help of the region name. But sometimes it might be a little more difficuilt if the id change coincides with a slight name change.