A Step-by-Step Guide to Uncovering National Digital Complexity with GitHub Innovation Graph Data

Overview

In the modern economy, software has become a critical driver of growth and innovation. Yet, traditional economic indicators often fail to capture the full extent of a nation's software production because code crosses borders invisibly—through git push, cloud services, and package managers. This blind spot, sometimes called digital dark matter, hides a wealth of productive knowledge. Recent research published in Research Policy by Sándor Juhász, Johannes Wachs, Jermain Kaminski, and César A. Hidalgo demonstrates how the GitHub Innovation Graph can be used to measure the “digital complexity” of nations. This tutorial guides you through the process of using GitHub Innovation Graph data to compute a Software Economic Complexity Index (Software ECI) that predicts GDP, inequality, and emissions in ways traditional metrics miss.

A Step-by-Step Guide to Uncovering National Digital Complexity with GitHub Innovation Graph Data — Source: github.blog

Prerequisites

Before you begin, ensure you have the following:

Python 3.8+ installed with pandas, numpy, requests, and matplotlib libraries.
A GitHub Personal Access Token (if accessing the Innovation Graph API; otherwise, use the public datasets available as CSV/Parquet).
Basic understanding of Economic Complexity Index (ECI) methodology (see Step 3 for a refresher).
Familiarity with country-level GDP and inequality data (e.g., from World Bank or IMF) for validation.

Step-by-Step Instructions

Step 1: Understand the GitHub Innovation Graph Data

The GitHub Innovation Graph tracks the number of developers in each economy who push code in each programming language, based on IP addresses. It provides a quarterly breakdown of active developers per language per country. For this tutorial, we will use the Q4 2025 data release (as used in the original study). The data is available as a CSV file with columns: country_code, language, developer_count, and quarter.

import pandas as pd
# Load the dataset
df = pd.read_csv('github_innovation_graph_q4_2025.csv')
df.head()

This dataframe forms the raw material for computing digital complexity.

Step 2: Construct the Country-Language Matrix

We need to create a matrix where rows represent countries and columns represent programming languages, and each cell contains the Revealed Comparative Advantage (RCA) of that country in that language. RCA measures whether a country specializes in a language relative to the global average. The formula:

RCA(c,l) = (dev_c,l / dev_c) / (dev_l / dev_total)

Where dev_c,l is developers in country c using language l, dev_c is total developers in country c, dev_l is total developers using language l globally, and dev_total is total developers globally.

# Pivot to country-language matrix
matrix = df.pivot_table(index='country_code', columns='language', values='developer_count', fill_value=0)

# Compute RCA
total_per_country = matrix.sum(axis=1)
total_per_language = matrix.sum(axis=0)
global_total = total_per_language.sum()

rca = matrix.div(total_per_country, axis=0).div(total_per_language / global_total, axis=1)

# Binarize: set RCA >= 1 to 1, else 0
rca_binary = (rca >= 1).astype(int)

Step 3: Apply the Economic Complexity Index (ECI)

The ECI is calculated using the method of reflections, which iteratively computes the average diversity of a country and the average ubiquity of its languages. The standard formula for the Software ECI is:

ECI_c = ( (k_c - μ_k) / σ_k ) - ( (k_p - μ_p) / σ_p )

Where k_c is the diversity (number of languages with RCA>=1 for country c), and k_p is the ubiquity (number of countries with RCA>=1 for language p). However, the standard approach uses the eigenvector method. Here we implement a simplified version using matrix decomposition.

# Method of reflections (simplified)
k_c = rca_binary.sum(axis=1)  # diversity
k_p = rca_binary.sum(axis=0)  # ubiquity

# Normalize
eci = (k_c - k_c.mean()) / k_c.std() - (k_p - k_p.mean()) / k_p.std()

# Alternatively, use the eigenvector method via numpy
# This produces more accurate results
def compute_eci(rca_binary):
    # Step: calculate M_{cc'} = sum_l (M_cl * M_c'l / k_l)
    k_p_inv = 1.0 / rca_binary.sum(axis=0)
    k_p_inv[np.isinf(k_p_inv)] = 0
    M = rca_binary.values @ np.diag(k_p_inv) @ rca_binary.values.T
    # Second eigenvector
    eigvals, eigvecs = np.linalg.eig(M)
    idx = np.argsort(eigvals)[-2]  # second largest
    eci = eigvecs[:, idx]
    return eci

eci_scores = compute_eci(rca_binary)

# Create a DataFrame
eci_df = pd.DataFrame({'country': rca_binary.index, 'software_eci': eci_scores})

Step 4: Validate Against Macroeconomic Indicators

To see if Software ECI predicts GDP, inequality, or emissions, merge your results with external datasets. For example, use World Bank GDP per capita (PPP).

# Load GDP data
gdp = pd.read_csv('gdp_ppp_2025.csv')
merged = eci_df.merge(gdp, on='country')

# Linear regression
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

X = sm.add_constant(merged['software_eci'])
y = merged['gdp_per_capita']
model = sm.OLS(y, X).fit()
print(model.summary())

A positive and significant coefficient confirms that digital complexity captures economic potential beyond traditional measures.

Common Mistakes

Ignoring IP address biases: The GitHub Innovation Graph uses IP addresses to assign developers to countries. Developers using VPNs may be misattributed. Consider cross-referencing with survey data.
Using raw developer counts instead of RCA: RCA normalization is critical to account for country size and language popularity.
Overlooking data sparsity: Many countries have few developers. Filter out countries with less than, say, 100 total developers to reduce noise.
Misapplying ECI ordering: The eigenvalue method yields eigenvectors whose sign may be flipped. Check the correlation with expected complexity (e.g., high GDP countries should have positive ECI).
Forgetting to update data: The Innovation Graph is released quarterly. Use the latest release for current analysis, but maintain consistency if replicating the study.

Summary

By following this guide, you have learned how to transform GitHub Innovation Graph data into a powerful measure of national digital complexity. The Software ECI reveals the invisible productive knowledge embedded in a country's software production—knowledge that traditional trade data misses. This measure can predict GDP growth, income inequality, and carbon emissions with surprising accuracy. The methodology, pioneered by Juhász, Wachs, Kaminski, and Hidalgo, opens new avenues for research on the economic impact of open-source software and developer collaboration.

Start exploring today at GitHub Innovation Graph and contribute to understanding the digital economy.

Tags: