A Step-by-Step Guide to Uncovering National Digital Complexity with GitHub Innovation Graph Data

By

Overview

In the modern economy, software has become a critical driver of growth and innovation. Yet, traditional economic indicators often fail to capture the full extent of a nation's software production because code crosses borders invisibly—through git push, cloud services, and package managers. This blind spot, sometimes called digital dark matter, hides a wealth of productive knowledge. Recent research published in Research Policy by Sándor Juhász, Johannes Wachs, Jermain Kaminski, and César A. Hidalgo demonstrates how the GitHub Innovation Graph can be used to measure the “digital complexity” of nations. This tutorial guides you through the process of using GitHub Innovation Graph data to compute a Software Economic Complexity Index (Software ECI) that predicts GDP, inequality, and emissions in ways traditional metrics miss.

A Step-by-Step Guide to Uncovering National Digital Complexity with GitHub Innovation Graph Data
Source: github.blog

Prerequisites

Before you begin, ensure you have the following:

Step-by-Step Instructions

Step 1: Understand the GitHub Innovation Graph Data

The GitHub Innovation Graph tracks the number of developers in each economy who push code in each programming language, based on IP addresses. It provides a quarterly breakdown of active developers per language per country. For this tutorial, we will use the Q4 2025 data release (as used in the original study). The data is available as a CSV file with columns: country_code, language, developer_count, and quarter.

import pandas as pd
# Load the dataset
df = pd.read_csv('github_innovation_graph_q4_2025.csv')
df.head()

This dataframe forms the raw material for computing digital complexity.

Step 2: Construct the Country-Language Matrix

We need to create a matrix where rows represent countries and columns represent programming languages, and each cell contains the Revealed Comparative Advantage (RCA) of that country in that language. RCA measures whether a country specializes in a language relative to the global average. The formula:

RCA(c,l) = (dev_c,l / dev_c) / (dev_l / dev_total)

Where dev_c,l is developers in country c using language l, dev_c is total developers in country c, dev_l is total developers using language l globally, and dev_total is total developers globally.

# Pivot to country-language matrix
matrix = df.pivot_table(index='country_code', columns='language', values='developer_count', fill_value=0)

# Compute RCA
total_per_country = matrix.sum(axis=1)
total_per_language = matrix.sum(axis=0)
global_total = total_per_language.sum()

rca = matrix.div(total_per_country, axis=0).div(total_per_language / global_total, axis=1)

# Binarize: set RCA >= 1 to 1, else 0
rca_binary = (rca >= 1).astype(int)

Step 3: Apply the Economic Complexity Index (ECI)

The ECI is calculated using the method of reflections, which iteratively computes the average diversity of a country and the average ubiquity of its languages. The standard formula for the Software ECI is:

ECI_c = ( (k_c - μ_k) / σ_k ) - ( (k_p - μ_p) / σ_p )

Where k_c is the diversity (number of languages with RCA>=1 for country c), and k_p is the ubiquity (number of countries with RCA>=1 for language p). However, the standard approach uses the eigenvector method. Here we implement a simplified version using matrix decomposition.

# Method of reflections (simplified)
k_c = rca_binary.sum(axis=1)  # diversity
k_p = rca_binary.sum(axis=0)  # ubiquity

# Normalize
eci = (k_c - k_c.mean()) / k_c.std() - (k_p - k_p.mean()) / k_p.std()

# Alternatively, use the eigenvector method via numpy
# This produces more accurate results
def compute_eci(rca_binary):
    # Step: calculate M_{cc'} = sum_l (M_cl * M_c'l / k_l)
    k_p_inv = 1.0 / rca_binary.sum(axis=0)
    k_p_inv[np.isinf(k_p_inv)] = 0
    M = rca_binary.values @ np.diag(k_p_inv) @ rca_binary.values.T
    # Second eigenvector
    eigvals, eigvecs = np.linalg.eig(M)
    idx = np.argsort(eigvals)[-2]  # second largest
    eci = eigvecs[:, idx]
    return eci

eci_scores = compute_eci(rca_binary)

# Create a DataFrame
eci_df = pd.DataFrame({'country': rca_binary.index, 'software_eci': eci_scores})

Step 4: Validate Against Macroeconomic Indicators

To see if Software ECI predicts GDP, inequality, or emissions, merge your results with external datasets. For example, use World Bank GDP per capita (PPP).

A Step-by-Step Guide to Uncovering National Digital Complexity with GitHub Innovation Graph Data
Source: github.blog
# Load GDP data
gdp = pd.read_csv('gdp_ppp_2025.csv')
merged = eci_df.merge(gdp, on='country')

# Linear regression
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

X = sm.add_constant(merged['software_eci'])
y = merged['gdp_per_capita']
model = sm.OLS(y, X).fit()
print(model.summary())

A positive and significant coefficient confirms that digital complexity captures economic potential beyond traditional measures.

Common Mistakes

Summary

By following this guide, you have learned how to transform GitHub Innovation Graph data into a powerful measure of national digital complexity. The Software ECI reveals the invisible productive knowledge embedded in a country's software production—knowledge that traditional trade data misses. This measure can predict GDP growth, income inequality, and carbon emissions with surprising accuracy. The methodology, pioneered by Juhász, Wachs, Kaminski, and Hidalgo, opens new avenues for research on the economic impact of open-source software and developer collaboration.

Start exploring today at GitHub Innovation Graph and contribute to understanding the digital economy.

Tags:

Related Articles

Recommended

Discover More

10 Critical Facts About Microsoft's Emergency ASP.NET Patch for macOS and LinuxAWS Unveils Cost Allocation for AI Spending as Claude Mythos and Agent Registry DebutOpenAI Emails Expose Musk’s Founding Role and Growing Rift with Altman8 Essential Strategies for Testing Code You Didn't Write (and Can't Predict)NVIDIA Unveils Experimental 'cuda-oxide' Compiler: Write GPU Kernels in Rust, Compile Directly to PTX