Skip to main content Accessibility Help Accessibility Feedback
Console Logo
Console Logo

Product details

GitHub Activity Data

GitHub

Includes activity from over 3M open source GitHub repositories

Overview
Samples

Overview

GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

Additional details

  • Type: Data
  • Category: Encyclopedic, Other
  • Dataset source: GitHub
  • Cloud service: BigQuery
  • Expected update frequency: Weekly

    Samples

    Here are some examples of SQL queries you can run on this data in BigQuery. For more tips, updated resources, and community content, see the updated resource list at https://medium.com/@hoffa/b3576fd2b150

    Note: performing a query on the GitHub Archive dataset can easily exceed your 1TB of processed data, which could exceed your free monthly quota. The following sample queries use a set of smaller tables.

    What are the most commonly used Go packages?
    This query uses one of the smaller sample tables to find the most popular Go packages. Run this query

    What are the most commonly used Java packages?
    This query uses one of the smaller sample tables to find the most popular Java packages. Run this query

    How many times does 'This should never happen' appear?
    This query uses a smaller sample table to find how many times the comment "this should never happen" is present. Run this query . If you run this query against the entire dataset the answer is around 1,000,000!

    Terms of Service

    This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://help.github.com/articles/github-terms-of-service/ - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    Your page may be loading slowly because you're building optimized sources. If you intended on using uncompiled sources, please click this link.

    Hide the shortcuts helper

    Google Cloud Console has failed to load JavaScript sources from www.gstatic.com.
    Possible reasons are:

    • www.gstatic.com or its IP addresses are blocked by your network administrator
    • Google has temporarily blocked your account or network due to excessive automated requests
    Please contact your network administrator for further assistance.

    Click to view dataset