Commits data

git/commits.jsonl.zip data is produced as a side product of analysis. It contains JSON rows such as this:

["2019-08-13 21:37 +0900", "deadbeef...", "project/repo", ["beefdead"], 1565699847, 1565699847, "dev@acme.org", "Firstname Developer", [["modify", 2]], 1, "commit summary", "full commit msg", [["file_changed.md", [2, ""]]]]

Explanations of the data fields in commits.jsonl.zip with real life examples:

"2019-08-13 21:37 +0900", Datetime string of commit author timestamp
"90ec7996cc4bc1a96410c1794965b8c5e1479f37", Commit SHA1 from Git
"databrickskoalas/koalas", Project/repo-name (repo-name as in Git)
["82e2e410817dc1728f97038f193d823f615d0d6a"], Parent commits SHA1 list
1565699847, Git author timestamp of the commit
1565699847, Git commiter timestamp of the commit
"developer-name@gmail.com", Author of the commit as in Git log (email)
"Developer Name", Author of the commit as in Git log (displayname)
[["modify", 2]], Information about how many lines were modified / created.
1, How many files were changed/created
"correct pip installation command (#642)", Commit msg summary line
"correct pip installation command (#642)\n\n", Full commit message
[["CONTRIBUTING.md", [2, ""]]] List of each modified/created/removed file and information on how many lines were modified in each of them. [2, ""] means 2 lines modified.

The order of the columns is protected, meaning that new columns will be added to the end if the file content is extended. The order of the lines in the file is not guaranteed to be according to time, but because the timestamp is the first column, it is easy to get it sorted with cat and sort.

Example how to handle it in Python

import json

import json
for line in open(fname):
    commit_entry = json.loads(line)
    (commit_time, sha, repo_path, parents_sha_list, commit_committed_date, commit_authored_date, commit_author_email, commit_author_name, changed_lines, changed_files, commit_summary, commit_message, commit_impact) = commit_entry
    # Do something with the above fields

There are large aggregated commit datasets of open source projects, and will offer them without cost for research purposes. If interested, contact us through web page chat.

Other data sets

This help article will be extended on a need basis. Please contact us if you have special raw data needs.

Accessing raw data produced by Softagram

Commits data

Explanations of the data fields in commits.jsonl.zip with real life examples:

Example how to handle it in Python

Other data sets