hannes.robur.coop/Posts/Maintainers

77 lines
7.3 KiB
Text
Raw Permalink Normal View History

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"><head><title>Who maintains package X?</title><meta charset="UTF-8"/><link rel="stylesheet" href="/static/css/style.css"/><link rel="stylesheet" href="/static/css/highlight.css"/><script src="/static/js/highlight.pack.js"></script><script>hljs.initHighlightingOnLoad();</script><link rel="alternate" href="/atom" title="Who maintains package X?" type="application/atom+xml"/><meta name="viewport" content="width=device-width, initial-scale=1, viewport-fit=cover"/></head><body><nav class="navbar navbar-default navbar-fixed-top"><div class="container"><div class="navbar-header"><a class="navbar-brand" href="/Posts">full stack engineer</a></div><div class="collapse navbar-collapse collapse"><ul class="nav navbar-nav navbar-right"><li><a href="/About"><span>About</span></a></li><li><a href="/Posts"><span>Posts</span></a></li></ul></div></div></nav><main><div class="flex-container"><div class="post"><h2>Who maintains package X?</h2><span class="author">Written by hannes</span><br/><div class="tags">Classified under: <a href="/tags/package signing" class="tag">package signing</a><a href="/tags/security" class="tag">security</a></div><span class="date">Published: 2017-02-16 (last updated: 2017-03-09)</span><article><p>A very important data point for conex, the new opam signing utility, is who is authorised for a given package. We
could have written this manually down, or force each author to create a
pull request for their packages, but this would be a long process and not
easy: the main opam repository has around 1500 unique packages, and 350
contributors. Fortunately, it is a git repository with 5 years of history, and
over 6900 pull requests. Each opam file may also contain a <code>maintainers</code> entry,
a list of strings (usually a mail address).</p>
<p>The data sources we correlate are the <code>maintainers</code> entry in opam file, and who
actually committed in the opam repository. This is inspired by <a href="https://github.com/ocaml/opam/issues/2693">some GitHub
discussion</a>.</p>
<h3 id="github-id-and-email-address">GitHub id and email address</h3>
<p>For simplicity, since conex uses any (unique) identifier for authors, and the opam
repository is hosted on GitHub, we use a GitHub id as author identifier.
Maintainer information is an email address, thus we need a mapping between them.</p>
<p>We wrote a <a href="https://raw.githubusercontent.com/hannesm/conex/master/analysis/loop-prs.sh">shell
script</a>
to find all PR merges, their GitHub id (in a brittle way: using the name of the
git remote), and email address of the last commit. It also saves a diff of the
PR for later. This results in 6922 PRs (opam repository version 38d908dcbc58d07467fbc00698083fa4cbd94f9d).</p>
<p>The metadata output is processed by
<a href="https://github.com/hannesm/conex/blob/dbdfc5337c97d62edc74f1c546023bcb5e719343/analysis/maintainer.ml#L134-L156">github_mail</a>:
we ignore PRs from GitHub organisations <code>PR.ignore_github</code>, where commits
<code>PR.ignore_pr</code> are picked from a different author (manually), bad mail addresses,
and <a href="https://github.com/yallop">Jeremy's</a> mail address (it is added to too many GitHub ids otherwise). The
goal is to have a for an email address a single GitHub id. 329 authors with 416 mail addresses are mapped.</p>
<h3 id="maintainer-in-opam">Maintainer in opam</h3>
<p>As mentioned, lots of packages contain a <code>maintainers</code> entry. In
<a href="https://github.com/hannesm/conex/blob/dbdfc5337c97d62edc74f1c546023bcb5e719343/analysis/maintainer.ml#L40-L68"><code>maintainers</code></a>
we extract the mail addresses of the <a href="https://github.com/hannesm/conex/blob/dbdfc5337c97d62edc74f1c546023bcb5e719343/analysis/maintainer.ml#L70-L94">most recently released opam
file</a>.
Some hardcoded matches are teams which do not properly maintain the maintainers
field (such as mirage and xapi-project ;). We're open for suggestions to extend
this massaging to the needs. Additionally, the contact at ocamlpro mail address
was used for all packages before the maintainers entry was introduced (based on
a discussion with Louis Gesbert). 132 packages with empty maintainers.</p>
<h3 id="fitness">Fitness</h3>
<p>Combining these two data sources, we hoped to find a strict small set of whom to
authorise for which package. Turns out some people use different mail addresses
for git commits and opam maintainer entries, which <a href="https://github.com/hannesm/conex/blob/dbdfc5337c97d62edc74f1c546023bcb5e719343/analysis/maintainer.ml#L233-L269">are be easily
fixed</a>.</p>
<p>While <a href="https://github.com/hannesm/conex/blob/dbdfc5337c97d62edc74f1c546023bcb5e719343/analysis/maintainer.ml#L169-L205">processing the full diffs of each
PR</a>
(using the diff parser of conex mentioned above), ignoring the 44% done by
<a href="https://github.com/hannesm/conex/blob/dbdfc5337c97d62edc74f1c546023bcb5e719343/analysis/maintainer.ml#L158-L165">janitors</a>
(a manually created set by looking at log data, please report if wrong), we
categorise the modifications: authorised modification (the GitHub id is
authorised for the package), modification by an author to a team-owned package
(propose to add this author to the team), modification of a package where no
GitHub id is authorised, and unauthorised modification. We also ignore packages
which are no longer in the opam repository.</p>
<p>2766 modifications were authorised, 418 were team-owned, 452 were to packages
with no maintainer, and 570 unauthorised. This results in 125 unowned packages.</p>
<p>Out of the 452 modifications to packages with no maintainer, 75 are a global
one-to-one author to package relation, and are directly authorised.</p>
<p>Inference of team members is an overapproximation (everybody who committed
changes to their packages), additionally the janitors are missing. We will have
to fill these manually.</p>
<pre><code>alt-ergo -&gt; OCamlPro-Iguernlala UnixJunkie backtracking bobot nobrowser
janestreet -&gt; backtracking hannesm j0sh rgrinberg smondet
mirage -&gt; MagnusS dbuenzli djs55 hannesm hnrgrgr jonludlam mato mor1 pgj pqwy pw374 rdicosmo rgrinberg ruhatch sg2342 talex5 yomimono
ocsigen -&gt; balat benozol dbuenzli hhugo hnrgrgr jpdeplaix mfp pveber scjung slegrand45 smondet vasilisp
xapi-project -&gt; dbuenzli djs55 euanh mcclurmc rdicosmo simonjbeaumont yomimono
</code></pre>
<h3 id="alternative-approach-github-urls">Alternative approach: GitHub urls</h3>
<p>An alternative approach (attempted earlier) working only for GitHub hosted projects, is to authorise
<a href="https://github.com/hannesm/conex/blob/github/analysis/maintainer.ml#L37-L91">the use of the user part of the GitHub repository
URL</a>.
Results after filtering GitHub organisations are not yet satisfactory (but only
56 packages with no maintainer, <a href="https://github.com/hannesm/opam-repository/tree/github">output repo</a>. This approach
completely ignores the manually written maintainer field.</p>
<h3 id="conclusion">Conclusion</h3>
<p>Manually maintained metadata is easily out of date, and not very useful. But
combining automatically created metadata with manually, and some manual tweaking
leads to reasonable data.</p>
<p>The resulting authorised inference is available <a href="https://github.com/hannesm/opam-repository/tree/auth">in this branch</a>.</p>
</article></div></div></main></body></html>