GitHub’s API is really handy when you want to get stats about one of your repositories. Its GraphQL API in particular is really powerful. I’ve shown how it could be used to fetch project information in my GitHub Actions tutorial.
While working on another project, I ran into a limitation that had me scratch my head for quite a bit. I wanted to share my workaround for anyone who may run into this problem in the future.
tl;dr: the GitHub REST API returnsĀ a maximum of 1,000 results for the search endpoint.
You can use the GraphQL API and the search endpoint to search for specific items in your repo. In the example below, I’m searching for all Pull Requests merged in the past 6 months in the “Automattic/jetpack” repository, and returning the total number of PRs (issueCount) as well as each PR’s title.
query {
search(
first: 100,
query: "repo:automattic/jetpack is:pr is:merged merged:2022-10-05..2023-04-05 sort:updated-asc",
type: ISSUE,
after: $cursor
) {
issueCount,
nodes {
... on PullRequest {
title
}
}
pageInfo {
hasNextPage
endCursor
}
}
}
By providing hasNextPage
and endCursor
, I can paginate through the results. I can provide the cursor query variable to fetch another page of 100 results, and then another, until hasNextPage
returns false. You can read more about this cursor-based pagination in the GraphQL docs.
This worked well enough, but I was only getting 1,000 Pull Requests. This was odd since there were more than 1,000 results in total; issueCount
returned 2,129 results for that query.
I realized, after way too much time spent debugging, that I wasn’t getting new results after the 10th page (i.e. 10 times 100 results, since we have 100 results per page).
It turns out, GitHub’s API limits the number of results to 1,000 for the Search endpoint!
My next stop was repository.pullRequests
. It provides “a list of pull requests that have been opened in the repository.” That seemed like my best bet, and the documentation didn’t mention a limit.
query {
repository(
owner:"automattic",
name:"jetpack"
) {
pullRequests(
states: MERGED,
first: 100,
after: $cursor,
orderBy: {
direction:DESC,
field: UPDATED_AT
}
) {
totalCount
nodes {
... on PullRequest {
mergedAt
createdAt
title
}
}
pageInfo {
hasNextPage,
endCursor
}
}
}
}
This isn’t quite enough though:
- We’ll get all the merged Pull Requests in the repository, with no specific time frame.
orderBy.field
does not offer a way to sort bymergedAt
, which would be useful for me here. I cannot just loop through all the PRs while looking atpullRequests.nodes.mergedAt
, and stop my query when I start seeing PRs that were merged more than 6 months ago. This may miss some recent PRs since a PR can be edited after it’s been merged.
My solution was to use both endpoints:
query {
search(
query: "repo:automattic/jetpack" is:pr is:merged merged:2022-10-05..2023-04-05",
type: ISSUE,
first: 100,
) {
issueCount
}
repository(
owner: "automattic",
name: "jetpack"
) {
pullRequests(
states: MERGED,
first: 100,
after: $cursor,
orderBy: {field: UPDATED_AT, direction: DESC}
) {
pageInfo {
hasNextPage
endCursor
}
nodes {
author {
login
}
labels(first: 100) {
edges {
node {
name
}
}
}
createdAt
mergedAt
}
}
}
}
In my logic, I build an array of Pull Requests from scratch, adding new Pull Requests as they are returned by repository.pullRequests
. I stop adding to my array when it’s the same size as the total reported by search.issueCount
.
It’s not perfect, but it does the trick for my needs. :) Hopefully in the future we can order pull requests by merge date and all thir will be easier. I’ll go open an issue to make that suggestion in a bit.
Hopefully this can help you if you also ran into that limitation.