Semantic Scholar API - Tutorial

Get Started with Semantic Scholar API

Learn to search for papers and authors, download datasets, and more

Quick guide to get started with Semantic Scholar API

Getting Started Guide

The Semantic Scholar REST API uses standard HTTP verbs, response codes, and authentication. Our Getting Started Guide will teach you how to interact with the Semantic Scholar API by sending a request and analyzing the response. All code examples are shown in Python. For those who would prefer a code-free experience, you can try out the API using interactive tools like Postman (a popular, free API Testing platform). We have included additional visuals of how each request would look like in Postman for your reference.

Checklist before making an API request

  • Do you know the endpoint and its URL?
  • Do you know the request parameters?
  • Do you need an API key?

See below for how to obtain each piece of information!

How do I know the endpoint URL?

An API endpoint URL consists of two main parts:

  • Base URL: Tells the API where to start looking for the data you want. Each Semantic Scholar service has its own base URL, which you can find below.
  • Resource path: Specifies the entity or action you want to perform.

For example, the paper relevance search endpoint in the Academic Graph API would have the following URL:

a diagram displaying the base url [https://api.semanticscholar.org/graph/v1/] and resource path [/paper/search]

Base URLs

How do I make an API Request?

Once you have your endpoint URL, use our API documentation to determine any input parameters you are required to send with your request, and the format in which they must be sent. For example, the author search endpoint requires us to specify the ‘author id’ as a path parameter and to specify the details we want about an author as a query parameter.

Once you have your endpoint URL and the required input parameters, you are ready to send your request! Each programming language has its own way of making an API request. Below you will find examples of how to send a request to the paper relevance search endpoint in Python and in Postman.

Python Example:

import requests

# Define the API endpoint URL
url = 'https://api.semanticscholar.org/graph/v1/paper/search'

# More specific query parameter
query_params = {'query': 'quantum computing'}

# Directly define the API key (Reminder: Securely handle API keys in production environments)
api_key = 'your api key goes here'  # Replace with the actual API key

# Define headers with API key
headers = {'x-api-key': api_key}

# Send the API request
response = requests.get(url, params=query_params, headers=headers)

# Check response status
if response.status_code == 200:
   response_data = response.json()
   # Process and print the response data as needed
   print(response_data)
else:
   print(f"Request failed with status code {response.status_code}: {response.text}")

Postman Request Example:

Response: A successful response from the paper relevance search endpoint would look like the example below. The first three fields (total, offset, next) are pagination data we can use to page through our results. The data field is a list of objects, each containing information about a paper.

How do I use an API Key?

Although not every endpoint requires authentication via an API key, as a best practice we recommend always including your key with every request. Doing so will help Semantic Scholar better support you in the event you need additional help or debugging support. Additionally, all unauthenticated users share a limit of 5,000 requests per 5 minutes. 

To authenticate via an API key, include your key in a custom header called “x-api-key”, as shown in the Python example below:

Warning: It is advised to store and retrieve your API key values through environment variables instead of hard-coding them.

#define a custom header called x-api-key
headers = {'x-api-key': 'your-api-key-goes-here'}


To include an API key as a header in Postman, you can switch to the Headers tab:

Common Error Codes

Error Code

Status

Description

400

Bad Request

The server could not understand your request. Check your parameters.

401

Unauthorized

You're not authenticated or your credentials are invalid.

403

Forbidden

The server understood the request but refused it. You don't have permission to access the requested resource.

404

Not Found

The requested resource or endpoint does not exist.

429

Too Many Requests

You've hit the rate limit, slow down your requests.

500

Internal Server Error

Something went wrong on the server’s side

Pagination

Pagination is a technique used in APIs to manage and retrieve large sets of data in smaller, manageable chunks. This is particularly useful when dealing with extensive datasets to improve efficiency and reduce the load on both the client and server.

Key Parameters:

  • Limit: Specifies the maximum number of items (e.g., papers) to be returned in a single API response. For example, in the request https://api.semanticscholar.org/graph/v1/paper/search?query=halloween&limit=3, the limit=3 indicates that the response should include a maximum of 3 papers.

  • Offset: Represents the starting point from which the API should begin fetching items. It helps skip a certain number of items. For example, if offset=10, the API will start retrieving items from the 11th item onward.
  • Next: A token or identifier provided in the response, pointing to the next set of items. It allows fetching the next page of results. For example, the next field in the response will contain information needed for the client to request the next set of items.

The client requests the API for the first page of results. The API responds with the specified number of items (limit) along with the total number of items (total). If there are more items to retrieve, the response includes a next token. The client can use the next token in subsequent requests to get the next page of results until all items are fetched. This way, pagination allows clients to retrieve large datasets efficiently, page by page, based on their needs.

Example Request

The following request asks the API to find papers related to "halloween" with a limit of 3 papers per response:
https://api.semanticscholar.org/graph/v1/paper/search?query=halloween&limit=3

Example Response

  • total: Indicates that there are a total of 3063 papers related to "halloween" in the Semantic Scholar database
  • offset: Shows that the current response starts from the first paper (position 0)
  • next: Contains a token (in this case, 3) that the client can use to fetch the next set of papers

Academic Graph: Paper

Searching and Retrieving Paper Details

Use Case: Search for papers using keywords and get additional details about a specific paper that interests you. The paper relevance search endpoint will perform a keyword search for papers using our custom-trained ranker.

Step 1: First search for papers you are interested in, using the paper relevance search endpoint.

https://api.semanticscholar.org/graph/v1/paper/search?query=semantic%20scholar%20platform&limit=3

The request above contains the following parameters:

Parameter Name

Type

Value

Description

query

query

semantic scholar platform

Keyword to search. Here we are searching for papers related to the Semantic Scholar platform

limit

query

3

Limit how many records we would like to retrieve at a time. In this case we would like to only retrieve 3 papers at a time. 

Python Example:

import requests

# Define the paper search endpoint URL
url = 'https://api.semanticscholar.org/graph/v1/paper/search'

# Define the required query parameter and its value (in this case, the keyword we want to search for)
query_params = {
    'query': 'semantic scholar platform',
    'limit': 3
}

# Make the GET request with the URL and query parameters
searchResponse = requests.get(url, params=query_params)

Postman Request Example:

Postman Request
  • Response: Below you will find the response we received from the API. The first three fields (total, offset, next) are pagination data we can use to page through our results. The data field is a list of objects, each containing information about a paper. In the next step, let's try to find out more information about the paper titled "The Semantic Scholar Open Data Platform" by using its paperId.
Postman Results

NOTE: This endpoint supports pagination. Check out our Pagination guide for more information.

Step 2: Retrieve more details about the “The Semantic Scholar Open Data Platform” paper by using its paperId - cb92a7f9d9dbcf9145e32fdfa0e70e2a6b828eb1

https://api.semanticscholar.org/graph/v1/paper/cb92a7f9d9dbcf9145e32fdfa0e70e2a6b828eb1?fields=title,year,abstract,authors.name

The request above contains the following parameters:

Parameter Name

Type

Value

Description

paperId

path

cb92a7f9d9dbcf9145e32fdfa0e70e2a6b828eb1

ID of the paper to be retrieved. In this example we are using the Semantic Scholar Paper ID, but the API also supports external paper IDs*.

fields

query

title,year,abstract,
authors.name

Comma Separated Value (CSV) list of details about the paper we would like to know. In this example, we are asking for the paper’s title, year of publication, abstract, and the names of its authors

Python Example:

import requests

# Define the paper search endpoint URL
url = 'https://api.semanticscholar.org/graph/v1/paper/search'

# Define the required query parameter and its value (in this case, the keyword we want to search for)
query_params = {
    'query': 'semantic scholar platform',
    'limit': 3
}

# Define a separate function to make a request to the paper details endpoint using a paper_id. This function will be used later on (after we call the paper search endpoint).
def get_paper_data(paper_id):
  url = 'https://api.semanticscholar.org/graph/v1/paper/' + paper_id

  # Define which details about the paper you would like to receive in the response
  paper_data_query_params = {'fields': 'title,year,abstract,authors.name'}

  # Send the API request and store the response in a variable
  response = requests.get(url, params=paper_data_query_params)
  if response.status_code == 200:
    return response.json()
  else:
    return None

# Make the GET request to the paper search endpoint with the URL and query parameters
search_response = requests.get(url, params=query_params)

# Check if the request was successful (status code 200)
if search_response.status_code == 200:
  search_response = search_response.json()

  # Retrieve the paper id corresponding to the 1st result in the list
  paper_id = search_response['data'][0]['paperId']

  # Retrieve the paper details corresponding to this paper id using the function we defined earlier.
  paper_details = get_paper_data(paper_id)

  # Check if paper_details is not None before proceeding
  if paper_details is not None:
    
    # Your code to work with the paper details goes here
    
  else:
    print("Failed to retrieve paper details.")

else:
  # Handle potential errors or non-200 responses
  print(f"Relevance Search Request failed with status code {search_response.status_code}: {search_response.text}")

Postman Request Example:

Postman Example
  • Response: As requested, we received the paper's title, abstract, year of publication, and authors' names in the response, shown below.
Postman Results

Using Paper Search with External IDs

Semantic Scholar API also supports lookups through many external paper identifiers, in addition to the Semantic Scholar Paper ID. The table below lists the currently supported external IDs.

Alternative Paper Identifiers and Examples:

Semantic Scholar Paper ID https://api.semanticscholar.org/0796f6cd7f0403a854d67d525e9b32af3b277331

DOI https://api.semanticscholar.org/10.1038/nrn3241

ArXiv ID https://api.semanticscholar.org/arXiv:1705.10311

ACL ID https://api.semanticscholar.org/ACL:W12-3903

PubMed ID https://api.semanticscholar.org/PMID:19872477

Corpus ID https://api.semanticscholar.org/CorpusID:37220927

Use Case: Earlier we saw how to retrieve a paper’s details using its Semantic Scholar Paper ID. In this example, let's fetch details about a paper using its arXiv ID, one of the many external IDs supported by Semantic Scholar API:

  • Endpoint: https://api.semanticscholar.org/graph/v1/paper/{paperId}
  • Request:

https://api.semanticscholar.org/graph/v1/paper/arXiv:1705.10311?fields=title,year,abstract,authors.name

The request above contains the following parameters:

Parameter Name

Type

Value

Description

paperId

path

arXiv:1705.10311

ID of the paper to be retrieved. In this case our ID references a paper from an external source - arXiv, hence we supply the paper’s arXiv ID

fields

query

title,year,abstract,
authors.name

CSV list of details about the paper we would like to know. In this example, we are asking for the paper’s title, year of publication, abstract, and the names of its authors

Python Example:

import requests

#define external id
arXivId = "arXiv:1705.10311"

#construct request url
url = 'https://api.semanticscholar.org/graph/v1/paper/' + arXivId

#Define which details about the paper you would like to receive in the response
paperDataQueryParams = {'fields': 'title,year,abstract,authors.name'}

#Send the API request and store the response in a variable
response = requests.get(url, params=paperDataQueryParams)

if response.status_code == 200:
    response = response.json()

    #your code to work with the response goes here

else:
    print(f"response failed with errorcode:{response.status_code}")
    #error handling code goes here
  • Response: In the response, we receive the fields we requested for the paper with arXiv ID arXiv:1705.10311

Filtering Search Results

Use Case: I want to search for papers on Natural Language Processing (NLP) that were published in Journals since 2018.

https://api.semanticscholar.org/graph/v1/paper/search?query=NLP&limit=5&publicationTypes=JournalArticle&year=2018-&fields=title,publicationTypes,publicationDate

The request above contains the following parameters:

Parameter Name

Type

Value

Description

query

query

NLP

Keywords we want to search for. The response will contain papers related to NLP

limit

query

5

Pagination parameter that lets us page through our results by limiting the number of records we would like to retrieve at a time. In this case, we would like to retrieve papers 5 at a time

publicationTypes

query

JournalArticle

Restrict our results to only contain papers that appear in Journals

year

query

2018-

Restrict our results to papers that were published after 2018 

fields

query

title,publicationTypes,
publicationDate

CSV list of details about the paper we would like to know. In this example, we are asking for the paper’s title, publication type, and publication date (YYYY-MM-DD) if available

Python Example:

import requests

# Define the endpoint URL
url = "https://api.semanticscholar.org/graph/v1/paper/search"

# Define the query parameters
query_params = {
    'query': 'NLP',
    'limit': 5,
    'publicationTypes': 'JournalArticle',
    'year': '2018-',
    'fields': 'title,publicationTypes,publicationDate'
}

# Make the request with the specified parameters
response = requests.get(url, params=query_params)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Convert response to JSON format
    response = response.json()

    #code to process response data goes here

else:
    # Handle potential errors or non-200 responses
    print(f"Request failed with status code {response.status_code}: {response.text}")
  • Response: As shown below, we only receive papers that meet our filter criteria (published after 2018 in Journals)

Using Search Query Operators

Semantic Scholar’s Paper Bulk Search supports a variety of operators that enable advanced filtering and precise specifications in search queries. All keywords in the search query are matched against the paper’s title and abstract. Refer to the API Documentation for all operators supported. We have included examples of varying complexity below to help you get started.

Example:

((cloud computing) | virtualization) +security -privacy

Explanation: Matches papers containing the words  "cloud” and “computing", or the word "virtualization" in their title or abstract. The paper title or abstract must also include the term "security" but should exclude the word "privacy". For example, a paper with the title "Ensuring Security in Cloud Computing Environments" could be included, unless its abstract contains the word “privacy”.

Request:

https://api.semanticscholar.org/graph/v1/paper/search/bulk?query=%28%28cloud%20computing%29%20%7C%20virtualization%29%20%2Bsecurity%20-privacy&fields=title,abstract

Example:

"red blood cell" + artificial intelligence

Explanation: Matches papers where the title or abstract contains the exact phrase “red blood cell” along with the words “artificial” and “intelligence”. For example, a paper with the title "Applications of Artificial Intelligence in Healthcare" would be included if it also contained the phrase “red blood cell” in its abstract.

Request:

https://api.semanticscholar.org/graph/v1/paper/search/bulk?query=%22red%20blood%20cell%22%20%2B%20artificial%20intelligence&fields=title,abstract

Example:

fish*

Explanation: Matches papers where the title or abstract contains words with “fish” in their prefix, such as “fishtank”, “fishes”, or “fishy”. For example a paper with the title "Ecology of Deep-Sea Fishes" would be included.

Request:

https://api.semanticscholar.org/graph/v1/paper/search/bulk?query=fish%2A&fields=title,abstract

Example:

bugs~3

Explanation: Matches papers where the title or abstract contains words with an edit distance of 3 from the word “bugs”, such as “buggy”, “but”, "buns", “busg”, etc. An edit is the addition, removal, or change of a single character.

Request:

https://api.semanticscholar.org/graph/v1/paper/search/bulk?query=bugs~3&fields=title,abstract

Example:

“blue lake” ~3

Explanation: Matches papers where the title or abstract contains phrases with up to 3 terms between the words specified in the phrase. For example, a paper titled “Preserving blue lakes during the winter” or with an abstract containing a phrase such as blue fishes in the lake” would be included.

Request:

https://api.semanticscholar.org/graph/v1/paper/search/bulk?query=%22blue%20lake%22~3&fields=title,abstract

Academic Graph: Author

Searching and Retrieving Author Details

Use Case: You can search for an author by their author id or their name. Let's search for an author by name and find out more details about the papers they have written .

The request above contains the following parameters:

Parameter Name

Type

Value

Description

query

query

Bob H. Smith

Name of the author to be searched

fields

query

paperCount, papers.title,
papers.fieldsOfStudy

Comma Separated Value (CSV) list of details about the author to be returned. We would like to know how many papers this author has written, all the titles of all the papers this author has written, and which field of study the paper belongs to

Python Example:

import requests

# Define the API endpoint URL
url = "https://api.semanticscholar.org/graph/v1/author/search"

# Define the required query parameters
query_params = {
    "query": "Bob H. Smith",
    "fields": "paperCount,papers.title,papers.fieldsOfStudy"
}

# Make the GET request
response = requests.get(url, params=query_params)

# Check if the request was successful
if response.status_code == 200:
    # Parse and work with the response data in JSON format
    data = response.json()

    # Your code to process the data goes here

else:
    # Handle the error, e.g., print an error message
    print(f"Request failed with status code {response.status_code}")

Postman Request Example:

  • Response: In the response, we receive pagination data, the number of papers this author has written (paperCount), and the title and fieldOfStudy for each paper.

Recommendations

Retrieving Paper Recommendations

Use Case: I’m building a research tool and want to recommend other papers to my user based on the paper they are currently reading. How can the API help me do this?

https://api.semanticscholar.org/recommendations/v1/papers/forpaper/649def34f8be52c8b66281af98ae884c09aef38b?fields=title,year

The request above contains the following parameters:

Parameter Name

Type

Value

Description

paperId

path

649def34f8be52c8b66281
af98ae884c09aef38b

ID of the paper to base the recommendations on.

fields

query

title,year

Details about each recommended paper to be included in the response. In this case we just want to know the paper title and year of publication

Python Example:

import requests

# Define the base URL for the API
base_url = "https://api.semanticscholar.org/recommendations/v1/papers/forpaper/"

# Define the paperId
paperId = "649def34f8be52c8b66281af98ae884c09aef38b"

# Construct the full URL with the paperId as a path parameter
url = base_url + paperId

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()
  
    # Extract the list of recommended papers from the response
    recommended_papers = data.get("recommendedPapers", [])

    # Your code to work with the recommended papers list goes here
    
else:
    # Handle the error, e.g., print an error message
    print(f"Request failed with status code {response.status_code}")

Postman Request Example:

  • Response: In the API response, we receive recommendations for papers similar to the one we specified in the request (paperId representing the paper titled Construction of the Literature Graph in Semantic Scholar).

Datasets

Use Case: I want to work with Semantic Scholar data locally. How do I download your datasets?

Warning: To download a dataset, you will need to provide your API key.

Note: Semantic Scholar datasets (papers, authors, embeddings, abstracts, etc.) are grouped by releases. Each release is a snapshot in time containing a particular version of a dataset. The below steps illustrate how to find the list of available releases, the datasets available in a given release, and how to retrieve download links to those datasets.

Step 1: Find list of available releases.

https://api.semanticscholar.org/datasets/v1/release/

Python Example:

import requests

# Define base URL for datasets API
base_url = "https://api.semanticscholar.org/datasets/v1/release/"

# To get the list of available releases make a request to the base url. No additional parameters needed.
response = requests.get(base_url)

# Check response status
if response.status_code == 200:
   response_data = response.json()
   # Process and print the response data as needed
   print(response_data)

Postman Request Example:

  • Response: The response contains the list of releases available at the time the request was made.

Step 2: Find datasets available in a given release. Let's assume we want to find the datasets available in the 10/31/2023 release.

https://api.semanticscholar.org/datasets/v1/release/2023-10-31

NOTE: For this endpoint the release_id is simply the release date, which must be specified in the URL path

Parameter Name

Type

Value

Description

release_id

path

2023-10-31

Release ID (YYYY-MM-DD) represents the version of the dataset you are requesting. 

Python Example:

import requests

base_url = "https://api.semanticscholar.org/datasets/v1/release/"

# Get available releases
response = requests.get(base_url)

# Assume we want data from the latest release, which will correspond to the last item in the response list since releases are ordered chronologically. The last item in a list can be found using the index value computed by len() - 1. A short-hand for this is to use an index value of -1.
release_id = response.json()[-1]

# Make a request to get datasets available the latest release
datasets_response = requests.get(base_url + release_id)

Tip: To retrieve datasets from the latest release, the release_id can also be set to “latest” instead of the actual date value. For example, the following would also be valid: https://api.semanticscholar.org/datasets/v1/release/latest

This only works for the latest release. For any earlier releases, the actual release date must be specified.

Postman Request Example:

  • Response:

Step 3: Retrieve download links for a dataset of our choice. Let's assume that we want to download the papers dataset that was returned in the response of the previous call to the datasets endpoint. The request would be structured as follows, but in this step, we will submit the request via Python (shown below), and we must use an API key.

NOTE: For this endpoint, the release_id is simply the release date (YYYY-MM-DD), which must be specified in the URL path

The request above contains the following parameters:

Parameter Name

Type

Value

Description

x-api-key

header

<your api key> 

*New users can request an API key

To download a dataset, a user must authenticate themselves via an API key.

release_id

path

2023-10-31

An ID to identify the release for which you are requesting datasets.

dataset name

path

papers

The name of the dataset you want to download. 

Python Example:

import requests

base_url = "https://api.semanticscholar.org/datasets/v1/release/"

# This endpoint requires authentication via api key
api_key = "your api key goes here"
headers = {"x-api-key": api_key}

# Get available releases
response = requests.get(base_url)

# Fetch latest release id
release_id = response.json()[-1]

# Define dataset name you want to download
dataset_name = 'papers'

# Send the GET request and store the response in a variable
download_links_response = requests.get(base_url + release_id + '/dataset/' + dataset_name, headers=headers)

# Check response status
if response.status_code == 200:
   response_data = download_links_response.json()
   # Process and print the response data as needed
   print(response_data)

Postman Request Example:

  • Response: The response contains the data set name, description, a README with license and usage information, and temporary, pre-signed download links for the dataset files.

Complete Python Example:

import requests

# Define base URL for datasets API
base_url = "https://api.semanticscholar.org/datasets/v1/release/"

# This endpoint requires authentication via api key
api_key = "YOUR_API_KEY"
headers = {"x-api-key": api_key}

# Make the initial request to get the list of releases
response = requests.get(base_url)

if response.status_code == 200:
    # Assume we want data from the latest release, which will correspond to the last item in the response list since releases are ordered chronologically
    release_id = response.json()[-1]

    # Make a request to get datasets available in the latest release (this endpoint url is the release id appended to the base url)
    datasets_response = requests.get(base_url + release_id)

    if datasets_response.status_code == 200:
        # Fetch the datasets list from the response
        datasets = datasets_response.json()['datasets']

        # Check if the 'papers' dataset exists
        papers_dataset_exists = any(dataset.get('name') == 'papers' for dataset in datasets)

        if papers_dataset_exists:
            # Make a request to get download links for the 'papers' dataset
            dataset_name = 'papers'
            download_links_response = requests.get(base_url + release_id + '/dataset/' + dataset_name, headers=headers)

            if download_links_response.status_code == 200:
                download_links = download_links_response.json()["files"]

                # Your code to process the download links goes here

            else:
                print(f"Failed to get download links. Status code: {download_links_response.status_code}")
        else:
            print("The 'papers' dataset does not exist in the list.")
    else:
        print(f"Failed to get datasets. Status code: {datasets_response.status_code}")
else:
    print(f"Failed to get releases. Status code: {response.status_code}")

Updating Datasets with Incremental Diffs

Semantic Scholar Datasets are updated every release, and ensuring your data stays current is crucial for leveraging the latest research information. When working with datasets locally, it is possible to miss the most recent updates. Given the substantial size of these datasets, downloading a new copy with each release may not be practical. With the Incremental Diffs endpoint we can fetch a concise list of diffs between two versions of a dataset, enabling efficient updates. 

Each "diff" represents changes between two sequential releases, and contains two lists of files: an "updated" list and a "deleted" list. Records in the "updated" list need to be inserted or replaced by their primary key (usually corpus id). Records in the "deleted" list should be removed from your dataset.

Let's see an example:

Use Case: Assume our authors dataset was downloaded on 10/31/2023, but the latest available release is 11/07/2023. Let’s fetch the updates to the authors dataset between 10/31 and 11/07.

The request above will contain the following parameters:

NOTE: Release ID is simply the release date (YYYY-MM-DD)

Parameter Name

Type

Value

Description

x-api-key

header

<your-api-key>

This endpoint requires users to authenticate via their Semantic Scholar API Key. Click here to request one if you haven’t already.

start_release_id

path

2023-10-31

Release ID (YYYY-MM-DD) that represents the version of your current dataset.

end_release_id

path

2023-11-07

Release ID (YYYY-MM-DD) that represents the version of the dataset you would like to update to. The end_release_id date must fall after start_release_id. In this example, we want our dataset to be up-to-date as of 11/07/2023. 

dataset_name

path

authors

Name of the dataset you would like to update (papers, authors, tldrs, etc.)

Complete Python Example:

import requests

# Set the path parameters
start_release_id = "2023-10-31"
end_release_id = "2023-11-07"
dataset_name = "authors"

# Set the API key. For best practice, store and retrieve API keys via environment variables
api_key = "<your-api-key>"

# Construct the complete endpoint URL with the path parameters
url = f"https://api.semanticscholar.org/datasets/v1/diffs/{start_release_id}/to/{end_release_id}/{dataset_name}"

# Make sure to include your api key in a header
headers = {"x-api-key": api_key}

# Make the API request
response = requests.get(url, headers=headers)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Extract the diffs from the response
    diffs = response.json()['diffs']

    #Your code to work with the diffs goes here
    
else:
    # Handle potential errors or non-200 responses
    print(f"Request failed with status code {response.status_code}: {response.text}")

Response: Our response returns a diff containing two sets of files. Each file in the update_files list contains records that must be inserted or replaced (by primary key) in your current dataset. Each file in the delete_files list contains records that must be removed from your current dataset.

NOTE: Each "diff" represents changes between two sequential releases. Since the next release after 10/31/23 was on 11/07/23, our API response only contained a single diff object. If we wanted to instead update our 10/31 dataset to the following 11/14/23 release, we would have received two diff objects. The first diff would contain a set of files representing the changes from 10/31 to 11/07, and the second diff would contain changes from 11/07 to 11/14.

Suggestions for Working with Downloaded Datasets

Explore the following sections for inspiration on leveraging your downloaded data. Please be aware that the tools, libraries, and frameworks mentioned below are not a comprehensive list and their performance will vary based on the size of your data and machine’s capabilities. If you are unsure of which tool is best suited to your needs, we have included some guidelines to help you decide. They are all external tools with no affiliation to Semantic Scholar, and are simply offered as suggestions to facilitate your initial exploration of our data.

NOTE: All Semantic Scholar Datasets are delivered in JSON format

Exploring Datasets

Command Line

The more command:

Perhaps the simplest mechanism to view your downloaded data without installing any external tool or library is via the command line through commands like more. This command is used to display the contents of a file in a paginated manner and lets you page through the contents of your downloaded file in chunks without loading up the entire dataset. It shows one screen of text at a time and allows you to navigate through the file using the following keyboard commands:

  • Spacebar: Move forward one screen.
  • Enter: Move forward one line.

Example: Assume we have downloaded the papers dataset, and renamed the file to “papersDataset”. Using the more command would produce the first ‘screen’ of text. To view the next screen, we could press Spacebar. To only view the next line of text we could press Enter.

  • Command: more papersDataset
  • Output:
more papersDataset output

The jq tool

'jq' is a lightweight and flexible command-line tool for exploring and manipulating JSON data. With jq, you can easily view formatted json output, select and view specific fields, filter data based on conditions, and more.

Example: Let's assume we have downloaded the papers dataset and named our file “PapersDataset”. The jq command to format output is jq ‘.’ <file-name>. Let’s use jq to view our formatted paper data:

  • Command: jq . PapersDataset
  • Output:
jq . papersDataset output

Python Pandas Library

Pandas is a powerful and easy-to-use data analysis and manipulation library available in Python. Using Pandas, you can effortlessly import, clean, and explore your data. One of the key structures in Pandas is a DataFrame, which can be thought of as a table of information, akin to a spreadsheet with rows and columns. Each column has a name, similar to a header in Excel, and each row represents a set of related data. With a DataFrame, tasks like sorting, filtering, and analyzing your data become straightforward. In the following sections, we will see how to leverage basic Pandas functions to view and explore our Semantic Scholar data in a DataFrame.

The head function: In Pandas you can use the head( ) function to view the initial few rows of your dataframe.

Python Example:

import pandas as pd

# Read JSON file into Pandas DataFrame. The ‘lines’ parameter indicates that our file contains one json object per line
df = pd.read_json('publication venues dataset', lines=True)

# Print the first few rows of the DataFrame
print(df.head())

Output:

Pandas head output

NOTE: You will notice that this is a very wide dataframe, where each column represents a field in our json object (e.g. id, name, issn, url, etc.). By default pandas only shows the first and last columns. To view all the columns, you can configure the pandas display settings before printing your output, with pd.set_option('display.max_columns', None)

The count function: We can use the count( ) function to count the number of rows that have data in them (e.g. not null). This can be useful to test the quality of your dataset.

Python Example:

# Display count of non-null values for each column
print(df.count())

Output:

Pandas count output

Apache Spark (via Python)

Apache Spark is a fast and powerful processing engine that can analyze large-scale data faster than traditional methods via in-memory caching and optimized query execution. Spark offers APIs for a variety of programming languages, so you can utilize its capabilities regardless of the language you are coding in. In our examples we will showcase the Spark Python API, commonly known as PySpark. Let’s see how to use PySpark functions to start viewing and exploring our data:

The show function: PySpark’s show( ) function is similar to print( ) or head( ) in pandas and will display the first few rows of data. Let’s load up our publication venues data into a PySpark DataFrame and see how it looks

Python Example:

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("dataset_exploration").getOrCreate()

# Read the dataset file named 'publication venues dataset' into a PySpark DataFrame. Depending on the directory you are working from you may need to include the complete file path.
df = spark.read.json("publication venues dataset")

# Display the first few rows
df.show()

Output:

Apache Spark Show output

The printSchema function: PySpark offers a handy printSchema( ) function if you want to explore the structure of your data

Python Example:

# Display the object schema

df.printSchema()

Output:

Apache Spark print schema output

MongoDB

MongoDB is a fast and flexible database tool built for exploring and analyzing large scale datasets. Think of it as a robust digital warehouse where you can efficiently organize, store, and retrieve large volumes of data. In addition, MongoDB is a NoSQL database that stores data in a flexible schema-less format, scales horizontally, supports various data models, and is optimized for performance. MongoDB offers both on-premise and fully managed cloud options (Atlas) and can be accessed via the Mongo shell or a GUI (known as Mongo Compass). You can check out our guide on setting up Mongo if you need help getting started. In the example below, we have imported a papers dataset into a Mongo Atlas cluster and show you how to leverage the Mongo Compass GUI to view and explore your data.

Once you have imported your data, you can view it via Compass as shown in the example below. You can leverage the Compass documentation to discover all its capabilities. We have listed some key items on the user interface to get you acquainted:

  • Data can be viewed in the default list view (shown below), object view, or table view by toggling the button on the upper right hand corner. In the list view, each ‘card’ displays a single record, or in this case a paper object. Notice that MongoDB appends its own ID, known as ObjectId to each record.
  • You can filter and analyze your data using the filter pane at the top of the screen, and click on the Explain button to see how your filters were applied to obtain your result set. Note that since Mongo is a NoSQL database, it has a slightly different query language from SQL to use for filtering and manipulation.
  • The default tab is the Documents tab where you can view and scroll through your data. You can also switch to the Aggregations tab to transform, filter, group, and perform aggregate operations on your dataset. In the Schema tab, Mongo provides an analysis of the schema of your dataset. When you click on the Indexes tab, you will find that the default index for searches is Mongo’s ObjectId. If you believe you will perform frequent searches using another attribute (e.g. corpusid), you can add an additional index to optimize performance.
  • You can always add more data to your dataset via the green Add Data button right under the filter query bar
MongoDB UI
Setting Up MongoDB

You have the option of installing MongoDB onto your machine, or using their managed database-as-a-service option on the cloud, otherwise known as Atlas. Once you set up your database, you can download the GUI tool (Mongo Compass) and connect it to your database to visually interact with your data. If you are new to mongo and want to just explore, you can setup a free cluster on Atlas with just a few easy steps:

Set Up a Free Cluster on MongoDB Atlas:

  1. Sign Up/Login:
    1.1. Visit the MongoDB Atlas website.
    1.2. Sign up for a new account or log in if you already have one.
  2. Create a New Cluster:
    2.1. After logging in, click on "Build a Cluster."
    2.2. Choose the free tier (M0) or another desired plan.
    2.3. Select your preferred cloud provider and region.
  3. Configure Cluster:
    3.1. Set up additional configurations, such as cluster name and cluster tier.
    3.2. Click "Create Cluster" to initiate the cluster deployment. It may take a few minutes.

Connect to MongoDB Compass:

  1. Download and Install MongoDB Compass:
    1.1. Download MongoDB Compass from the official website.
    1.2. Install the Compass application on your computer.
  2. Retrieve Connection String:
    2.1. In MongoDB Atlas, go to the "Clusters" section.
    2.2. Click on "Connect" for your cluster.
    2.3. Choose "Connect Your Application."
    2.4. Copy the connection string.
  3. Connect Compass to Atlas:
    3.1. Open MongoDB Compass.
    3.2. Paste the connection string in the connection dialog.
    3.3. Modify the username, password, and database name if needed.
    3.4. Click "Connect."

Import Data:

  1. Create a Database and Collection:
    1.1. In MongoDB Compass, navigate to the "Database" tab.
    1.2. Create a new database and collection by clicking "Create Database" and "Add My Own Data."
  2. Import Data:
    2.1. In the new collection, click "Add Data" and choose "Import File."
    2.2. Select your JSON or CSV file containing the data.
    2.3. Map fields if necessary and click "Import."
  3. Verify Data:
    3.1. Explore the imported data in MongoDB Compass to ensure it's displayed correctly.

Now, you have successfully set up a free cluster on MongoDB Atlas, connected MongoDB Compass to the cluster, and imported data into your MongoDB database. This process allows you to start working with your data using MongoDB's powerful tools.

TIP: We recommend checking the Mongo website for the latest installation instructions and FAQ in case you run into any issues.

Filtering and Analyzing Datasets

Python Pandas

Filtering: We can filter our data by specifying conditions. For example, let’s assume we have loaded our authors' dataset into a dataframe, and want to filter by authors who have written at least 5 papers and been cited at least 10 times. After applying this filter, let's select and display only the authorid, name, papercount, and citationcount fields 

Python Example:

#filter dataframe by authors who have more than 5 publications and have been cited at least 10 times
df = df[(df.papercount >= 5) & (df.citationcount >= 10)]

# Select and print a subset of the columns in our filtered dataframe
print(df[['authorid', 'name', 'papercount', 'citationcount']])

Output:

Python Pandas Filtering Output

Sorting: Pandas offers a variety of sorting functions to organize our data. In the example below, we use the sort_values( ) function to sort the dataframe by the “name” column and only display the authorid and name columns. The default is ascending order, so in this case our output will list authors in alphabetical order.   e can filter our data by specifying conditions. For example, let’s assume we have loaded our authors' dataset into a dataframe, and want to filter by authors who have written at least 5 papers and been cited at least 10 times. After applying this filter, let's select and display only the authorid, name, papercount, and citationcount fields 

Python Example:

#Let's sort our authors in alphabetical order
df = df.sort_values(by='name')

Output:

Python Pandas Sorting Output

Checking for missing values: Let’s say we want to assess the quality of our data by checking for missing (null) values. We can count how many missing values we have by using the isnull() and sum() functions.

Python Example:

# Count and print the number of missing values for each author attribute
print(df.isnull().sum())

Output:

Python Pandas Checking for missing values Output

Apache Spark (via Python)

Summary Statistics: PySpark offers a handy describe( ) function to delve into and display summary statistics for the specified columns in our dataset. In this example we describe the papercount, citationcount, and orderBy attributes of our author data. In the results we can see the average papercount of authors in this dataset, along with their average citationcount, hindex, and other common statistical measures.

Python Example:

df.describe(["papercount", "citationcount", "hindex"]).show()

Output:

Apache Spark Summary Statistics Output

Sorting: Let’s try sorting our data using PySpark. We can call the orderBy( ) function and specify the column we want to sort by, in this case papercount. We also call the desc() function to sort in descending order (from highest to lowest papercount). We also only want to display the authorid, name, and papercount fields, and display the top 3 records.

Python Example:

df = df.orderBy(col("papercount").desc())
df.select("authorid", "name", "papercount").show(3)

Output:

Apache Spark Summary Sorting Output

MongoDB

Querying, Filtering, and Sorting in Mongo: Using the Mongo Compass GUI we can filter and sort our dataset per our needs. For example, let's see which papers in Medicine were cited the most in the last 5 years, and exclude any papers with under 50 citations. In the project field we choose which fields we would like to display in the output, and we sort in descending order by citationcount 

Example:

{
   's2fieldsofstudy.category': 'Medicine',
   'citationcount': {
       '$gte': 50
   },
   'year': {
       '$gte': 2019,
       '$lte': 2023
   }
}
MongoDB

Output:

MongoDB Output

Command Line (jq)

Let's say you want to filter publication venues that are only journals. You can use jq to filter json objects by a condition, as shown below:

Command: jq ‘ . | select(has(“type”) and .type == “journal”)’ publicationVenues

Output:

jq Output

Working with Multiple Datasets

Oftentimes we may want to combine information from multiple datasets to gather insights. Consider the following example:

Use case: Let’s delve into a publication venue, such as the “Journal of the Geological Society”, and learn more about the papers that have been published in it. Perhaps we would like to gather the names of authors who have published a paper in this journal, but only those whose papers have been cited at least 15 times. We can combine information from the publication venues dataset and the papers dataset to find the authors that meet this criteria. To do this, we can load our datasets into pandas dataframes and retrieve the publication venue ID associated with the “Journal of the Geological Society” from the publication venues dataset. Then we can search the papers dataset for papers that have a citationcount of at least 15 and are tagged to that venue ID. Finally we can collect the names of authors associated with each of those papers that met our criteria. From this point you can explore other possibilities, such as viewing other papers published by those authors, checking out their homepage on the Semantic Scholar website, and more.

Python Example:

import pandas as pd

# Create Pandas DataFrames
papers_df = pd.read_json('papersDataset', lines=True)
venues_df = pd.read_json('publicationVenuesDataset', lines=True)

# Find the venue id for our publication venue of interest - "Journal of the Geological Society"
publication_venue_id = venues_df.loc[venues_df["name"] == "Journal of the Geological Society", "id"].values[0]

# Filter papers based on the venue id with a citation count of at least 15
filtered_geology_papers = papers_df.loc[
    (papers_df["publicationvenueid"] == publication_venue_id) & (papers_df["citationcount"] >= 15)
]

# Traverse the list of authors for each paper that met our filter criteria and collect their names into a list
author_names = []
for authors_list in filtered_geology_papers["authors"]:
    author_names.extend(author["name"] for author in authors_list)

# Print the resulting author names, with each name on a new line
print("Authors associated with papers from the Journal of the Geological Society:")
print(*author_names, sep="\n")

Output:

Multiple Datasets Output

Which Tool Should I Use?

You may be wondering which of the tools we discussed is right for you. The answer depends on the size of the datasets you are working with, the complexity of your analysis, and the operations you are looking to perform, and your machine’s capabilities, including processing power and available memory.

The Python Pandas library is typically more useful when your data can fit into memory and you are performing relatively simple operations like filtering, sorting, and basic aggregations. You can also utilize the optional chunksize parameter to read and process your data in smaller, more manageable chunks.

If your dataset size is significantly larger than available memory, you can use parallel computing tools such as the Python Dask library or Apache Spark.

  • Dask integrates easily with Pandas and can handle complex larger than memory computations by efficiently distributing computations across multiple cores.
  • For very large datasets that exceed the capacity of a single machine, Apache Spark is a distributed computing framework that can scale horizontally across a cluster of machines.

If your data is too large for in-memory processing, you can resort to database systems like MongoDB, PostgreSQL, and others that are specifically designed to handle large datasets. Many of these other tools offer cloud based solutions that can scale easily according to your needs.

Join the Semantic Scholar API Community Slack Channel

Get Started

Latest News & Updates

Case Study: Iterative Design for Skimming Support

Case Study: Iterative Design for Skimming Support

How might we help researchers quickly assess the relevance of scientific literature? Take a closer look at Skimming, Semantic Reader’s latest AI feature, and the collaborative design process behind it.

Behind the Scenes of Semantic Scholar’s New Author Influence Design

Behind the Scenes of Semantic Scholar’s New Author Influence Design

We released a new version of Author Influence interface to help scholars better discover other scholars in their fields. Here's how we identified user insights and made those design choices.

Artificial-intelligence search engines wrangle academic literature

Artificial-intelligence search engines wrangle academic literature

Nature had a chat with Dan Weld, Chief Scientist at Semantic Scholar, to discuss how search engines are helping scientists explore and innovate by making it easier to draw connections from a massive collection of scientific literature.