The guide you are reading contains:

  • a high-level introduction to the Fatcat catalog and software
  • a bibliographic style guide for editors, also useful for understanding metadata found in the catalog
  • technical details and guidance for use of the catalog's public REST API, for developers building bots, services, or contributing to the server software
  • policies and licensing details for all contributors and downstream users of the catalog

What is Fatcat?

Fatcat is an open bibliographic catalog of written works. The scope of works is somewhat flexible, with a focus on published research outputs like journal articles, pre-prints, and conference proceedings. Records are collaboratively editable, versioned, available in bulk form, and include URL-agnostic file-level metadata.

Both the Fatcat software and the metadata stored in the service are free (in both the libre and gratis sense) for others to share, reuse, fork, or extend. See Policies for licensing details, and Sources for attribution of the foundational metadata corpuses we build on top of.

Fatcat is currently used internally at the Internet Archive, but interested folks are welcome to contribute to it's design and development, and we hope to ultimately crowd-source corrections and additional to bibliographic metadata, and receive direct automated feeds of new content.

You can contact the Archive by email at webservices@archive.org, or the author directly at bnewbold@archive.org.

Editing Quickstart

This tutorial describes how to make edits to the Fatcat catalog using the web interface. We will add a new file to an existing release, then update the release to point at a different container. You can follow these directions on either the QA (NOTE: QA not available as of Spring 2021) or production public catalogs. You will:

  • create an editor account and log-in
  • create a new file entity
  • update an existing release entity
  • submit editgroup for review

First create an editor account and log-in. If you don't have an account with any of the existing federated log-in services (eg, Wikipedia, ORCID, Github), you can create a few Internet Archive account, confirm your email, and then log-in to Fatcat using that. You should see your username in the upper right-hand corner of every page when you are successfully logged in.

Next find the release's fatcat identifer for the paper we want to add a file to. You can search by title, or lookup a paper by an identifier (such as a DOI or arXiv ID). If the release you are looking for doesn't exist yet, you'll need to create a new one. All of these actions are linked from the Fatcat front page for each entity type.

The release fatcat identifier is the garbled looking string like hsmo6p4smrganpb3fndaj2lon4 which you can find under the title of the paper's entity page, and also in the URL. You'll need this identifier to link the file to the release.

Before creating a new file entity (or any entity for that matter), check that there isn't already an entity referencing the exact same file. Download the file (eg, PDF) that you want to add to your local computer, and calculate the SHA-1 hash of the file using a tool like sha1sum on the command line. If you aren't familiar with command line tools, you can upload to a free online service. The SHA-1 hash will look like de9aefc4522b385121e72faaee75bda9fbb8bf6e, and you can do a file lookup. If a file already exists, you could edit it to add new URLs (locations), or add/update any release links.

Assuming a file entity doesn't already exist, go to create file. We will want to start a new "editgroup" for these changes. If you don't have any editgroups in progress, you can just enter a description sentence and a new one will be created; if you did have edits in progress, you'll need to select the "create new editgroup" option from the drop-down of your existing editgroups.

Enter the basic file metadata in the fields provided. The red stared fields are required (size in bytes and SHA-1). Add a URL on the public web where the file can be found. It's best if PDFs are uploaded to repositories (eg, Zenodo) or hosted on the publisher's website. A second archival location can be added (eg, using the Wayback Machine's "save page now" feature), or you could skip this and wait for a bot to verify and archive the URL later. The left drop-down menu lets you set the "type" of each URL. Add the release identifier you found earlier to the "Releases" list.

Add a one-sentence description of your change, and submit the form. You will be redirected to a provisional ("work in progress") view of the new entity. Edits are not immediately merged into the catalog proper; the first need to be "submitted" and then accepted (eg, by a human moderator or robot).

Let's add a second edit to the same editgroup before continuing. The new file view should have a link to the release entity; follow that link, then click the "edit" button (either the tab or the blue link at the bottom of the infobox). This time, the most recent editgroup should already be selected, so you don't need to enter a description at the top. If there are any problems with basic metadata, go ahead and fix them, but otherwise skip down to the "Container" section and update the fatcat identifer ("FCID") to point to the correct journal. You can lookup journals by ISSN-L, or search by title. Add a short description of your change ("Updated journal to XYZ") and then submit.

You now have two edits in your editgroup. There should be links to the editgroup itself from the "work-in-progress" pages, or you can find all your editgroups from the drop-down link in the upper right-hand corner of every page (your username, then "Edit History"). The editgroup page shows all the entities created, updated, or deleted, and allows you to make tweaks (re-edit) or remove changes. If the release/container update you made was bogus (just as a learning exersize), you could remove it here. It's a good practice to group related edits into the same editgroup, but only up to 50 or so edits at a time (more than that becomes difficult hard to review).

If things look good, click the "submit" button on the editgroup page. This will mark your changes as "ready for review", and they will show up on the global reviewable editgroups list. If you change your mind, you can "unsubmit" the editgroup and make more changes. Humans and bots can make annotations to editgroups, recommending changes. At the current time there are no email or other update notifications, so you need to check in on annotations and other status manually.

When your changes have been reviewed, a moderator will "accept" them, and the entities will be updated in the catalog. Every accepted editgroup ends up in the changelog.

And then you're done, thanks for your contribution!

High-Level Overview

This section gives an introduction to:

  • the goals of the project, and now it relates to the rest of the Open Access and archival ecosystem
  • how catalog data is represented as entities and revisions with full edit history, and how entities are referred to and cross-referenced with identifiers
  • how humans and bots propose changes to the catalog, and how these changes are reviewed
  • the major sources of bulk and continuously updated metadata that form the foundation of the catalog
  • a rough sketch of the software back-end, database, and libraries
  • roadmap for near-future work

Project Goals and Ecosystem Niche

The Internet Archive has two primary use cases for Fatcat:

  • Tracking the "completeness" of our holdings against all known published works. In particular, allow us to monitor progress, identify gaps, and prioritize further collection work.
  • Be a public-facing catalog and access mechanism for our open access holdings.

In the larger ecosystem, Fatcat could also provide:

  • A work-level (as opposed to title-level) archival dashboard: what fraction of all published works are preserved in archives? KBART, CLOCKSS, Portico, and other preservation networks don't provide granular metadata
  • A collaborative, independent, non-commercial, fully-open, field-agnostic, "completeness"-oriented catalog of scholarly metadata
  • Unified (centralized) foundation for discovery and access across repositories and archives: discovery projects can focus on user experience instead of building their own catalog from scratch
  • Research corpus for meta-science, with an emphasis on availability and reproducibility (metadata corpus itself is open access, and file-level hashes control for content drift)
  • Foundational infrastructure for distributed digital preservation
  • On-ramp for non-traditional digital works (web-native and "grey literature") into the scholarly web


What types of works should be included in the catalog?

The goal is to capture the "scholarly web": the graph of written works that cite other works. Any work that is both cited more than once and cites more than one other work in the catalog is likely to be in scope. "Leaf nodes" and small islands of intra-cited works may or may not be in scope.

Fatcat does not include any fulltext content itself, even for clearly licensed open access works, but does have verified hyperlinks to fulltext content, and includes file-level metadata (hashes and fingerprints) to help identify content from any source. File-level URLs with context ("repository", "publisher", "webarchive") should make Fatcat more useful for both humans and machines to quickly access fulltext content of a given mimetype than existing redirect or landing page systems. So another factor in deciding scope is whether a work has "digital fixity" and can be contained in immutable files or can be captured by web archives.

References and Previous Work

The closest overall analog of Fatcat is MusicBrainz, a collaboratively edited music database. Open Library is a very similar existing service, which exclusively contains book metadata.

Wikidata seems to be the most successful and actively edited/developed open bibliographic database at this time (early 2018), including the wikicite conference and related Wikimedia/Wikipedia projects. Wikidata is a general purpose semantic database of entities, facts, and relationships; bibliographic metadata has become a large fraction of all content in recent years. The focus there seems to be linking knowledge (statements) to specific sources unambiguously. Potential advantages Fatcat has are a focus on a specific scope (not a general-purpose database of entities) and a goal of completeness (capturing as many works and relationships as rapidly as possible). With so much overlap, the two efforts might merge in the future.

The technical design of Fatcat is loosely inspired by the git branch/tag/commit/tree architecture, and specifically inspired by Oliver Charles' "New Edit System" blog posts from 2012.

There are a number of proprietary, for-profit bibliographic databases, including Web of Science, Google Scholar, Microsoft Academic Graph, aminer, Scopus, and Dimensions. There are excellent field-limited databases like dblp, MEDLINE, and Semantic Scholar. Large, general-purpose databases also exist that are not directly user-editable, including the OpenCitation corpus, CORE, BASE, and CrossRef. We do not know of any large (more than 60 million works), open (bulk-downloadable with permissive or no license), field agnostic, user-editable corpus of scholarly publication bibliographic metadata.

Further Reading

"From ISIS to CouchDB: Databases and Data Models for Bibliographic Records" by Luciano G. Ramalho. code4lib, 2013. https://journal.code4lib.org/articles/4893

"Representing bibliographic data in JSON". github README file, 2017. https://github.com/rdmpage/bibliographic-metadata-json

"Citation Style Language", https://citationstyles.org/

"Functional Requirements for Bibliographic Records", Wikipedia article, https://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records

OpenCitations and I40C http://opencitations.net/, https://i4oc.org/

Data Model

Entity Types and Ontology

Loosely following "Functional Requirements for Bibliographic Records" (FRBR), but removing the "manifestation" abstraction, and favoring files (digital artifacts) over physical items, the primary bibliographic entity types are:

  • work: representing an abstract unit of creative output. Does not contain any metadata itself; used only to group release entities. For example, a journal article could be posted as a pre-print, published on a journal website, translated into multiple languages, and then re-published (with minimal changes) as a book chapter; these would all be variants of the same work.
  • release: a specific "release" or "publicly published" version of a work. Contains traditional bibliographic metadata (title, date of publication, media type, language, etc). Has relationships to other entities:
    • child of a single work (required)
    • multiple creator entities as "contributors" (authors, editors)
    • outbound references to multiple other release entities
    • member of a single container, for example a journal or book series
  • file: a single concrete, fixed digital artifact; a manifestation of one or more releases. Machine-verifiable metadata includes file hashes, size, and detected file format. Verified URLs link to locations on the open web where this file can be found or has been archived. Has relationships:
    • multiple release entities that this file is a complete manifestation of (almost always a single release)
  • fileset: a list of muliple concrete files, together forming complete release manifestation. Primarily intended for datasets and supplementary materials; could also contain a paper "package" (source file and figures).
  • webcapture: a single snapshot (point in time) of a webpage or small website (multiple pages) which are a complete manifestation of a release. Not a landing page or page referencing the release.
  • creator: persona (pseudonym, group, or specific human name) that has contributed to one or more release. Not necessarily one-to-one with a human person.
  • container (aka "venue", "serial", "title"): a grouping of releases from a single publisher.

Note that, compared to many similar bibliographic ontologies, the current one does not have entities to represent:

  • physical artifacts, either generically or specific copies
  • funding sources
  • publishing entities
  • "events at a time and place"

Each entity type has it's own relations and fields (captured in a schema), but there are are also generic operations and fields common across all entities. The API for creating, updating, querying, and inspecting entities is roughly the same regardless of type.

Identifiers and Revisions

A specific version of any entity in the catalog is called a "revision". Revisions are generally immutable (do not change and are not editable), and are not normally referred to directly. Instead, persistent "fatcat identifiers" can be created, which "point to" a single revision at a time. This distinction means that entities referred to by an identifier can change over time (as metadata is corrected and expanded). Revision objects do not "point" back to specific identifiers, so they are not the same as a simple "version number" for an identifier.

Identifiers also have the ability to be merged (by redirecting one identifier to another) and "deleted" (by pointing the identifier to no revision at all). All changes to identifiers are captured as an "edit" object. Edit history can be fetched and inspected on a per-identifier basis, and any changes can easily be reverted (even merges/redirects and "deletion").

"Work in progress" or "proposed" updates are staged as edit objects without updating the identifiers themselves.

Controlled Vocabularies

Some individual fields have additional constraints, either in the form of pattern validation ("values must be upper case, contain only certain characters"), or membership in a fixed set of values. These may include:

  • license and open access status
  • work "types" (article vs. book chapter vs. proceeding, etc)
  • contributor types (author, translator, illustrator, etc)
  • human languages
  • identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifiers themselves)

Other fixed-set "vocabularies" become too large to easily maintain or express in code. These could be added to the backend databases, or be enforced by bots (instead of the system itself). These mostly include externally-registered identifiers or types, such as:

  • file mimetypes
  • identifiers themselves (DOI, ORCID, etc), by checking for registration against canonical APIs and databases

Global Edit Changelog

As part of the process of "accepting" an edit group, a row is written to an immutable, append-only table (which internally is a SQL table) documenting each identifier change. This changelog establishes a monotonically increasing version number for the entire corpus, and should make interaction with other systems easier (eg, search engines, replicated databases, alternative storage backends, notification frameworks, etc.).


Basic Editing Workflow and Bots

Both human editors and bots should have edits go through the same API, with humans using either the default web interface, client software, or third-party integrations.

The normal workflow is to create edits (or updates, merges, deletions) on individual entities. Individual changes are bundled into an "edit group" of related edits (eg, correcting authorship info for multiple works related to a single author). When ready, the editor "submits" the edit group for review. During the review period, human editors vote and bots can perform automated checks. During this period the editor can make tweaks if necessary. After some fixed time period (one week?) with no changes and no blocking issues, the edit group would be accepted if no merge conflicts have be created by other edits to the same entities. This process balances editing labor (reviews are easy, but optional) against quality (cool-down period makes it easier to detect and prevent spam or out-of-control bots). More sophisticated roles and permissions could allow some certain humans and bots to push through edits more rapidly (eg, importing new works from a publisher API).

Bots need to be tuned to have appropriate edit group sizes (eg, daily batches, instead of millions of works in a single edit) to make human QA review and reverts manageable.

Data provenance and source references are captured in the edit metadata, instead of being encoded in the entity data model itself. In the case of importing external databases, the expectation is that special-purpose bot accounts are be used, and tag timestamps and external identifiers in the edit metadata. Human editors can leave edit messages to clarify their sources.

A style guide and discussion forum are intended to be be hosted as separate stand-alone services for editors to propose projects and debate process or scope changes. These services should have unified accounts and logins (OAuth?) for consistent account IDs across all services.


The core metadata bootstrap sources, by entity type, are:

  • releases: Crossref metadata, with DOIs as the primary identifier, and PubMed (central), Wikidata, and CORE identifiers cross-referenced
  • containers: munged metadata from the DOAJ, ROAD, and Norwegian journal list, with ISSN-Ls as the primary identifier. ISSN provides an "ISSN to ISSN-L" mapping to normalize electronic and print ISSN numbers.
  • creators: ORCID metadata and identifier.

Initial file metadata and matches (file-to-release) come from earlier Internet Archive matching efforts, and in particular efforts to extra bibliographic metadata from PDFs (using GROBID) and fuzzy match (with conservative settings) to Crossref metadata.

The intent is to continuously ingest and merge metadata from a small number of large (~2-3 million more more records) general-purpose aggregators and catalogs in a centralized fashion, using bots, and then support volunteers and organizations in writing bots to merge high-quality metadata from field or institution-specific catalogs.

Progeny information (where the metadata comes from, or who "makes specific claims") is stored in edit metadata in the data model. Value-level attribution can be achieved by looking at the full edit history for an entity as a series of patches.


The canonical backend datastore exposes a microservice-like HTTP API, which could be extended with gRPC or GraphQL interfaces. The initial datastore is a transactional SQL database, but this implementation detail is abstracted by the API.

As little "application logic" as possible should be embedded in this back-end; as much as possible would be pushed to bots which could be authored and operated by anybody. A separate web interface project talks to the API backend and can be developed more rapidly with less concern about data loss or corruption.

A cronjob will create periodic database dumps, both in "full" form (all tables and all edit history, removing only authentication credentials) and "flattened" form (with only the most recent version of each entity).

One design goal is to be linked-data/RDF/JSON-LD/semantic-web "compatible", but not necessarily "first". It should be possible to export the database in a relatively clean RDF form, and to fetch data in a variety of formats, but internally Fatcat is not backed by a triple-store, and is not tied to any specific third-party ontology or schema.

Microservice daemons should be able to proxy between the primary API and standard protocols like ResourceSync and OAI-PMH, and third party bots can ingest or synchronize the database in those formats.

Fatcat Identifiers

Fatcat identifiers are semantically meaningless fixed-length random numbers, usually represented in case-insensitive base32 format. Each entity type has its own identifier namespace.

128-bit (UUID size) identifiers encode as 26 characters (but note that not all such strings decode to valid UUIDs), and in the backend can be serialized in UUID columns:


In comparison, 96-bit identifiers would have 20 characters and look like:


and 64-bit:


Fatcat identifiers can used to interlink between databases, but are explicitly not intended to supplant DOIs, ISBNs, handle, ARKs, and other "registered" persistent identifiers for general use.

Internal Schema

Internally, identifiers are lightweight pointers to "revisions" of an entity. Revisions are stored in their complete form, not as a patch or difference; if comparing to distributed version control systems (for managing changes to source code), this follows the git model, not the mercurial model.

The entity revisions are immutable once accepted; the editing process involves the creation of new entity revisions and, if the edit is approved, pointing the identifier to the new revision. Entities cross-reference between themselves by identifier not revision number. Identifier pointers also support (versioned) deletion and redirects (for merging entities).

Edit objects represent a change to a single entity; edits get batched together into edit groups (like "commits" and "pull requests" in git parlance).

SQL tables look something like this (with separate tables for entity type a la work_revision and work_edit):

    id (uuid)
    current_revision (entity_revision foreign key)
    redirect_id (optional; points to another entity_ident)
    is_live (boolean; whether newly created entity has been accepted)

    <all entity-style-specific fields>
    extra: json blob for schema evolution

    editgroup_id (editgroup foreign key)
    ident (entity_ident foreign key)
    new_revision (entity_revision foreign key)
    new_redirect (optional; points to entity_ident table)
    previous_revision (optional; points to entity_revision)
    extra: json blob for provenance metadata

    editor_id (editor table foreign key)
    extra: json blob for provenance metadata

An individual entity can be in the following "states", from which the given actions (transition) can be made:

  • wip (not live; not redirect; has rev)
    • activate (to active)
  • active (live; not redirect; has rev)
    • redirect (to redirect)
    • delete (to deleted)
  • redirect (live; redirect; rev or not)
    • split (to active)
    • delete (to delete)
  • deleted (live; not redirect; no rev)
    • redirect (to redirect)
    • activate (to active)

"WIP, redirect" or "WIP, deleted" are invalid states.

Additional entity-specific columns hold actual metadata. Additional tables (which reference both entity_revision and entity_id foreign keys as appropriate) represent things like authorship relationships (creator/release), citations between works, etc. Every revision of an entity requires duplicating all of these associated rows, which could end up being a large source of inefficiency, but is necessary to represent the full history of an object.


Core unimplemented features (as of February 2019) include:

  • rate-limiting and spam/abuse mitigation
  • actual entity creation, editing, deleting through the web interface
  • several web interface views (eg, editor-specific changelog, recent changes)
  • "work aggolomeration", merging related releases under the same work
  • linking known citations (we know DOI or PMID of "target", but haven't updated reference to point to fatcat ident)

Contributions would be helpful to implement:

  • import (bulk and/or continuous updates) for more metadata sources
  • better handling of work/release distinction in, eg, search results and citation counting
  • de-duplication (via merging) for all entity types
  • matching improvements, eg, for references (citations), contributions (authorship), work grouping, and file/release matching
  • internationalization of the web interface (translation to multiple languages)
  • accessibility review of user interface

Possible breaking API and schema changes:

  • move all edit endpoints under /editgroup/<editgroup_id>/..., instead of having an editgroup_id query parameter
  • rename release_status to release_stage
  • handle retractions/withdrawls with widthdrawn and withdrawn_date release fields, and retracted status
  • new entity type for research institutions, to track author affiliation. Use the new (2019) ROR identifier/registry
  • container nesting, or some method to handle conferences (event vs. series) and other "series" or "group" containers
  • include more author name metadata (display, sur, given) in contribs, and potentially references. Need this to format citations properly (CSL) when we don't have full author linkage

Other longer term projects could include:

  • full-text search over release files
  • bi-directional synchronization with other user-editable catalogs, such as Wikidata
  • alternate/enhanced backend to store full edit history without overloading traditional relational database
  • make external identifiers generic, instead of having a fixed (indexed) list. Eg, extid table for every entity rev, with string ("issn:1234-5678") or structure ('{type: "issn", value: "1234-5678"}')
  • URLs for entities. Have avoided so far, in lieu of external identifiers or web captures
  • "save paper now" feature in web interface
  • generic tagging of entities. Needs design/scoping; a separate service? editor-specific? tag by slugs, free-form text, or wikidata entities? "delicious for papers"?. Something as an alternative to traditional hierarchal categorization.
  • first-class support for books: additional external identifiers, metadata tweaks, bulk import of MARC or other metadata records, matching to DOAB and other open-access book collections

Known Issues

  • changelog index may have gaps due to PostgreSQL sequence and transaction roll-back behavior
  • search is idiosyncratic: does not cover contrib names by default, and some queries cause errors (eg, "N/A" without quotes)

Unresolved Questions

How to handle translations of, eg, titles and author names? To be clear, not translations of works (which are just separate releases), these are more like aliases or "originally known as".

Should external identifers be made generic? Eg, instead of having arxiv_id as a column, have a table of arbitary identifers, with either an extid_type or just use a prefix like arxiv:someid.

Should contributor/author affiliation and contact information be retained? It could be very useful for disambiguation, but we don't want to build a huge database for "marketing" and other spam.

Can general-purpose SQL databases like Postgres or MySQL scale well enough to hold several tables with billions of entity revisions? Right from the start there are hundreds of millions of works and releases, many of which having dozens of citations, many authors, and many identifiers, and then we'll have potentially dozens of edits for each of these. This multiplies out to `1e8 * 2e1

  • 2e1 = 4e10`, or 40 billion rows in the citation table. If each row was 32 bytes on average (uncompressed, not including index size), that would be 1.3 TByte on its own, larger than common SSD disks. I do think a transactional SQL datastore is the right answer. In my experience locking and index rebuild times are usually the biggest scaling challenges; the largely-immutable architecture here should mitigate locking. Hopefully few indexes would be needed in the primary database, as user interfaces could rely on secondary read-only search engines for more complex queries and views.

There is a tension between focus and scope creep. If a central database like Fatcat doesn't support enough fields and metadata, then it will not be possible to completely import other corpuses, and this becomes "yet another" partial bibliographic database. On the other hand, accepting arbitrary data leads to other problems: sparseness increases (we have more "partial" data), potential for redundancy is high, humans will start editing content that might be bulk-replaced, etc.

There might be a need to support "stub" references between entities. Eg, when adding citations from PDF extraction, the cited works are likely to be ambiguous. Could create "stub" works to be merged/resolved later, or could leave the citation hanging. Same with authors, containers (journals), etc.

Cataloging Style Guide

Language and Translation of Metadata

The Fatcat data model does not include multiple titles or names for the same entity, or even a "native"/"international" representation as seems common in other bibliographic systems. This most notably applies to release titles, but also to container and publisher names, and likely other fields.

For now, editors must use their own judgment over whether to use the title of the release listed in the work itself

This is not to be confused with translations of entire works, which should be treated as an entirely separate release.

External Identifiers

"Fake identifiers", which are actually registered and used in examples and documentation (such as DOI 10.5555/12345678) are allowed (and the entity should be tagged as a fake or example). Non-registered "identifier-like strings", which are semantically valid but not registered, should not exist in Fatcat metadata in an identifier column. Invalid identifier strings can be stored in "extra" metadata. Crossref has blogged about this distinction.


All DOIs stored in an entity column should be registered (aka, should be resolvable from doi.org). Invalid identifiers may be cleaned up or removed by bots.

DOIs should always be stored and transferred in lower-case form. Note that there are almost no other constraints on DOIs (and handles in general): they may have multiple forward slashes, whitespace, of arbitrary length, etc. Crossref has a number of examples of such "valid" but frustratingly formatted strings.

In the Fatcat ontology, DOIs and release entities are one-to-one.

It is the intention to automatically (via bot) create a Fatcat release for every Crossref-registered DOI from a whitelist of media types ("journal-article" etc, but not all), and it would be desirable to auto-create entities for in-scope publications from all registrars. It is not the intention to auto-create a release for every registered DOI. In particular, "sub-component" DOIs (eg, for an individual figure or table from a publication) aren't currently auto-created, but could be stored in "extra" metadata, or on a case-by-case basis.

Human Names

Representing names of human beings in databases is a fraught subject. For some background reading, see:

Particular difficult issues in the context of a bibliographic database include the non-universal concept of "family" vs. "given" names and their relationship to first and last names; the inclusion of honorary titles and other suffixes and prefixes to a name; the distinction between "preferred", "legal", and "bibliographic" names, or other situations where a person may not wish to be known under the name they are commonly referred to under; language and character set issues; and pseudonyms, anonymous publications, and fake personas (perhaps representing a group, like Bourbaki).

The general guidance for Fatcat is to:

  • not be a "source of truth" for representing a persona or human being; ORCID and Wikidata are better suited to this task
  • represent author personas, not necessarily 1-to-1 with human beings
  • prioritize the concerns of a reader or researcher over that of the author
  • enable basic interoperability with external databases, file formats, schemas, and style guides
  • when possible, respect the wishes of individuals

The data model for the creator entity has three name fields:

  • surname and given_name: needed for "aligning" with external databases, and to export metadata to many standard formats
  • display_name: the "preferred" representation for display of the entire name, in the context of international attribution of authorship of a written work

Names to not necessarily need to expressed in a Latin character set, but also does not necessarily need to be in the native language of the creator or the language of their notable works

Ideally all three fields are populated for all creators.

It seems likely that this schema and guidance will need review. "Extra" metadata can be used to store aliases and alternative representations, which may be useful for disambiguation and automated de-duplication.

Editgroups and Meta-Meta-Data

Editors are expected to group their edits in semantically meaningful editgroups of a reasonable size for review and acceptance. For example, merging two creators and updating related releases could all go in a single editgroup. Large refactors, conversions, and imports, which may touch thousands of entities, should be grouped into reasonable size editgroups; extremely large editgroups may cause technical issues, and make review unmanageable. 50 edits is a decent batch size, and 100 is a good upper limit (and may be enforced by the server).

Common Entity Fields

All entities have:

  • extra: free-form JSON metadata

The "extra" field is an "escape hatch" to include extra fields not in the regular schema. It is intended to enable gradual evolution of the schema, as well as accommodating niche or field-specific content. Reasonable care should be taken with this extra metadata: don't include large text or binary fields, hundreds of fields, duplicate metadata, etc.

Container Entity Reference


  • name (string, required): The title of the publication, as used in international indexing services. Eg, "Journal of Important Results". Not necessarily in the native language, but also not necessarily in English. Alternative titles (and translations) can be stored in "extra" metadata (see below)
  • container_type (string): eg, journal vs. conference vs. book series. Controlled vocabulary is described below.
  • publisher (string): The name of the publishing organization. Eg, "Society of Curious Students".
  • issnl (string): an external identifier, with registration controlled by the ISSN organization. Registration is relatively inexpensive and easy to obtain (depending on world region), so almost all serial publications have one. The ISSN-L ("linking ISSN") is one of either the print ("ISSNp") or electronic ("ISSNe") identifiers for a serial publication; not all publications have both types of ISSN, but many do, which can cause confusion. The ISSN master list is not gratis/public, but the ISSN-L mapping is.
  • wikidata_qid (string): external linking identifier to a Wikidata entity.

extra Fields

  • abbrev (string): a commonly used abbreviation for the publication, as used in citations, following the [ISO 4][] standard. Eg, "Journal of Polymer Science Part A" -> "J. Polym. Sci. A"
  • acronym (string): acronym of publication name. Usually all upper-case, but sometimes a very terse, single-word truncated form of the name (eg, a pun).
  • coden (string): an external identifier, the [CODEN code][]. 6 characters, all upper-case.
  • issnp (string): Print ISSN
  • issne (string): Electronic ISSN
  • default_license (string, slug): short name (eg, "CC-BY-SA") for the default/recommended license for works published in this container
  • original_name (string): native name (if name is translated)
  • platform (string): hosting platform: OJS, wordpress, scielo, etc
  • mimetypes (array of string): formats that this container publishes all works under (eg, 'application/pdf', 'text/html')
  • first_year (integer): first year of publication
  • last_year (integer): final year of publication (implies that container is no longer active)
  • languages (array of strings): ISO codes; the first entry is considered the "primary" language (if that makes sense)
  • country (string): ISO abbreviation (two characters) for the country this container is published in
  • aliases (array of strings): significant alternative names or abbreviations for this container (not just capitalization/punctuation)
  • region (string, slug): continent/world-region (vocabulary is TODO)
  • discipline (string, slug): highest-level subject aread (vocabulary is TODO)
  • urls (array of strings): known homepage URLs for this container (first in array is default)

Additional fields used in analytics and "curration" tracking:

  • doaj (object)
    • as_of (string, ISO datetime): datetime of most recent check; if not set, not actually in DOAJ
    • seal (bool): has DOAJ seal
    • work_level (bool): whether work-level publications are registered with DOAJ
    • archive (array of strings): preservation archives
  • road (object)
    • as_of (string, ISO datetime): datetime of most recent check; if not set, not actually in ROAD
  • kbart (object)
    • lockss, clockss, portico, jstor etc (object)
      • year_spans (array of arrays of integers (pairs)): year spans (inclusive) for which the given archive has preserved this container
      • volume_spans (array of arrays of integers (pairs)): volume spans (inclusive) for which the given archive has preserved this container
  • sherpa_romeo (object):
    • color (string): the SHERPA/RoMEO "color" of the publisher of this container
  • doi: TODO: include list of prefixes and which (if any) DOI registrar is used
  • dblp (object):
    • prefix (string): prefix of dblp keys published as part of this container (eg, 'journals/blah' or 'conf/xyz')
  • ia (object): Internet Archive specific fields
    • sim (object): same format as kbart preservation above; coverage in microfilm collection
    • longtail (bool): is this considered a "long-tail" open access venue
  • publisher_type (string): controlled vocabulary

For KBART and other "coverage" fields, we "over-count" on the assumption that works with "in-progress" status will soon actually be preserved. Elements of these arrays are either an integer (means that single year is preserved), or an array of length two (meaning everything between the two numbers (inclusive) is preserved).

container_type Vocabulary

  • journal
  • proceedings
  • conference-series
  • book-series
  • blog
  • magazine
  • trade
  • test

Creator Entity Reference


  • display_name (string, required): Full name, as will be displayed in user interfaces. Eg, "Grace Hopper"
  • given_name (string): Also known as "first name". Eg, "Grace".
  • surname (string): Also known as "last name". Eg, "Hooper".
  • orcid (string): external identifier, as registered with ORCID.
  • wikidata_qid (string): external linking identifier to a Wikidata entity.

See also "Human Names" sub-section of style guide.

File Entity Reference


  • size (integer, positive, non-zero): Size of file in bytes. Eg: 1048576.
  • md5 (string): MD5 hash in lower-case hex. Eg: "d41efcc592d1e40ac13905377399eb9b".
  • sha1 (string): SHA-1 hash in lower-case hex. Not technically required, but the most-used of the hash fields and should always be included. Eg: "f013d66c7f6817d08b7eb2a93e6d0440c1f3e7f8".
  • sha256: SHA-256 hash in lower-case hex. Eg: "a77e4c11a57f1d757fca5754a8f83b5d4ece49a2d28596889127c1a2f3f28832".
  • urls: An array of "typed" URLs. Order is not meaningful, and may not be preserved.
    • url (string, required): Eg: "https://example.edu/~frau/prcding.pdf".
    • rel (string, required): Eg: "webarchive".
  • mimetype (string): Format of the file. If XML, specific schema can be included after a +. Example: "application/pdf"
  • release_ids (array of string identifiers): references to release entities that this file represents a manifestation of. Note that a single file can contain multiple release references (eg, a PDF containing a full issue with many articles), and that a release will often have multiple files (differing only by watermarks, or different digitizations of the same printed work, or variant MIME/media types of the same published work).

URL rel Vocabulary

  • web: generic public web sites; for http/https URLs, this should be the default
  • webarchive: full URL to a resource in a long-term web archive
  • repository: direct URL to a resource stored in a repository (eg, an institutional or field-specific research data repository)
  • academicsocial: academic social networks (such as academia.edu or ResearchGate)
  • publisher: resources hosted on publisher's website
  • aggregator: fulltext aggregator or search engine, like CORE or Semantic Scholar
  • dweb: content hosted on distributed/decentralized web protocols, such as dat:// or ipfs:// URLs

Fileset Entity Reference


Warning: This schema is not yet stable.

  • manifest (array of objects): each entry represents a file
    • path (string, required): relative path to file (including filename)
    • size (integer, required): in bytes
    • md5 (string): MD5 hash in lower-case hex
    • sha1 (string): SHA-1 hash in lower-case hex
    • sha256 (string): SHA-256 hash in lower-case hex
    • extra (object): any extra metadata about this specific file
  • urls: An array of "typed" URLs. Order is not meaningful, and may not be preserved.
    • url (string, required): Eg: "https://example.edu/~frau/prcding.pdf".
    • rel (string, required): Eg: "webarchive".
  • release_ids (array of string identifiers): references to release entities

Web Capture Entity Reference


Warning: This schema is not yet stable.

  • cdx (array of objects): each entry represents a distinct web resource (URL). First is considered the primary/entry. Roughly aligns with CDXJ schema.
    • surt (string, required): sortable URL format
    • timestamp (string, datetime, required): ISO format, UTC timezone, with Z prefix required, with second (or finer) precision. Eg, "2016-09-19T17:20:24Z". Wayback timestamps (like "20160919172024") should be converted naively.
    • url (string, required): full URL
    • mimetype (string): content type of the resource
    • status_code (integer, signed): HTTP status code
    • sha1 (string, required): SHA-1 hash in lower-case hex
    • sha256 (string): SHA-256 hash in lower-case hex
  • archive_urls: An array of "typed" URLs where this snapshot can be found. Can be wayback/memento instances, or direct links to a WARC file containing all the capture resources. Often will only be a single archive. Order is not meaningful, and may not be preserved.
    • url (string, required): Eg: "https://example.edu/~frau/prcding.pdf".
    • rel (string, required): Eg: "wayback" or "warc"
  • original_url (string): base URL of the resource. May reference a specific CDX entry, or maybe in normalized form.
  • timestamp (string, datetime): same format as CDX line timestamp (UTC, etc). Corresponds to the overall capture timestamp. Can be the earliest of CDX timestamps if that makes sense
  • release_ids (array of string identifiers): references to release entities

Release Entity Reference


  • title (string, required): the display title of the release. May include subtitle.
  • subtitle (string): intended only to be used primarily with books, not journal articles. Subtitle may also be appended to the title instead of populating this field.
  • original_title (string): the full original language title, if title is translated
  • work_id (fatcat identifier; required): the (single) work that this release is grouped under. If not specified in a creation (POST) action, the API will auto-generate a work.
  • container_id (fatcat identifier): a (single) container that this release is part of. When expanded the container field contains the full container entity.
  • release_type (string, controlled set): represents the medium or form-factor of this release; eg, "book" versus "journal article". Not necessarily the same across all releases of a work. See definitions below.
  • release_stage (string, controlled set): represents the publishing/review lifecycle status of this particular release of the work. See definitions below.
  • release_date (string, ISO date format): when this release was first made publicly available. Blank if only year is known.
  • release_year (integer): year when this release was first made publicly available; should match release_date if both are known.
  • withdrawn_status (optional, string, controlled set):
  • withdrawn_date (optional, string, ISO date format): when this release was withdrawn. Blank if only year is known.
  • withdrawn_year (optional, integer): year when this release was withdrawn; should match withdrawn_date if both are known.
  • ext_ids (key/value object of string-to-string mappings): external identifiers. At least an empty ext_ids object is always required for release entities, so individual identifiers can be accessed directly.
  • volume (string): optionally, stores the specific volume of a serial publication this release was published in. type: string
  • issue (string): optionally, stores the specific issue of a serial publication this release was published in.
  • pages (string): the pages (within a volume/issue of a publication) that this release can be looked up under. This is a free-form string, and could represent the first page, a range of pages, or even prefix pages (like "xii-xxx").
  • version (string): optionally, describes distinguishes this release version from others. Generally a number, software-style version, or other short/slug string, not a freeform description. Book "edition" descriptions can also go in an edition extra field. Often used in conjunction with external identifiers. If you're not certain, don't use this field!
  • number (string): an inherent identifier for this release (or work), often part of the title. For example, standards numbers, technical memo numbers, book series number, etc. Not a book chapter number however (which can be stored in extra). Depending on field or series-specific norms, the number may be stored here, in the title, or in both fields.
  • publisher (string): name of the publishing entity. This does not need to be populated if the associated container entity has the publisher field set, though it is acceptable to duplicate, as the publishing entity of a container may differ over time. Should be set for singleton releases, like books.
  • language (string, slug): the primary language used in this particular release of the work. Only a single language can be specified; additional languages can be stored in "extra" metadata (TODO: which field?). This field should be a valid RFC1766/ISO639 language code (two letters). AKA, a controlled vocabulary, not a free-form name of the language.
  • license_slug (string, slug): the license of this release. Usually a creative commons short code (eg, CC-BY), though a small number of other short names for publisher-specific licenses are included (TODO: list these).
  • contribs (array of objects): an array of authorship and other creator contributions to this release. Contribution fields include:
    • index (integer, optional): the (zero-indexed) order of this author. Authorship order has significance in many fields. Non-author contributions (illustration, translation, editorship) may or may not be ordered, depending on context, but index numbers should be unique per release (aka, there should not be "first author" and "first translator")
    • creator_id (identifier): if known, a reference to a specific creator
    • raw_name (string): the name of the contributor, as attributed in the text of this work. If the creator_id is linked, this may be different from the display_name; if a creator is not linked, this field is particularly important. Syntax and name order is not specified, but most often will be "display order", not index/alphabetical (in Western tradition, surname followed by given name).
    • role (string, of a set): the type of contribution, from a controlled vocabulary. TODO: vocabulary needs review.
    • extra (string): additional context can go here. For example, author affiliation, "this is the corresponding author", etc.
  • refs (array of ident strings): references (aka, citations) to other releases. References can only be linked to a specific target release (not a work), though it may be ambiguous which release of a work is being referenced if the citation is not specific enough. Reference fields include:
    • index (integer, optional): reference lists and bibliographies almost always have an implicit order. Zero-indexed. Note that this is distinct from the key field.
    • target_release_id (fatcat identifier): if known, and the release exists, a cross-reference to the Fatcat entity
    • extra (JSON, optional): additional citation format metadata can be stored here, particularly if the citation schema does not align. Common fields might be "volume", "authors", "issue", "publisher", "url", and external identifiers ("doi", "isbn13").
    • key (string): works often reference works with a short slug or index number, which can be captured here. For example, "[BROWN2017]". Keys generally supersede the index field, though both can/should be supplied.
    • year (integer): year of publication of the cited release.
    • container_title (string): if applicable, the name of the container of the release being cited, as written in the citation (usually an abbreviation).
    • title (string): the title of the work/release being cited, as written.
    • locator (string): a more specific reference into the work/release being cited, for example the page number(s). For web reference, store the URL in "extra", not here.
  • abstracts (array of objects): see below
    • sha1 (string, hex, required): reference to the abstract content (string). Example: "3f242a192acc258bdfdb151943419437f440c313"
    • content (string): The abstract raw content itself. Example: <jats:p>Some abstract thing goes here</jats:p>
    • mimetype (string): not formally required, but should effectively always get set. text/plain if the abstract doesn't have a structured format
    • lang (string, controlled set): the human language this abstract is in. See the lang field of release for format and vocabulary.

External Identifiers (ext_ids)

The ext_ids object name-spaces external identifiers and makes it easier to add new identifiers to the schema in the future.

Many identifier fields must match an internal regex (string syntax constraint) to ensure they are properly formatted, though these checks aren't always complete or correct in more obscure cases.

  • doi (string): full DOI number, lower-case. Example: "10.1234/abcde.789". See the "External Identifiers" section of style guide for more notes about DOIs specifically.
  • wikidata_qid (string): external identifier for Wikidata entities. These are integers prefixed with "Q", like "Q4321". Each release entity can be associated with at most one Wikidata entity (this field is not an array), and Wikidata entities should be associated with at most a single release. In the future it may be possible to associate Wikidata entities with work entities instead.
  • isbn13 (string): external identifier for books. ISBN-9 and other formats should be converted to canonical ISBN-13.
  • pmid (string): external identifier for PubMed database. These are bare integers, but stored in a string format.
  • pmcid (string): external identifier for PubMed Central database. These are integers prefixed with "PMC" (upper case), like "PMC4321". Versioned PMCIDs can also be stored (eg, "PMC4321.1"; future clarification of whether versions should always be stored will be needed.
  • core (string): external identifier for the [CORE] open access aggregator. These identifiers are integers, but stored in string format.
  • arxiv (string) external identifier to a (version-specific) arxiv.org work. For releases, must always include the vN suffix (eg, v3).
  • jstor (string) external identifier for works in JSTOR.
  • ark (string) ARK identifer
  • mag (string) Microsoft Academic Graph identifier

extra Fields

  • crossref (object), for extra crossref-specific metadata
    • subject (array of strings) for subject/category of content
    • type (string) raw/original Crossref type
    • alternative-id (array of strings)
    • archive (array of strings), indicating preservation services deposited
    • funder (object/dictionary)
  • aliases (array of strings) for additional titles this release might be known by
  • container_name (string) if not matched to a container entity
  • group-title (string) for releases within an collection/group
  • translation_of (release identifier) if this release is a translation of another (usually under the same work)
  • superceded (boolean) if there is another release under the same work that should be referenced/indicated instead. Intended as a temporary hint until proper work-based search is implemented. As an example use, all arxiv release versions except for the most recent get this set.

release_type Vocabulary

This vocabulary is based on the CSL types, with a small number of (proposed) extensions:

  • article-magazine
  • article-journal, including pre-prints and working papers
  • book
  • chapter is allowed as they are frequently referenced and read independent of the entire book. The data model does not currently support linking a subset of a release to an entity representing the entire release. The release/work/file distinctions should not be used to group multiple chapters under a single work; a book chapter can be it's own work. A paper which is republished as a chapter (eg, in a collection, or "edited" book) can have both releases under one work. The criteria of whether to "split" a book and have release entities for each chapter is whether the chapter has been cited/reference as such.
  • dataset
  • entry, which can be used for generic web resources like question/answer site entries.
  • entry-encyclopedia
  • manuscript
  • paper-conference
  • patent
  • post-weblog for blog entries
  • report
  • review, for things like book reviews, not the "literature review" form of article-journal, nor peer reviews (see peer_review)
  • speech can be used for eg, slides and recorded conference presentations themselves, as distinct from paper-conference
  • thesis
  • webpage
  • peer_review (fatcat extension)
  • software (fatcat extension)
  • standard (fatcat extension), for technical standards like RFCs
  • abstract (fatcat extension), for releases that are only an abstract of a larger work. In particular, translations. Many are granted DOIs.
  • editorial (custom extension) for columns, "in this issue", and other content published along peer-reviewed content in journals. Many are granted DOIs.
  • letter for "letters to the editor", "authors respond", and sub-article-length published content. Many are granted DOIs.
  • stub (fatcat extension) for releases which have notable external identifiers, and thus are included "for completeness", but don't seem to represent a "full work".
  • component (fatcat extension) for sub-components of a full paper (or other work). Eg, figures or tables.

An example of a stub might be a paper that gets an extra DOI by accident; the primary DOI should be a full release, and the accidental DOI can be a stub release under the same work. stub releases shouldn't be considered full releases when counting or aggregating (though if technically difficult this may not always be implemented). Other things that can be categorized as stubs (which seem to often end up mis-categorized as full articles in bibliographic databases):

  • commercial advertisements
  • "trap" or "honey pot" works, which are fakes included in databases to detect re-publishing without attribution
  • "This page is intentionally blank"
  • "About the author", "About the editors", "About the cover"
  • "Acknowledgments"
  • "Notices"

All other CSL types are also allowed, though they are mostly out of scope:

  • article (generic; should usually be some other type)
  • article-newspaper
  • bill
  • broadcast
  • entry-dictionary
  • figure
  • graphic
  • interview
  • legislation
  • legal_case
  • map
  • motion_picture
  • musical_score
  • pamphlet
  • personal_communication
  • post
  • review-book
  • song
  • treaty

For the purpose of statistics, the following release types are considered "papers":

  • article
  • article-journal
  • chapter
  • paper-conference
  • thesis

release_stage Vocabulary

These roughly follow the DRIVER publication version guidelines, with the addition of a retracted status.

  • draft is an early version of a work which is not considered for peer review. Sometimes these are posted to websites or repositories for early comments and feedback.
  • submitted is the version that was submitted for publication. Also known as "pre-print", "pre-review", "under review". Note that this doesn't imply that the work was every actually submitted, reviewed, or accepted for publication, just that this is the version that "would be". Most versions in pre-print repositories are likely to have this status.
  • accepted is a version that has undergone peer review and accepted for published, but has not gone through any publisher copy editing or re-formatting. Also known as "post-print", "author's manuscript", "publisher's proof".
  • published is the version that the publisher distributes. May include minor (gramatical, typographical, broken link, aesthetic) corrections. Also known as "version of record", "final publication version", "archival copy".
  • updated: post-publication significant updates (considered a separate release in Fatcat). Also known as "correction" (in the context of either a published "correction notice", or the full new version)
  • retraction for post-publication retraction notices (should be a release under the same work as the published release)

Note that in the case of a retraction, the original publication does not get state retracted, only the retraction notice does. The original publication does get a withdrawn_status metadata field set.

When blank, indicates status isn't known, and wasn't inferred at creation time. Can often be interpreted as published, but be careful!

withdrawn_status Vocabulary

Don't know of an existing controlled vocabulary for things like retractions or other reasons for marking papers as removed from publication, so invented my own. These labels should be considered experimental and subject to change.

Note that some of these will apply more to pre-print servers or publishing accidents, and don't necessarily make sense as a formal change of status for a print journal publication.

Any value at all indicates that the release should be considered "no longer published by the publisher or primary host", which could mean different things in different contexts. As some concrete examples, works are often accidentally generated a duplicate DOI; physics papers have been taken down in reponse to government order under national security justifications; papers have been withdrawn for public health reasons (above and beyond any academic-style retraction); entire journals may be found to be predatory and pulled from circulation; individual papers may be retracted by authors if a serious mistake or error is found; an author's entire publication history may be retracted in cases of serious academic misconduct or fraud.

  • withdrawn is generic: the work is no longer available from the original publisher. There may be no reason, or the reason may not be known yet.
  • retracted for when a work is formally retracted, usually accompanied by a retraction notice (a separate release under the same work). Note that the retraction itself should not have a withdrawn_status.
  • concern for when publishers release an "expression of concern", often indicating that the work is not reliable in some way, but not yet formally retracted. In this case the original work is probably still available, but should be marked as suspect. This is not the same as presence of errata.
  • safety for works pulled for public health or human safety concerns.
  • national-security for works pulled over national security concerns.
  • spam for content that is considered spam (eg, bogus pre-print or repository submissions). Not to be confused with advertisements or product reviews in journals.

contribs.role Vocabulary

  • author
  • translator
  • illustrator
  • editor

All other CSL role types are also allowed, though are mostly out of scope for Fatcat:

  • collection-editor
  • composer
  • container-author
  • director
  • editorial-director
  • editortranslator
  • interviewer
  • original-author
  • recipient
  • reviewed-author

If blank, indicates that type of contribution is not known; this can often be interpreted as authorship.

Work Entity Reference

Works have no fields! They just group releases.


The Fatcat HTTP API is mostly a classic REST "CRUD" (Create, Read, Update, Delete) API, with a few twists.

A declarative specification of all API endpoints, JSON data models, and response types is available in OpenAPI 2.0 format. Code generation tools are used to generate both server-side type-safe endpoint routes and client-side libraries. Auto-generated reference documentation is, for now, available at https://api.fatcat.wiki.

All API traffic is over HTTPS; there is no HTTP endpoint, even for read-only operations. All endpoints accept and return only JSON serialized content.

Entity Endpoints/Actions

Actions could, in theory, be directed at any of:

entities (ident)

Top-level entity actions (resulting in edits):

create (new rev)
update (new rev)
split (remove redirect)

On existing entity edits (within a group):


An edit group as a whole can be:


Other per-entity endpoints:

lookup (by external persistent identifier)
match (by field/context; unimplemented)


All mutating entity operations (create, update, delete) accept a required editgroup_id query parameter. Editgroups (with contextual metadata) should be created before starting edits.

Related edits (to multiple entities) should be collected under a single editgroup, up to a reasonable size. More than 50 edits per entity type, or more than 100 edits total in an editgroup become unwieldy.

After creating and modifying the editgroup, it may be "submitted", which flags it for review by bot and human editors. The editgroup may be "accepted" (merged), or if changes are necessary the edits can be updated and re-submitted.

Sub-Entity Expansion

To reduce the need for multiple GET queries when looking for common related metadata, it is possible to include linked entities in responses using the expand query parameter. For example, by default the release model only includes an optional container_id field which points to a container entity. If the expand parameter is set:


Then the full container model will be included under the container field. Multiple expand parameters can be passed, comma-separated.

Authentication and Authorization

There are two editor types: bots and humans. Additionally, either type of editor may have additional privileges which allow them to, eg, directly accept editgroups (as opposed to submitting edits for review).

All mutating API calls (POST, PUT, DELETE HTTP verbs) require token-based authentication using an HTTP Bearer token. New tokens can be generated in the web interface.

Autoaccept Flag

Currently only on batch creation (POST) for entities.

For all bulk operations, optional 'editgroup' query parameter overrides individual editgroup parameters.

If autoaccept flag is set and editgroup is not, a new editgroup is automatically created and overrides for all entities inserted. Note that this is different behavior from the "use current or create new" default behavior for regular creation.

Unfortunately, "true" and "false" are the only values acceptable for boolean rust/openapi2 query parameters

QA Instance

The intent is to run a public "sandbox" QA instance of the catalog, using a subset of the full catalog, running the most recent development branch of the API specification. This instance can be used by developers for prototyping and experimentation, though note that all data is periodically wiped, and this endpoint is more likely to have bugs or be offline.

Bulk Exports

There are several types of bulk exports and database dumps folks might be interested in:

  • complete database dumps
  • changelog history with all entity revisions and edit metadata
  • identifier snapshot tables
  • entity exports

All exports and dumps get uploaded to the Internet Archive under the "Fatcat Database Snapshots and Bulk Metadata Exports" collection.

Complete Database Dumps

The most simple and complete bulk export. Useful for disaster recovery, mirroring, or forking the entire service. The internal database schema is not stable, so not as useful for longitudinal analysis. These dumps will include edits-in-progress, deleted entities, old revisions, etc, which are potentially difficult or impossible to fetch through the API.

Public copies may have some tables redacted (eg, API credentials).

Dumps are in PostgreSQL pg_dump "tar" binary format, and can be restored locally with the pg_restore command. See ./extra/sql_dumps/ for commands and details. Dumps are on the order of 100 GBytes (compressed) and will grow over time.

Changelog History

These are currently unimplemented; would involve "hydrating" sub-entities into changelog exports. Useful for some mirrors, and analysis that needs to track provenance information. Format would be the public API schema (JSON).

All information in these dumps should be possible to fetch via the public API, including on a feed/streaming basis using the sequential changelog index. All information is also contained in the database dumps.

Identifier Snapshots

Many of the other dump formats are very large. To save time and bandwidth, a few simple snapshot tables can be exported directly in TSV format. Because these tables can be dumped in single SQL transactions, they are consistent point-in-time snapshots.

One format is per-entity identifier/revision tables. These contain active, deleted, and redirected identifiers, with revision and redirect references, and are used to generate the entity dumps below.

Other tables contain external identifier mappings or file hashes.

Release abstracts can be dumped in their own table (JSON format), allowing them to be included only by reference from other dumps. The copyright status and usage restrictions on abstracts are different from other catalog content; see the policy page for more context. Abstracts are immutable and referenced by hash in the database, so the consistency of these dumps is not as much of a concern as with other exports.

Unlike all other dumps and public formats, the Fatcat identifiers in these dumps are in raw UUID format (not base32-encoded), though this may be fixed in the future.

See ./extra/sql_dumps/ for scripts and details. Dumps are on the order of a couple GBytes each (compressed).

Entity Exports

Using the above identifier snapshots, the Rust fatcat-export program outputs single-entity-per-line JSON files with the same schema as the HTTP API. These might contain the default fields, or be in "expanded" format containing sub-entities for each record.

Only "active" entities are included (not deleted, work-in-progress, or redirected entities).

These dumps can be quite large when expanded (over 100 GBytes compressed), but do not include history so will not grow as fast as other exports over time. Not all entity types are dumped at the moment; if you would like specific dumps get in touch!


These quickstart examples gloss over a lot of details in the API. The canonical API documentation (generated from the OpenAPI specification) is available at https://api.fatcat.wiki/redoc.

The first two simple cookbook examples here include full headers. Later examples only show python client library code snippets.

Lookup Fulltext URLs by DOI

Often you have a DOI or other paper identifier and want to find open copies of the paper to read. In fatcat terms, you want to lookup a release by external identifier, then sort through any associated file entities to find the best files and URLs to download. Note that the Unpaywall API is custom designed for this task and you should look in to using that instead.

This is read-only task and requires no authentication. The simple summary is to:

  1. GET the release lookup endpoint with the external identifier as a query parameter. Also set the hide parameter to elide unused fields, and the expand parameter to files to include related files in a single request.
  2. If you get a hit (HTTP 200), sort through the files field (an array) and for each file the urls field (also an array) to select the best URL(s).

The URL to use would look like https://api.fatcat.wiki/v0/release/lookup?doi=10.1088/0264-9381/19/7/380&expand=files&hide=abstracts,refs in a browser. The query parameters should be URL encoded (eg, the DOI / characters replaced with with %20), but almost all HTTP tools and libraries will do this automatically.

The raw HTTP request would look like:

GET /v0/release/lookup?doi=10.1088%2F0264-9381%2F19%2F7%2F380&expand=files&hide=abstracts%2Crefs HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: api.fatcat.wiki
User-Agent: HTTPie/0.9.8

And the response (with some headers removed and JSON body paraphrased):

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 1996
Content-Type: application/json
Date: Tue, 17 Sep 2019 22:47:54 GMT
X-Frame-Options: SAMEORIGIN
X-Span-ID: caa70cff-967d-4429-96c6-71909738ab4c

    "ident": "3j36alui7fcwncbc4xdaklywb4", 
    "title": "LIGO sensing system performance", 
    "publisher": "IOP Publishing", 
    "release_date": "2002-03-19", 
    "release_stage": "published", 
    "release_type": "article-journal", 
    "release_year": 2002, 
    "revision": "2e36dfbe-9a4b-4917-95bb-f02b04f6b5d0", 
    "state": "active", 
    "work_id": "ejllv7xq4rgrrffpsf3prqurwq"
    "container_id": "j5iizqxt2rainmxg6nfmpg2ds4", 
    "contribs": [],
    "ext_ids": {
        "doi": "10.1088/0264-9381/19/7/380"
    "files": [
            "ident": "vmfyqb77r5gs3pkoekzfcjgsb4", 
            "mimetype": "application/pdf", 
            "release_ids": [
            "revision": "66639928-d9e2-45e2-a883-36616d5b0a67", 
            "sha1": "54244fe8d35bff2db2a3ff946e60c194f68821ae", 
            "state": "active", 
            "urls": [
                    "rel": "web", 
                    "url": "http://www.gravity.uwa.edu.au/amaldi/papers/Landry.pdf"
                    "rel": "webarchive", 
                    "url": "https://web.archive.org/web/20081011163648/http://www.gravity.uwa.edu.au/amaldi/papers/Landry.pdf"
            "ident": "3ta26geysncdxlgswjoaiqlbyu", 
            "mimetype": "application/pdf", 
            "release_ids": [
            "revision": "5c7a8cb0-4710-415a-93d5-d7cb6c42dfd1", 
            "sha1": "954c0fb370af7f72a0cb47505b8793e8e5e23136", 
            "state": "active", 
            "urls": [
                    "rel": "webarchive", 
                    "url": "https://web.archive.org/web/20050624182645/http://www.gravity.uwa.edu.au/amaldi/papers/Landry.pdf"
                    "rel": "webarchive", 
                    "url": "https://web.archive.org/web/20091024040004/http://www.gravity.uwa.edu.au/amaldi/papers/Landry.pdf"
                    "rel": "web", 
                    "url": "http://www.gravity.uwa.edu.au/amaldi/papers/Landry.pdf"

An httpie and jq one-liner to grap the first URL would be:

http https://api.fatcat.wiki/v0/release/lookup doi==10.1088/0264-9381/19/7/380 expand==files hide==abstracts,refs | jq '.files[0].urls[0].url' -r

Using the python client library (fatcat-openapi-client), you might do something like:

import fatcat_openapi_client
from fatcat_openapi_client.rest import ApiException

conf = fatcat_openapi_client.Configuration()
conf.host = "https://api.fatcat.wiki/v0"
api = fatcat_openapi_client.DefaultApi(fatcat_openapi_client.ApiClient(conf))
doi = "10.1088/0264-9381/19/7/380"

    r = api.lookup_release(doi=doi, expand="files", hide="abstracts,refs")
except ApiException as ae:
    if ae.status == 404:
        print("DOI not found!")
        raise ae

print("Fatcat release found: https://fatcat.wiki/release/{}".format(r.ident))

for f in r.files:
    if f.mimetype != 'application/pdf':
    for u in r.urls:
        if u.rel == 'webarchive' and '//web.archive.org/' in u.url:
            print("Wayback PDF URL: {}".format(u.url)

print("No Wayback PDF URL found")

A more advanced lookup tool would check for sibling releases under the same work and provide both alternative links ("no version of record available, but here is the pre-print") and notify the end user about any updates or retractions to the work as a whole.

Creating an Entity

Let's use a container (journal) entity as a simple example of mutation of the catalog. This assumes you already have an editor account and API token, both obtained through the web interface.

In summary:

  1. Create (POST) an editgroup
  2. Create (POST) the container entity as part of editgroup
  3. Submit the editgroup for review
  4. (privileged) Accept the editgroup

See the API docs for full details of authentication.

To create an editgroup, the raw HTTP request (to https://api.fatcat.wiki/v0/editgroup) and response would look like:

POST /v0/editgroup HTTP/1.1
Accept: application/json, */*
Accept-Encoding: gzip, deflate
Authorization: Bearer AgEPZGV2LmZhdGNhdC53aWtpAhYyMDE5MDEwMS1kZXYtZHVtbXkta2V5AAImZWRpdG9yX2lkID0gYWFhYWFhYWFhYWFhYmt2a2FhYWFhYWFhYWkAAht0aW1lID4gMjAxOS0wMS0wOVQwMDo1Nzo1MloAAAYgnroNha1hSftChtxHGTnLEmM/pY8MeQS/jBSV0UNvXug=
Connection: keep-alive
Content-Length: 2
Content-Type: application/json
Host: api.fatcat.wiki
User-Agent: HTTPie/0.9.8


HTTP/1.1 201 Created
Connection: keep-alive
Content-Length: 126
Content-Type: application/json
Date: Tue, 17 Sep 2019 23:25:55 GMT
X-Span-ID: cc016e0e-77ae-4ca0-b1da-b0a38e48a130

    "created": "2019-09-17T23:25:55.273836Z", 
    "editgroup_id": "aqhyo2ulmzfbrewn3rv7dhl65u", 
    "editor_id": "4vmpwdwxxneitkonvgm2pk6kya"

It is important to parse the response to get the editgroup_id. Next POST to https://fatcat.wiki/v0/editgroup/EDITGROUP_ID/container (with the editgroup_id substituted) and the JSON container entity as the body:

POST /v0/editgroup/aqhyo2ulmzfbrewn3rv7dhl65u/container HTTP/1.1
Accept: application/json, */*
Accept-Encoding: gzip, deflate
Authorization: Bearer AgEPZGV2LmZhdGNhdC53aWtpAhYyMDE5MDEwMS1kZXYtZHVtbXkta2V5AAImZWRpdG9yX2lkID0gYWFhYWFhYWFhYWFhYmt2a2FhYWFhYWFhYWkAAht0aW1lID4gMjAxOS0wMS0wOVQwMDo1Nzo1MloAAAYgnroNha1hSftChtxHGTnLEmM/pY8MeQS/jBSV0UNvXug=
Connection: keep-alive
Content-Length: 54
Content-Type: application/json
Host: api.fatcat.wiki
User-Agent: HTTPie/0.9.8

    "issnl": "1234-5678", 
    "name": "Journal of Something"

HTTP/1.1 201 Created
Connection: keep-alive
Content-Length: 181
Content-Type: application/json
Date: Tue, 17 Sep 2019 23:30:32 GMT
X-Span-ID: eb2f4243-ed43-4a21-bbf0-d653590fcfe2

    "edit_id": "ea203496-ecb9-45c7-ac50-3cb24cdbb58f", 
    "editgroup_id": "aqhyo2ulmzfbrewn3rv7dhl65u", 
    "ident": "g3kyxylxjbej7drf6apqpfkl6i", 
    "revision": "796429d2-44a4-4ece-a9b2-e80edcd4277a"

To submit an editgroup, use the update endpoint with the submit query parameter set to true. The body should be the editgroup object (as JSON), but is mostly ignored:

PUT /v0/editgroup/aqhyo2ulmzfbrewn3rv7dhl65u?submit=true HTTP/1.1
Accept: application/json, */*
Accept-Encoding: gzip, deflate
Authorization: Bearer AgEPZGV2LmZhdGNhdC53aWtpAhYyMDE5MDEwMS1kZXYtZHVtbXkta2V5AAImZWRpdG9yX2lkID0gYWFhYWFhYWFhYWFhYmt2a2FhYWFhYWFhYWkAAht0aW1lID4gMjAxOS0wMS0wOVQwMDo1Nzo1MloAAAYgnroNha1hSftChtxHGTnLEmM/pY8MeQS/jBSV0UNvXug=
Connection: keep-alive
Content-Length: 131
Content-Type: application/json
Host: api.fatcat.wiki
User-Agent: HTTPie/0.9.8

    "created": "2019-09-17T23:25:55.273836Z", 
    "editgroup_id": "aqhyo2ulmzfbrewn3rv7dhl65u", 
    "editor_id": "4vmpwdwxxneitkonvgm2pk6kya"

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 168
Content-Type: application/json
Date: Tue, 17 Sep 2019 23:37:06 GMT
X-Span-ID: c0ac0406-83ce-4e07-a892-3f83c02ec207

    "created": "2019-09-17T23:25:55.273836Z", 
    "editgroup_id": "aqhyo2ulmzfbrewn3rv7dhl65u", 
    "editor_id": "4vmpwdwxxneitkonvgm2pk6kya", 
    "submitted": "2019-09-17T23:37:06.288434Z"

Lastly, if your editor account as the admin role, you can "accept" the editgroup using the accept endpoint:

POST /v0/editgroup/aqhyo2ulmzfbrewn3rv7dhl65u/accept HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Authorization: Bearer AgEPZGV2LmZhdGNhdC53aWtpAhYyMDE5MDEwMS1kZXYtZHVtbXkta2V5AAImZWRpdG9yX2lkID0gYWFhYWFhYWFhYWFhYmt2a2FhYWFhYWFhYWkAAht0aW1lID4gMjAxOS0wMS0wOVQwMDo1Nzo1MloAAAYgnroNha1hSftChtxHGTnLEmM/pY8MeQS/jBSV0UNvXug=
Connection: keep-alive
Content-Length: 0
Host: api.fatcat.wiki
User-Agent: HTTPie/0.9.8

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 36
Content-Type: application/json
Date: Tue, 17 Sep 2019 23:40:21 GMT
X-Span-ID: cb4d66f0-9e67-4908-8dff-97489cc87ca2

    "message": "horray!", 
    "success": true

This whole exchange is, of course, must faster with the python library:

import fatcat_openapi_client
from fatcat_openapi_client.rest import ApiException

conf = fatcat_openapi_client.Configuration()
conf.host = "https://api.fatcat.wiki/v0"
conf.api_key_prefix["Authorization"] = "Bearer"
conf.api_key["Authorization"] = "AgEPZGV2LmZhdGNhdC53aWtpAhYyMDE5MDEwMS1kZXYtZHVtbXkta2V5AAImZWRpdG9yX2lkID0gYWFhYWFhYWFhYWFhYmt2a2FhYWFhYWFhYWkAAht0aW1lID4gMjAxOS0wMS0wOVQwMDo1Nzo1MloAAAYgnroNha1hSftChtxHGTnLEmM/pY8MeQS/jBSV0UNvXug="
api = fatcat_openapi_client.DefaultApi(fatcat_openapi_client.ApiClient(conf))

c = fatcat_openapi_client.ContainerEntity(
    name="Test Journal",
editgroup = api.create_editgroup(description="my test editgroup")
c_edit = api.create_container(editgroup.editgroup_id, c)
api.update_editgroup(editgroup.editgroup_id, submit=True)

# only if you have permissions

Updating an Existing Entity

It is important to ensure that edits/updates are idempotent, in this case meaning that if you ran the same script twice in quick succession, no mutation or update would occur the second time. This is usually achieved by always fetching entities just before an edit and checking that updates are actually necessary.

The basic process is to:

  1. Fetch (GET) or Lookup (GET) the existing entity. Check that edit is actually necessary!
  2. Create (POST) a new editgroup
  3. Update (PUT) the entity
  4. Submit (PUT) the editgroup for review

Python example code:

import fatcat_openapi_client
from fatcat_openapi_client.rest import ApiException

conf = fatcat_openapi_client.Configuration()
conf.host = "https://api.fatcat.wiki/v0"
conf.api_key_prefix["Authorization"] = "Bearer"
conf.api_key["Authorization"] = "AgEPZGV2LmZhdGNhdC53aWtpAhYyMDE5MDEwMS1kZXYtZHVtbXkta2V5AAImZWRpdG9yX2lkID0gYWFhYWFhYWFhYWFhYmt2a2FhYWFhYWFhYWkAAht0aW1lID4gMjAxOS0wMS0wOVQwMDo1Nzo1MloAAAYgnroNha1hSftChtxHGTnLEmM/pY8MeQS/jBSV0UNvXug="
api = fatcat_openapi_client.DefaultApi(fatcat_openapi_client.ApiClient(conf))

new_name = "Classical and Quantum Gravity"
c = api.get_container('j5iizqxt2rainmxg6nfmpg2ds4')
if c.name == new_name:
    print("Already updated!")

c.name = new_name

editgroup = api.create_editgroup(description="my test container editgroup")
c_edit = api.update_container(editgroup.editgroup_id, c.ident, c)
api.update_editgroup(editgroup.editgroup_id, submit=True)

Merging Duplicate Entities

Like other mutations, be careful that any merge oprations do not clobber the catalog if run multiple times.


  1. Fetch (GET) both entities. Ensure that merging is still required.
  2. Decide which will be the "primary" entity (the other will redirect to it)
  3. Create (POST) a new editgroup
  4. Update (PUT) the "primary" entity with any updated metadata merged from the other entity (optional), and the editgroup id set
  5. Update (PUT) the "other" entity with the redirect flag set to the primary's identifier.
  6. Submit (PUT) the editgroup for review
  7. Somebody (human or bot) with admin privileges will Accept (POST) the editgroup.

Python example code:

import fatcat_openapi_client
from fatcat_openapi_client.rest import ApiException

conf = fatcat_openapi_client.Configuration()
conf.host = "https://api.fatcat.wiki/v0"
conf.api_key_prefix["Authorization"] = "Bearer"
conf.api_key["Authorization"] = "AgEPZGV2LmZhdGNhdC53aWtpAhYyMDE5MDEwMS1kZXYtZHVtbXkta2V5AAImZWRpdG9yX2lkID0gYWFhYWFhYWFhYWFhYmt2a2FhYWFhYWFhYWkAAht0aW1lID4gMjAxOS0wMS0wOVQwMDo1Nzo1MloAAAYgnroNha1hSftChtxHGTnLEmM/pY8MeQS/jBSV0UNvXug="
api = fatcat_openapi_client.DefaultApi(fatcat_openapi_client.ApiClient(conf))

left = api.get_creator('iimvc523xbhqlav6j3sbthuehu')
right = api.get_creator('lav6j3sbthuehuiimvc523xbhq')

# check that merge/redirect hasn't happened yet
assert left.state == 'active' and right.state == 'active'
assert left.redirect = None and right.redirect_id = None
assert left.revision != right.revision

# decide to merge "right" into "left"
if not left.orcid:
    left.orcid = right.orcid
if not left.surname:
    left.surname = right.surname

editgroup = api.create_editgroup(description="my test creator merge editgroup")
left_edit = api.update_creator(editgroup.editgroup_id, left.ident, left)
right_edit = api.update_creator(eidtgroup.editgroup_id, right.ident,
api.update_editgroup(editgroup.editgroup_id, submit=True)

Batch Create Entities

When importing large numbers (thousands) of entities, it can be faster to use the batch create operations instead of individual editgroup and entity creation. Using the batch endpoints requires care because the potential to pollute the catalog with bad entities (and the effort required to clean up) can be much larger.

These methods always require the admin role, because they are the equivalent of creation and editgroup accept.

It is not currently possible to do batch updates or deletes in a single request.

The basic process is:

  1. Confirm that input entities should be created (eg, using identifier lookups), and bundle into groups of 50-100 entities.
  2. Batch create (POST) a set of entities, with editgroup metadata included along with list of entities (all of a single type). Entire batch is inserted in a single request.
import fatcat_openapi_client
from fatcat_openapi_client.rest import ApiException

conf = fatcat_openapi_client.Configuration()
conf.host = "https://api.fatcat.wiki/v0"
conf.api_key_prefix["Authorization"] = "Bearer"
conf.api_key["Authorization"] = "AgEPZGV2LmZhdGNhdC53aWtpAhYyMDE5MDEwMS1kZXYtZHVtbXkta2V5AAImZWRpdG9yX2lkID0gYWFhYWFhYWFhYWFhYmt2a2FhYWFhYWFhYWkAAht0aW1lID4gMjAxOS0wMS0wOVQwMDo1Nzo1MloAAAYgnroNha1hSftChtxHGTnLEmM/pY8MeQS/jBSV0UNvXug="
api = fatcat_openapi_client.DefaultApi(fatcat_openapi_client.ApiClient(conf))

releases = [
      title="Dummy Release",
      title="Another Dummy Release",
    # ... more releases here ...

# check that releases don't exist already; this could be a filter
for r in releases:
    existing = None
        existing = api.lookup_release(doi=r.ext_ids.doi)
    except ApiException as ae
        assert ae.status == 404
    assert existing is None

# ensure our batch size isn't too large
assert len(releases) <= 100

editgroup = api.create_release_auto_batch(
            description="my test batch",

Import New Files Linked to Releases

Let's say you knew of many open access PDFs, including their SHA-1, size, and a URL:

10.123/456  7043946a7afe0ee32c9d4c22a9b3fc2ba6d34b42    7238    https://archive.org/download/open_access_files/456.pdf
10.123/789  350a8d5c6fac151ec2c81d4df5d58d14aeefc72f    1277    https://archive.org/download/open_access_files/789.pdf
10.123/900  9d9a9868a661b13c32fd38021addadb7b4a31122     166    https://archive.org/download/open_access_files/900.pdf

The process for adding these could be something like:

  1. For each row, check if file with SHA-1 exists; if so, skip
  2. For each row, lookup the release by DOI; if it doesn't exist, skip
  3. Transform into File entities
  4. Group entities into batches
  5. Submit batches to API

There are multiple ways to structure code to do this. You may want to look at the importer class under python/fatcat_tools/importers/common.py, and other existing import scripts in that directory for a framework to structure this type of import.

Here is a simpler example using only the python library:

# TODO: actually test this code

import sys
import fatcat_openapi_client
from fatcat_openapi_client import *
from fatcat_openapi_client.rest import ApiException

conf = fatcat_openapi_client.Configuration()
conf.host = "https://api.fatcat.wiki/v0"
conf.api_key_prefix["Authorization"] = "Bearer"
conf.api_key["Authorization"] = "AgEPZGV2LmZhdGNhdC53aWtpAhYyMDE5MDEwMS1kZXYtZHVtbXkta2V5AAImZWRpdG9yX2lkID0gYWFhYWFhYWFhYWFhYmt2a2FhYWFhYWFhYWkAAht0aW1lID4gMjAxOS0wMS0wOVQwMDo1Nzo1MloAAAYgnroNha1hSftChtxHGTnLEmM/pY8MeQS/jBSV0UNvXug="
api = fatcat_openapi_client.DefaultApi(fatcat_openapi_client.ApiClient(conf))


def try_row(fields):
    # remove any extra whitespace
    fields = [f.strip() for f in fields]
    doi = fields[0]
    sha1 = fields[1]
    size = int(fields[2])
    url = fields[3]

    # check for existing file
        existing = api.lookup_file(sha1=sha1)
        print("File with SHA-1 exists: {}".format(sha1))
        return None
    except ApiException as ae:
        if ae.status != 404:
            raise ae

    # lookup release by DOI
        release = api.lookup_release(doi=doi)
    except ApiException as ae:
        if ae.status == 404:
            print("No existing release for DOI: {}".format(doi))
            return None
            raise ae

    fe = FileEntity(
        urls=[FileUrl(rel="archive", url=url)],
    return fe

def run(input_file):
    file_entities = []
    for line in input_file:
        fe = try_row(line.split('\t'))
        if fe:
    if not file_entities:
        print("Tried all lines, nothing to do!")

    # TODO: iterate over fixed-size batches
    first_batch = file_entities[:100]

    # easy way: create as a batch if you have permission
    if HAVE_ADMIN:
        editgroup = api.create_release_auto_batch(
                    description="my test batch",

    # longer way: create one-at-a-time
    editgroup = api.create_editgroup(Editgroup(
        description="batch import of files-by-DOI. Data from XYZ",
            # put the name of your script/project here
            'agent': 'tutorial_example_script',

    for fe in first_batch:
        edit = api.create_file(editgroup.editgroup_id, fe)

    # submit for review
    api.update_editgroup(editgroup.editgroup_id, editgroup, submit=True)
    print("Submitted editgroup: https://fatcat.wiki/editgroup/{}".format(editgroup.editgroup_id))


if __name__=='__main__':
    if len(sys.argv) != 2:
        print("Pass input TSV file as argument")
    with open(sys.argv[1], 'r') as input_file:

# ensure our batch size isn't too large
assert len(releases) <= 100

editgroup = api.create_release_auto_batch(
            description="my test batch",


Our aspiration is for this to be an open, collaborative project, with individuals and organization of all sizes able to participate. There is not much structure or documentation on how volunteers can get started or be most helpful, but perhaps we can work together on that as well!

The best place to organize and coordinate right now is the gitter chatroom. Gitter is described as "for developers", but we use it for everybody, and you don't need an invitation.

Want to help out? Below are a few example roles you could play.

Anybody: Find Bugs, Suggest Improvements

The user sign-up and editing workflow on fatcat.wiki is currently pretty poor. How could this experience be improved and better documented? Specific ideas, suggestions and diagrams would be very helpful. You don't need to know how to program or about web technologies to contribute; hand drawings and example text can be sufficient.

Community Organizer: Partner and Volunteer Organizing

Are you passionate about Open Access and want to help build a community around preservation and universal access to knowledge? We could use help structuring an editing community, and communicating with partner projects like Wikidata to ensure we are not duplicating efforts.

A good example of a project to organize would be improving journal-level metadata in wikidata, including journal homepages, and linking to fatcat "container" entities.

Research Librarian: Identify Missing Content

If you have an interest in a specific scholarly field, you could give us feedback on how good of a job fatcat is doing preserving at-risk open access content. We know we have a lot of work to do, but both specific examples of missing publications, as well as broader patterns and missing holes are helpful to know about. Some missing content we know we don't have, but there are surely entire categories of in-scope content that we do not even know are missing!

Metadata Librarian: Schema Improvements

Are you an experienced wrangler of BibFrame, MARC, bibtext, RDF, OAI-PMH, and Citation Style Language? Our data model and entity schemas are bespoke (sorry!) and designed to evolved over time. There might be related efforts and new controlled vocabularies we could adopt or align with, or small changes to the schema might enable new use cases. It could be as simple as identifying and prioritizing new external identifiers (PIDs) to allow. Let us know what we got right and what needs improvement!

Power Editor: Better Interfaces

Are you super experienced with data entry, editing, and corrections? Do you have ideas on how our interface could be improved, or what kinds of new interfaces and tools could be build to support effective editing? Our open API allows third-party interfaces to make edits on individuals' behalf, meaning new tools can be build for specific patterns of editing or user contribution.

Data Scientist: Wrangling and Visualization

We have hundreds of gigabytes of metadata to transform and normalize before importing, and already have a rich open dataset with millions of linked entities. Our elasticsearch analytics database has an open read-only endpoint (https://search.fatcat.wiki), which are used to power our coverage interface. What other interactive visualizations could be built? What tools should we be using to wrangle bibliographic metadata better and faster?

Author: Verify Metadata

Do you publish research documents, and want to ensure it is accessible to the broadest audience today and in the future? Like many academic search engines, you can add papers and link an author profile to specific publications. Unlike others, you can also ensure uploaded pre-prints and other open versions of your research are found and linked using the "save paper now" feature, and you can any errors made by publishers and bots.

Translation and Accessibility Advocate

Some of our web interfaces have existing internationalization infrastructure, and translations can be contributed directly.

Other projects need help getting translation infrastructure in place, and all of our projects could use review and recommendations for improvement by experts in web accessibility. For example, if you use a screen reader, feedback on which parts of our services are most difficult to use are very helpful.

Software Developer: Bot Wrangling

Fatcat is structured such that all changes to the catalog go through an open API. This includes human edits through the web interface, but the large majority of edits are made by bots. You could write a new bot to help...

  • review human edits (from the "reviewable" queue) to "lint" for typos, missing fields, or other problems, and then leave an annotation
  • harvest, transform, and import metadata from addition subject- and region-specific sources
  • find and clean-up patterns of poor or incorrect metadata already in the catalog

SQL Expert: Database Scaling

We have a large (500+ GByte) PostgreSQL database backing the catalog. This is working great so far, but we have concerns about how the catalog will scale further, especially if bots start making multiple updates per entity. You could review our SQL schema and recommend improvements, or give feedback and advice on how to switch to a distributed primary datastore.

Financial Supporter

Short on time? As a US 501(c)(3) non-profit, the Internet Archive always appreciates and makes good use of donations.

Software Contributions

Bugs and patches can be filed on Github at: https://github.com/internetarchive/fatcat

When considering making a non-trivial contribution, it can save review time and duplicated work to post an issue with your intentions and plan. New code and features must include unit tests before being merged, though we can help with writing them.

Norms and Policies

These social norms are explicitly expected to evolve and mature if the number of contributors to the project grows. It is important to have some policies as a starting point, but also important not to set these policies in stone until they have been reviewed.

See also the Code of Conduct and Privacy Policy.

Metadata Licensing

The Fatcat catalog content license is the Creative Commons Zero ("CC-0") license, which is effectively a public domain grant. This applies to the catalog metadata itself (titles, entity relationships, citation metadata, URLs, hashes, identifiers), as well as "meta-meta-data" provided by editors (edit descriptions, provenance metadata, etc).

The core catalog is designed to contain only factual information: "this work, known by this title and with these third-party identifiers, is believed to be represented by these files and published under such-and-such venue". As a norm, sourcing metadata (for attribution and provenance) is retained for each edit made to the catalog.

A notable exception to this policy are abstracts, for which no copyright claims or license is made. Abstract content is kept separate from core catalog metadata; downstream users need to make their own decision regarding reuse and distribution of this material.

As a social norm, it is expected (and appreciated!) that downstream users of the public API and/or bulk exports provide attribution, and even transitive attribution (acknowledging the original source of metadata contributed to Fatcat). As an academic norm, researchers are encouraged to cite the corpus as a dataset (when this option becomes available). However, neither of these norms are enforced via the copyright mechanism.

As a strong norm, editors should expect full access to the full corpus and edit history, including all of their contributions.

Immutable History

All editors agree to the licensing terms, and understand that their full public history of contributions is made irrevocably public. Edits and contributions may be reverted, but the history (and content) of their edits are retained. Edit history is not removed from the corpus on the request of an editor or when an editor closes their account.

In an emergency situation, such as non-bibliographic content getting encoded in the corpus by bypassing normal filters (eg, base64 encoding hate crime content or exploitative photos, as has happened to some blockchain projects), the ecosystem may decide to collectively, in a coordinated manner, expunge specific records from their history.

Documentation Licensing

This guide ("The Fatcat Guide") is licensed under the Creative Commons Attribution license.

Software Licensing

The Fatcat software project licensing policy is to adopt strong copyleft licenses for server software (where the majority of software development takes place), permissive licenses for client library and bot framework software, and CC-0 (public grant) licensing for declarative interface specifications (such as SQL schemas and REST API specifications).

Fatcat Code of Conduct

In this early stage of the project, this document is a work in progress. In particular there is no moderation team or policy for responding to concerns in online discussions. However, it is important to clarify norms and expectations as early as possible.

To contact the Internet Archive privately about conduct concerns or to report unacceptable behavior, you can email ethics@archive.org.


  • We are committed to providing a friendly, safe and welcoming environment for all, regardless of level of experience, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, religion, nationality, or other similar characteristic.

  • Please avoid using overtly sexual aliases or other nicknames that might detract from a friendly, safe and welcoming environment for all.

  • Please be kind and courteous. There’s no need to be mean or rude.

  • Respect that people have differences of opinion and that every design or implementation choice carries a trade-off and numerous costs. There is seldom a right answer.

  • Please keep unstructured critique to a minimum. If you have solid ideas you want to experiment with, make a fork and see how it works.

  • This Code of Conduct applies to all online project spaces (including the catalog itself, code repositories, mailing lists, chat rooms, forums, and comment threads), as well as any physical spaces such as conference gatherings or meetups.

  • All participants are expected to respect this code, irregardless of their position or record of contributions to the project.

Unacceptable behavior

The following types of behavior are unacceptable in the Fatcat project, both online and in-person, and constitute code of conduct violations.

Abusive behavior

  • Harassment: including offensive verbal comments related to gender, sexual orientation, disability, physical appearance, body size, race, or religion, as well as sexual images in public spaces, deliberate intimidation, stalking, following, harassing photography or recording, inappropriate physical contact, and unwelcome sexual or romantic attention.

  • Threats: threatening someone physically or verbally. For example, threatening to publicize sensitive information about someone’s personal life.

Unwelcoming behavior

  • Blatant-isms: saying things that are explicitly racist, sexist, homophobic, etc. For example, arguing that some people are less intelligent because of their gender, race or religion. Subtle -isms and small mistakes made in conversation are not code of conduct violations. However, repeating something after it has been pointed out to you that you broke a social rule, or antagonizing or arguing with someone who has pointed out your subtle -ism is considered unwelcoming behavior, and is not allowed in the project.

  • Maliciousness towards other participants: deliberately attempting to make others feel bad, name-calling, singling out others for derision or exclusion. For example, telling someone they’re not a real programmer or that they don’t belong in the project.

  • Being especially unpleasant: for example, if multiple community members report annoying, rude, or especially distracting behavior.

  • Spamming, trolling, flaming, baiting or other attention-stealing behavior is not welcome.

About This Document

The Fatcat Code of Conduct is inspired by, and derived from:

Privacy Policy

It is important to note that this section is currently aspirational: the servers hosting early deployments of Fatcat are largely in a defaults configuration and have not been audited to ensure that these guidelines are being followed.

It is a goal for Fatcat to conduct as little surveillance of reader and editor behavior and activities as possible. In practical terms, this means minimizing the overall amount of logging and collection of identifying information. This is in contrast to submitted edit content, which is captured, preserved, and republished as widely as possible.

The general intention is to:

  • not use third-party tracking (via extract browser-side requests or javascript)
  • collect aggregate metrics (overall hit numbers), but not log individual interactions ("this IP visited this page at this time")

Exceptions will likely be made:

  • temporary caching of IP addresses may be necessary to implement rate-limiting and debug traffic spikes
  • exception logging, abuse detection, and other exceptional

Some uncertain areas of privacy include:

  • should third-party authentication identities be linked to editor ids? what about the specific case of ORCID if used for login?
  • what about discussion and comments on edits? should conversations be included in full history dumps? should editors be allowed to update or remove comments?


2020 Workshop On Open Citations And Open Scholarly Metadata 2020 - Fatcat (vidoo on archive.org)

2019-10-25 FORCE2019 - Perpetual Access Machines: Archiving Web-Published Scholarship at Scale (video on youtube.com)

Blog Posts And Press

2020-09-17 blog.dshr.org - Don't Say We Didn't Warn You

2020-09-15: blog.archive.org - How the Internet Archive is Ensuring Permanent Access to Open Access Journal Articles

2020-02-18 blog.dshr.org - The Scholarly Record At The Internet Archive

2019-04-18 blog.dshr.org - Personal Pods and Fatcat

2018-10-03 blog.dshr.org - Brief Talk At Internet Archive Event

2018-03-05 blog.archive.org - Andrew W. Mellon Foundation Awards Grant to the Internet Archive for Long Tail Journal Preservation


2020-09-08 sciencemag.org: Dozens of scientific journals have vanished from the internet, and no one preserved them

2020-09-10 nature.com: More than 100 scientific journals have disappeared from the Internet

2020-08-27 arxiv.org Open is not forever: a study of vanished open access journals


Ito, Joichi. “Citing Blogs.” Joi Ito’s Web (2018). Accessed March 11, 2019. https://joi.ito.com/weblog/2018/05/28/citing-blogs.html.
Karaganis, Joe, ed. Shadow Libraries: Access to Knowledge in Global Higher Education. Cambridge, MA : Ottawa, ON: The MIT Press ; International Development Research Centre, 2018.
Khabsa, Madian, and C. Lee Giles. “The Number of Scholarly Documents on the Public Web.” PLOS ONE 9, no. 5 (May 9, 2014): e93949.
Knoth, Petr, and Zdenek Zdrahal. “CORE: Three Access Levels to Underpin Open Access.” D-Lib Magazine 18, no. 11/12 (November 2012). Accessed March 11, 2019. http://www.dlib.org/dlib/november12/knoth/11knoth.html.
Ortega, Jose Luis. Academic Search Enghines: New Information Trends and Services for Scientists on the Web. Chandos information professional series. Philadelphia, PA: Elsevier, 2014.
Page, Roderic. “Notes on Bibliographic Metadata in JSON.” Last modified July 12, 2017. Accessed March 11, 2019. https://github.com/rdmpage/bibliographic-metadata-json.
Piwowar, Heather, Jason Priem, Vincent Larivière, Juan Pablo Alperin, Lisa Matthias, Bree Norlander, Ashley Farley, Jevin West, and Stefanie Haustein. “The State of OA: A Large-Scale Analysis of the Prevalence and Impact of Open Access Articles.” PeerJ 6 (February 13, 2018): e4375.
Ramalho, Luciano G. “From ISIS to CouchDB: Databases and Data Models for Bibliographic Records.” The Code4Lib Journal, no. 13 (April 11, 2011). Accessed March 11, 2019. https://journal.code4lib.org/articles/4893.
rclark1. “DOI-like Strings and Fake DOIs.” Website. Crossref. Accessed March 11, 2019. https://www.crossref.org/blog/doi-like-strings-and-fake-dois/.
Svenonius, Elaine. The Intellectual Foundation of Information Organization. First MIT Press paperback ed. Digital libraries and electronic publishing. Cambridge, Mass.: MIT Press, 2009.
Van de Sompel, Herbert, Robert Sanderson, Martin Klein, Michael L. Nelson, Bernhard Haslhofer, Simeon Warner, and Carl Lagoze. “A Perspective on Resource Synchronization.” D-Lib Magazine 18, no. 9/10 (September 2012). Accessed March 11, 2019. http://www.dlib.org/dlib/september12/vandesompel/09vandesompel.html.
Wright, Alex. Cataloging the World: Paul Otlet and the Birth of the Information Age. Oxford ; New York: Oxford University Press, 2014.
“Citation Style Language.” Citation Style Language. Accessed March 11, 2019. https://citationstyles.org/.
“Open Archives Initiative Protocol for Metadata Harvesting.” Accessed March 11, 2019. https://www.openarchives.org/pmh/.

About This Guide

This guide is generated from markdown text files using the mdBook tool. The source is mirrored on Github at https://github.com/internetarchive/fatcat.

Contributions and corrections are welcome! If you create a (free) account on github you can submit comments and corrections as "Issues", or directly edit the source and submit "Pull Requests" with changes.

This guide is licensed under a Creative Commons Attribution (CC-BY) license, meaning you are free to redistribute, sell, and extend it without special permission, as long as you credit the original authors.