De novo design of protein structure and function with RFdiffusion

There has been considerable recent progress in designing new proteins using deep-learning methods1–9. Despite this progress, a general deep-learning framework for protein design that enables solution of a wide range of design challenges, including de novo binder design and design of higher-order symmetric architectures, has yet to be described. Diffusion models10,11 have had considerable success in image and language generative modelling but limited success when applied to protein modelling, probably due to the complexity of protein backbone geometry and sequence–structure relationships. Here we show that by fine-tuning the RoseTTAFold structure prediction network on protein structure denoising tasks, we obtain a generative model of protein backbones that achieves outstanding performance on unconditional and topology-constrained protein monomer design, protein binder design, symmetric oligomer design, enzyme active site scaffolding and symmetric motif scaffolding for therapeutic and metal-binding protein design. We demonstrate the power and generality of the method, called RoseTTAFold diffusion (RFdiffusion), by experimentally characterizing the structures and functions of hundreds of designed symmetric assemblies, metal-binding proteins and protein binders. The accuracy of RFdiffusion is confirmed by the cryogenic electron microscopy structure of a designed binder in complex with influenza haemagglutinin that is nearly identical to the design model. In a manner analogous to networks that produce images from user-specified inputs, RFdiffusion enables the design of diverse functional proteins from simple molecular specifications.

Reporting on race, ethnicity, or other socially relevant groupings Note that full information on the approval of the study protocol must also be provided in the manuscript.

Field-specific reporting
Please select the one below that is the best fit for your research.If you are not sure, read the appropriate sections before making your selection.

Life sciences
Behavioural & social sciences Ecological, evolutionary & environmental sciences For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative.

Sample size
Variable depending on analysis performed.Detailed in figure legends.Sample sizes were chosen prior to the experiment, and were decided arbitrarily by the experimenter (rather than by statistical test), but were large enough to draw meaningful conclusions from the experiment.

Replication
Each dataset contains many (n reported in figure legends) independent measurements.
Randomization N/A (all analysis was automated, so each datapoint was generated computationally under controlled and uniform settings) Blinding N/A (all analysis was automated, so there was no user intervention that could have introduced bias) Behavioural & social sciences study design All studies must disclose on these points even when the disclosure is negative.

Study description
Briefly describe the study type including whether data are quantitative, qualitative, or mixed-methods (e.g.qualitative cross-sectional, quantitative experimental, mixed-methods case study).

Sampling strategy
Describe the sampling procedure (e.g. random, snowball, stratified, convenience).Describe the statistical methods that were used to predetermine sample size OR if no sample-size calculation was performed, describe how sample sizes were chosen and provide a rationale for why these sample sizes are sufficient.For qualitative data, please indicate whether data saturation was considered, and what criteria were used to decide that no further sampling was needed.

Data collection
Provide details about the data collection procedure, including the instruments or devices used to record the data (e.g.pen and paper, computer, eye tracker, video or audio equipment) whether anyone was present besides the participant(s) and the researcher, and whether the researcher was blind to experimental condition and/or the study hypothesis during data collection.

Timing
Indicate the start and stop dates of data collection.If there is a gap between collection periods, state the dates for each sample cohort.

nature portfolio | reporting summary
April 2023

Data exclusions
If no data were excluded from the analyses, state so OR if data were excluded, provide the exact number of exclusions and the rationale behind them, indicating whether exclusion criteria were pre-established.

Non-participation
State how many participants dropped out/declined participation and the reason(s) given OR provide response rate OR state that no participants dropped out/declined participation.

Randomization
If participants were not allocated into experimental groups, state so OR describe how participants were allocated to groups, and if allocation was not random, describe how covariates were controlled.

Ecological, evolutionary & environmental sciences study design
All studies must disclose on these points even when the disclosure is negative.

Study description
Briefly describe the study.For quantitative data include treatment factors and interactions, design structure (e.g.factorial, nested, hierarchical), nature and number of experimental units and replicates.

Sampling strategy
Note the sampling procedure.Describe the statistical methods that were used to predetermine sample size OR if no sample-size calculation was performed, describe how sample sizes were chosen and provide a rationale for why these sample sizes are sufficient.

Data collection
Describe the data collection procedure, including who recorded the data and how.
Timing and spatial scale Indicate the start and stop dates of data collection, noting the frequency and periodicity of sampling and providing a rationale for these choices.If there is a gap between collection periods, state the dates for each sample cohort.Specify the spatial scale from which the data are taken

Data exclusions
If no data were excluded from the analyses, state so OR if data were excluded, describe the exclusions and the rationale behind them, indicating whether exclusion criteria were pre-established.

Reproducibility
Describe the measures taken to verify the reproducibility of experimental findings.For each experiment, note whether any attempts to repeat the experiment failed OR state that all attempts to repeat the experiment were successful.

Randomization
Describe how samples/organisms/participants were allocated into groups.If allocation was not random, describe how covariates were controlled.If this is not relevant to your study, explain why.

Blinding
Describe the extent of blinding used during data acquisition and analysis.If blinding was not possible, describe why OR explain why blinding was not relevant to your study.

Did the study involve field work?
Yes No

Field conditions
Describe the study conditions for field work, providing relevant parameters (e.g.temperature, rainfall).

Location
State the location of the sampling or experiment, providing relevant parameters (e.g.latitude and longitude, elevation, water depth).
Access & import/export Describe the efforts you have made to access habitats and to collect and import/export your samples in a responsible manner and in compliance with local, national and international laws, noting any permits that were obtained (give the name of the issuing authority, the date of issue, and any identifying information).

Disturbance
Describe any disturbance caused by the study and how it was minimized.

Reporting for specific materials, systems and methods
We require information from authors about some types of materials, experimental systems and methods used in many studies.Here, indicate whether each material, system or method listed is relevant to your study.Field-collected samples For laboratory work with field-collected samples, describe all relevant parameters such as housing, maintenance, temperature, photoperiod and end-of-experiment protocol OR state that the study did not involve samples collected from the field.

nature portfolio | reporting summary
April 2023

Magnetic resonance imaging
Experimental design

Design type
Indicate task or resting state; event-related or block design.

Design specifications
Specify the number of blocks, trials or experimental units per session and/or subject, and specify the length of each trial or block (if trials are blocked) and interval between trials.
Behavioral performance measures State number and/or type of variables recorded (e.g.correct button press, response time) and what statistics were used to establish that the subjects were performing the task as expected (e.g.mean, range, and/or standard deviation across subjects).

Specify in Tesla
Sequence & imaging parameters Specify the pulse sequence type (gradient echo, spin echo, etc.), imaging type (EPI, spiral, etc.), field of view, matrix size, slice thickness, orientation and TE/TR/flip angle.

Area of acquisition
State whether a whole brain scan was used OR define the area of acquisition, describing how the region was determined.

Volume censoring
Define your software and/or method and criteria for volume censoring, and state the extent of such censoring.

Statistical modeling & inference
Model type and settings

Graph analysis
Report the dependent variable and connectivity measure, specifying weighted graph or binarized graph, Describe the research sample (e.g. a group of tagged Passer domesticus, all Stenocereus thurberi within Organ Pipe Cactus National Monument), and provide a rationale for the sample choice.When relevant, describe the organism taxa, source, sex, age range and any manipulations.State what population the sample is meant to represent when applicable.For studies involving existing datasets, describe the data and its source.
If you are not sure if a list item applies to your research, read the appropriate section before selecting a response.Please state if this information has not been collected.Report sex-based analyses where performed, justify reasons for lack of sex-based analysis.
Define precise effect in terms of the task or stimulus conditions instead of psychological concepts and indicate whether ANOVA or factorial designs were used.Specify voxel-wise or cluster-wise and report all relevant parameters for cluster-wise methods.Describe the type of correction and how it is obtained for multiple comparisons (e.g.FWE, FDR, permutation or Monte Carlo).
Specify type(mass univariate, multivariate, RSA, predictive, etc.)and describe essential details of the model at the first and second levels (e.g.fixed, random or mixed effects; drift or auto-correlation).Report the measures of dependence used and the model details (e.g.Pearson correlation, partial correlation, mutual information).