AbNatiV vs OASis and T20: Comparison of Antibody Humanization Scoring Tools

Many early therapeutic antibodies came from mice. When given to humans, they often triggered human anti-mouse antibody (HAMA) responses, which led to rapid drug clearance. Before the availability of antibody humanization services, scientists used complementarity-determining region (CDR) grafting, transferring mouse CDRs onto human antibody frameworks, which partially addressed the problem. Yet, amino acids in the framework regions (FRs), even those far from the CDRs, can still affect antigen recognition, requiring repeated experimental validation.

This challenge led to the development of antibody humanization scoring methods. By measuring the similarity between an antibody sequence and human antibodies, these approaches estimate how “human-like” the antibody is and help predict potential immunogenicity. This article focuses on these scoring methods and provides a brief overview of the main tools available.

The Structure of the Antibody Variable Region

The variable region of an antibody consists of framework regions (FRs) and complementarity-determining regions (CDRs). CDRs, located at the tips of the variable region, make direct contact with the antigen, while FRs support the CDRs and maintain their three-dimensional structure.

FRs and CDRs vary significantly across species. Framework regions show large interspecies differences, whereas CDRs are relatively conserved, since all antibodies must use a limited set of CDRs to recognize an almost infinite variety of antigens. Humanization, therefore, involves a delicate trade-off: CDRs are essential for maintaining binding affinity and must retain murine features, yet they can also carry T-cell epitopes that trigger immune responses. Managing this balance is one of the main challenges in predicting the immunogenicity of antibody humanization.

Comparing Antibody Humanization Scoring Methods: T20/Z-score, OASis, MG Model vs. AbNatiV

T20/Z-score Humanization Scoring

The Z-score method assesses how similar a given antibody sequence is to all sequences in the human germline database, typically using Hamming distance. Its name comes from the statistical concept of a Z-score, which measures deviation from the mean. T20 is an improved version that averages the similarity of only the 20 most closely related germline genes. Clavero-Álvarez et al. (2018) demonstrated that this approach more effectively distinguishes between human and mouse sequences.

AbNatiV vs OASis and T20: Comparison of Antibody Humanization Scoring Tools

The main advantages of T20/Z-score are simplicity, intuitive logic, and low computational cost, making it suitable for rapid screening. However, it does not account for correlations between residues and cannot capture co-evolutionary signals. Noise from the CDRs can also mask humanization signals in the framework regions.

In a benchmark study published in Nature Machine Intelligence (2024), Ramon et al. reported classification accuracies of 0.751 for Z-score and 0.786 for T20.

MG Model: Improving Humanization Accuracy via Residue Coupling Analysis

Clavero-Álvarez et al. (2018) introduced a scoring method based on statistical physics principles. At its core is the maximum entropy principle: given only observed pairwise residue frequencies, the model infers the probability distribution that best fits the data. The model defines an energy function with both single-site and pairwise terms—single-site terms describe amino acid preferences at individual positions, while pairwise terms capture correlations and couplings between residues.

A key insight is that correlations between the VH and VL chains cannot be ignored. Traditional approaches usually treat heavy and light chains separately, but the MG model shows that analyzing both chains together substantially improves classification accuracy.

The strengths of the MG model include capturing long-range correlations, outperforming sequence similarity-based methods in classification, and providing interpretable scores. Its limitations are reliance on Gaussian approximations, higher computational demands, and the absence of readily available public implementations.

AbNatiV vs OASis and T20: Comparison of Antibody Humanization Scoring Tools

OASis Humanness Score: Assessing Humanization Using Real Antibody Repertoires

OASis (Observed Antibody Space isotonic) draws on the OAS natural antibody database, which contains over 500 million authentic human antibody sequences. To calculate a humanness score, the antibody of interest is broken down into overlapping 9-mer peptides. Each segment is then compared to the database to determine how frequently it occurs—the higher the frequency, the more “human-like” that segment is. These scores are combined to generate an overall humanness score for the full antibody sequence.

Unlike T20/Z-score methods, which rely on germline gene sequences, OASis is based on observed immune repertoires, capturing the diversity generated by somatic hypermutation (SHM).

Key advantages include: granular interpretability (each 9-mer is scored independently, allowing precise identification of human-like regions); reliance on a large-scale, real-world antibody database; and demonstrated clinical relevance, as reported in the BioPhi study. The OASis humanness score has become a preferred tool for many pharmaceutical companies when evaluating antibody humanization.

Limitations include dependence on database coverage, which may reduce accuracy for rare mutations, and ongoing challenges in scoring the CDR regions.

AbNatiV vs OASis and T20: Comparison of Antibody Humanization Scoring Tools

AbNatiV: Deep Learning Tool for VHH Humanization in 2024

AbNatiV (Antibody Nativeness Evaluator), published in Nature Machine Intelligence in 2024, is a deep learning-based platform that combines antibody humanization scoring with sequence engineering capabilities.

Technical Foundation

AbNatiV, the latest humanization scoring method in 2024, uses a multi-model training strategy, with separate models for VH (heavy chain), Vκ and Vλ (light chains), and VHH antibodies, built on the V2 architecture (VH2/VL2/VHH2). The core technique is masked learning: certain positions in the sequence are randomly hidden, and the model learns to predict the masked amino acids.

Dual Functionality: Scoring and Humanization

AbNatiV functions both as a scoring tool and a humanization engine:

· Scoring: It provides a humanness score that estimates how likely a sequence belongs to the natural human antibody distribution. Outputs include sequence-level scores and residue-level nativeness profiles, which can help predict immunogenicity risks.

· Humanization Engineering: AbNatiV can directly generate humanized sequences. Using either a dual-control strategy (hum_vhh) or single-control strategies (hum_vh_vl, hum_vh_vl_paired), non-human sequences can be converted into more human-like versions while maintaining antigen-binding properties.

AbNatiV vs OASis and T20: Comparison of Antibody Humanization Scoring Tools

Performance: PR-AUC 0.965, Surpassing Other Methods

For the task of classifying VH sequences as human versus rhesus macaque, AbNatiV achieved a PR-AUC of 0.965, substantially outperforming AbLSTM, which reached only 0.721. Even after retraining, AbLSTM improved only to 0.777. Sequence reconstruction accuracy was 96%, higher than Sapiens (92%) and AbLSTM (81%).

AbNatiV vs OASis and T20: Comparison of Antibody Humanization Scoring Tools

Advantages

AbNatiV stands out for its combined scoring and humanization capabilities, multi-strategy coverage across VH, VL, and VHH antibodies, and user-friendly web service for easy access.

Should AbNatiV Be Used for Nanobody Humanization Scoring?

Before AbNatiV, several methods had been proposed for antibody humanization scoring. BioPhi germline content relies on a single nearest-neighbor approach; Hu-mAb uses supervised learning with random forests; IgReconstruct scores based on nucleotide frequencies; and AbLSTM employs an LSTM deep learning architecture. While most of these methods have been surpassed since the publication of AbNatiV, they represent the developmental trajectory of the field. Beyond these mainstream approaches, nanobody humanization scoring has gained increasing attention, particularly in the VHH domain.

Comparison of Humanization Scoring Methods

This table summarizes the key performance metrics for the leading antibody humanization scoring tools available:

Evaluation Metric	T20/Z-score	MG Model	OASis	AbNatiV
Classification Accuracy	0.75–0.79	High	High	Very High (0.965)
Clinical Immunogenicity Relevance	Supported by data	Supported by data	Supported by data	Supported by data
Interpretability	Sequence-level	Sequence-level	Residue-level (9-mer)	Residue-level
Computational Cost	Low	Moderate	Moderate	Moderate
Humanization Engineering Capability	No	No	Yes (direct sequence generation)	Yes (direct sequence generation)
Tool Accessibility	Requires custom implementation	Requires custom implementation	Open-source (BioPhi)	Open-source

Comparing the Use Cases for T20, OASis, AbNatiV, and MG Models

T20/Z-score: Well-suited for preliminary screening, situations with limited computational resources, or rapid assessments. Its simple calculations and intuitive principles make it an effective first-pass tool.

MG Model: Ideal for scenarios requiring detailed analysis of residue correlations and VH–VL interactions. The statistical physics framework provides probabilistic interpretability, with cross-chain correlation modeling as a key strength.

OASis: Best for routine applications and projects that demand fine-grained insights. Its open-source nature and residue-level scoring make it particularly useful for guiding targeted humanization design.

AbNatiV:Recommended for antibody humanization projects or cases where the highest classification accuracy is desired. Specifically optimized for VHH sequences, it achieves a PR-AUC of 0.965. Combining scoring and humanization capabilities, AbNatiV offers a one-stop solution for both assessment and design.

In practical antibody humanization projects, limitations typically fall into four areas:

Difficulty in scoring CDR regions, as humanness scores do not directly predict T-cell-mediated immunogenicity;
Cross-chain correlations remain inadequately addressed in most methods.
Humanness scores are indirect indicators and do not equate to actual immunogenicity, which is influenced by factors such as administration route, dosage regimen, and patient population.
Limited clinical data, with most scoring methods lacking extensive validation.

Future Directions include developing multidimensional scoring systems (e.g., combining humanness, developability, and affinity), iterative refinement of deep learning models, expansion of public databases such as OAS (including antibodies collected during the COVID-19 pandemic to train more powerful models), and creating computational–experimental feedback loops where scoring guides experiments and experimental results refine models.

Choosing the Right Humanization Scoring Tool

Humanization scoring has evolved from sequence alignment methods (Z-score/T20) to statistical modeling (MG model) and now to deep learning approaches (OASis, AbNatiV). Each method has its own strengths and limitations in accuracy, interpretability, and computational cost. In practice, cross-validation using multiple approaches, combined with experimental confirmation, is recommended to guide final decisions.

Antibody Humanization Workflow at AlpVHHs

CDR grafting remains the main method employed in antibody humanization. At AlpVHHs, a typical workflow proceeds as follows:

Antibody modeling using state-of-the-art deep learning-based methods to predict precise three-dimensional structures.
Sequence alignment and template selection based on overall sequence similarity, loop structural compatibility, expression potential, and other developability criteria.
CDR grafting combined with identification of critical binding sites through structural analysis.
Back-mutations at key framework positions to restore or maintain antigen-binding affinity and CDR conformation.
Immunogenicity assessment using specialized software together with annotation of potential post-translational modification (PTM) sites based on empirical data.
Further engineering of regions exhibiting high immunogenicity risk, PTM liability, proline content, hydrophobicity, or aggregation propensity.

Scoring tools such as T20, OASis, and AbNatiV are integrated at multiple stages to support informed decision-making and variant optimization.

AbNatiV vs OASis and T20: Comparison of Antibody Humanization Scoring Tools

About AlpVHHs

AlpVHHs® focuses on nanobody research and services, supporting innovative pharmaceutical companies in early nanobody discovery and humanization optimization. Leveraging a professional technology platform, AlpVHHs has partnered with over 300+ domestic and international pharmaceutical R&D companies, successfully delivering more than 1500+ early-stage nanobody projects.

If you are seeking support in antibody humanization or SdAb pre-discovery CRO service, we welcome you to contact us. We are happy to provide professional expertise and assistance to help advance your drug development projects.

References:
[1]. Clavero-Álvarez A et al. Humanization of Antibodies using a Statistical Inference Approach. Scientific Reports. 2018;8:14820. doi:10.1038/s41598-018-32986-y.

[2]. Prihoda D et al. BioPhi: A platform for antibody design, humanization, and humanness evaluation based on natural antibody repertoires and deep learning. mAbs. 2022;14(1):2020203. doi:10.1080/19420862.2021.2020203.

[3]. Gao et al.: Monoclonal antibody humanness score and its applications. BMC Biotechnology 2013 13:55. doi:10.1186/1472-6750-13-55

[4]. Ramon A et al. Assessing antibody and nanobody nativeness for hit selection and humanization with AbNatiV. Nat Mach Intell. 2024;6:74-91. doi:10.1038/s42256-023-00778-3.

< Application prospect of single-domain antibody in ADC drug

> Stable GPCR Conformations Enable Extracellular Nanobody Screening