Resources
|
Research Highlights

Research Highlights
Applied Analytics Advance Soybean Breeding

Highlights:

  • Soybean data, from observable traits to genetics, can be combined to better predict the potential for crosses and lines in the breeding process.
  • A Virginia Tech model is being developed to use data and machine learning to reduce the guesswork and improve decision-making in soybean breeding.

Aerial view of soybean breeding plots at Virginia Tech. Photo: Virginia Tech

By Laura Temple

Major league baseball teams use advanced analytics to help predict performance when scouting, drafting or acquiring players. Coaches and managers look at statistical tendencies to develop plans and strategy for each game and series. 

What would it look like to translate such data analysis to decisions involved in soybean breeding? Despite advances like genetic markers, soybean breeding still contains significant guesswork, trial and error. 

Bo Zhang, associate professor and soybean breeder at Virginia Tech, believes predicting soybean cross performance based on data will accelerate the breeding process, use resources more efficiently and increase the percentage of tested lines successfully released to farmers. She is building a predictive model for her public soybean breeding program, thanks to Soy Checkoff support from the Virginia Soybean Board.

“This practical breeding model incorporates new technology, like machine learning, into the state breeding program,” Zhang says. “Once the model is well established, using it to predict results for untested lines created from breeding crosses will shorten field trials, cut test entries and deliver high-performing varieties.” 

Large volumes of data exist on the soybean genome, and much more data can be gleaned from the field relatively efficiently. Plugging that information into equations would help her team develop gameplans for advancing new varieties from the breeding program.

Digital primary color, or RGB, image of Virginia Tech soybean breeding plots, used to collect additional phenotypic data from the field. Source: Virginia Tech

The model started with data from 181 known soybean lines. This information, including field-based data, trained machine-learning algorithms to predict yield results for new, untested soybean lines.

Zhang’s developing model will combine three types of data: phenotype, genotype and a breeder experience-based rating system. 

Phenotype

Phenotype refers to an organism’s observable characteristics and traits. This category includes physical and developmental traits. Zhang’s team gathers this information through traditional observation and measurements, but they also fly drones above the crop canopy of developmental lines to gather data like height, canopy coverage, canopy temperature and more.

This part of the model was developed first.

“In our first season, the phenotypic model achieved about 70% accuracy when validated against actual field data,” she says. “We expect this to improve with continued data accumulation.”

Genotype

Genotype describes an organism’s actual genetic information, the specific set of genes it inherited. As of summer 2025, this section of the model is still in development. 

The volume of information available from the soybean genome and past breeding efforts should lead to reliable predictions of line performance. The intent is to incorporate genetic data from parent lines into the model. It should enable breeders to improve prediction accuracy and evaluate the breeding potential of each cross more efficiently.

Corresponding thermal image of Virginia Tech soybean breeding plots, providing another layer of phenotypic data on soybean lines. Source: Virginia Tech

“Combining phenotypic and genotypic data will boost the model’s accuracy,” Zhang says. “Eventually, we will select just the most promising lines to test in the field, saving time and resources.”

Breeder Rating

The third component of Zhang’s breeding prediction model incorporates her breeder rating system, based on years of breeding experience.

“When evaluating soybean breeding lines in the field, I rate them from 1, which is the best, to 5, or the worst,” she says. “This rating adds depth about the potential for a line in our environment.”

She uses this rating to help decide what lines to move forward and which ones to drop. Training machine learning to understand and apply ratings will improve the efficiency of making these decisions.

When the model incorporates all three types of data, Zhang expects to improve both the efficiency of her breeding program and the quality of varieties released. 

The Virginia Tech soybean breeding program focuses on Maturity Groups 4 and early 5. However, she has been adding more late-MG 3 material to the program as farmers seek to plant soybeans earlier. This model will accelerate that process. 

Additional Resources

Creating a New Breeding Tool Based on Plant Proteins and Machine Learning – SRIN article 

Meet the researcher: Bo Zhang SRIN profile | University profile

Published: Sep 8, 2025