Life Sciences

Speaking Protein: How NLP and biology collided

An unlikely partnership led to discovering the secret grammar of the language of life

About the Author: Thomas Makkink is a writer and BioAI research engineer at InstaDeep. He works on applying machine learning methods for the development of next generation personalised therapeutics.

Michael Heinzinger was weighing a risky proposition. 

It was early 2018 and an AI research student from across campus at the Technical University of Munich had approached the biology lab where Heinzinger worked with a seemingly outlandish idea: Could the same models that were being used for spell check, autocomplete and Alexa learn the language of life itself?

Heinzinger, a computational biologist, had joined the lab earlier that year and was looking for a PhD research focus. His lab was dedicated to understanding protein sequences, often described as the building blocks of life. Coming from the world of biology, he was unaware of the fast-moving developments in Natural Language Processing, or NLP, algorithms so he floated the idea past colleagues. They were sceptical. 

“This was an extremely high risk project to be honest,” he recalls, explaining when you have three to four years to do a PhD and you need to spend time wisely, which often means focusing on the most promising direction. “We were starting off with the assumption that we might beat the state-of-the-art that had been developed over 30 years. This could horribly fail and then you just stand there after three years and you can only say: ‘Yeah, we tried it. It didn’t work.’”

This quandary was the beginning of ProtTrans, the first AI learning model to study the world of protein sequences and use natural language processing to reveal an underlying grammar within the protein universe. This unexpected application of AI learning to biology sparked huge excitement in both fields and a flurry of innovations.

The person who approached Michael with the idea was Ahmed Elnaggar, an AI researcher fascinated by the rapid developments in NLP and the self-supervised learning possibilities that came with it. Intuition told him it would work. He just needed the right partner. His email pitch to the Rostlab found its way to Heinzinger. The two set up a meeting. 

“I always say ‘It’s better to read, read, read, or listen, listen, listen – and then talk,’ Elnaggar says. “When I started to read about these transformer models and this new type of processing, I saw a lot of people were already working in this area. I always like to try to find a new idea, a new, green field that no one’s tried before.” 

“But when this new model came out, rather than try to implement it in a different field, most of the people only focused on language use cases.”

“This idea came when I started to walk on our campus,” Elnaggar says, recalling how he  counted the number of “chairs” – professors in German academia with their own teams and budgets.

Searching for a suitable green field, he researched each chair, looking into their work and research focus and started crafting a short list. The list included Rostlab, led by Professor Burkhard Rost, a theoretical physicist and bioinformatics pioneer. “Rostlab was not the only lab that I thought of, but after discussing these ideas with different labs, I thought it was a use case that we could try right now.”

x1IGT_1 Molecular watchdogs, antibody proteins circulate in the blood, scrutinising all they touch. When they find viruses, bacteria or other threats, they bind to them to either eliminate them or coat their surface to shield us from infection

In their discussions, the idea evolved as they explored the available data sets and how to test the NLP model. “We didn’t have this protein idea at first,” he recalls. “We just had the idea that maybe we can, one day, make a model that can extract features from a single gene sequence.”

Elnaggar knew language models, but he had no idea bioinformaticians had such a wealth of unstructured data. “Without these discussions with Mike, it wouldn’t have gone into this direction.”

The timing was perfect. “It was just an amazing coincidence that he stumbled into our glass office,” Heinzinger says. “At this time, we were just hitting a wall when we tried to predict protein-to-protein interactions. We were looking for other ways to actually represent protein sequences using only single protein sequences.”

“Then Ahmed knocks at the door and is like, ‘There are these natural language processing algorithms which are currently beating the benchmark every week with new versions rolling out weekly. You guys have sequential data so what about applying this to these algorithms?’”

Heinzinger decided to take the plunge.

“We went the most straightforward way you can imagine,” he says. “Treat one single amino acid as a word and treat one sequence as a sentence. Then you are already there, more or less.”

Proteins, the building blocks of life, are made up of strings of amino acids. There are 20 amino acids in all, each represented by 20 letters from the alphabet. These letters, when interpreted sequentially, are just like words in a sentence and tell you the structure of a protein. 

If he had lingering reservations, they evaporated early on. “We got pretty fast, pretty good results,” he says. “After that we just kept training. Every week or so, we took a check-point, and we saw the performance increase week after week. This was just magic.”

Reading data sets of millions of protein sequences, the trained language model was looking for and extracting features, identifying common patterns and combinations, as it learned to parse and predict the protein “language.” 

The surprising success gave them courage, Heinzinger says. It had outperformed word-to-vec, an older method of converting words to numerical representation. The language model only kept improving. “Until of course – it hit a ceiling,” Heinzinger recalls, “which then required us to somehow find more sophisticated algorithms.”

Elnaggar was invited to give a talk about their progress at a high-performance computing conference. He called their prototype SeqVec (Sequence-to-Vector) and said it suggested a promising solution to the challenge of how to efficiently handle the exponentially increasing number of sequences in protein databases.

After the talk, representatives from Google, NVIDIA and Cornell University offered the pair the use of their systems to expand their work. “This is something I would advise researchers, build a prototype,” Elnaggar says, “and network.” 

Inspired by traditional NLP approaches, the pair had initially looked for massive datasets, irrespective of quality. They trained the learning model on BFD, also known as the “Big Fat Database,” the largest available database with 2.1 billion protein sequences. “It’s a dirty, noisy set. And then we saw that the problem is this also then reflects in training time. The larger the data set, of course, the longer you also have to train.”

x2OHX Our primary defense against alcohol, the alcohol dehydrogenase proteins in our liver and stomach detoxify the equivalent of one stiff drink each hour

Despite having more computing power, including access to IBM’s Summit, the world’s second fastest supercomputer, the rate of improvement tapered off again. They wanted to produce results that could compete with Multiple Sequence Alignment (MSA), a high definition way to represent an alignment of three or more protein sequences.

“We got hungry,” Heinzinger says. “It was worth getting frustrated to get to the last one or two percentage points.” 

The next jump came when they found a leaner, cleaner collection of protein sequences called Uniref in a publication from Facebook’s AI team. “It meant you could cover the protein universe in a more uniform way than BFD did, which had a high bias towards large families.”

They had proven that transformers, which had largely achieved fame in NLP tasks, could also provide “embeddings” for proteins, which encode the sequences in a mathematical space, clustering proteins similar in structure and function closer together than unrelated proteins. Transformers could refine our map of the protein universe. 

They called their learning model ProtTrans, combining the words protein and transformer and it offered valuable insights not just on the biology side but also to the world of AI by demonstrating how transformers are better at modelling relationships in much longer sequences than the previous best-performing AI  models, which were mainly recurrent neural networks.

Transformers are also perfectly suited for exploiting modern supercomputer architecture where a single transformer can train faster by simultaneously using many parallel CPUs, the computer chip found on average laptops, GPUs, the hardware of choice for deep-learning models, or TPUs, Google’s custom Tensor Processing Unit chips.  

They’re also data-hungry, perfect for learning on enormous data sets. And finally, Elnaggar points to recent research proving that transformers are Graph Neural Networks – which means that from the sequence of amino acids a transformer is well suited to implicitly learn information about the structure of a protein.  

ProtTrans continues to send ripples through the world of biotechnology. “The work these guys have done is saving lives,” says InstaDeep’s Nicolas Lopez Carranza. 

He credits Elnaggar and Heinzinger with influencing his team’s work in developing the AI protein design platform DeepChain.

“On the DeepChain team, we realised very quickly the value of ProtTrans,” he says, adding the platform’s machine-learning predictors abilities incorporate ProtTrans protein sequence-analysing powers. 

“It allowed us to analyse proteins’ evolutionary landscape with a new pair of special goggles,” Lopez Carranza says, pointing to features like  DeepChain’s Playground, which can help users better analyse protein sequences in completely new and different ways thanks to ProtTrans. Tools like this are empowering researchers to design new disease therapeutics, vaccines and potential cures.

For those without access to a Summit supercomputer or thousands of GPUs, InstaDeep has helped provide open-source bio-transformers and hundreds of millions of pre-computed protein embeddings to help democratise ProtTrans advances. The aim is to help researchers solve their own protein problems as part of the open-source DeepChain Apps initiative (see page 9).

Heinzinger, still a PhD candidate and now co-author on several state-of-the-art papers, reflects on the journey.  

“Remember I mentioned I thought that it was highly unlikely that we could beat the state-of-the-art using only a single protein sequence?” Heinzinger asks. “I see how I was wrong there – luckily. 

“The reality is, looking back, it does not seem that difficult anymore,” he adds. “It was more or less having this idea first and realising it’s actually worth trying.”

About the Author: Thomas Makkink is a writer and BioAI research engineer at InstaDeep. He works on applying machine learning methods for the development of next generation personalised therapeutics.

More from Decisive Agents

Life Sciences