Build a Text Paraphraser Using Python with Pegasus Transformer for NLP

Build a Text Paraphraser Using Python with Pegasus Transformer for NLP

Picture of Nsikak Imoh, author of Macsika Blog
Build a Text Paraphraser Using Python with Pegasus Transformer for NLP written on plain background
Build a Text Paraphraser Using Python with Pegasus Transformer for NLP written on plain background

Table of Content

A text paraphrasing program comes in handle for numerous purposes, including rewriting a block of sentences in an article, post, or email.

The task of paraphrasing a text usually requires building and training a Natural Language Processing (NLP) model.

NLP is tasking not only because language is a complex structure, but also the amount of data required to train an NLP model to carry out tasks such as paraphrasing sentences impacts the model performance heavily.

Hence, if it is not properly trained, you get funny outputs.

Also, the process of acquiring and labeling additional observations for an NLP can be expensive and very time-consuming.

One common approach to building a text paraphraser, especially in Python, has been to apply data augmentation to the labeled text data and rewrite the text using back translation, e.g. (en -> de -> en).

What is the Pegasus transformer Model?

Google’s research team introduced a world-class summarization model called PEGASUS. It expands Pre-training with Extracted Gap-sentences for Abstractive Summarization.

We can adopt this summarization model to paraphrase text or a sentence using seq2seq transformer models.

Additionally, seq2seq transformer models make it easy to rewrite a text without using the back translation process.

This post does not in any way promote stealing content from other websites using a method popularly called article spinning. It is solely intended for research and testing purposes.

NB: Running this program will download some files. One of which is the model is about 2 GB or more in size.

How to Build a Text Paraphraser Using Python with Pegasus Transformer for NLP

Adopting this model for paraphrasing text means that we fine-tune the Google Pegasus model for paraphrasing tasks and convert TF checkpoints to PyTorch using this script on transformer’s library by Huggingface.

  • Install the Dependencies

    The first step would be to install the required dependencies for our paraphrasing model.

    
    pip install torch
     pip install sentence-splitter
     pip install transformers
     pip install SentencePiece
    
    install the required dependencies.

    We use PyTorch and the transformers package to work with the PEGASUS model.

    Also, we use the sentence-splitter package to split our paragraphs into sentences and the SentencePiece package to encode and decode sentences.

  • Set Up the PEGASUS Model

    Next, we will set up our PEGASUS transformer model, import the dependencies, make the required settings such as maximum length of sentences, and more.

    
      import torch
      from typing import List
      from transformers import PegasusForConditionalGeneration, PegasusTokenizer 
      model_name = 'tuner007/pegasus_paraphrase' 
      torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
      tokenizer = PegasusTokenizer.from_pretrained(model_name)
      model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
    
    set up PEGASUS transformer model.
  • Access the Model

    
      def get_response(input_text: str, num_return_sequences: int, num_beams: int) -> List[str]:
          batch = tokenizer.prepare_seq2seq_batch([input_text], truncation = True, padding = 'longest', max_length = 60).to(torch_device)
          translated = model.generate(**batch, max_length = 60, num_beams = num_beams, num_return_sequences = num_return_sequences, temperature = 1.5)
          tgt_text = tokenizer.batch_decode(translated, skip_special_tokens = True)
          return tgt_text
    
    Access the Model.
  • Test the Model

    Paraphrase a single sentence:

    
    context = "Which course should I take to get started in data science?"
    num_return_sequences = 10
    num_beams = 10
    get_response(context, num_return_sequences, num_beams)
    
    Test the Model.

    The output:

    
    ['Which data science course should I take?',
     'Which data science course should I take first?',
     'Should I take a data science course?',
     'Which data science class should I take?',
     'Which data science course should I attend?',
     'I want to get started in data science.',
     'Which data science course should I enroll in?',
     'Which data science course is right for me?',
     'Which data science course is best for me?',
     'Which course should I take to get started?'
    ]
    
    Output.

    We got ten different paraphrased sentences by the model because we set the number of responses to 10. Paraphrase a paragraph: The model works efficiently on a single sentence. Hence, we have to break a paragraph into single sentences. The code below takes the input paragraph and splits it into a list of sentences. Then we apply a loop operation and paraphrase each sentence in the iteration.

    
    from sentence_splitter import SentenceSplitter, split_text_into_sentences
    splitter = SentenceSplitter(language='en')
    
    context = "I will be showing you how to build a web application in Python using the SweetViz and its dependent library. Data science combines multiple fields, including statistics, scientific methods, artificial intelligence (AI), and data analysis, to extract value from data. Those who practice data science are called data scientists, and they combine a range of skills to analyze data collected from the web, smartphones, customers, sensors, and other sources to derive actionable insights."
    sentence_list = splitter.split(context)
    num_return_sequences = 5
    num_beams = 5
    get_response(context, num_return_sequences, num_beams)
    paraphrase = [] 
    for i in sentence_list:
    	a = get_response(i,1)
    	paraphrase.append(a)
    
    Working with paragraph.

    Output :

    
      [
          ['I will show you how to use the SweetViz and its dependent library to build a web application.'],
          ['Data science combines multiple fields, including statistics, scientific methods, and data analysis, to extract value from data.'],
          ['Data scientists combine a range of skills to analyze data collected from the web, smartphones, customers, sensors, and other sources to derive actionable insights.']
     ... ]
    
    Output.

    Combine the separated lists into a paragraph:

    
      paraphrase_paragraph = [' '.join(x for x in paraphrase2) ]
      paraphrased_text = str(paraphrase3).strip('[]').strip("'")
      paraphrased_text
    
    Combine output.

    Output :

    
      I will show you how to use the SweetViz and its dependent library to build a web application. Data science combines multiple fields, including statistics, scientific methods, and data analysis, to extract value from data. Data scientists combine a range of skills to analyze data collected from the web, smartphones, customers, sensors, and other sources to derive actionable insights.
    
    Output.

Wrap Off

You learned how to create a Text Paraphrase model by using NLP methods. You also learned about the PEGASUS transformer model and explored its main components for NLP and how it simplifies the process.

You may use the following resources to learn more PEGASUS model research white paper, Paraphrase model using HuggingFace, User Guide to PEGASUS.

Get the Complete Code of Python Code Snippets on Github.

Connect with me.

Need an engineer on your team to grease an idea, build a great product, grow a business or just sip tea and share a laugh?