A Comprehensive Guide to Calculating Statistical Features from Text using Python

Utilizing Statistical Metrics to Improve Text Accessibility and Understanding

In this blog, we will cover the below topics

  • What is readability and why is it important?
  • How to assess the readability in Python?
  • Build a Python app to test the readability of text content
  • Sentiment analysis of the text content

“I write as straight as I can, just as I walk as straight as I can, because that is the best way to get there.” — H. G. Wells, English novelist and historian

What is readability and why is it important?

Readability tells us how difficult or easy is the piece of text to read and comprehend. The choice of words and complexity of the topic plays an important role in the overall reading experience. Some of the other factors that influence readability are sentence length, syllables per word, and most importantly the audience who will consume the content eg: students, teachers, business, technical professionals, etc.

High readability is more likely to engage the reader for a longer duration, it would also mean that the content is less ambiguous and the information presented is easy to interpret without spending too much energy or time. In business, technical, or marketing content, readability is very important. For example, a product description or user manual should be simple and yet should serve the message to the audience without the use of any technical jargon.

How to measure readability?

It is measured by collecting key metrics from the text using mathematical formulae. The end result is the readability scores that indicate the ease of reading. Eg: Is the text content easy to read for 5ᵗʰ graders or 9ᵗʰ graders?

There are many methods to calculate the readability scores and will explore a few of them listed below

  • The Flesch Reading Ease formula
  • The Flesch-Kincaid Grade Level
  • The Fog Scale
  • The Coleman-Liau Index
  • Linsear Write Formula
  • Difficult words

Calculating the readability score using Python

Let us calculate the score for text content for each of the methodologies from the previous section. We will be using textstat library. Let us install, import the library and define a text for analysis.

The text content is from Cricinfo’s post-match commentary from the IPL match RCB VS SRH dated 19 May 2023.

 pip install textstat  # Install the library
import textstat # Load the library


test_data = (
"Kohli century leads a clinical performance as RCB retain control over
their fate for now. He took charge with scintillating strokeplay,
it looked like approaching a slowdown but hit his way past it.
Some of the cover drives and the whips were sensational.
There were a quiet couple of overs but generally included a
boundary each over. Faf was the aggressor in the powerplay and
survived being dismissed by an athletic catch by Dagar because the
second bouncer for the over was deemed to be a no-ball.
Kohli wasn't shying away from playing attacking shots at all stages.
Sunrisers' bowling couldn't apply consistent pressure,
they weren't allowed to, and end their season at home.")

The Flesch Reading Ease formula

textstat.flesch_reading_ease(test_data)
OUTPUT: 63.39

Conclusion: The text is classified as standard content
---------------------------------------
Score Difficulty Reference Table
90-100 Very Easy
80-89 Easy
70-79 Fairly Easy
60-69 Standard
50-59 Fairly Difficult
30-49 Difficult
0-29 Very Confusing

The Flesch-Kincaid Grade Level

textstat.flesch_kincaid_grade(test_data)
OUTPUT: 8.5

Conclusion: A 8ᵗʰ grader should be able to read the content

The Fog Scale

textstat.gunning_fog(test_data)
OUTPUT: 11.43

Conclusion: A 11ᵗʰ grader should be able to read the content

The Coleman-Liau Index

textstat.coleman_liau_index(test_data)
OUTPUT: 10.5

Conclusion: A 10ᵗʰ grader should be able to read the content

Linsear Write Formula

textstat.linsear_write_formula(test_data)
OUTPUT: 10.5
Conclusion: A 10ᵗʰ grader should be able to read the content

Difficult words

textstat.difficult_words(test_data)

OUTPUT: 31

Conclusion: There are 31 difficult words in the text content

-------------------------------------------------------------------------

textstat.difficult_words_list(test_data)

OUTPUT: List of all difficult words
['control', 'apply', 'consistent', "sunrisers'", 'dagar', 'aggressor',
'pressure', 'approaching', 'boundary', 'bouncer', 'allowed', 'bowling',
'clinical', 'athletic', 'retain', 'powerplay', 'dismissed', 'slowdown',
'couple', 'scintillating', 'generally', 'performance', 'survived',
'attacking', 'playing',
...
'included', "weren't", 'strokeplay', 'shying', 'century']

There are many other methods to be tried for calculating the readability score and additional details can be found in textstat documentation.

Limitations of readability scores

  • The readability scores don’t tell anything about the quality of text content
  • The readability score can be misleading in the case of short text content
  • Different scoring methodologies give different results, so the selection of methodology should be well thought out before using it.
  • A lower score doesn’t necessarily mean bad text content instead it is just an indication to review the text

Sentiment analysis of the text content

In simple words, sentiment analysis is a process for analyzing the content to determine if is positive, negative, or neutral. There are many sophisticated products and cloud services/APIs that provide NLP modules for sentiment analysis. Python lets us build custom machine-learning models for which one needs to write code and use various libraries to achieve the results.

In this section, we will explore a library by the name textcaret which takes away all the hassles by providing a unified framework to perform the common NLP tasks. The sentiment analysis has two components as below

  1. Polarity: It refers to the strength of the opinion could be positive or negative. The words such as love, trust, admiration, and respect show the strength(+) of the opinion.
  2. Subjectivity: It refers to personal opinions, emotions or judgments, etc. could be positive or negative. An opinion such as “I like my new car for its speed and performance” is a personal opinion(+) based on experience.

In both cases, the score ranges between [-1 to 1]. The negative score refers to negative emotions and the positive represents the positive emotions. A score near [0] is classified as neutral

# pip install textcaret
from textcaret import TextSentiment
docx = TextSentiment(test_data)
docx.sentiment()

OUTPUT:
{'sentence': "Kohli century leads a clinical performance .......at home.",
'sentiment': Sentiment(
polarity=0.10238095238095238,
subjectivity=0.2857142857142857)
}

We use the same text content as earlier and both the polarity (0.10) and subjectivity (0.28) scores seem to be hovering around neutral.

test_data = "Medium is a great platform for writers and bloggers"
docx = TextSentiment(test_data)
docx.sentiment()

OUTPUT:
{'sentence': 'Medium is a great platform for writers adn bloggers',
'sentiment': Sentiment(
polarity=0.8,
subjectivity=0.75)
}

In the above example, I have shared my positive opinion about the medium platform and as expected the polarity (0.8) and subjectivity(0.75) are nearing 1.0

Closing Thoughts

By measuring a variety of readability metrics and using readability formulas, we can better understand the language level and complexity of the text in question. This can help us identify and create more effective texts and learning materials. We also looked into sentiment analysis to gauge the emotion and subjectivity of the content. Generally, the content related to training material, research papers, and user manuals are conveying information that is based on fact and should be neutral in nature whereas product marketing materials should be more positive. In all such cases, the sentiment analysis score helps immensely to moderate the content.

I hope you liked the article and found it helpful.

You can connect with me — on Linkedin and Github

References

https://textstat.readthedocs.io/en/latest/

https://pypi.org/project/textcaret/

Leave a Reply

Your email address will not be published. Required fields are marked *