Practical Text Analytics using spaCy v3
What is text analytics? I like this definition: “Text analytics is the process of transforming unstructured text documents into usable, structured data. Text analysis works by breaking apart sentences and phrases into their components, and then evaluating each part’s role and meaning using complex software rules and machine learning algorithms.” …
Overview
What is text analytics?
I like this definition: “Text analytics is the process of transforming unstructured text documents into usable, structured data. Text analysis works by breaking apart sentences and phrases into their components, and then evaluating each part’s role and meaning using complex software rules and machine learning algorithms.” [Source: Lexalytics website]
In spaCy, you can use machine learning algorithms in two ways
1) pretrained models provided by spaCy and other organizations – for example the en_core_web_md, which I use in this course, is a pretrained model provided by Explosion, the company which created spaCy
2) custom machine learning models that you train on your data – which is often referred to in the documentation as “statistical models”
Why not statistical models?
This is what the makers of spaCy say in their documentation:
“For complex tasks, it’s usually better to train a statistical entity recognition model. However, statistical models require training data, so for many situations, rule-based approaches are more practical. This is especially true at the start of a project: you can use a rule-based approach as part of a data collection process, to help you “bootstrap” a statistical model.
Training a model is useful if you have some examples and you want your system to be able to generalize based on those examples. It works especially well if there are clues in the local context. For instance, if you’re trying to detect person or company names, your application may benefit from a statistical named entity recognition model.
Rule-based systems are a good choice if there’s a more or less finite number of examples that you want to find in the data, or if there’s a very clear, structured pattern you can express with token rules or regular expressions. For instance, country names, IP addresses or URLs are things you might be able to handle well with a purely rule-based approach.”
Just to clarify, I am not against developing statistical models – but as the documentation states quite clearly, it is often more practical to start with rules based systems. One of my main aims in this course is to provide a solid understanding of what you can and cannot do using just a rules based system – in fact I use only one dataset in this entire course so it is a lot easier for the students to make this distinction.
When you combine a rules based system with the data visualization technique I describe in this course, you will also gain a very good understanding of your dataset. You can then use this understanding to improve your statistical model if you choose to build one.
In my view, most people barely scratch the surface when it comes to using spaCy rules for text analytics. I hope this course will provide them a lot of new insight into how to approach this task.
Who this course is for:
- Data Science practitioners who want to use spaCy and Natural Language Processing
- Anyone who has a spreadsheet where one of the columns is a paragraph of text and wants to know how to extract useful information from that text to use with the filters you can apply on the OTHER columns (sort, less than, greater than etc) in spreadsheet tools like Excel and Airtable