Python - Text Mining : Text Preprocessing using NLTK and Sastrawi

The next article's about Text Mining 😂

Stages of analysis in text mining is to collect data and then extract the features to be used.

By applying processes in text mining, we will obtain data patterns, trends, and extraction of potential knowledge from text data. That's just a glimpse of information about text mining.

Let's discuss an important part of text mining, it was text pre-processing.

Stage of Text Pre-processing:

- Case Folding is a conversion process or simply the process of converting the entire text in a document into a standard form, usually lowercase.

- Tokenizing is the process of cutting the input string based on each word that makes it up, or the process of dividing sentences into tokens.

- Filtering is the process of taking important words from the token results. The filtering stage can be done by deleting the stoplist / stopword (removing the less important words).

- Stemming is the stage of returning the words obtained from the results of filtering to its basic form, eliminating the initial prefix (prefix) and the final prefix (suffix) so that the basic word is obtained.

Implementation

1. Import Library

The main libraries that we will use to do text pre-processing are NLTK and Sastrawi.

import pandas as pd
import string

from openpyxl import load_workbook
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

stemmer = StemmerFactory().create_stemmer()
remover = StopWordRemoverFactory().create_stop_word_remover()
translator = str.maketrans('', '', string.punctuation)

2. Load the Data

Don't forget to prepare a dataset in the form of .xlsx or .csv which contains text data. Here I give the sample dataset named Corpus.xlsx which contains 3 text documents.

data = pd.read_excel("Corpus.xlsx")
data

3. Build a Custom Library

Create a custom library to do each of the steps from case folding, tokenizing, filtering to stemming.

def stemming(text):
porter = PorterStemmer()
stop = set(stopwords.words('english'))
text = text.lower()
text = [i for i in text.lower().split() if i not in stop]
text = ' '.join(text)
preprocessed_text = text.translate(translator)
text_stem = porter.stem(preprocessed_text)
return text_stem

def preprocessing(text):
text = text.lower()
text_clean = remover.remove(text)
text_stem = stemmer.stem(text_clean)
text_stem = stemming(text_stem)
return text_stem

4. Preprocessing

Perform text processing by using the custom library and append functions in python.

preprocessed = []
for dt in data['Corpus']:
preprocessed.append(preprocessing(dt))
preprocessed

Dataset Samples :

Suara seorang perempuan terdengar dari arah luar rumah. Dari caranya memanggil, bisa terlihat jelas sifat dan wataknya. “Kapan kamu mau melunasi utangmu? Sudah lebih dari enam bulan kontrakan rumah belum dibayar! Utangmu di warungku juga sudah numpuk! Janjinya bulan depan… bulan depan… bulan depannya lagi! Aku sudah muak dengan janji-janjimu!”

Ketika gagang pintu ditarik dan pintunya bergeser membuka dan membentuk sudut enam puluh derajat, Bu Rumi sudah berdiri tepat di tengah pintu. Seperti ratu kuntilanak menyeramkan. Mirip setan keorangan_bukan orang kesetanan_. Mukanya merah marah. Dua orang berkaos hitam ketat di samping kirinya. Badan kekar. Lengan penuh tato. Yang satu plontos, yang satunya rambut cepak mirip AKABRI masuk desa.

“Ummi, uang ini akan lebih bermanfaat untuk keluarga Bu Fatimah. Mungkin tidak akan ada lagi bulan depan untukku. Mohon pengertiannya, Mi. Ummi juga sudah dengar kata dokter dua minggu lalu.’’ Fatih menatap lekat-lekat Umminya. Mencoba memastikan. Memberikan pengertian sekaligus agar diberi izin memberikan uang yang sudah dipegangnya.

Text preprocessing results :

Thanks.. 😊

Comments

Jack Shephard22 June 2022 at 02:51
Very Informative and creative contents. This concept is a good way to enhance knowledge. Thanks for sharing. Continue to share your knowledge through articles like these.

Data Engineering Services

Artificial Intelligence Services

Data Analytics Services

Data Modernization Services
Will Conway3 August 2022 at 03:02
Very Informative and creative contents. This concept is a good way to enhance knowledge. Thanks for sharing. Continue to share your knowledge through articles like these.

Data Engineering Services

Data Analytics Solutions

Artificial Intelligence Solutions

Data Modernization Solutions

#CieNgoding

Search This Blog