Python - Text Preprocessing : Split text without spaces into a list of basic words using the Naive Algorithm

On this occasion, I would like to share about how to separate text without spaces. This article will discuss how to separate text without spaces into a list of basic words using Naive Algorithm using python3.

To do this kind of split text in python, actually we can use the wordsegment library. Unfortunately, the wordsegment library will only work on texts that contain basic and general basic words. From the cases that I have met, to break up or separate text from Twitter trending topics that contain slang like otw, btw, or other than that, this wordsegment library is less helpful.

Here I use jupyter notebook, the complete steps for split text can be seen in the following source code:

1. Import the Library

import pandas as pd
from math import log

2. Load the Data

data = pd.read_excel('DataTrend.xlsx', sep=',')
data



3. Build a Dictionary

Create a dictionary in the form of a .txt file to save basic words in the form of common, standard words and slang words.
For example, we give the name dictionary: KataDasar.txt


4. Define a Function to Calculate the Probability

words = open("KataDasar.txt", encoding="utf8").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

 
Define a function to check to be able to match the character between the text and the list of basic words in the .txt dictionary.


def spaces(text):
        def find_match(i):
            candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
            return min((c + wordcost.get(text[i-k-1:i], 9e999), k+1) for k,c in candidates)

        cost = [0]
        for i in range(1,len(text)+1):
            c,k = best_match(i)
            cost.append(c)

        out = []
        i = len(text)
        while i>0:
            c,k = best_match(i)
            assert c == cost[i]
            out.append(text[i-k:i])
            i -= k

        return " ".join(reversed(out))



6. Change the DataFrame into a List

pecah = []
Trendtopics = data.trend.tolist()
Trendtopics

 


6. Split Text

for Trendtopics in data['trend']:
    tren = Trendtopics.lower()
    trentop = (infer_spaces(tren))
    pecah.append(''.join(trentop))
    print(trentop)



 
Thank you... I hope this article can help you 😊😊😊

Reference : stackoverflow

Comments