Python - Text Preprocessing : Split text without spaces into a list of basic words using the Naive Algorithm
1. Import the Library
import pandas as pdfrom math import log
2. Load the Data
data = pd.read_excel('DataTrend.xlsx', sep=',')data
3. Build a Dictionary
Create a dictionary in the form of a .txt file to save basic words in the form of common, standard words and slang words.For example, we give the name dictionary: KataDasar.txt
4. Define a Function to Calculate the Probability
words = open("KataDasar.txt", encoding="utf8").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
Define a function to check to be able to match the character between the text and the list of basic words in the .txt dictionary.
def spaces(text):
def find_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(text[i-k-1:i], 9e999), k+1) for k,c in candidates)
cost = [0]
for i in range(1,len(text)+1):
c,k = best_match(i)
cost.append(c)
out = []
i = len(text)
while i>0:
c,k = best_match(i)
assert c == cost[i]
out.append(text[i-k:i])
i -= k
return " ".join(reversed(out))
6. Change the DataFrame into a List
pecah = []
Trendtopics = data.trend.tolist()
Trendtopics
6. Split Text
for Trendtopics in data['trend']:
tren = Trendtopics.lower()
trentop = (infer_spaces(tren))
pecah.append(''.join(trentop))
print(trentop)
Reference : stackoverflow
Comments
Post a Comment