Text Processing with Haskell and Python

In this article i will be calculating some basic things with Python and Haskell. The text file is from here.


1)Number of words

2)Number of sentences

3)Number of question marks

4)Number of unique words

5)Number of paragraphs

6)First paragraph of every section

7)Last paragraph of every section


Haskell is a functional language and its logic is almost the base of all other languages in terms of treating everything as a function and evaluating functions.

It is extremely shortcut to find the requirements with haskell because they are straightforward and implemented inside the haskell. My code is like following;

Screenshot from 2015-12-27 10:46:45

Basically, main role is “filter” method. It filters the strings or strings inside the files and output whatever is desired.

Import clause is to import the library Data.List to indicate that we will be working with lists(to calculate paragraphs, words,etc) and this will happen by putting them into an array of strings.

putStr is to print a text into command line.

length is to calculate the size of the arrays.

I used Hugs to compile my code. Hugs is an editor to run the haskell files and haskell codes. For example, if i want to load a file into hugs and run the haskell code inside of it it will be enough to type :l <file-directory> to run the code. In my case, i used to run /tmp/count.hs with runhugs command. It is something similar to the following;(number of unique words are being calculated in the picture)

Screenshot from 2015-12-27 11:13:16

For question mark, filter filters the text with question marks(?), for sentence it filters for the dots(.), for words it filters words clause, for paragraphs it filters null lines came after a text and for unique words it uses nub function to add everything to a set to deliminate the duplicates of every word in the text.

Part2: Python

Python is a very reliable, flexible and usable language among the others. It can compile so many things without having any trouble in any operating system.

I previously installed lamp server and that included Python so i didn’t make any additional installation to my computer. Typing python to command line will be enough to open the python editor. Running our Python code is like doing a similar thing to Haskell case. Writing python <file-directory> is enough to run the python code in the gedit file. The commented code is like following for the first 5 requirements. They are very simple in terms of writing a single for loop and incrementing the values in that for loop. Since loops are not allowed in Haskell, we didn’t do that.


Screenshot from 2015-12-27 11:20:15Screenshot from 2015-12-27 11:21:37



For the sixth and seventh requirement, i saved the indexes of every subtitle for every chapter of the book and divided the book into subranges of string arrays. Getting every array’s first and last element is enough to do the requirement. And every paragraph is recorded into another file to see them more clearly since command line is not enough to see all of them in a row. That file is like following;

Screenshot from 2015-12-27 11:23:40

Since there are lots of unique words(around 100k for that book), calculation for both of them is taking so long in both Haskell and Python. Especially in Haskell it is so much slower.

Unique words file is also like following. It can be tested whether it has a duplicate in that array by searching a specific element in the file. I did that and didn’t find any duplicates so the code is working!

Screenshot from 2015-12-27 11:25:55
Any comments of suggestions would be very appreciated. See you soon!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s