Quantcast
Channel: SCN : All Content - All Communities
Viewing all articles
Browse latest Browse all 8392

A SIMPLE TEXT SEARCH METHOD USING TEXT TO INTEGER TRANSFORMATION

$
0
0

ABSTRACT:


‘String searches’ in text file has a wide array of application from simple serialization of data, variables in applications to complex databases. In the below document a simple search method is proposed which avoids compare operations being performed on all existing characters of the text by transforming the text file to integer arrays which is achieved by a pre-processing on text file and then by performing fast and effective binary searches on the pre-processed file. Thus fast and effective searches are performed without the need of indexing or algorithms involving pattern searches.

 

 

INTRODUCTION:


Normally string searches within the text file are performed by examining each character of the text file for the first character of the string to be searched. Once that is found the subsequent characters of the text would be compared to the characters of the string. If no match occurred then the text would again be checked character by character in an effort to find a match. Thus almost every character in the text needs to be examined

 

To overcome this tediousness there are some algorithms and indexing techniques available which have its own complexity of implementation, different area of application. Some of the techniques like “Boyer–Moore string search algorithm”, Full text indexing etc. provides effective solution to make the search less tedious and more resource friendly by pre-processing the text file and performing pattern based search of the string and categorizing data into different indexes respectively. The method proposed in this document suggests a simplified process of text search by reducing the compare operation that is to be performed on characters occurring to simple binary searches on integer array achieved by a pre-processing step

 

Here method of search is split into two stages. In first stage as a part of pre-processing the multiple rows of arrays of integers are generated for each ASCII character occurring in the text file. Since there are only 126 ASCII characters, out of which only 94 are printable. Hence there can be only 94 integer arrays. Each integer in the array represents the position of occurrence of the character against which it is maintained in the text file. For example, character ‘c’ may occur in a particular text file at positions twentieth, seventy eight, one twenty second and so on. Thus integers 20, 78, 122... .  Are stored as integer array for ‘c’. Similarly all characters in the text file are converted into integer array. We can recreate the text file with this integer array anyway

 

In second stage of search. The pre-processed file got in the above method is subjected to search. Unlike traditional search, instead of comparing each character of the text with the characters of the string to be searched, we will have arrays of integers here, which represents the occurrence or position of a particular character in the text is clearly maintained. Such a sorted integer array can be subjected to binary search..

 

This method and its application is explained in detail in the section followed.

 

 

METHODS and IMPLEMENTATIONS:


For discussion on the implementation of this method we take an example of a text file as shown below. Say we have a total of N character. With just small case English alphabets as shown in the image below:

for_abs.png

 

The above text file is subjected to a pre-processing first, as mentioned earlier. The pre-processing step would create a file which may later be used any number of time to perform search quickly. In Pre-processing the entire text is searched for each ASCII character and its position in the file are stored. Consider in above example we first check for character ‘a’ whose positional occurrence in above text is at 1, 6, 9 … and so on, similarly for space character, ‘ ’ the occurrence is at 4, 9, 18 … and so on. On computing the file as mentioned above we would get a set of array of integer for each character. As shown below


Array for character ‘a’ : [1, 6, 9 …….]

Array for character ‘b’ : [2, 7, 21 ……]


And so on,

 

This resulting array of integer is stored in separate file and further execution of search is performed on file containing these arrays. As explained, end result of the above pre-processing may be a set of arrays of integers. Each array representing the position of a particular ASCII character in the text file. Since the above text uses only small case English alphabets there would be 27 arrays (including blank spaces) and total number of integers on all arrays would be N

 

The file resulted from the above execution shall be retained for further usage of search. Suppose a string ‘nkj’ needs to be searched in the text file. We would have a pre-processed file containing of integers where we first go to array of integers of n. Say that array consists of integer as shown

 

45, 55, 62, 70, 78, 85, 94

 

The above integers show the positional occurrence of character ‘n’ in the text file. With the above integers we then move to check the integer array of the second letter of the string to be searched, that is ‘k’. Whose array of integer would contain(For Examlpe)

 

10, 16, 18, 23, 26, 28, 32, 37, 41, 46, 56, 64, 79 ..

 

In the above array of k. we have to search only for the integer coming exactly after integers in array n. That is since we have elements, 45, 55, 62, 70, 78, 85, 94 we would have to search for integers 46, 56, 63, 71, 79, 86, 95 in integer array of k. This can be done with simple binary search, as the arrays are in order since we have only 2 matches for ‘k’ that is 46 and 56. The search to be performed for ‘nkj’ is even narrowed for character ‘j’ to just two elements that is 47 and 57 to be searched in its array. Say our array of integers of ‘j’ is

 

20, 28, 32, 45, 57, 63, 75, 80

 

We will have only one match that is 57. Hence we get searched string at 55, 56, 57

 

Thus compare operation on entire alphabets of text file is replaced by binary search of 3 integer arrays

 

 

IMPROVEMENTS:

 

1. Searches on the text need not be started from the first element of the string to be searched instead we may start from that character of the string that has least occurrence in the text file compared to other characters of the string thereby the number integers to be searched is reduced

2. As the size of the file increases. The number of integer in total sum of array increases. Example a text having 1223 character would need 10bits to represent position of each character.This bit length would increase with increase in the size of text. To avoid this increase we may split the text suitably or represent with array of multiple dimensions  


Viewing all articles
Browse latest Browse all 8392

Trending Articles