G-String ascii file editor

Version of June, 2004

The GString class library is distributed as part of the string library. See the string library documentation on how to customise the include.h file for your compiler and for the file list.

The GString class library is an editor similar, in concept, to the Unix SED editor. One writes a series commands to edit an ASCII file such as, for example, HTML or C++ source. This is useful if you want to do the same editing job over and over again on one or several files. For example you might want to change the name of a class in a C++ program and have to go through all your source files to make the change.

This editor, however, is a C++ library, and the series of commands, are written in C++. So one has a full power of C++ and one does not have to learn a new language. This editor, by itself, is slightly less ambitious than the SED editor. However, when you add in the extra flexibility and capabilities provided by it being embedded in C++, you get a much more powerful product, that is as easy to use if you already know the basics of C++. At least that is the intention.

This is a still a development version and there are a number of rough edges. It needs further testing and development.

There are three components

The StringList class stores the file as a linked list of strings, each string corresponding to a line in the original file. The GString (generalised strings) family of classes provide the editing capability. The TaggedStringList family of classes allow one to edit subsets of the strings making up the original file.

See the example section for some simple examples.

This library uses my string library.

GString uses nested classes so will not run on older compilers.

 

I have tested this on Borland 5.02, Borland Builder 5, Microsoft VC++ 6 and 7, Intel 5 and 8 for Windows (all in console mode); Gnu G++ 2.96, 3.1 for Linux and Intel 6 for Linux; Gnu G++ 2.95 on Cygwin; and Gnu G++ 2.95 and Sun CC 6 on a Sun. There was a problem with the test program on Intel 5 for Linux, CC generates a number of warning messages, otherwise everything was fine.

StringList class

This is similar to an implementation of the standard template class list explicitly for Strings.

The member functions and friends

StringList SL; Construct a StringList.
StringList(StringList& SL); Construct a StringList from another StringList.
~StringList(); The destructor.
void operator=(StringList& SL); Copy a StringList.
void push_back(const String& s); Insert a new string at the end of a StringList.
void push_front(const String& s); Insert a new string at the beginning of a StringList
String pop_back(); Return the string at the end of a StringList and erase it from the StringList.
String pop_front(); Return the string at the beginning of a StringList and erase it from the StringList.
void CleanUp(); Erase all the strings from the StringList.
int size(); Return the number of strings in the StringList.
void Format(int width, StringList& sl); Word-wrap the document represented by the StringList to fit in a page of the given width. Return the result to sl.
friend ostream& operator<<(ostream& os, StringList& sl); Output the entire StringList to a file.
friend void operator>>(istream& is, StringList& sl); Read an entire file to a StringList.
StringList_String operator()(const String& s); Return a StringList_String by selecting strings containing the string s.
StringList_GString operator()(GString& g); Return a StringList_GString by selecting strings conforming to the pattern g.
StringList_Range operator()(const String& s1, const String& s2, int ends = 3); Return a StringList_Range by selecting ranges delimited by the strings s1 and s2. See note below.
TaggedStringList All(); Return a TaggedStringList containing all of the StringList.
iterator begin(); Return an iterator pointing to the beginning of the StringList.
iterator end(); Return an iterator pointing to one past the end of the StringList.
reverse_iterator rbegin(); Return a reverse iterator pointing to the end of the StringList.
reverse_iterator rend(); Return a reverse iterator pointing one before the beginning of the StringList.
iterator find(const String& s); Return an iterator pointing to the first string equal to s.
reverse_iterator rfind(const String& s); Return a reverse iterator pointing to the last string equal to s.
iterator find(const String& s, iterator i); Return an iterator pointing to the first string equal to s after i.
reverse_iterator rfind(const String& s, reverse_iterator i); Return a reverse iterator pointing to the last string equal to s before i.
void erase(iterator i); Erase the string corresponding to i. See note below.
void erase(reverse_iterator i); Erase the string corresponding to i. See note below.
void insert_before(const String& s, iterator i); Insert a string, s, before iterator i.
void insert_after(const String& s, iterator i); Insert a string, s, after iterator i.

Notes

The parameter ends in

   StringList_Range operator()(const String& s1, const String& s2, int ends = 3);

determines which of the range ends are included in the range.

ends range ends included in TaggedStringList
0 neither
1 second
2 first
3 both

After erasing a string corresponding to an iterator i you cannot manipulate i, for example, with i++. This means that special care must be taken with loops which erase elements from a StringList. For example, use

StringList::iterator i1;
for (StringList::iterator i = SL.begin(); i != SL.end(); i=i1)
{
   i1 = i; ++i1; SL.erase(i);
}

rather than

for (StringList::iterator i = SL.begin(); i != SL.end(); ++i)
{
   SL.erase(i);
}

Iterator functions

Function iterator reverse iterator
++i point to next string point to previous string
--i point to previous string point to next string
i++ return current value of iterator, but increment value of iterator return current value of iterator, but increment value of iterator
i-- return current value of iterator, but decrement value of iterator return current value of iterator, but decrement value of iterator
*i return the string corresponding to the iterator return the string corresponding to the iterator
i==j true if the iterators are the same true if the iterators are the same
i!=j true if the iterators are different true if the iterators are different
i-> (*i). (*i).

 

GString classes

These comprise the family of classes for describing a pattern that one might match a string to.

A GString expression is a C++ expression involving any of the operators +, |, &, ^, ~, <; and > and character strings, Strings and other GString objects.

For a string to match a GString expression the entire string must be matched. This is in contrast to just a segment as is the case with string expressions in SED. For example the GString expression

   DOTS + "quick" + DOTS

does match The quick brown fox whereas

   GS + "quick"

does not. In these expressions + means concatenate; DOTS will match any string; GS is included purely to make the C++ compiler recognise the character string as a GString expression.

GString expressions typically occur in any of three places. In the statement

   SL(gs1).s(gs2,gs3);

where gs1, gs2 and gs3 are GString expressions, gs1 determines which strings from the StringList SL are selected for editing; gs2 (which may be the same as gs1) represents the target for editing and gs3 determines the results of the editing. If gs2 had failed to match there would have been no editing. In this statement gs2 and gs3 must have exactly the same pattern of operators, with two exceptions described below. Suppose I wish to find all strings in SL which contain the word quick and change brown to black in these strings. Then I could use the expression

   SL(DOTS + "quick" + DOTS).s(DOTS+"brown"+DOTS,
                               DOTS+"black"+DOTS);

where I have put the gs3 expression under the gs2 expression to make sure the operators match. (Where brown occurs more than once in an individual string, only the first will be changed). In this particular instance it would have been simpler to use

   SL("quick").sf("brown","black")

where sf means substitute first. Where only simple strings are involved, the program does follow the SED convention of requiring only that the search string be included rather than having an exact match of the whole string.

Here is the list of GString classes, objects and operators.

gs.Matches(s) true if GString gs matches a String s
CI(A) matches if A matches with a case-insensitive compare
A | B matches if either A or B matches
A & B matches if both A and B match
A ^ B matches of one but not both of A and B match
A + B matches if the target string can be divided into two parts; the first matching A and the second matching B.
A > B same as + but first match A, then see if B matches
A < B same as + but first match B, then see if A matches
~ A matches if A does not match
AnyString class derived from GString: match any string (including a zero length string); remember the string that is matched
LongestString class derived from GString: match any string but try to find the longest possibility; remember the string that is matched
ShortestString class derived from GString: match any string but try to find the shortest possibility; remember the string that is matched
FixedLengthString(int n) class derived from GString: match any string of length n, remember the string that is matched
.Value() get string stored by any of previous 4 classes
Dots class derived from GString: match any string (including a zero length string)
LongestDots class derived from GString: match any string but try to find the longest possibility
ShortestDots class derived from GString: match any string but try to find the shortest possibility
FixedLengthDots(int n) class derived from GString: match any string of length n
WhiteSpace class derived from GString: match white space - try to match as much as possible (length must be greater than zero)
OptionalWhiteSpace class derived from GString: match white space - try to match as much as possible (length may be zero)
InitialGString A dummy class to enable the C++ compiler to recognise a character string or a String as a GString object
DOTS A globally declared Dots object
LDOTS A globally declared LongestDots object
SDOTS A globally declared ShortestDots object
DOT A globally declared FixedLengthDots object with length 1
WS A globally declared WhiteSpace object
OWS A globally declared OptionalWhiteSpace object
GS A globally declared InitialGString object

The objects constructed by LongestString, ShortestString, and FixedLengthString will match any string but remember the string that they matched. This is returned in the editing phase of the program. If there is any ambiguity in the way the match is carried out LongestString will try to match the longest possible string and ShortestString will try to match the shortest possible string. (The rules governing LongestString and ShortestString need more refining).

Use DOTS to match any string. Use LDOTS or SDOT to remove ambiguities by matching the longest or shortest possible string.

Use DOT to match a single character.

The GS object should be used when a String object or a character string needs to be converted to a GString object and the C++ syntax rules don't do this automatically. For example,

   GS | "cat" | "dog"

means cat or dog,

   GS + "The" + DOTS + "dog" + DOTS

means The (or There, Then etc) at the beginning of the string and dog somewhere within the string. If the GS is left out you will get a C++ syntax error. The need for this trick is one of the downsides of embedding the editor in C++. At the moment +, |, &, etc are implemented as member functions so you need a GS as shown in the preceding expressions and also in expressions such as

   DOTS + "one" > GS + "two" + DOTS

In the final editing stage the values returned to the string are determined by the second argument (gs3) of s. DOTS means copy from the corresponding part of the string before editing. A String object or a character string or means replace the corresponding text in the original text with this text. A LongestString, ShortestString, and FixedLengthString object means copy the text remembered by that object. In each case where there are alternative matching strings in the GString expression only those strings leading to the match are copied.

It doesn't seem possible to track expressions preceded by ~ in a sensible way. So I don't try to do this. The whole ~ clause, including the ~ can be replaced by a single character string or by DOTS (if we want to copy the original string).

Some examples are given at the end of this document.

Concatenated GStrings:

Try to avoid concatenated sequences using + with more than 3 objects of unknown length. Otherwise the speed may be very slow. If possible use < or > in place of + to improve speed. That is use

   A > B > C > D

or

   A < B < C < D

in preference to

   A + B + C + D

where A, B, C and D are GStrings and we are trying to match the concatenation of A, B, C and D. If the length of these GStrings can't be determined in advance the expression with + will search all possible combinations of the lengths until a match is found. This can require an excessive search time. If the expression with > is used, then A is matched, scanning over possible lengths of A. If successful then B is matched and so on. If the expression with < is used then the search starts with D. These will tend to be faster but with the possibility of missing a match. You need to have a string that is matched explicitly (as opposed to DOTS, for example) immediately to the left of each > or to the right of each <. For example,

   GS + "The"  >  DOTS + "dog"  >  DOTS + "."

should be used in preference to

   GS + "The" + DOTS  >  GS + "dog" + DOTS  >  "."

The second version will not match DOTS correctly and the match will almost certainly fail.

Longest and shortest strings:

The direction of the search (from left to right or right to left) which matching an expression with +, < or > is determined by the use ShortestString, LongestString, SDOTS or LDOTS objects. It is easy to set up situations where the results are unpredictable and this needs further work.

White Space:

The WhiteSpace and OptionalWhiteSpace objects, WS and OWS match sequences of  the spaces, tabs, line feeds and carriage returns. The surrounding characters must not be white space characters.

TaggedStringList classes

The TaggedStringList family is used to select a subset of the StringList for editing. It includes the TaggedStringList class and three classes, StringList_StringStringList_Range and StringList_GString derived from it. Objects from these classes are generated from one of the following expressions

   SL.All()
   SL(string)
   SL(string, string, ends)
   SL(gstring)

where SL is a StringList,  string is a String, gstring is a GString, and ends is an int. These classes enable one to edit a subset of the strings in a StringList.

SL.All() includes all the strings in SL.

SL(s), where s is either a character string or an object of class String, includes those strings in SL which include the string s.

SL(s1,s2, ends), where s1 and s2 are either character strings or objects of class String, include those strings in SL selected in the range s1 to s2. The ends parameter, if included, determines which ends of the ranges are included. See the description in the notes on the StringList class. The default is to include both ends.

SL(gs), where gs is a GString expression, includes those strings in SL which match the pattern gs. Note that here, as elsewhere, the pattern must match the complete string and not just a subset of it.

Objects of these classes should not be constructed as stand-alone objects but should be used as part of an editing function such as

   SL.All().sa("target","replace with");

They can also the passed to a function as a TaggedStringList& parameter. This provides a way of carrying out more complicated editing functions that cannot be carried out directly with the functions provided in TaggedStringList family. These classes can access TaggedStringList::iterator and TaggedStringList::reverse_iterator classes which have the same properties with corresponding classes associated with the StringList class.

Note: do not use the reverse_iterator class with a StringList_Range object.

Note: if possible, use a simple string as an argument to SL to reduce the number of strings that have to be searched with a GString.

The member functions and friends

friend ostream& operator<<(ostream& os, TaggedStringList& tsl); Output the entire TaggedStringList to a file.
iterator begin(); Return an iterator pointing to the beginning of the StringList.
iterator end(); Return an iterator pointing to one past the end of the StringList.
reverse_iterator rbegin(); Return a reverse iterator pointing to the end of the StringList.
reverse_iterator rend(); Return a reverse iterator pointing one before the beginning of the StringList.
void erase(iterator i); Erase the string corresponding to i. See note about the corresponding entry in the StringList class.
void erase(reverse_iterator i); Erase the string corresponding to i. See note about the corresponding entry in the StringList class.
void insert_before(const String& s, iterator i) Insert s before the string corresponding to i.
void insert_after(const String& s, iterator i) Insert s after the string corresponding to i.
int sf(const String& s1, const String& s2, iterator i); Substitute the first occurrence of s1 by s2 in the string corresponding to i. Return number of changes (0 or 1).
int sl(const String& s1, const String& s2, iterator i); Substitute the last occurrence of s1 by s2 in the string corresponding to i. Return number of changes (0 or 1).
int sa(const String& s1, const String& s2, iterator i); Substitute all occurrences of s1 by s2 in the string corresponding to i. Return number of changes.
int s(GString& g1, GString& g2, iterator i); Substitute the pattern of g1 by g2 in the string corresponding to i. Return number of changes (0 or 1).
void UpperCase(iterator i); Convert the string corresponding to i to upper case.
void LowerCase(iterator i); Convert the string corresponding to i to lower case.
int erase(); Erase all strings selected by the TaggedStringList class. Return number of erases.
int insert_before(const String& s) Insert s before each of the selected strings. Return number of inserts.
int insert_after(const String& s) Insert s after each of the selected strings. Return number of inserts.
int sf(const String& s1, const String& s2); Substitute the first occurrence of s1 by s2 in each of the selected strings. Return number of changes.
int sl(const String& s1, const String& s2); Substitute the last occurrence of s1 by s2 in each of the selected strings. Return number of changes.
int sa(const String& s1, const String& s2); Substitute all occurrences of s1 by s2 in each of the selected strings. Return number of changes.
int s(GString& g1, GString& g2); Substitute the pattern of g1 by g2 in each of the selected strings. Return number of changes.
void UpperCase(); Convert the selected strings to upper case.
void LowerCase(); Convert the selected strings to lower case.

The iterator versions of the editing functions are intended to be used within loops involving an iterator. Use a (forward) iterator rather than a reverse iterator. The editing functions without iterators refer to the whole TaggedStringList family object.

Functions

int s(String& S, const GString& g1, const GString& g2); Substitute the pattern of g1 by g2 in string S. Return number of changes (0 or 1).

 

Examples

I have a file, fox.txt, containing the single line:

   The quick brown fox jumps over the lazy dog.

I demonstrate a variety of statements for editing this line.

Change quick to fast and print out the resulting line.

   ifstream is("fox.txt");
   StringList Fox; is >> Fox;
   Fox.All().sa("quick", "fast");
   cout << Fox;

or using GStrings

   ifstream is("fox.txt");
   StringList Fox; is >> Fox;
   Fox.All().s(DOTS + "quick" + DOTS,
               DOTS + "fast"  + DOTS);
   cout << Fox;

Suppose the fox line is one line in a long file. I want the editing to apply only to lines that include the word fox (where we are ignoring the possibility that fox is included in some other word).

   ifstream is("fox.txt");
   StringList Fox; is >> Fox;
   Fox("fox").s(DOTS + "quick" + DOTS,
                DOTS + "fast"  + DOTS);
   cout << Fox;

I don't know what the adjectives between The and fox are but I want to change them to slow grey. I show just the editing line.

   Fox.All().s(GS + "The" + DOTS          + "fox" + DOTS,
                    DOTS  + " slow grey " + DOTS  + DOTS);

I am not sure whether the verb is jump or jumps but I want to change it to leap or leaps respectively.

   Fox.All().s(DOTS + (GS | "jump" | "jumps") + DOTS,
               DOTS + (GS | "leap" | "leaps") + DOTS);

I don't what the animals involved are but I want to swap them.

   ShortestString SS1, SS2;
   Fox.All().s(GS + "The quick brown " + SS1 + " jumps over the lazy " + SS2 + ".",
                    DOTS               + SS2 + DOTS                    + SS1 + DOTS);

 

History

To do

More testing.

Operator * to allow multiple copies of a target. I will need three types of counters N0 (may be 0 or 1 repetitions), NN (may be any number of repetitions including 0, N1 (may be any number of repetitions > 0). Can also have a positive integer. Assume > combination rather than +. Assume longest when saying type of * expression.  This is going to be a bit of problem as Collect won't work correctly; may need a IntList class to hold switching data.

Review rules regarding shortest and longest.

Letter by letter translations (SED y operator) - need function R('a', 'z');

Fix problem with i.erase(); ++i;. Decide how to handle i.insert_next(S); ++i;.

In TaggedStringList make iterators forward only; make reverse iterator fully functional (?)

Introduce new editing phase, Load, between Matches and Collect to load values into ShortestString etc. This will ensure the value is not set if not involved in a match. Introduce boolean indicator into ShortestString etc to show if value has been set; reset when assigned a value with =. Run Load at search stage as well as edit phase.(??)

More internal checking - iterators correctly matched with object they are accessing, comparisons valid.

Check that operators in collect and translate phases agree.

Notes on implementing new facilities. What is needed for a new TaggedStringList class and what is needed for a new operator (operator - need to define action in both GString and InitialGstring; list functions required) or a new target object. Requirements on Match Load Collect and Translate. Interchangeable target object -> must push one string to list at collect time and pop one at translate time - note defaults.

Check all target objects are interchangeable between Collect and Translate phases. Collect must push a value; Translate must Pop a value.

Directory read into StringList - how to handle attributes?

Column oriented input and output

Regular expression input

Number input (eg recognise E format)

Extend and document substitution list facilities