Saturday, 5 July 2008
Extract Columns From Tabular Text - Cut
A quick way to extract one or more columns from tabular or character delimited data, such as Web pages or log files, is to use the GnuWin cut command.
Some examples:
- Print just the bug number and title from a list of bugs in a Web page (e.g. from Bugzilla):
cut -f1,8. - Print the URLs requested from Apache log (in common format):
cut -d" " -f7.
The -f switch specifies the column to extract. By default, the delimiter is TAB and the -d switch specifies an alternative delimiter.
One limitation of cut is that you can't specify the columns relative to last column, unless you know the index of the last column. If your data has a varying number of columns, such as the path strings printed by the find . -type f command, such as the example below …
./Profiles/9ls0tqn1.default/blocklist.xml ./Profiles/9ls0tqn1.default/bookmarkbackups/bookmarks-2008-06-14.html
… you can't easily extract just the file name (the last column) in every line.
A related command is colrm, which removes character columns from the input. It's quite limited and does the opposite of what I expect, so I haven't used it.
Cut is a simple utility to extract columns of data, and it can't process the column data like a scripting language. I'll write a bit more about processing tabular data in future.
See Also
Labels: GnuWin
Tuesday, 24 June 2008
Gawk Print Last Field
gawk script to print the last field of each line:
gawk -F <delimiter> "{ print $NF }".
-F defines the separator in the command line, otherwise you would prepend "BEGIN { FS=<delimiter> }".
NF is the number of fields in a line and $n is the value of the n'th field, so $NF outputs the last field.
See Also
Labels: GnuWin
Wednesday, 4 June 2008
Match Multiple String Patterns
To find multiple string patterns in an input file or stream, these commands are equivalent:
sed -n -e "/pattern1/p" -e "/pattern2/p". -n suppresses printing all input lines.sed -n -r -e "/pattern1|pattern2/p". -r enables extended regular expressions.grep -e "pattern1" -e "pattern2"..grep -E "pattern1|pattern2". -E enables extended regular expressions.findstr "pattern1 pattern2". You have to delimit the patterns in a single string argument. To find strings containing white spaces, you have to use the \s (whitespace) character class in your pattern.
Labels: GnuWin, Windows Cmd
Saturday, 24 May 2008
Fix Incorrectly Encoded Unicode Files with Python
The Problem
We had a lot of text files committed into our CVS repository as Unicode format. When these files were checked out later, we found that they weren't really text files nor Unicode files because CVS had only prepended two bytes to the start of these files, FF FE, but left only one byte for encoding each character. Some text editors such as Vim could open these files but other applications such as Notepad and Excel showed only gibberish.
Unicode Encoded Text in Files
Unicode is an encoding standard … for processing, storage and interchange of text data in any language
. For the purpose of fixing this problem, we just have to know how to identify and write valid Unicode files.
We use two tools to experiment and visualize the effect of different encoding methods:
- Microsoft Notepad editor, because it can save text files using different encoding methods.
- GnuWin32 od utility to output the data in a file as byte values.
Open Notepad and enter this text: Hello World. Select the File / Save As menu item. In the Save As dialog, there are four encoding methods in the Encoding drop down list: ANSI, Unicode, Unicode big endian and UTF-8. Save the same text using each of the encoding methods into four files, say TestANSI.txt, TestUnicode.txt, TestUnicodeBigEndian.txt and TestUTF8.txt, respectively.
Examine the contents of each file using od:
>od -A x -t x1 HelloANSI.txt 000000 48 65 6c 6c 6f 20 57 6f 72 6c 64 00000b >od -A x -t x1 HelloUnicode.txt 000000 ff fe 48 00 65 00 6c 00 6c 00 6f 00 20 00 57 00 000010 6f 00 72 00 6c 00 64 00 000018 >od -A x -t x1 HelloUnicodeBigEndian.txt 000000 fe ff 00 48 00 65 00 6c 00 6c 00 6f 00 20 00 57 000010 00 6f 00 72 00 6c 00 64 000018 >od -A x -t x1 HelloUTF8.txt 000000 ef bb bf 48 65 6c 6c 6f 20 57 6f 72 6c 64 00000e
The ANSI encoded file contains 11 bytes representing the characters you typed. The Unicode encoded files contain 24 bytes, starting with a two-byte BOM and using two bytes to represent each character. If the first two bytes are FF FE, then the two bytes are stored in low-byte, high-byte order. Conversely, if the first two bytes are FE FF, then the two bytes are stored in high-byte, low-byte order. Finally, when a file starts with byte EF BB BF, only one byte is used to encode each ANSI character and two or more bytes are used to encode non-ANSI characters (not demonstrated).
Fixing Incorrectly Encoded Files in Python
Now we know the format of a Unicode encoded file: it starts with FF FE and stores each character in low-byte, high-byte order. Our text files in CVS just have ANSI characters, so we just have to insert a 0 byte between each character, starting from the third byte. Julian W. wrote a short Python script that to do this. I don't have his code right now, so here's my version for correcting the Unicode encoding for a file:
import codecs
raw = map(ord, file(r'HelloBadUnicode.txt').read())
if raw[0] == 255 and raw[1] == 254 and raw[3] != 0:
output = codecs.open(r'HelloFixedUnicode.txt', 'w', 'UTF-16')
for i in raw[2:]:
output.write(chr(i))
output.close()
References
- Unicode Consortium's FAQ on UTF-8, UTF-16, UTF-32 & BOM.
- Wikipedia's Byte-order mark.
Postscript
I started with a more complicated piece of Python code using lists and generators:
from itertools import repeat
from operator import concat
raw = map(ord, file(r'HelloBadUnicode.txt').read())
if raw[0] == 255 and raw[1] == 254 and raw[3] != 0:
output = file(r'HelloFixedUnicode.txt','w')
output.write(chr(255))
output.write(chr(254))
for i in reduce(concat, zip(raw[2:], repeat(0, len(raw)-2))):
output.write(chr(i))
output.close()
But then I realised I just had to write a 0 byte after each ANSI character, so here's a simpler version:
raw = map(ord, file(r'HelloBadUnicode.txt').read())
if raw[0] == 255 and raw[1] == 254 and raw[3] != 0:
output = file(r'HelloFixedUnicode.txt','w')
output.write(chr(255))
output.write(chr(254))
for i in raw[2:]:
output.write(chr(i))
output.write(chr(0))
output.close()
2008-05-25. I remembered that Python had no problems with writing Unicode files, resulting in the even simpler code in the body of this article.
Labels: GnuWin, Python, Software, Windows
Tuesday, 20 May 2008
GnuWin32 find and missing argument for exec
Reminder on how to use -exec action in GnuWin32 find command in Windows cmd.exe. For example, if you want to find a string, the format is:
find . -type f -exec grep <pattern> {} ;
If you do any of the following, you can get this cryptic error message: find: missing argument to `-exec'
- Put double-quote marks around the command:
find . -type f -exec "grep <pattern> {} ;" - Don't leave a space between braces and semi-colon:
find . -type f -exec "grep <pattern> {};" - Use Unix shell escape character:
find . -type f -exec grep <pattern> {} \;
Finally, if all else fails and you lack time to investigate, use xargs:
find . -type f | xargs grep <pattern>
Labels: GnuWin, Windows Cmd
Wednesday, 7 May 2008
Sed Translate / Transform / Transliterate Command
Note to self: sed's (Stream EDitor) command y/list1/list2/ to transform / transliterate each character is based on its position in list1 to a character in the same position in list2. list1 and list2 must be an explicit character list, not a regular expression (and hence, not a character class). In other words, if you enter y/[a-z]/[A-Z]/, sed will look for these characters in the input, '[', 'a', '-', 'z' and ']', to replace with '[', 'A', '-', 'Z' and ']' respectively; sed does not expand a character class [a-z] to replace with [A-Z]. Same with Posix character class names such as [:lower:] and [:upper:].
I incorrectly mixed up the idea that sed's transform command with the tr (translate) command, which supports interpreted sequences, e.g. tr [:lower:] [:upper:] will transform all lower case characters to upper case.
Labels: GnuWin
Thursday, 1 May 2008
More Uses of Getclip-Putclip
More uses of GnuWin32 / Cygutils tools getclip and putclip using this recipe: getclip | <command chain> | putclip.
- Copy m'th and n'th column of a table from a browser:
cut -fm,n. - Copy columns from Excel and replace tab character with space:
tr \t " ". - Capitalize letters:
tr [:lower:] [:upper:]. (Duh! Enter Shift-F3 in Microsoft Word, thanks to Maria H.). - Remove indentation from e-mail messages:
sed "s/> //". - Remove indentation from source code in Word document:
sed -e "s/^ //"(5-May-2008). - Join lines broken into multiple lines by e-mail clients:
dos2unix | tr -d \n. On a Windows system,trdoesn't recogniseCR-LFpairs for terminating a line, so you have to convert them to a Unix-styleLFusingdos2unixfirst (6-May-2008). - Another way to join broken lines:
tr -d \r\nusing escape codes for carriage return and line feed, respectively (11-May-2008). - Remove formatting from string:
getclip | putclip. This is equivalent to Microsoft Word's Paste Special / Unformatted Text. Also to work-around an annoyance in Outlook 2003, were the Edit / Paste Special is disabled when you are responding to an HTML-formatted document (7-May-2008). - Remove HTML / XML formatting from input:
sed -e "s/<[^>]*>//g"(12-Jun-2008).
A second recipe is (for /f %i in ('getclip') do @command %i) | putclip if command cannot be used in a pipeline. Two examples are basename (return name of file in a path string) and dirname (return path string without file name).
2005-05-01: Don't simply list transformations and filters that can be done with GnuWin32 tools, but ones where existing applications (e.g. Excel, Firefox, Outlook or Word) don't have an easy way to achieve a particular action.
Labels: GnuWin, Windows, Windows Cmd
Wednesday, 30 April 2008
Using Clipboard in the Command Line
GnuWin32 / Cygutils package has two tools for interacting with the Windows clipboard: getclip and putclip. The first copies text from the clipboard to standard output and the second copies text from standard input to the clipboard. These tools are useful when you want to process text from one Windows application before pasting the text into another application, in the following recipe: getclip | <filters> | putclip.
For example, I want to paste all DLL file names in a folder into a document:
- Navigate to the required folder using 2xExplorer browser.
- Type Alt+a to select all files.
- Type Alt+c to copy all file names. 2xExplorer copies the absolute path for each file.
- Start cmd.exe console.
- In cmd.exe console, enter:
getclip | cut -d\ -fn | grep dll$ | putclip. cut is GnuWin32 tool which selects a column of data given a column delimiter (-d\defines backslash) and field number (-fndefines column n). grep filters the output to only list files with "dll" in their name. - Start editor.
- Paste the text in the clipboard in destination document.
Of course, you can do the same using Excel:
- Navigate to the required folder using 2xExplorer browser.
- Type Alt+a to select all files.
- Type Alt+c to copy all file names. 2xExplorer copies the absolute path for each file.
- Start Excel.
- Paste data in a worksheet column.
- Select all cells by typing Shift+Space.
- Open Convert Text to Columns Wizard by typing Alt+d+e.
- Select Delimited data type by typing Alt+d.
- Type Alt+n to go to page 2.
- Select Other delimiter by typing Alt+o, then enter "\" for paths.
- Type Alt+f run the wizard.
- Start Auto Filter by typing Alt+d+f+f.
- Move to filter column using the mouse (no keyboard shortcuts?) then select from the drop down list (Custom …).
- Select ends width criteria, enter .dll, then press Enter.
- Move cursor to required column and select it using Control+Space.
- Copy column by typing Control+C.
- Start editor.
- Paste the text in the clipboard in destination document.
The Excel solution has many more steps than the getclip-putclip solution but Excel leads you through to a solution step-by-step. If you're familiar with GNU tools, then getclip-putclip recipe is faster to use and much more extensible.
2008-05-07. I should have remembered that the basename command would output the name of the file without the leading path string. See later article More Uses of Getclip-PutClip about how to use basename in a pipeline.
Labels: GnuWin, Windows, Windows Cmd
Friday, 25 April 2008
Strange GnuWin32 Invalid Argument Error Messages
When chaining GnuWin32 commands in Windows cmd.exe, you may encounter strange error messages like this:
> ls | grep … ls: write error: Invalid argument
The first command reports a write error but the error is really in the second command after the pipe symbol.
You may also encounter a similar write error if the wrong command is found in your PATH variable. For instance, Windows and GnuWin32 both have a find and sort command which support different command-line options, so depending on the order of directories listed in your PATH variable, one version or the other is used. If you enter the wrong command-line options for these commands, they won't start and cause the command earlier in the chain to report some sort of I/O error.
Labels: GnuWin, Windows Cmd
Del.icio.us
Stumble It!


