Saturday, 5 July 2008

 

Extract Columns From Tabular Text - Cut

A quick way to extract one or more columns from tabular or character delimited data, such as Web pages or log files, is to use the GnuWin cut command.

Some examples:

The -f switch specifies the column to extract. By default, the delimiter is TAB and the -d switch specifies an alternative delimiter.

One limitation of cut is that you can't specify the columns relative to last column, unless you know the index of the last column. If your data has a varying number of columns, such as the path strings printed by the find . -type f command, such as the example below …

./Profiles/9ls0tqn1.default/blocklist.xml
./Profiles/9ls0tqn1.default/bookmarkbackups/bookmarks-2008-06-14.html

… you can't easily extract just the file name (the last column) in every line.

A related command is colrm, which removes character columns from the input. It's quite limited and does the opposite of what I expect, so I haven't used it.

Cut is a simple utility to extract columns of data, and it can't process the column data like a scripting language. I'll write a bit more about processing tabular data in future.

See Also

Labels:


Tuesday, 24 June 2008

 

Gawk Print Last Field

gawk script to print the last field of each line:

gawk -F <delimiter> "{ print $NF }".

-F defines the separator in the command line, otherwise you would prepend "BEGIN { FS=<delimiter> }".

NF is the number of fields in a line and $n is the value of the n'th field, so $NF outputs the last field.

See Also

Labels:


Wednesday, 4 June 2008

 

Match Multiple String Patterns

To find multiple string patterns in an input file or stream, these commands are equivalent:

Labels: ,


Saturday, 24 May 2008

 

Fix Incorrectly Encoded Unicode Files with Python

The Problem

We had a lot of text files committed into our CVS repository as Unicode format. When these files were checked out later, we found that they weren't really text files nor Unicode files because CVS had only prepended two bytes to the start of these files, FF FE, but left only one byte for encoding each character. Some text editors such as Vim could open these files but other applications such as Notepad and Excel showed only gibberish.

Unicode Encoded Text in Files

Unicode is an encoding standard … for processing, storage and interchange of text data in any language. For the purpose of fixing this problem, we just have to know how to identify and write valid Unicode files.

We use two tools to experiment and visualize the effect of different encoding methods:

  1. Microsoft Notepad editor, because it can save text files using different encoding methods.
  2. GnuWin32 od utility to output the data in a file as byte values.

Open Notepad and enter this text: Hello World. Select the File / Save As menu item. In the Save As dialog, there are four encoding methods in the Encoding drop down list: ANSI, Unicode, Unicode big endian and UTF-8. Save the same text using each of the encoding methods into four files, say TestANSI.txt, TestUnicode.txt, TestUnicodeBigEndian.txt and TestUTF8.txt, respectively.

Examine the contents of each file using od:

>od -A x -t x1 HelloANSI.txt
000000 48 65 6c 6c 6f 20 57 6f 72 6c 64
00000b

>od -A x -t x1 HelloUnicode.txt
000000 ff fe 48 00 65 00 6c 00 6c 00 6f 00 20 00 57 00
000010 6f 00 72 00 6c 00 64 00
000018

>od -A x -t x1 HelloUnicodeBigEndian.txt
000000 fe ff 00 48 00 65 00 6c 00 6c 00 6f 00 20 00 57
000010 00 6f 00 72 00 6c 00 64
000018

>od -A x -t x1 HelloUTF8.txt
000000 ef bb bf 48 65 6c 6c 6f 20 57 6f 72 6c 64
00000e

The ANSI encoded file contains 11 bytes representing the characters you typed. The Unicode encoded files contain 24 bytes, starting with a two-byte BOM and using two bytes to represent each character. If the first two bytes are FF FE, then the two bytes are stored in low-byte, high-byte order. Conversely, if the first two bytes are FE FF, then the two bytes are stored in high-byte, low-byte order. Finally, when a file starts with byte EF BB BF, only one byte is used to encode each ANSI character and two or more bytes are used to encode non-ANSI characters (not demonstrated).

Fixing Incorrectly Encoded Files in Python

Now we know the format of a Unicode encoded file: it starts with FF FE and stores each character in low-byte, high-byte order. Our text files in CVS just have ANSI characters, so we just have to insert a 0 byte between each character, starting from the third byte. Julian W. wrote a short Python script that to do this. I don't have his code right now, so here's my version for correcting the Unicode encoding for a file:

import codecs
raw = map(ord, file(r'HelloBadUnicode.txt').read())
if raw[0] == 255 and raw[1] == 254 and raw[3] != 0:
  output = codecs.open(r'HelloFixedUnicode.txt', 'w', 'UTF-16')
  for i in raw[2:]:
    output.write(chr(i))
  output.close()

References

Postscript

I started with a more complicated piece of Python code using lists and generators:

from itertools import repeat
from operator import concat

raw = map(ord, file(r'HelloBadUnicode.txt').read())
if raw[0] == 255 and raw[1] == 254 and raw[3] != 0:
  output = file(r'HelloFixedUnicode.txt','w')
  output.write(chr(255))
  output.write(chr(254))
  for i in reduce(concat, zip(raw[2:], repeat(0, len(raw)-2))):
    output.write(chr(i))
  output.close()

But then I realised I just had to write a 0 byte after each ANSI character, so here's a simpler version:

raw = map(ord, file(r'HelloBadUnicode.txt').read())
if raw[0] == 255 and raw[1] == 254 and raw[3] != 0:
  output = file(r'HelloFixedUnicode.txt','w')
  output.write(chr(255))
  output.write(chr(254))
  for i in raw[2:]:
    output.write(chr(i))
    output.write(chr(0))
  output.close()

2008-05-25. I remembered that Python had no problems with writing Unicode files, resulting in the even simpler code in the body of this article.

Labels: , , ,


Tuesday, 20 May 2008

 

GnuWin32 find and missing argument for exec

Reminder on how to use -exec action in GnuWin32 find command in Windows cmd.exe. For example, if you want to find a string, the format is:

find . -type f -exec grep <pattern> {} ;

If you do any of the following, you can get this cryptic error message: find: missing argument to `-exec'

Finally, if all else fails and you lack time to investigate, use xargs:

find . -type f | xargs grep <pattern>

Labels: ,


Wednesday, 7 May 2008

 

Sed Translate / Transform / Transliterate Command

Note to self: sed's (Stream EDitor) command y/list1/list2/ to transform / transliterate each character is based on its position in list1 to a character in the same position in list2. list1 and list2 must be an explicit character list, not a regular expression (and hence, not a character class). In other words, if you enter y/[a-z]/[A-Z]/, sed will look for these characters in the input, '[', 'a', '-', 'z' and ']', to replace with '[', 'A', '-', 'Z' and ']' respectively; sed does not expand a character class [a-z] to replace with [A-Z]. Same with Posix character class names such as [:lower:] and [:upper:].

I incorrectly mixed up the idea that sed's transform command with the tr (translate) command, which supports interpreted sequences, e.g. tr [:lower:] [:upper:] will transform all lower case characters to upper case.

Labels:


Thursday, 1 May 2008

 

More Uses of Getclip-Putclip

More uses of GnuWin32 / Cygutils tools getclip and putclip using this recipe: getclip | <command chain> | putclip.

A second recipe is (for /f %i in ('getclip') do @command %i) | putclip if command cannot be used in a pipeline. Two examples are basename (return name of file in a path string) and dirname (return path string without file name).

2005-05-01: Don't simply list transformations and filters that can be done with GnuWin32 tools, but ones where existing applications (e.g. Excel, Firefox, Outlook or Word) don't have an easy way to achieve a particular action.

Labels: , ,


Wednesday, 30 April 2008

 

Using Clipboard in the Command Line

GnuWin32 / Cygutils package has two tools for interacting with the Windows clipboard: getclip and putclip. The first copies text from the clipboard to standard output and the second copies text from standard input to the clipboard. These tools are useful when you want to process text from one Windows application before pasting the text into another application, in the following recipe: getclip | <filters> | putclip.

For example, I want to paste all DLL file names in a folder into a document:

  1. Navigate to the required folder using 2xExplorer browser.
  2. Type Alt+a to select all files.
  3. Type Alt+c to copy all file names. 2xExplorer copies the absolute path for each file.
  4. Start cmd.exe console.
  5. In cmd.exe console, enter: getclip | cut -d\ -fn | grep dll$ | putclip. cut is GnuWin32 tool which selects a column of data given a column delimiter (-d\ defines backslash) and field number (-fn defines column n). grep filters the output to only list files with "dll" in their name.
  6. Start editor.
  7. Paste the text in the clipboard in destination document.

Of course, you can do the same using Excel:

  1. Navigate to the required folder using 2xExplorer browser.
  2. Type Alt+a to select all files.
  3. Type Alt+c to copy all file names. 2xExplorer copies the absolute path for each file.
  4. Start Excel.
  5. Paste data in a worksheet column.
  6. Select all cells by typing Shift+Space.
  7. Open Convert Text to Columns Wizard by typing Alt+d+e.
  8. Select Delimited data type by typing Alt+d.
  9. Type Alt+n to go to page 2.
  10. Select Other delimiter by typing Alt+o, then enter "\" for paths.
  11. Type Alt+f run the wizard.
  12. Start Auto Filter by typing Alt+d+f+f.
  13. Move to filter column using the mouse (no keyboard shortcuts?) then select from the drop down list (Custom …).
  14. Select ends width criteria, enter .dll, then press Enter.
  15. Move cursor to required column and select it using Control+Space.
  16. Copy column by typing Control+C.
  17. Start editor.
  18. Paste the text in the clipboard in destination document.

The Excel solution has many more steps than the getclip-putclip solution but Excel leads you through to a solution step-by-step. If you're familiar with GNU tools, then getclip-putclip recipe is faster to use and much more extensible.

2008-05-07. I should have remembered that the basename command would output the name of the file without the leading path string. See later article More Uses of Getclip-PutClip about how to use basename in a pipeline.

Labels: , ,


Friday, 25 April 2008

 

Strange GnuWin32 Invalid Argument Error Messages

When chaining GnuWin32 commands in Windows cmd.exe, you may encounter strange error messages like this:

> ls | grep
…
ls: write error: Invalid argument

The first command reports a write error but the error is really in the second command after the pipe symbol.

You may also encounter a similar write error if the wrong command is found in your PATH variable. For instance, Windows and GnuWin32 both have a find and sort command which support different command-line options, so depending on the order of directories listed in your PATH variable, one version or the other is used. If you enter the wrong command-line options for these commands, they won't start and cause the command earlier in the chain to report some sort of I/O error.

Labels: ,


This page is powered by Blogger. Isn't yours?