Tuesday, 20 May 2008
GnuWin32 find and missing argument for exec
Reminder on how to use -exec action in GnuWin32 find command in Windows cmd.exe. For example, if you want to find a string, the format is:
find . -type f -exec grep <pattern> {} ;
If you do any of the following, you can get this cryptic error message: find: missing argument to `-exec'
- Put double-quote marks around the command:
find . -type f -exec "grep <pattern> {} ;" - Don't leave a space between braces and semi-colon:
find . -type f -exec "grep <pattern> {};" - Use Unix shell escape character:
find . -type f -exec grep <pattern> {} \;
Finally, if all else fails and you lack time to investigate, use xargs:
find . -type f | xargs grep <pattern>
Labels: GnuWin, Scripting, Windows Cmd
Python Command Line (-c option) Test 2
Julian W. suggested that I write a one line Python loop for my command scripts instead of map(), as in my earlier article. Instead of map(lambda l: expression(l), sys.stdin), I could write for l in sys.stdin: expression(l). An example trivial command to echo all input lines would be:
python -c "import sys; for line in sys.stdin: print line,"
Problem is that the Python interpreter complains:
File "", line 1 import sys; for line in sys.stdin: print line, ^ SyntaxError: invalid syntax
Wednesday, 7 May 2008
Sed Translate / Transform / Transliterate Command
Note to self: sed's (Stream EDitor) command y/list1/list2/ to transform / transliterate each character is based on its position in list1 to a character in the same position in list2. list1 and list2 must be an explicit character list, not a regular expression (and hence, not a character class). In other words, if you enter y/[a-z]/[A-Z]/, sed will look for these characters in the input, '[', 'a', '-', 'z' and ']', to replace with '[', 'A', '-', 'Z' and ']' respectively; sed does not expand a character class [a-z] to replace with [A-Z]. Same with Posix character class names such as [:lower:] and [:upper:].
I incorrectly mixed up the idea that sed's transform command with the tr (translate) command, which supports interpreted sequences, e.g. tr [:lower:] [:upper:] will transform all lower case characters to upper case.
Monday, 5 May 2008
Obscure Cmd.exe Output Replacement (Back Tick)
With Unix shells such as bash and [t]csh, you can set the value of a variable to the result of a command using the back-tick operator (or output replacement
). For example, LINES = `wc -l filename`, would set the variable LINES with the result of wc -l, which is the number of lines in filename. This technique is useful when you want pass the value of a computed variable to subsequent commands in a script.
Windows' cmd.exe also supports this feature, in a obscure way, using the for command: for /f %i in ('command') do set VARIABLE=%i. To reproduce the previous example in cmd.exe: for /f %i in ('wc -l filename') do set LINES=%i.
Notes:
- Use
%%iin a script. - Use single-quote marks to delimit a command. If you use double-quote marks,
fortreats the argument in parentheses as a string.
I saw this and other cmd.exe hacks somewhere but I didn't take a note of it. Grr. Remind myself to update this page when I find that site again.
Labels: Scripting, Windows Cmd
Thursday, 1 May 2008
More Uses of Getclip-Putclip
More uses of GnuWin32 / Cygutils tools getclip and putclip using this recipe: getclip | <command chain> | putclip.
- Copy m'th and n'th column of a table from a browser:
cut -fm,n. - Copy columns from Excel and replace tab character with space:
tr \t " ". - Capitalize letters:
tr [:lower:] [:upper:]. (Duh! Enter Shift-F3 in Microsoft Word, thanks to Maria H.). - Remove indentation from e-mail messages:
sed "s/> //". - Remove indentation from source code in Word document:
sed -e "s/^ //"(5-May-2008). - Join lines broken into multiple lines by e-mail clients:
dos2unix | tr -d \n. On a Windows system,trdoesn't recogniseCR-LFpairs for terminating a line, so you have to convert them to a Unix-styleLFusingdos2unixfirst (6-May-2008). - Another way to join broken lines:
tr -d \r\nusing escape codes for carriage return and line feed, respectively (11-May-2008). - Remove formatting from string:
getclip | putclip. This is equivalent to Microsoft Word's Paste Special / Unformatted Text. Also to work-around an annoyance in Outlook 2003, were the Edit / Paste Special is disabled when you are responding to an HTML-formatted document (7-May-2008). - Remove HTML / XML formatting from input:
sed -e "s/<[^>]*>//g"(12-Jun-2008).
A second recipe is (for /f %i in ('getclip') do @command %i) | putclip if command cannot be used in a pipeline. Two examples are basename (return name of file in a path string) and dirname (return path string without file name).
2005-05-01: Don't simply list transformations and filters that can be done with GnuWin32 tools, but ones where existing applications (e.g. Excel, Firefox, Outlook or Word) don't have an easy way to achieve a particular action.
Labels: GnuWin, Scripting, Windows, Windows Cmd
Wednesday, 30 April 2008
Using Clipboard in the Command Line
GnuWin32 / Cygutils package has two tools for interacting with the Windows clipboard: getclip and putclip. The first copies text from the clipboard to standard output and the second copies text from standard input to the clipboard. These tools are useful when you want to process text from one Windows application before pasting the text into another application, in the following recipe: getclip | <filters> | putclip.
For example, I want to paste all DLL file names in a folder into a document:
- Navigate to the required folder using 2xExplorer browser.
- Type Alt+a to select all files.
- Type Alt+c to copy all file names. 2xExplorer copies the absolute path for each file.
- Start cmd.exe console.
- In cmd.exe console, enter:
getclip | cut -d\ -fn | grep dll$ | putclip. cut is GnuWin32 tool which selects a column of data given a column delimiter (-d\defines backslash) and field number (-fndefines column n). grep filters the output to only list files with "dll" in their name. - Start editor.
- Paste the text in the clipboard in destination document.
Of course, you can do the same using Excel:
- Navigate to the required folder using 2xExplorer browser.
- Type Alt+a to select all files.
- Type Alt+c to copy all file names. 2xExplorer copies the absolute path for each file.
- Start Excel.
- Paste data in a worksheet column.
- Select all cells by typing Shift+Space.
- Open Convert Text to Columns Wizard by typing Alt+d+e.
- Select Delimited data type by typing Alt+d.
- Type Alt+n to go to page 2.
- Select Other delimiter by typing Alt+o, then enter "\" for paths.
- Type Alt+f run the wizard.
- Start Auto Filter by typing Alt+d+f+f.
- Move to filter column using the mouse (no keyboard shortcuts?) then select from the drop down list (Custom …).
- Select ends width criteria, enter .dll, then press Enter.
- Move cursor to required column and select it using Control+Space.
- Copy column by typing Control+C.
- Start editor.
- Paste the text in the clipboard in destination document.
The Excel solution has many more steps than the getclip-putclip solution but Excel leads you through to a solution step-by-step. If you're familiar with GNU tools, then getclip-putclip recipe is faster to use and much more extensible.
2008-05-07. I should have remembered that the basename command would output the name of the file without the leading path string. See later article More Uses of Getclip-PutClip about how to use basename in a pipeline.
Labels: GnuWin, Scripting, Windows, Windows Cmd
Friday, 25 April 2008
Strange GnuWin32 Invalid Argument Error Messages
When chaining GnuWin32 commands in Windows cmd.exe, you may encounter strange error messages like this:
> ls | grep … ls: write error: Invalid argument
The first command reports a write error but the error is really in the second command after the pipe symbol.
You may also encounter a similar write error if the wrong command is found in your PATH variable. For instance, Windows and GnuWin32 both have a find and sort command which support different command-line options, so depending on the order of directories listed in your PATH variable, one version or the other is used. If you enter the wrong command-line options for these commands, they won't start and cause the command earlier in the chain to report some sort of I/O error.
Labels: GnuWin, Scripting, Windows Cmd
Thursday, 24 April 2008
Python Command Line (-c option)
Perl has a -n option which implicitly runs a while-loop over all lines in STDIN (while (<>) { }). This mode is handy in a command shell when Perl is the recipient of the output of another command and you don't want to write a script. Can we do the same for Python?
Python has a -c option which runs a command in the string following it. While it's not entirely clear to me what is a Python command, I found that you can write some useful functions using list functions and statements using this template:
python -c "import <package>; print '\n'.join(<list function>(lambda x: <expression>, (s.strip() for s in sys.stdin)))
To use this template, replace <package> with a package name (e.g. os), <list function> with a list function (e.g. filter()) and <expression> with, well, an expression. The rest of the template just constructs a list of strings (without a trailing "\n") from the input and prints the results.
For simple string processing, the list function and expression are not required, resulting in a simplified version of this template:
python -c "import <package>; print '\n'.join(<fn>(s.strip()) for s in sys.stdin)"
While researching this topic, I found an ASPN Python Recipe called Pyline to help write commands. Here's the examples in that recipe rewritten using my template:
Print the first 20 characters of each line:
tail test.txt | python -c "import sys; print '\n'.join(s.strip()[:20] for s in sys.stdin)"
Print the 7th word in each line, assuming the separator is ' ':
tail test.txt | python -c "import sys; print '\n'.join(s.strip().split(' ')[6:7] for s in sys.stdin)"
Note that you can also get columns of text from a file using the cut command. Also note that the reason for using the array slice is to avoid getting an IndexError exception if the string is not long enough.
List all files that are greater than 1024 bytes in size:
ls | python -c "import os, sys; print '\n'.join(filter(lambda x: os.path.isfile(x) and os.stat(x).st_size > 1024, (s.strip() for s in sys.stdin)))
Generate MD5 digest values for a list of files, like md5sum.
ls *.txt | python -c "import md5, sys; print ''.join('%s %s' % md5.new(file(s.strip()).read()).hexdigest(), s) for s in sys.stdin)"
26-Apr-2008: Replaced list comprehension statement (for-in with square brackets) with generator expression (for-in with parentheses) in the template to avoid very large lists stored in memory.
Added MD5 digest example, and realised that we only need to use list functions (e.g. filter()) if you want to change the members of the resulting list. Otherwise, the simpler template suffices.
Tuesday, 8 April 2008
Functional Python Palindromes
To find all palindromes from a list of words in a file, one word per line, you could write a procedural Python program like this:
for row in file('test.txt'):
s = row.strip()
if s == s[::-1]:
print s
Here's a functional Python version, with notes below:
from itertools import imap
filter(lambda s: s == s[::-1], imap(str.strip, file('test.txt')))
5 3 4 2 1
- Create a file iterator.
- When we read a line from a file into a string, the string has a trailing newline character (e.g. 'add\n'). We want to remove that trailing newline character, so we use the
itertools.imap()function to create a new iterator that applies thestr.strip()to each line read. The result is that we have an iterator that provides strings without the newline character. - Define an anonymous function using
lambdakeyword that returns true if the input string is a palindrome. - Python idiom for returning a reversed sequence (a string is a sequence of characters).
- Use the
filterfunction to return a list of palindromes.
Using this input file …
add dad dam mad made madam set
… the result of running the functional script is:
['dad', 'madam']
Monday, 7 April 2008
Reading CSV Files in Python
Python has a csv module for reading and writing CSV files (usually exported by Excel or database tables). The basic use of this module is documented in the on-line help. My CSV files usually have a header row, so the idiomatic way to skip this line is to open the CSV file and use the next() function immediately:
from csv import reader
f = open("blah.csv", "rb")
f.next()
for row in reader(f):
print row
If your CSV files are pretty simple (e.g. only single line data, no quotes, etc.), you can use list comprehension and array slicing:
1 2 3
for row in [line.strip().split(',') for line in file("blah.csv")][1:]:
print row
Notes:
- You have to remove the trailing "\n" from each line.
- Split the input line using the delimiter, typically a comma.
- The list comprehension statement returns all lines, so to ignore the first line, you take a slice of the array starting from the second line.
Saturday, 2 February 2008
Prune Directories with Python
I converted my earlier PowerShell script to prune directories to Python:
1 from os import listdir, rmdir
2 from os.path import isdir, exists, join
3
4 def prune_directory(path):
5 if len(path) < 1:
6 print "Empty path"
7 return
8 if not exists(path):
9 print "Invalid path:", path
10 return
11 if not isdir(path):
12 return
13 if len(listdir(path)) <= 0:
14 rmdir(path)
15 return
16 for elem in listdir(path):
17 prune_directory(join(path, elem))
18 if len(listdir(path)) <= 0:
19 rmdir(path)
It's almost a one-to-one translation from the initial PowerShell version. The difference is that in Python, you have to ensure that path is a directory before you call listdir(path).
Python 2.5 has a new generator function os.walk() for traversing a directory tree. Below, the recursive function call in lines 13-19 is replaced by a for loop in lines 13-14.
1 from os import listdir, rmdir, walk
2
3
4 def prune_directory_walk(path):
5 if len(path) < 1:
6 print "Empty path"
7 return
8 if not exists(path):
9 print "Invalid path:", path
10 return
11 if not isdir(path):
12 return
13 for curr, dirs, files in walk(path, topdown=False):
14 if len(listdir(curr)) < 1: rmdir(curr)
In this sample, os.walk() returns a tuple (curr, dirs, files) for each directory it visits. curr is the current directory being traversed, and dirs and files are the directories and files in that directory. Using the parameter topdown=False, os.walk() starts producing these tuples from the lowest descendant directory and working up to the start directory, path.
Note that the loop's conditional statement uses len(listdir(curr)) instead of just len(dirs). os.walk() generates the dirs list before it visits each of the directories in dirs; if all child directories in dirs have been deleted, dirs would still contain an unchanged list and the parent directory, curr, would not be deleted. At least, that's what I think happens; the Python help doesn't say so explicitly.
In earlier versions of Python, there is a similar function called os.path.walk() but os.walk() is much easier to use.
Labels: Programming, Python, Scripting
Thursday, 31 January 2008
Cmd.exe Environment Variables with Colon and Tilde
Some commands in Windows cmd.exe batch files have a leading tilde (~) character or a trailing colon-tilde (:~) pair of characters attached to environment variable names. What's the purpose of these characters?
Trailing Colon-Tilde Pair
You can find more about colon-tilde in the help using set /?. Briefly, you can …
- Slice a string in a variable:
%NAME:~s,nwhere s is the start position (zero-offset) and n is the number of characters. If s is negative, then the offset starts from the right side minus 1 (i.e. -1 refers to the rightmost character). If n is negative, then length - n characters, are extracted. - Replace a substring with another string in a variable:
%NAME:s1=s2%where s1 is substring to be replaced and s2 is the replacement.
Leading Tilde
The leading tilde is used to decompose elements in a batch file parameter formatted as a path, such as the parent directory or file extension. The best reference is Frequently Asked Questions Regarding The Windows 2000 Command Processor, "How do I parse a file name parameter into its' constituent parts?" (sic). Note that you can only use a leading tilde for batch file parameters, not environment variables (!).
4-Mar-08. Another version of the reference is Frequently Asked Questions Regarding The Windows 2000 Command Processor. 09-Sep-02.
Labels: Scripting, Windows Cmd
Wednesday, 30 January 2008
Add Path Environment Variable in PowerShell
Adding a path to the PATH environment variable in the current PowerShell session is simpler than I thought:
> $env:path += ";path"
Note: remember to prepend the semi-colon to the new path make a valid path list.
You can make your PATH variable persistent using the SetEnvironmentVariable() .Net method:
[System.Environment]::SetEnvironmentVariable("PATH", $Env:Path + ";path", "target")
… where target is "Machine", "User" or "Process". Check the .Net documentation for what these values mean.
Labels: PowerShell, Scripting
Saturday, 26 January 2008
Prune Directories with PowerShell
I made a backup of all files with a certain pattern files from one directory to another. If the pattern was, say PostScript (*.ps) files, you can use the following PowerShell statement:
Copy-Item -recurse -filter *.ps <source> <destination>
Now I had a new directory with the same structure as the original one, but Copy-Item made many new empty directories because there were files in the source directories but these files were not copied. Just to be tidy, I wanted to prune the empty directories in the destination path. The Remove-Item cmdlet does not have an option to remove empty directories, so I wrote the following short PowerShell script:
1 function prune-directory {
2 param ([string]$path)
3 if ($path.Length -le 0) {
4 write-host "Empty path."
5 return
6 }
7 if (-not (test-path -literalPath $path)) {
8 write-host "Invalid path:", $path
9 return
10 }
11 if (@(get-childitem $path).Count -le 0) {
12 remove-item $path
13 return
14 }
15 get-childitem $path | where-object { $_.PsIsContainer} | foreach { prune-directory $_.FullName }
16 if (@(get-childitem $path).Count -le 0) {
17 remove-item $path
18 }
19 }
To use it, just enter:
prune-directory <path>
You should verify that the function works the way you expect before using it. Once your directories or files are deleted, they're GONE.
prune-directory() is a recursive function that walks a directory tree and deletes any empty directory it finds. Lines 3-10 check for invalid parameters, lines 11-14 delete the current directory if it is empty and line 18 calls this function for all children which are containers in the current directory. Lines 19-22 are required in case the current directory has no children because they were all deleted by line 15.
In line 11 and 16, we use @(…) to force the result of get-childitem $path to be an array, otherwise we may not be able to count the number of children in a directory. It's a known - uh - nuance in PowerShell that if a cmdlet finds zero or one object, it returns an scalar value rather than an array.
2008-05-15: This change should fix the problem of escape characters in the path string: test-path -literalPath $path.
Labels: PowerShell, Programming, Scripting
Sunday, 20 January 2008
PowerShell File Version Information
Compiled files in your Windows computer, such as executables and libraries (or files with a .exe and .dll suffix in their names), can contain some additional information stored in a FileVersion structure. You can see these properties in Explorer's Properties dialog, Details tab.
Before releasing a Windows-based product, I wanted to check that the Copyright and Product Version fields in all compiled files were correctly updated. We always increment the number in Product Version and if the product was released in the start of the year, we also update the Copyright field. You can use Windows Explorer to view these properties in several files at once (just select all relevant files and show the Properties dialog), but if a file had a different value from the others, the Properties / Details tab shows the unhelpful multiple values text. Which file has a value different from the others? It isn't that hard to check each field individually but why not automate the test?
You can use the following PowerShell one-liner to list the Copyright field of all compiled files:
> get-childitem * -include *.dll,*.exe | foreach-object { "{0}`t{1}" -f $_.Name, [System.Diagnostics.FileVersionInfo]::GetVersionInfo($_).LegalCopyright }
To test on the Windows PowerShell folder:
gpowershell.exe Copyright (c) Microsoft Corporation. All rights reserved. powershell.exe © Microsoft Corporation. All rights reserved. pwrshmsg.dll © Microsoft Corporation. All rights reserved. pwrshsip.dll © Microsoft Corporation. All rights reserved.
The first command get-childitem * -include *.dll,*.exe retrieves a list of files in the current directory that have a *.dll or *.exe suffix. The Get-ChildItem cmdlet has a -filter option but it only accepts one pattern.
The second command outputs the filename and the Copyright information of each file using the format (-f) operator. We use the .Net [System.Diagnostics.FileVersionInfo]::GetVersionInfo() method to obtain the file's LegalCopyright (or Copyright) field.
To check Product Version field, use .ProductVersion instead of .LegalCopyright. If you are interested in other fields, check MSDN for a complete list of field returned by GetVersionInfo.
Labels: PowerShell, Scripting
Wednesday, 9 January 2008
PowerShell and Cat Extract Lines with Line Numbers
I wanted to extract some lines of text, each line prefixed with its line number, from a text file. Frustratingly, while IDEs such as Visual Studio and NetBeans and text editors such as Vim happily (if software were said to have emotion) show line numbers in their display, you can't select the line numbers with text! In the last century, I would use cat -n | head -n1 | tail -n2 and copy the required lines from the output. Fast forward to yesterday where I found myself using cat -n again in PowerShell. This time, I could use Select-Object to extract only the lines I wanted …
> cat -n <file> | select-object -first n1 | select-object -last n2 Get-Content : A parameter cannot be found that matches parameter name 'n'. …
It turns out that cat is aliased to Get-Content, which doesn't process the -n parameter. Shay Levi and Richard Siddaway provided me with some solutions (see newsgroup microsoft.public.windows.powershell, topic "Temporarily ignore alias") and I was on my way again:
> cat.exe -n <file> | select-object -first n1 | select-object -last n2
Of course, since the first command creates an array of strings, you can slice it and end up with a much shorter statement:
> (cat.exe -n <file>)[r1..r2]
Note the following relationship: r1 = n1-n2 and r2 = n1.
But if you don't have cat.exe installed, you can reproduce the behaviour of cat -n with this PowerShell solution:
> get-content <file> | foreach-object { $i=1 } { "{0,4} {1}" -f $i, $_; $i++ }
Let's analyze the second command: -f is the PowerShell format operator which formats the right-hand argument ($i, $_) using the left-hand argument ("{0,4} {1}"). The first format control string ({0,4}) means "format input 0 in 4 spaces, right aligned". The second format control string ({1}) just writes each line without any formatting.
Labels: PowerShell, Scripting
Tuesday, 1 January 2008
PowerShell Group-Object and Anagrams
In an earlier article, we used an associative array to group words with the same property (in this case, the same set of letters) to find anagrams. While that solution worked, it seemed to me that there should be an easier solution using the Group-Object cmdlet.
> "add", "dad", "dam", "mad", "made", "madam", "set" | group { $_.toCharArray() | sort-object }
Count Name Group
----- ---- -----
2 a d d {add, dad}
2 a d m {dam, mad}
1 a d e m {made}
1 a a d m m {madam}
1 e s t {set}
Looking good, so let's try a bigger set of words in a file:
> get-content test.txt | group-object { $_.toCharArray() | sort-object }
Count Name Group
----- ---- -----
2 a d d {test.txt, test.txt}
2 a d m {test.txt, test.txt}
1 a d e m {test.txt}
1 a a d m m {test.txt}
1 e s t {test.txt}
That's mighty weird. For some reason, the group has the name of the file rather than the actual word while the signature in the Name column is computed correctly. Is the problem to do with the expression for the group-object? Let's try a simpler expression:
> get-content test.txt | group-object { $_.length }
Count Name Group
----- ---- -----
5 3 {test.txt, test.txt, test.txt, test.txt...}
1 4 {test.txt}
1 5 {test.txt}
It's very puzzling and it seems like group-object was treating each file rather than each word as an input. But then, why is the expression being computed for each word?
Even stranger is when you assign the contents of a file to a variable and get the same result!
> $l = get-content test.txt
> move-item test.txt test2.txt #Ensure original file is no longer available.
> $l | group-object {$_.length}
Count Name Group
----- ---- -----
5 3 {test.txt, test.txt, test.txt, test.txt...}
1 4 {test.txt}
1 5 {test.txt}
In this case, I would have thought that group-object would operate on a list of words and not refer to the original file.
Later … .Net has a function string[] ReadAllLines() that returns an array of strings, so the following works a treat:
> [System.IO.File]::ReadAllLines("C:\temp\download\doc\language\test.txt") | group-object {$_.ToCharArray() | sort-object}
Count Name Group
----- ---- -----
2 a d d {add, dad}
2 a d m {dam, mad}
1 a d e m {made}
1 a a d m m {madam}
1 e s t {set}
At least PowerShell's integration with the .Net Framework makes it possible to solve a problem if the pre-defined cmdlets don't work as you expect.
2-Jan-2008. If you're using PowerShell 2.0 CTP, the Get-Content version works.
Labels: DotNet, PowerShell, Scripting
Friday, 21 December 2007
PowerShell Associative Arrays and Anagrams
Jon Bentley's Programming Pearls describes the following pipeline for finding anagrams from a list of words: generate a signature for the word, then group together all words with the same signature. The signature
is just a sorted list of all letters in a word. For instance, dame
, edam
, made
and mead
all have the same signature adem
.
Associative Arrays
To implement an anagram-finder in PowerShell, let's use an associative array
and we store the signatures in the keys and each word that has the same signature is stored in an array related to that key. Below is a concrete example of what we plan to do:
> $a = @{adem:("dame","edam","made","mead")}
> $a
Name Value
---- -----
adem {dame, edam, made, mead}
Notes
- The key for an associative array does not have to be quoted if it is a string without a whitespace.
- Oddly, arrays in the value column are printed delimited by braces instead of parentheses.
Generating Word Signatures
A word signature is just a string with the letters in the original word sorted. We split a string into a char[] type, sort it, then make it into a string again:
> $sig = [string]("edam".ToCharArray() | sort-object)
> $sig
a d e m
> $sig.length
7
Note that the signature sig is a string with a space between each character. We can prettify the signature but it doesn't hurt because all the signatures will have the same format.
Find Anagrams in a Word List
Now that we can create a signature, we can find all anagrams in a word list by assigning each word's signature as a key in the associative array's value and adding the word to that key's array:
> get-content <test.txt> | foreach-object { $h = @{} } { $t = $_.clone(); $sig = [string]($t.ToCharArray() | sort-object); if (!$h.containsKey($sig)) { $h[$sig] = @() } $h[$sig] += $t } { $h }
Name Value
---- -----
adem {dame, edam, made, mead}
Let's decompose this longish statement to understand what it is doing:
get-content <test.txt> |- Send every line in the input file into a pipeline.
foreach-object- Apply some operation on each object in the pipeline.
{ $h = @{} }- Initialize the associative array h in the begin script block.
- { process block }
- Here is where words with the same signature are grouped together in the associative array.
$t = $_.clone();- Copy the input word from the current pipeline object before it is overwritten in the next statement.
$sig = [string]($t.ToCharArray() | sort-object);- Get the signature of the input word.
if (!$h.containsKey($sig)) { $h[$sig] = @() }- Create an array of anagrams if the signature does not already exist.
$h[$sig] += $t- Add the current word to the array of anagrams.
- { $h }
- Output the associative array h in the end script block.
Conclusion
This article presented an imperative approach to grouping data with the same property using a loop and an associative array. You could apply the same style to any programming language and get the same result. In a future article, I will explore how to use a more streamlined approach to solve similar problems.
Labels: PowerShell, Programming, Scripting
Wednesday, 19 December 2007
Code Noise Ratio
Using Scott Hickey's article on reducing code noise in Groovy as a starting point, can we measure of the noisiness of a programming language and environment? Let's say that the quietest code where the developer has to add the least amount boilerplate code to implement a particular feature. Just to keep the metric simple, let's just measure the size of the source files for different implementations of the same feature and assume that the developer is trying to write sensible code. We assume that the shortest version is the quietest and calculate the ratio between the shortest version and all other versions.
Let's test this ratio on several versions of Hello World
implemented using different scripting languages in previous articles. What is the noisiness of each implementation?
| Version | Platform | Size | Ratio |
|---|---|---|---|
| Groovy + SwingBuilder | Java | 240 | 0% |
| Jython | Java | 253 | 5.42% |
| Groovy | Java | 270 | 12.50% |
| IronPython | .Net | 473 | 91.25% |
| PowerShell | .Net | 579 | 141.25% |
What does this table tell us about reducing code noise for small programs?
- Script environment should pre-import common classes. Groovy + SwingBuilder and plain Groovy makes it very easy to write a small GUI program because all the Java Swing references are pre-imported. The Jython and IronPython implementations are nearly the same but the IronPython version is longer because it has to load .Net references.
- Use class aliases. One reason the PowerShell version is very long is you can't add a class name into the current namespace (such as Python's
from <library> import <class>or C#using namespace <blah>). You can define class aliases like this:$Form = [System.Windows.Forms.Form]but that only starts reducing noise when you use that class more than once. - Define properties in constructors. Another reason the PowerShell version is long is that only the .Net constructors for GUI controls are available through the platform interface, so to define a control, you have to write a sequence of statements starting with creating a new object followed by some
SetX()methods. I wonder if PowerShell adaptors can overload constructors?
Labels: Programming, Scripting
Find Longest (or Shortest) Line in a File
Quick scripts to find the longest line in a file. To find the shortest line, just invert the appropriate test.
Gawk
gawk "{ if (length > max) { max = length; m = $0 } } END { print m }" <file>
Gawk + Sort + Tail
gawk "{ printf("""%4.0f %s\n""", length,$0) }" <file> | sort | tail -1
In Windows Cmd, three double-quotes are required to escape a double-quote, and the %4.0f format control ensures that lines up to 9999 characters long are sorted correctly.
PowerShell
get-content <file> | sort-object -property length | select-object -last 1
WC
wc -L gives the length of the longest line, but not the line itself.
2008-04-20: Consolidated all samples into this article.
Labels: PowerShell, Scripting
Monday, 17 December 2007
PowerShell String and Char[] Sort and Conversion
Introduction
I wanted to sort the characters in a string to use as a signature for that string. A string is basically an array of characters but System.String class does not have a Sort() method. Hmm, looks like we have to break the process into the following steps:
- Convert a string to a character array.
- Sort the character array.
- Convert the character array back into a string.
First cut
A hitch: System.String can only be converted into a char[], so we have to use an external sorter such as Sort-Object cmdlet. If we could have made an Array of char, then we could have used the Array's Sort() method. The first cut looks like this:
> $s = "mad" > $sig = sort-object -inputObject $s.ToCharArray() > $sig m a d
Eh? Why isn't sig sorted? Either Sort-Object or PowerShell doesn't do the expected when presented with an array using the -inputObject parameter. Let's test this:
> "m","a","d" | sort-object a d m > sort-object -inputObject "m","a","d" m a d
Second Cut
Because of the strange behaviour found in the previous section, we call Sort-Object in a pipeline. In addition, we reconstruct the output into a string:
> $s = "mad" > $sig = (string)($s.ToCharArray() | sort-object) > $sig a d m > $sig.length 5
What's wrong now? Why does sig have a whitespace between each character? This was getting a bit deep into PowerShell for me at the moment, so let's use an appropriate constructor in System.String such as String(char[]).
21-Dec-2007: Solution is to change OFS (Output Field Separator) to an empty string, like this: $OFS = "".
Third Cut
We try the System.String(char[]) constructor and make the sorted array output into a string:
> $s = "mad" > $sig = new-object String(($s.ToCharArray() | sort-object)) New-Object : Exception calling ".ctor" with "3" argument(s): "Index was out of range. Must be non-negative and less than the size of the collection. Parameter name: startIndex" At line:1 char:16 + $t = new-object <<<< String($s.ToCharArray())
What gives? The error indicates that a different constructor, String(char[], startIndex, length) should be used. It's not clear why the first constructor is not available.
Fourth Cut
Finally, taking into account all that we learnt above, we end up with the following statement which gives us the result we wanted:
> $s = "mad" > $sig = new-object String(($s.ToCharArray() | sort-object), 0, $s.length) > $sig adm
Conclusion
The resulting statement to sort characters in a string is mostly noise because the purpose of the statement statement is obscured by the need to convert an object from one type to another. In an earlier article about palindromes, another way to make a string from an character array is to use the static method [string]::Join(). That is …
> $sig = [string]::Join("", ($s.ToCharArray() | sort-object))
The second method is shorter but still rather obscure because it relies on the side-effect of the empty string argument when calling the Join() method. It's a rather disappointing end to this exercise because I spent most of the time fighting instead of using PowerShell.
19-Dec-2007. PowerShell 2.0 will have a new Join operator that should make this exercise moot.
21-Dec-2007. Fourth method is to change OFS first, leading to:
> $OFS = "" > $s = "mad" $gt; $sig = [string]($s.ToCharArray() | sort-object) > $sig adm > $sig.length 3
Labels: PowerShell, Scripting
Saturday, 15 December 2007
Two Hello World Windows in Groovy
Groovy is scripting language on the Java platform. Groovy is interesting because it can be compiled into Java byte-code and makes heavy use of closures.
Groovy Hello World
In the beginning, I ported my Jython version into Groovy. It looks pretty much the same as the Jython version:
// Hello World in Groovy
f = new javax.swing.JFrame("Hello World")
f.setSize(170,70)
f.contentPane.layout = new java.awt.FlowLayout()
f.defaultCloseOperation = javax.swing.JFrame.EXIT_ON_CLOSE
f.add(new javax.swing.JLabel("Label me:"))
f.add(new javax.swing.JButton("Press me"))
f.show()
Groovy + SwingBuilder Hello World
Groovy has a helper class called SwingBuilder to make it much easier to write Swing applications. Here's one way to re-write the program:
// Hello World Window in Groovy + SwingBuilder
sb = new groovy.swing.SwingBuilder()
f = sb.frame(
title:"Hello World"
,size:[170,70]
,defaultCloseOperation: javax.swing.JFrame.EXIT_ON_CLOSE) {
flowLayout()
label(text:"Label me:")
button(text:"Press me")
}
f.show()
SwingBuilder is quite wonderful because it allows you to define a GUI with a minimum of code noise
. The definition of a JFrame probably maps a keyword X to a setX() function but I haven't worked out how SwingBuilder converts words such as flowLayout()
, label
and button
to Swing objects and functions.
Labels: Groovy, Java, Scripting
Del.icio.us
Stumble It!