Analyzing CSV file looking for trends or abberations

I'm often faced with data (spreadsheets, config, etc) I have to analyze to try and find what might be causing things to happen. Sometimes good things, but usually bad things and often urgently in data I've never looked at before and may be unfamiliar with generally.

I tried looking for an advanced analysis tool, something that will look for repeated phrases or other things that might make it easier to generically understand trends in the data, but couldn't find anything.

I'm posting for two reasons

  • I'm hoping for a recommendation on some kind of software that can do this kind of analysis
  • I wrote a powershell script that does a very basic analysis, I wanted to share it and I'm hoping for improvements to it (including encapsulating it into a function).

The code I came up with just counts the number of times each entry shows up in each column, sorts based on that count, and outputs formatted results.


    #Before You Begin, Set the following
    $SourceFile = Get-ChildItem ".\SomeFile.csv"
    $OutputFile = &{$d=$SourceFile.Directory; $n=$SourceFile.BaseName; $e=$SourceFile.Extension; "$d\$n"+"_Stats"+"$e"} #This just appends _Stats to the source filename

    #$Data = gci . #For Testing
    $Data = Import-Csv $SourceFile
    $ColumnList = $Data|Get-Member|where-object{$_.MemberType -eq "NoteProperty"}|ForEach-Object{$_.Name}
    $CountedData = $ColumnList|ForEach-Object{
        $ThisColumn = $_; 
        $Data|Group-Object $ThisColumn|Select-Object @{
            n="ColumnName"; 
            e={$ThisColumn}
        },Count, @{
            n="Value"; 
            e={$_.Name}
        }
    }|Sort -Descending Count,ColumnName,Value #ColumnName, Count, Value
    $Results=""
    $CountedData|Group-Object ColumnName|ForEach-Object{
        $ThisColumn=$_.name; 
        $ThisGroup=$_.Group; 
        $Results="$Results`n$ThisColumn"; 
        $ThisGroup|ForEach-Object{
            $ThisCount=$_.Count;
            $ThisValue=$_.Value;
            $Results=$Results+",($ThisCount) $ThisValue"
        }
    }
    $Results|Out-File $OutputFile
    start $SourceFile.Directory

Answers


  • You should check out Google Refine (which is downloadable software that runs in your browser). It does a fantastic job of cleaning up messy CSV files.
  • csvstudio is a set of Python tools (and a full CLI app) for generating stats off CSV files.

But if you really want to get serious about data mining, you should take a look at http://www.rdatamining.com/


I would take a look at the R language and RStudio. It is built for doing statistical analysis on large data sets. Tons and tons of libraries.


Need Your Help

Javafx PropertyValueFactory not populating Tableview

javafx-2 javafx tableview

This has baffled me for a while now and I cannot seem to get the grasp of it. I'm using Cell Value Factory to populate a simple one column table and it does not populate in the table.

Separating Gtk.Grid into two rows

gtk vala

I want to have two "rows" inside of my Gtk.Grid.