Why Visual Data Manipulation?
Posted on by Andy KerfontaCategories UncategorizedLeave a comment on Why Visual Data Manipulation?

Visual Data Manipulation (VDM) differs from traditional methods such as spreadsheets or programming languages by allowing you to visually choose, configure, and combine transformations and calculations that act upon your data.  While spreadsheets certainly have a visual component, they focus on showing the data itself, rather than the operations.  Programming languages like Python, R, or Matlab hide even more, leaving you with just the input and output data as well as the code itself (if you are a programmer).

Simple for Beginners

Full-blown programming languages like R or Python are incredibly powerful for data work, but the critical drawback for all of them is that in order to use that power, you must be an experienced user.  That’s no problem if you have a team of programmers, but what about the actual data consumers who aren’t computer scientists?  They’re often stuck using spreadsheets because it’s just not practical to learn a programming language.

VDM splits apart the data and the operations so that you can focus on the flow of your data rather than on deciphering opaque algorithms or learning complicated syntax.  This makes designing your model much closer to drawing a simple flowchart — something programmers often do anyway prior to coding.  Not only is a flowchart natural and intuitive for most people, but it lets you build your model one piece at a time even if you aren’t certain of all the steps at the start.

Since VDM tools provide sets of nodes which each do a single, simple operation, building the steps in your flowchart only requires that you understand those simple operations.  From there, it’s just a question of connecting them together to get the result you need.  

Models Mirror Logic

The ability to logically create a model step by step the way you think about it is arguably the most important part of VDM.  It means that mistakes are less likely and easier to fix, the time required to get something done is reduced, and the learning curve is gentle.

If you’ve used spreadsheet for very long, you’re familiar with “Excel Hell.”  Workbooks become so complicated and patched together that deciphering logic, let alone errors, is next to impossible.  Tracing dependencies can take huge amounts of time since the only real way to do it is to trace each affected cell back to its source.  In VDM, connections between nodes show dependencies at a glance and they also show the data at that specific stage in the process.  If your result is wrong, it’s a simple matter of following the data until an error is uncovered.

It has been estimated that “…88 percent of all spreadsheets have errors in them, while 50 percent of spreadsheets used by large companies have material defects”(1)  and that one in five large companies have suffered financial losses due to errors in spreadsheets (2).

The Model as a Report

Once your flowchart-style model is built, it becomes both a fully functioning data process as well as a useful tool for explaining that process to others.  The report is intrinsic to the model.  Not only is the end result shown, but all of the steps and the results of those steps are clearly visible.

There is no substitute for traditional reporting – graphs, charts, dashboards, etc.  And while most VDM tools (Transdata included) can send data to a reporting tool for presentation, those tools have a critical flaw: they only show the end result dataset(s).  All of the steps to arrive at that result are lost, or at best, obscured.
With a standard dashboard, you don’t get much in the way of an explanation of where the data came from.  How did you arrive at the results shown?  What logic went into the process?  All too often, these explanations are left to a supplemental discussion, or, perhaps, a Powerpoint presentation.  VDM combines the data and the explanation in one intuitive model so that the process is the story.


  1. https://www.cnbc.com/id/100923538
  2. http://www.businessinsider.com/stupid-spreadsheet-mistakes-could-lead-to-corporate-disaster-2015-4
Posted on by Andy KerfontaCategories UncategorizedLeave a comment on TRANSDATA UPDATES

This was quite a large release.  Not only were there a lot of visual and UX improvements, but we made some substantial under-the-hood improvements and added 6 functions.  Here are the highlights:

  • User-defined node and layer colors
  • User-defined layer names
  • New layer toolbox
  • Merge speed improvement (in many cases >10x)
  • Model notes
  • Input names automatically updated in formula(s) and other nodes when a node name is changed in model
  • Improvements to date / time handling
  • Solver improvements
  • New functions:

Nodes and Layers

Node and layer colors

Transdata has had layers for a long time, but they were pretty simple.  They didn’t have names (just numbers) and there was no visual indication of what layer(s) a node might be in.  Not only did we address both of those things, but there’s now a handy toolbox that appears next to your model for toggling layers, changing their names, and changing their colors (the toolbox itself can be toggled from the View or Layers menu).

We didn’t stop at improving the look and feel of just the layers though.  You can now change the color of any node type.  There is also a new default color scheme to maximize the visibility of different node types.

In the screenshot below, you can see the colors and controls in action.  Layer colors alter the border of nodes while changing the node color changes the interior.  Note that layer colors and names are saved with each model, but node colors are specific to the Transdata installation.

Behind-the-Scenes Improvements

While they might not be as eye-catching as all the colors, the under-the-hood improvements in this release are quite significant.  Merging got a huge speed improvement for the vast majority of use cases.  You can expect speedups up to 10x for some models.  Date and time handling is also vastly improved thanks to some new libraries.

Changing node names was an easy way to break other nodes since the name changes weren’t automatically propagated throughout the model.  Now, however, formulas are automatically updated for you.

While most of the new functions are straightforward, MAKERECTANGULAR is a bit special.  It forces any sort of ragged (i.e. non-rectangular) data to be rectangular by making sure all fields are present in every record, including duplicates.  This was already done automatically when exporting to things like CSVs that don’t handle ragged data, but now you can explicitly include it in your model.

Cleanup and Aggregation of QuickBooks Report Data
Posted on by Andy KerfontaCategories TRANSDATA ApplicationLeave a comment on Cleanup and Aggregation of QuickBooks Report Data

It’s almost a given that small to mid-sized businesses (SMBs) use QuickBooks to handle their accounting.  The combination of easy-to-use and power is hard to beat.

For some businesses, the built-in reports are all the information they ever need.  For others, additional analysis is needed to do things like aggregate based on item type, location, address, or customer.  One of our early users wanted to see the total amount of each product type they shipped to each state.  Nothing crazy, but not something that QuickBooks just gives you.

Doing this sort of analysis in Excel is certainly possible.  Generally, it involves some copying and pasting followed by a pivot table.  To see a great how-to for doing it with Excel, check out this blog by David Ringstrom: Analyze QuickBooks Data with Excel Pivot Tables

Let’s take a look at an example using TRANSDATA® Platform.  We’ll be working with the Profit and Loss Statement Detail report.  The Excel export you get out of QuickBooks is frequently not ideal for analysis.  There are extra columns and rows that are really there to make it look nice to a person.

Here is the QuickBooks export data as it appears:


Our first task then is to tidy up the data to make it more analysis-friendly.  In this case, we will simply get rid of the rows which contain subheadings.  This is easily done with a simple Filter Node in TRANSDATA® to check if the Type field is empty since subheading rows have no Type

Here are the results of that filter:


It looks a lot better, though we do end up with an extra column called “1”, but that’s not really an issue.  Since we’re going to aggregate this data, it won’t show up in the end result.  If you really hate it, you can always remove it (hint – Field Order Node).

Now that our data is cleaned up, let’s try a simple aggregation using an Aggregate Node.  We’ll combine records based on the Type field and sum the Amount field like so:


Here are the results showing the totals for each Type:


Much more useful already, isn’t it?


Now let’s try aggregating the data using multiple fields.  We’ll combine based unique combinations of Type, Name, and Class while still summing the Amount:


Here are the results of this new aggregation:


This is the kind of valuable business information that helps to quickly make decisions.  At a glance, we get the amount from each account.
Here is the completed model:


Just 6 nodes.  It’s available for download on our website.  Try it out with our free version and your own data, or use it as a starting point to build your own QuickBooks analysis model.

TRANSDATA® and Weather
Posted on by Andy KerfontaCategories TRANSDATA ApplicationLeave a comment on TRANSDATA® and Weather

Did you know that NOAA (the National Oceanic and Atmospheric Administration) makes ALL of their local weather data available for download?  It’s an astonishingly thorough dataset going back years and showing hundreds of stations with multiple sample intervals.

Through the National Centers for Environmental Information, you can download datasets for specific locations or the entire country with monthly, daily, and even hourly readings.  While reading up on the data online, I came across this question on Stack Exchange.  The issue is that the data has a field for WeatherType and one for HourlyPrecip, but sometimes, the first shows rain while the latter shows no precipitation.  You would think that this would be impossible.  As it turns out, this is not actually an error, but it illustrates a fairly general type of problem you may want to filter out of your data (i.e. apparent inconsistencies between variables).

I went with the hourly file for the entire country in November of 2016.  It weighs in at a modest 3,978,056 records, each with 44 fields.  If you’re following along, I’d recommend turning on caching.

In addition to the problem mentioned above, I wanted to check if we might be missing hourly reports from some of the stations.  If we are, there will be different numbers of records for each station’s unique ID (the WBAN field).

Without further ado, here’s the model:


The four output data nodes correspond to the answers we are seeking:

  • Output_1 – The number of reports per station
  • Output_2 – The station with the highest number of reports with location data merged onto the record
  • Output_3 – The average number of reports per station
  • Output_4 – Those reports which show rain but no hourly precipitation

Here is the data in Output_1, and right away we see there is an issue:


Every station has a different number of data points.  As it turns out, this is because “hourly” actually means “at least once per hour, maybe more”.  If we take a look at Output_3, we see that on average each station reports 1727 hourly readings (for November anyway).  Note that there are 720 hours in November.  Why all the extras?  A lot of the stations actually report every 15 (or fewer) minutes.  That wasn’t immediately obvious to me when I found the data.  Often, this sort of thing is assumed, and further analysis is done based on these (incorrect) assumptions.

Output_2 shows us the stations with the most reports (8053 of them):


By adding a simple filter to the original model, we can see that the station in Elkins reports every 5 minutes:


Now back to our original question: which data points show rain but no hourly precipitation?  Let’s take a look at Output_4:


We end up with 64,468 records that fit this description, and while we do know that in this case it’s not actually an error, this is the sort of error that is easy to overlook in data.  Maybe instead of rain information, the fields pertain to whether a particular object has been shipped and whether it arrived.  It’s easy to assume that these are always the same, but if that isn’t the case, incorrect and very costly business decisions can happen. -Andy

TransData Updates
Posted on by Matt SakuradaCategories TRANSDATA DevelopmentLeave a comment on TransData Updates

We stretched this release out a bit so that we could add two major new features: the Reporting Node and the Inception Node. Both are things we’ve wanted to add for quite a while, but other more pressing matters always got in the way.

Reporting Node

Generally speaking, people want to see some sort of graphical summary of their data. While there’s nothing stopping you from using the reporting tool of your choosing with Transdata, we’ve added a node that will export data directly to Zoho Reports. We chose Zoho for its ease of use, professional features, and availability of a free version.

Using the Reporting Node is easy. Simply connect the outputs of any nodes you want to send to the same Zoho Reports database, put in your account credentials, and run your model. Want to send data to multiple databases (or use multiple Zoho accounts)? No problem. Just use multiple Reporting Nodes. Data is sent every time you run your model, so your reports will always be up to date.

Inception Node

We went back and forth (we argued) on what to name this node for a couple weeks. Model Node was just confusing. Nesting Node tends to conjure up visions of horrible Excel formulas or IF statements. That’s not what’s happening here.

The Inception Node is a self-contained model-as-a-node. Build a model and save it as you normally would, then load it into a different model as a single node. You could think of it as a loadable module, or a black box, or just a way to tidy things up. It’s a great way to reduce the complexity of a large model. Break up the pieces into separate models and then load them as Inception Nodes.

When you run your model containing an Inception Node, the nested model receives input data, executes, and then sends output data back out into the parent model. If the nested model is saved with input data (and you don’t choose to overwrite it) it will be loaded and used as it normally would.

Other Stuff

  • More import options
  • Fixes to caching
  • Animations on nodes


Reinventing the Wheel
Posted on by Andy KerfontaCategories TRANSDATA DevelopmentLeave a comment on Reinventing the Wheel

When we tell people that Transdata is not built on top of an existing database, the standard response is an immediate: “Why not?”. If you are going to store, retrieve, and transform data, why would you go to the trouble of reinventing the wheel? After all, these days there are plenty of choices. Surely one of them would work, right?

The answer is twofold:
1. It presented an interesting challenge.
2. By starting from scratch, we could guarantee that we wouldn’t be held back by any of the inherent limitations of existing paradigms and that future innovation and improvement would be possible.

The first reason is largely what brought me to the project from my graduate work in artificial intelligence – clearly no small subject change. Getting to create a totally new system in a field which is not known for change promised to be fascinating. On a side note, I worked with graph-like data structures, so this isn’t really such a big leap. The second reason is the one that really matters to users. One of the main goals for Transdata, from the very beginning, was to make sure that the tool was flexible enough that it didn’t hinder the sort of tasks that data workers find themselves doing day in and day out. Things like combining disparate data sources and cleaning up messy or inconsistent data. The sorts of things that are painful (or impossible) to do in spreadsheets. Ragged data was a requirement from the start.

It wasn’t immediately obvious that we would start from scratch, but we quickly discovered that existing databases don’t offer the flexibility that we needed, or, even worse, they attempt to do everything. They would require so much abstraction between what the user was doing and the actual data manipulation that any efficiency would be completely lost. Whenever possible, I like
to write software so that what is happening under the hood is as close to what the user sees as possible. In Transdata, data is stored and moved around very much like in the flowchart you see in your model. Not only is that good from a UX standpoint, but it keeps me sane.

So, was it the right call? Well, we have yet to run into any unsolvable data issues, our codebase is comparatively small and manageable, we get great performance, and we’ve never had anything break due to a change in an external database. Works for me.


QuickBooks and Transdata
Posted on by Andy KerfontaCategories TRANSDATA ApplicationLeave a comment on QuickBooks and Transdata


Twenty years ago, I left a big company to work for myself providing consulting services to small business. The consulting company is called EmPower Resources, Inc., which provides management, financial and human resource services. The first thing I did when the company was organized was to purchase and set up QB. It is a good product that millions of companies and individuals use. It becomes even better when coupled with knowledge of bookkeeping and accounting.

In 1992 I received my MBA from James Madison University after four years of part-time study. I have worked with numerous accounting folks who did great jobs doing all of the detailed work required for corporate accounting. In 1996, I was on my own. I needed to take my MBA education and apply it to my own company. For the accounting issues, QB met the need. EmPower survives today, and has helped other companies become successful with most of them using QB.

Beyond the checking account and tax preparation value QB provides, small company owners are challenged to use all of the data (project, cost, revenue, profit, etc.) that QB accumulates. QB has numerous pre-made reports that can be generated, and users can create and memorize their own reports. The information can then be exported to xlsx or csv format. I’ll discuss issues with the xlsx format challenges in a future post. Expecting a small business owner to use all of the data in marketing, operations and financial aspects is daunting.

Using a few of the reports, I helped a two-year old company get back on track after I helped them start the company. Business seemed good, but they were short on cash. I found that they needed to raise prices as their margin had reduced after a few years of inflation. It isn’t that easy to raise prices, but we found a way to do that. I also showed them that their Accounts Receivable had started at 5 days and now was 45 days. One of the owners set collections as a much higher priority so cash could be generated.

So how does that owner use the data for maximum benefit? One way is for the owners to do the analysis themselves, and the TRANSDATA™ Platform is just what one needs for a simpler way to do analytics. Another way is to get accountants and consultants to do the analysis for them. It is costly and it also would likely be a slower iterative process.

I use the TRANSDATA™ Platform, and am always looking for new ways to “eat our own dogfood”. Please contact me at msakurada@blueflashsolutions.com, for more information.

What is Ragged Data?
Posted on by Andy KerfontaCategories TRANSDATA ApplicationLeave a comment on What is Ragged Data?

What is ragged data (and why do we care)?
By Andy Kerfonta

Ragged data (also known as a ragged matrix, irregular data, non-rectangular data, or an irregular matrix) is nothing more than a dataset with a different number of columns on each row. Why is this useful? Let’s look at a quick scenario using some baseball data.

We’ll start with two files: Master.csv and Salaries.csv . Master.csv contains a list of all players with their personal information, one row per player. Salaries.csv contains a list of all recorded salaries, one row per salary, so each player may have many rows or none at all.

What happens if we want to merge (join) these two tables together? In traditional rectangular data, you either end up with duplicated players in a vertical table (one row per salary) or a lot of nulls in a more horizontal table (every player has the same number of salary fields – some unused). Here is what happens when Transdata is used to create ragged data:


What we end up with is something much more logical: each player gets a single record with all available salary data. We can then reference particular salary values or do operations over them all (like average or sum by record). It is even possible to do something simple yet useful such as finding the average of every player’s first salary.

So why is ragged data important? The most obvious answer is that it makes merging data a lot simpler and more intuitive. SQL JOINs are easily derailed by messy or inconsistent input data. Since ragged data by definition cannot depend on data shape, field names must be used. This means input data can be ordered differently or even be missing fields altogether without breaking merges. Data integrity is maintained.

Ragged data can also reduce or eliminate data transpositions (swapping of columns and rows). In SAS, “…it is more efficient to store your data in a vertical format and processing the data is easier in a horizontal format.” The same limitation holds true for most rectangular data software. By being able to merge and operate on ragged data efficiently, the number of steps required for even simple transformations can be significantly reduced, and with it, the time spent.