15 May Isn’t it time to KNIME again?
By Mark Cooke – Audit Report Issue 51.
As you have probably read in other articles, I love data. I’m not ashamed. It’s a passion and I’m proud. From social science behavioral data to pure quantitative performance metrics; from big data to little data; messy dirty data to well-formed and beautifully curated data repositories. Come one, come all. I won’t discriminate and I won’t judge.
When it comes to working with data there is no tool I love more than KNIME (www.knime.org), either. It has a lot going for it, obviously. It’s open source (think “it’s free!”). The graphical user interface is easy to work with, and easy to explain to others what you have done. All of the workflows are re-usable. There is easy support for other technologies I like to use, too, like the R language and multiple SQL and noSQL databases.
To be honest, I am so spoiled now that the thought of ever doing data analysis in a spreadsheet gives me the hebegeebees. Literally, the hair on the back of my neck stands up and I start to panic into a cold sweat. I feel an instant need to run. Those days were awful. When I think of all the time I wasted re-doing what I had already done a thousand times before…well, you just can’t get that back. How things have changed!
KNIME 3.3.2 is the current release. It contains a bunch of goodies and the look and feel has been improved over the last year. They have continued with the “read from” and “write to” anywhere mentality and expanded those capabilities. This can get pretty technical, but suffice it to say that KNIME handles small datasets and huge ones (like Hadoop clusters) beautifully. It also has nodes for RESTful data requests – and if you don’t know what that means, it is simply a way to pull data down from remote servers, including those out on the internet.
Recent additions to KNIME around so called “big data” mean small machines can do big things. KNIME was always built with performance in mind, but the addition of some Spark nodes and parallel processing enhancements mean data can get really big. I have personally chewed through hundreds of millions (100,000,000+) lines of data on a laptop without much trouble. This can be really helpful if you are doing some complex analysis and want to compare many features inside your data.
Mapping has always been a part of KNIME, too. However, these nodes have also been expanded and improved. There are nodes to read ESRI data files, and nodes to draw beautiful maps with custom colors and features to represent data qualities. Simply, you can use any data that contain latitude and longitude coordinates to create rich visuals. If you really want to geek out (oh, me too!), you can use some popular libraries for mapping from tools like R to create super rich geospatial analyses.
The analytics portion of KNIME is, of course, the fun stuff. Data transformation (cleaning up all that legacy data, or splitting names or addresses for example) is robust – I haven’t run into anything yet I couldn’t do. But the analytics side is the fun stuff. There is a full suite of statistics tools. I mean full; regression, correlation, ANOVA…Chronbach Alpha anyone? But the fun stuff is inside the machine learning suite. This is where you can really start to build out models around similarity (is this property really like the others), distance (which entities are similar and which are different), or trends. There are nodes to guide you through time series analysis (looking at changing patterns over time). You can even finish off building some predictive models – analyses that can help you classify unknown entities into certain groups.
Since I have your attention, because only my like-minded brethren would have made it this far into the article, did you know I teach KNIME courses? Yes! You don’t have to go it alone. I typically teach classes with anywhere from five (5) to twenty (20) participants. A good introduction to the breadth and depth of KNIME, as well as some guided “what’s possible” scenarios, takes about four days. We start with an introduction to the platform, and end with a BYOD (“bring your own data” ) section. By that point you will have a solid grounding and should be able to accomplish many tasks independently.
After that, your community can take up the mantle and continue to grow skills, share tips and tricks, or even workflows. The workflows can be exported and shipped to other interested users. That also means that if one person has built the wheel, no one else has to do it again. The broader KNIME community is a brilliant group of international users who contribute frequently to the KNIME community website, helping to answer questions, or provide advice. It’s always a good idea to check in and see what they are up to from time to time.
So, isn’t it time to KNIME? Or should I say, can’t it be time to KNIME? I would love to hear from you if you want some advice on your data, or just want to chat about all things analytical. Give me a call at (704) 847-1234.