Pentaho Data Mining (Weka)

Welcome to the community home for Pentaho Data Mining Community Edition (CE) also known as Weka. Pentaho Data Mining is a comprehensive set of tools for machine learning and data mining. Its broad suite of classification, regression, association rules and clustering algorithms can be used to help you understand the business better and also be exploited to improve future performance through predictive analytics.

Community Edition is self supported open source software. An Enterprise Edition (EE) of Pentaho Data Mining including technical support and managed upgrades is also available. For more information about EE or for screen shots and datasheets, visit Pentaho Data Mining EE on Pentaho's corporate site.

Recent News and Releases
- 07/30/10 New Weka 3.4.17, 3.6.3 and 3.7.2 releases available, more info.
- 07/30/10 Weka 3.6.3 - stable GUI version now available.
- 07/30/10 Weka 3.4.1 - the stable book version is now available.
- 07/30/10 Weka 3.7.2 - development version now available.
Stable
Weka 3.6.3 GA (Release Notes)
This is a stable version created from the head of the development code line. The 3.6 code line will receive bug-fixes only (development of new features continues in 3.7). For a detailed list of improvements, please refer to the release notes.
New Features since 3.4
- 35 new learning schemes
- 17 new filters
- Grouping of steps (MetaBean) in Knowledge Flow
- New SQL viewer and visualization plugin support in Explorer
- Area under ROC (AUC) evaluation type
- Relation-valued attributes (supports multi-instance learning)
- Support for incremental clusterers
- XML format for instances
- Text directory to ARFF tool
- Several new data generators

Weka 3.4.17 GA (Book version) (Release Notes)
This is a patch release to Weka 3.4 containing a number of bug fixes. For a detailed list of improvements, please refer to the release notes.
New Features since 3.4
- 35 new learning schemes
- 17 new filters
- Grouping of steps (MetaBean) in Knowledge Flow
- New SQL viewer and visualization plugin support in Explorer
- Area under ROC (AUC) evaluation type
- Relation-valued attributes (supports multi-instance learning)
- Support for incremental clusterers
- XML format for instances
- Text directory to ARFF tool
- Several new data generators
- Click here for a detailed list of new features
In Development
Weka 3.7.2 Development
This is the new development branch of Weka, continuing from 3.5.8 and will include new features as well as bug fixes. Weka 3.7.2 moves a lot of algorithms/tools out of the main Weka distribution and into "packages", managed by a new package management system. Information on the package management system can be found in the WekaManual.pdf included in the 3.7.2 distribution and on the Weka Wiki:

- How to use the package manager wiki article.
- Package management system package structure wiki article.
- Packages can be browsed online here.

New Features in 3.7.2
- New package management system with both command-line and GUI interfaces
- Denormalize filter for flattening transactional data
- Import of PMML support vector machine models
- SGD stochastic gradient descent for learning binary SVMs, logistic regression and linear regression (can be trained incrementally)
- scatterPlot3D - a new package that adds a 3D scatter plot visualization to the Explorer
- associationRulesVisualizer - a new package that adds a 3D visualization of association rules to Associations panel of the Explorer
- prefuseTree - a new package that encapsulates the visualization of trees using the Prefuse visualization library (originally available as a source code download from the Weka wiki)
- prefuseGraph - a new package that encapsulates the visualization of graphs using Prefuse visualization library (originally available as a source code download from the Weka wiki)
- massiveOnlineAnalysis - a new connector package for the MOA data stream learning tool
- kfGroovy - a new package that adds a component to the Knowledge Flow that allows a Knowledge Flow step to be coded and compiled dynamically at runtime from a Groovy script
- GridSearch (now in a package called gridSearch) is now multi-threaded to take advantange of multi-core machines
- MathExpression filter can now reference values other than that of the attribute being processed
- FPGrowth now has a special command-line only option that enables the two passes over the data required by the algorithm to be done incrementally by reading one instance at a time off of the disk
- Scatter plot matrix visualization now has a "fast scroll" feature (consumes more memory than regular scrolling)
Quick Links
Pentaho Advertisement
Contribute to the Project
You can participate by contributing new code, reporting bugs, testing new releases, answering questions and more; Email us the proposed contribution and any other relevant details. Welcome to the team.
- Write a tech tip
- Report a bug in JIRA
- Answer posts on the forums
- Write some code