Pentaho Data Mining (Weka)

Welcome to the community home for Pentaho Data Mining Community Edition (CE) also known as Weka. Pentaho Data Mining is a comprehensive set of tools for machine learning and data mining. Its broad suite of classification, regression, association rules and clustering algorithms can be used to help you understand the business better and also be exploited to improve future performance through predictive analytics.

Community Edition is self supported open source software. An Enterprise Edition (EE) of Pentaho Data Mining including technical support and managed upgrades is also available. For more information about EE or for screen shots and datasheets, visit Pentaho Data Mining EE on Pentaho's corporate site.

Recent News and Releases

- 10/31/11 New Weka 3.6.6 and 3.7.5 releases available, more info.
- 06/30/11 New Weka 3.4.19, 3.6.5 and 3.7.4 releases available, more info.
- 06/30/11 Weka 3.6.5 - stable book 3rd ed. version now available.
- 06/30/11 Weka 3.4.19 - stable book 2nd ed. version is now available.
- 06/30/11 Weka 3.7.4 - development version now available.
 

Stable
Weka 3.6.6 GA (Book 3rd. ed. version) (Release Notes)
This is a stable version, created from the head of the 3.5 development code line, and corresponds to what is described in the Witten, Frank and Hall data mining book. The 3.6 code line will receive bug-fixes only (development of new features continues in 3.7). For a detailed list of improvements, please refer to the release notes.

New Features since 3.4
- 35 new learning schemes
- 17 new filters
- Grouping of steps (MetaBean) in Knowledge Flow
- New SQL viewer and visualization plugin support in Explorer
- Area under ROC (AUC) evaluation type
- Relation-valued attributes (supports multi-instance learning)
- Support for incremental clusterers
- XML format for instances
- Text directory to ARFF tool
- Several new data generators

Weka 3.4.19 GA (Book 2nd ed. version) (Release Notes)
This is a patch release to Weka 3.4 containing a number of bug fixes. For a detailed list of improvements, please refer to the release notes. Because the 3rd ed. of the data mining book was published in January 2011, this is the last GA release from the 3.4 code line.


 

In Development
Weka 3.7.x Development
This is the new development branch of Weka, continuing from 3.5.8 and will include new features as well as bug fixes. Weka 3.7.2 moved a lot of algorithms/tools out of the main Weka distribution and into "packages", managed by a new package management system. Information on the package management system can be found in the WekaManual.pdf included in the >=3.7.2 distribution and on the Weka Wiki:

- How to use the package manager wiki article.
- Package management system package structure wiki article.
- Packages can be browsed online here.

New Features in 3.7.5

In core weka:
 
- weka.classifiers.functions.SGDText - stochastic gradient descent for learning linear SVMs and logistic regression for text problems. Operates incrementally and directly on string attributes.
- New incremental version of the multi-class meta classifier (weka.classifiers.meta.MultiClassClassifierUpdateable).
- RandomForest now supports building trees in parallel.
- DatabaseLoader is now much faster when loading data sets with many nominal attributes.
- Database access now allows custom property files to be set at runtime, allowing access to databases different from the default one without having to restart Weka.
- TextDirectoryLoader can now operate incrementally.
- CSVLoader now supports files without a header row.
- Charts can now be exported to files from running Knowledge Flow processes via an offscreen rendering process.
- RemoveUseless filter now removes attributes with all missing values.
- Histogram visualization in the Explorer and Knowledge Flow is now faster.
- ClassifierPerformanceEvaluator in the Knowledge Flow is now multi-threaded to allow folds to be evaluated in parallel.
- File-based savers now support gzip compression.
- File-based loaders now support loading files as a resource from the classpath (including jars).
 
 
 
In packages:
- multiInstanceLearning - added MITI multi-instance tree learner and MIRI rule learner variant.
- RerankingSearch - a feature selection meta-search algorithm that speeds up the base search algorithm, contributed by Pablo Bermejo.
- timeseriesForecasting package now includes support for handling timestamp-based data which contains gaps in the regular time period.
- sasLoader - SAS sas7bdat file reader.
- CHIRP - A new classifier based on Composite Hypercubes on Iterated Random Projections, contributed by Leland Wilkinson.
- PSOSearch - An implementation of the Particle Swarm Optimization (PSO) algorithm to explore the space of attributes, contributed by Sebastian Luna Valero.
- wekaServer - A simple servlet-based server for executing data mining tasks (Explorer and KnowledgeFlow so far). Docs at URL=http://wiki.pentaho.com/display/DATAMINING/Weka+Server
- jfreechartOffscreenRenderer - Offscreen (headless) chart rendering in Knowledge Flow processes using the JFreeChart library.

Pentaho Advertisement

Contribute to the Project

You can participate by contributing new code, reporting bugs, testing new releases, answering questions and more; Email us the proposed contribution and any other relevant details. Welcome to the team.
- Write a tech tip
- Report a bug in JIRA
- Answer posts on the forums
- Write some code