Pentaho Data Integration (PDI) is a an extract, transform, and load (ETL)
solution that uses an innovative metadata-driven approach. It includes an easy to
use, graphical design environment for building ETL jobs and transformations,
resulting in faster development, lower maintenance costs, interactive debugging, and
simplified deployment.

Common Uses
Pentaho Data Integration is an extremely flexible tool that addresses a broad number
of use cases including:
- Data warehouse population with built-in support for slowly changing dimensions
and surrogate key creation
- Data migration between different databases and applications
- Loading huge data sets into databases taking full advantage of cloud, clustered
and massively parallel processing environments
- Data Cleansing with steps ranging from very simple to very complex
transformations
- Data Integration including the ability to leverage real-time ETL as a data
source for Pentaho Reporting
- Rapid prototyping of ROLAP schemas
- Hadoop functions: Hadoop job execution and scheduling, simple Hadoop MapReduce
design, Amazon EMR integration
Key Benefits
Pentaho Data Integration features and benefits include:
- Installs in minutes; you can be productive in one afternoon
- 100% Java with cross platform support for Windows, Linux and Macintosh
- Easy to use, graphical designer with over 100 out-of-the-box mapping objects
including inputs, transforms, and outputs
- Simple plug-in architecture for adding your own custom extensions
- Enterprise Data Integration server providing security integration, scheduling,
and robust content management including full revision history for jobs and
transformations
- Integrated designer (Spoon) combining ETL with metadata modeling and data
visualization, providing the perfect environment for rapidly developing new
Business Intelligence solutions
- Streaming engine architecture provides the ability to work with extremely large
data volumes
- Enterprise-class performance and scalability with a broad range of deployment
options including dedicated, clustered, and/or cloud-based ETL servers