Artifacts
Most machine learning pipelines aim to create one or more machine learning artifacts, such as a model, dataset, evaluation metrics, etc.
KFP provides first-class support for creating machine learning artifacts via the dsl.Artifact
class and other artifact subclasses. KFP maps these artifacts to their underlying ML Metadata schema title, the canonical name for the artifact type.
In general, artifacts and their associated annotations serve several purposes:
- To provide logical groupings of component/pipeline input/output types
- To provide a convenient mechanism for writing to object storage via the task’s local filesystem
- To enable type checking of pipelines that create ML artifacts
- To make the contents of some artifact types easily observable via special UI rendering
The following training_component
demonstrates standard usage of both input and output artifacts:
from kfp.dsl import Input, Output, Dataset, Model
@dsl.component
def training_component(dataset: Input[Dataset], model: Output[Model]):
"""Trains an output Model on an input Dataset."""
with open(dataset.path) as f:
contents = f.read()
# ... train tf_model model on contents of dataset ...
tf_model.save(model.path)
tf_model.metadata['framework'] = 'tensorflow'
This training_component
does the following:
- Accepts an input dataset
- Reads the input dataset’s content from the local filesystem
- Trains a model (omitted)
- Saves the model as a component output
- Sets some metadata about the saved model
As illustrated by training_component
, artifacts are simply a thin wrapper around some artifact properties, including the .path
from which the artifact can be read/written and the artifact’s .metadata
. The following sections describe these properties and other aspects of artifacts in detail.
Artifact types
The artifact annotation indicates the type of the artifact. KFP provides several artifact types within the DSL:
DSL object | Artifact schema title |
---|---|
Artifact |
system.Artifact |
Dataset |
system.Dataset |
Model |
system.Model |
Metrics |
system.Metrics |
ClassificationMetrics |
system.ClassificationMetrics |
SlicedClassificationMetrics |
system.SlicedClassificationMetrics |
HTML |
system.HTML |
Markdown |
system.Markdown |
Artifact
, Dataset
, Model
, and Metrics
are the most generic and commonly used artifact types. Artifact
is the default artifact base type and should be used in cases where the artifact type does not fit neatly into another artifact category. Artifact
is also compatible with all other artifact types. In this sense, the Artifact
type is also an artifact “any” type.
On the KFP open source UI, ClassificationMetrics
, SlicedClassificationMetrics
, HTML
, and Markdown
provide special UI rendering to make the contents of the artifact easily observable.
Declaring Input/Output artifacts
In components, an artifact annotation must always be wrapped in an Input
or Output
type marker to indicate the artifact’s I/O type. This is required, as it would otherwise be ambiguous whether an artifact is an input or output since input and output artifacts are both declared via Python function parameters.
In pipelines, input artifact annotations should be wrapped in an Input
type marker and, unlike in components, output artifacts should be provided as a return annotation as shown in concat_pipeline
’s Dataset
output:
from kfp import dsl
from kfp.dsl import Dataset, Input, Output
@dsl.component
def concat_component(
dataset1: Input[Dataset],
dataset2: Input[Dataset],
out_dataset: Output[Dataset],
):
with open(dataset1.path) as f:
contents1 = f.read()
with open(dataset2.path) as f:
contents2 = f.read()
with open(out_dataset.path, 'w') as f:
f.write(contents1 + contents2)
@dsl.pipeline
def concat_pipeline(
d1: Input[Dataset],
d2: Input[Dataset],
) -> Dataset:
return concat_component(
dataset1=d1,
dataset2=d2
).output['out_dataset']
You can specify multiple pipeline artifact outputs, just as you would for parameters. This is shown by concat_pipeline2
’s outputs intermediate_dataset
and final_dataset
:
from typing import NamedTuple
from kfp import dsl
from kfp.dsl import Dataset, Input
@dsl.pipeline
def concat_pipeline2(
d1: Input[Dataset],
d2: Input[Dataset],
d3: Input[Dataset],
) -> NamedTuple('Outputs',
intermediate_dataset=Dataset,
final_dataset=Dataset):
Outputs = NamedTuple('Outputs',
intermediate_dataset=Dataset,
final_dataset=Dataset)
concat1 = concat_component(
dataset1=d1,
dataset2=d2
)
concat2 = concat_component(
dataset1=concat1.outputs['out_dataset'],
dataset2=d3
)
return Outputs(intermediate_dataset=concat1.outputs['out_dataset'],
final_dataset=concat2.outputs['out_dataset'])
The KFP SDK compiler will type check artifact usage according to the rules described in Type Checking.
Using output artifacts
When you use an input or output annotation in a component, your component effectively makes a request at runtime for a URI path to the artifact.
For output artifacts, the artifact being created does not yet exist (your component is going to create it!). To make it easy for components to create artifacts, the KFP backend provides a unique system-generated URI where the component should write the output artifact. For both input and output artifacts, the URI is a path within the cloud object storage bucket specified as the pipeline root. The URI uniquely identifies the output by its name, producer task, and pipeline. The system-generated URI is accessible as an attribute on the .uri
attribute of the artifact instance automatically passed to the component at runtime:
from kfp import dsl
from kfp.dsl import Model
from kfp.dsl import Output
@dsl.component
def print_artifact(model: Output[Model]):
print('URI:', dataset.uri)
Note that you will never pass an output artifact to a component directly when composing your pipeline. For example, in concat_pipeline2
above, we do not pass out_dataset
to the concat_component
. The output artifact will be passed to the component automatically with the correct system-generated URI at runtime.
While you can write output artifacts directly to the URI, KFP provides an even easier mechanism via the artifact’s .path
attribute:
from kfp import dsl
from kfp.dsl import Model
from kfp.dsl import Output
@dsl.component
def print_and_create_artifact(model: Output[Model]):
print('path:', dataset.path)
with open(dataset.path, 'w') as f:
f.write('my dataset!')
After the task executes, KFP handles copying the file at .path
to the URI at .uri
automatically, allowing you to create artifact files by only interacting with the local filesystem. This approach works when the output artifact is stored as a file or directory.
For cases where the output artifact is not easily represented by a file (for example, the output is a container image containing a model), you should override the system-generated .uri
by setting it on the artifact directly, then write the output to that location. KFP will store the updated URI in ML Metadata. The artifact’s .path
attribute will not be useful.
Using input artifacts
For input artifacts, the artifact URI already exists since the artifact has already been created. KFP handles passing the correct URI to your component based on the data exchange established in your pipeline. As for output artifacts, KFP handles copying the existing file at .uri
to the path at .path
so that your component can read it from the local filesystem.
Input artifacts should be treated as immutable. You should not try to modify the contents of the file at .path
and any changes to .metadata
will not affect the artifact’s metadata in ML Metadata.
Artifact name and metadata
In addition to .uri
and .path
, artifacts also have a .name
and .metadata
.
from kfp import dsl
from kfp.dsl import Dataset
from kfp.dsl import Input
@dsl.component
def count_rows(dataset: Input[Dataset]) -> int:
with open(dataset.path) as f:
lines = f.readlines()
print('Information about the artifact:')
print('Name:', dataset.name)
print('URI:', dataset.uri)
print('Path:', dataset.path)
print('Metadata:', dataset.metadata)
return len(lines)
In KFP artifacts can have metadata, which can be accessed in a component via the artifact’s .metadata
attribute. Metadata is useful for recording information about the artifact such as which ML framework generated the artifact, what its downstream uses are, etc. For output artifacts, metadata can be set directly on the .metadata
dictionary, as shown for model
in the preceding training_component
.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.