:original_name: modelarts_23_0004.html .. _modelarts_23_0004: Creating a Dataset ================== To manage data using ModelArts, create a dataset. Then you can perform operations on the dataset, such as labeling data, importing data, and publishing the dataset. Prerequisites ------------- - Before using the data management function, you need permissions to access OBS. This function cannot be used if you are not authorized to access OBS. Before using the data management function, go to the **Settings** page and complete access authorization using an agency. - You have created OBS buckets and folders for storing data. In addition, the OBS buckets and ModelArts are in the same region. - You have uploaded data to be used to OBS. Procedure --------- #. Log in to the ModelArts management console. In the left navigation pane, choose **Data Management** > **Datasets**. The **Datasets** page is displayed. #. Click **Create Dataset**. On the **Create Dataset** page, create datasets of different types based on the data type and data labeling requirements. a. Set the basic information, the name and description of the dataset. .. _modelarts_23_0004__en-us_topic_0170886809_fig17294143617510: .. figure:: /_static/images/en-us_image_0000001157080905.png :alt: **Figure 1** Basic information about a dataset **Figure 1** Basic information about a dataset b. Select a labeling scene and type as required. For details about the types supported by ModelArts, see :ref:`Dataset Types `. .. _modelarts_23_0004__en-us_topic_0170886809_fig3599174864: .. figure:: /_static/images/en-us_image_0000001110761058.png :alt: **Figure 2** Selecting a labeling scene and type **Figure 2** Selecting a labeling scene and type c. Set the parameters based on the dataset type. For details, see the parameters of the following dataset types: - :ref:`Images (Image Classification, Object Detection, and Image Segmentation) ` - :ref:`Audio (Sound Classification, Speech Labeling, and Speech Paragraph Labeling) ` - :ref:`Text (Text Classification, Named Entity Recognition, and Text Triplet) ` - :ref:`Table ` - :ref:`Video ` - :ref:`Other (Free Format) ` d. Click **Create** in the lower right corner of the page. After the dataset is created, the dataset management page is displayed. You can perform the following operations on the dataset: label data, publish dataset versions, manage dataset versions, modify the dataset, import data, and delete the dataset. For details about the operations supported by different types of datasets, see . .. _modelarts_23_0004__en-us_topic_0170886809_section8625131415541: Images (Image Classification, Object Detection, and Image Segmentation) ----------------------------------------------------------------------- .. _modelarts_23_0004__en-us_topic_0170886809_fig773235071210: .. figure:: /_static/images/en-us_image_0000001157080911.png :alt: **Figure 3** Parameters of datasets for image classification and object detection **Figure 3** Parameters of datasets for image classification and object detection .. table:: **Table 1** Dataset parameters +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Parameter | Description | +===================================+=====================================================================================================================================================================================================================================================================================================================================================================================+ | Input Dataset Path | Select the OBS path to the input dataset. | +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Output Dataset Path | Select the OBS path to the output dataset. | | | | | | .. note:: | | | | | | The output dataset path cannot be the same as the input dataset path or cannot be the subdirectory of the input dataset path. Select an empty directory as the **Output Dataset Path**. | +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Label Set | - **Label Name**: Enter a label name. The label name can contain only letters, digits, underscores (_), and hyphens (-). The name contains 1 to 32 characters. | | | | | | - **Add Label**: Click **Add Label** to add more labels. | | | | | | - Setting a label color: This function is available only for datasets of the object detection type. Select a color from the color palette on the right of a label, or enter the hexadecimal color code to set the color. | | | | | | - Setting label attributes: For an object detection dataset, you can click the plus sign (+) on the right to add label attributes after setting a label color. Label attributes are used to distinguish different attributes of the objects with the same label. For example, yellow kittens and black kittens have the same label **cat** and their label attribute is **color**. | +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Team Labeling | Enable or disable team labeling. Image segmentation does not support team labeling. Therefore, this parameter is unavailable when you use image segmentation. | | | | | | After enabling team labeling, enter the name and type of the team labeling task, and select the labeling team and team members. For details about the parameter settings, see :ref:`Creating Team Labeling Tasks `. | | | | | | Before enabling team labeling, ensure that you have added a team and members on the **Labeling Teams** page. If no labeling team is available, click the link on the page to go to the **Labeling Teams** page, and add your team and members. For details, see :ref:`Introduction to Team Labeling `. | | | | | | After a dataset is created with team labeling enabled, you can view the **Team Labeling** mark in **Labeling Type**. | +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ .. _modelarts_23_0004__en-us_topic_0170886809_section17893314546: Audio (Sound Classification, Speech Labeling, and Speech Paragraph Labeling) ---------------------------------------------------------------------------- .. _modelarts_23_0004__en-us_topic_0170886809_fig107351821153417: .. figure:: /_static/images/en-us_image_0000001157080903.png :alt: **Figure 4** Parameters of datasets for sound classification, speech labeling, and speech paragraph labeling **Figure 4** Parameters of datasets for sound classification, speech labeling, and speech paragraph labeling +----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Parameter | Description | +==============================================+======================================================================================================================================================================================================================================================================================================================================================================================================================================================+ | Input Dataset Path | Select the OBS path to the input dataset. | +----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Output Dataset Path | Select the OBS path to the output dataset. | | | | | | .. note:: | | | | | | The output dataset path cannot be the same as the input dataset path or cannot be the subdirectory of the input dataset path. Select an empty directory as the **Output Dataset Path**. | +----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Label Set (Sound Classification) | Set labels only for datasets of the sound classification type. | | | | | | - **Label Name**: Enter a label name. The label name can contain only letters, digits, underscores (_), and hyphens (-). The name contains 1 to 32 characters. | | | - **Add Label**: Click **Add Label** to add more labels. | +----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Label Management (Speech Paragraph Labeling) | Only datasets for speech paragraph labeling support multiple labels. | | | | | | - **Single Label** | | | | | | A single label is used to label a piece of audio that has only one class. | | | | | | - **Label Name**: Enter a label name. The label name can contain contains 1 to 32 characters. Only letters, digits, underscores (_), and hyphens (-) are allowed. | | | - **Label Color**: Set the label color in the **Label Color** column. You can select a color from the color palette or enter a hexadecimal color code to set the color. | | | | | | - **Multiple Labels** | | | | | | Multiple labels are suitable for multi-dimensional labeling. For example, you can label a piece of audio as both noise and speech. For speech, you can label the audio with different speakers. You can click **Add Label Class** to add multiple label classes. A label class can contain multiple labels. The label class and name can contain contains 1 to 32 characters. Only letters, digits, underscores (_), and hyphens (-) are allowed. | | | | | | - **Label Class**: Set a label class. | | | - **Label Name**: Enter a label name. | | | - **Add Label**: Click **Add Label** to add more labels. | +----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Speech Labeling (Speech Paragraph Labeling) | Only datasets for speech paragraph labeling support speech labeling. By default, speech labeling is disabled. If this function is enabled, you can label speech content. | +----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Team Labeling | Only datasets of speech paragraph labeling support team labeling. | | | | | | After enabling team labeling, set the name and type of the team labeling task, and select the team and team members. For details about the parameter settings, see :ref:`Creating Team Labeling Tasks `. | | | | | | Before enabling team labeling, ensure that you have added a team and members on the **Labeling Teams** page. If no labeling team is available, click the link on the page to go to the **Labeling Teams** page, and add your team and members. For details, see :ref:`Introduction to Team Labeling `. | | | | | | After a dataset is created with team labeling enabled, you can view the **Team Labeling** mark in **Labeling Type**. | +----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ .. _modelarts_23_0004__en-us_topic_0170886809_section16230452125420: Text (Text Classification, Named Entity Recognition, and Text Triplet) ---------------------------------------------------------------------- .. _modelarts_23_0004__en-us_topic_0170886809_fig13128845173710: .. figure:: /_static/images/en-us_image_0000001110920960.png :alt: **Figure 5** Parameters of datasets for text classification, named entity recognition, and text triplet **Figure 5** Parameters of datasets for text classification, named entity recognition, and text triplet .. table:: **Table 2** Dataset parameters +------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Parameter | Description | +==================================================================+=======================================================================================================================================================================================================================================================================================================================================================+ | Input Dataset Path | Select the OBS path to the input dataset. | | | | | | .. note:: | | | | | | Labeled text classification data can be identified only when you import data. When creating a dataset, set an empty OBS directory. After the dataset is created, import the labeled data into it. For details about the format of the data to be imported, see :ref:`Specifications for Importing Data from an OBS Directory `. | +------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Output Dataset Path | Select the OBS path to the output dataset. | | | | | | .. note:: | | | | | | The output dataset path cannot be the same as the input dataset path or cannot be the subdirectory of the input dataset path. Select an empty directory as the **Output Dataset Path**. | +------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Label Set (for text classification and named entity recognition) | - **Label Name**: Enter a label name. The label name can contain only letters, digits, underscores (_), and hyphens (-). The name contains 1 to 32 characters. | | | | | | - **Add Label**: Click **Add Label** to add more labels. | | | | | | - Setting a label color: Select a color from the color palette or enter the hexadecimal color code to set the color. | +------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Label Set (for text triplet) | For datasets of the text triplet type, set entity labels and relationship labels. | | | | | | - **Entity Label**: Set the label name and label color. You can click the plus sign (+) on the right of the color area to add multiple labels. | | | - **Relationship Label**: a relationship between two entities. Set the source entity and target entity. Therefore, add at least two entity labels before adding a relationship label. | | | | | | |image1| | +------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Team Labeling | Enable or disable team labeling. | | | | | | After enabling team labeling, enter the name and type of the team labeling task, and select the labeling team and team members. For details about the parameter settings, see :ref:`Creating Team Labeling Tasks `. | | | | | | Before enabling team labeling, ensure that you have added a team and members on the **Labeling Teams** page. If no labeling team is available, click the link on the page to go to the **Labeling Teams** page, and add your team and members. For details, see :ref:`Introduction to Team Labeling `. | | | | | | After a dataset is created with team labeling enabled, you can view the **Team Labeling** mark in **Labeling Type**. | +------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ .. _modelarts_23_0004__en-us_topic_0170886809_section4103145619546: Table ----- .. note:: When using a CSV file, pay attention to the following: - When the data type is set to **String**, the data in the double quotation marks is regarded as one record by default. Ensure that the double quotation marks in the same row are closed. Otherwise, the data will be too large to display. - If the number of columns in a row of the CSV file is different from that defined in the schema, the row will be ignored. .. table:: **Table 3** Dataset parameters +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Parameter | Description | +===================================+==============================================================================================================================================================================================================================================================================================+ | Storage Path | Select the OBS path for storing table data. The data imported from the data source is stored in this path. The path cannot be the same as or a subdirectory of the file path in the OBS data source. | | | | | | After a table dataset is created, the following four directories are automatically generated in the storage path: | | | | | | - **annotation**: version publishing directory. Each time a version is published, a subdirectory with the same name as the version is generated in this directory. | | | - **data**: data storage directory. Imported data is stored in this directory. | | | - **logs**: directory for storing logs | | | - **temp**: temporary working directory | +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Import | If you have stored table data on other cloud services, you can enable this function to import data stored on OBS, DLI, or MRS. | +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Data Source (OBS) | - **File Path**: Browse all OBS buckets of the account and select the directory where the data file to be imported is located. | | | - **Contain Table Header**: If this parameter is enabled, the imported file contains table headers. In this case, the first row of the imported file is used as the column name. Otherwise, the default column name is added and automatically filled in the schema information. | | | | | | For details about OBS functions, see *Object Storage Service Console Operation Guide*. | +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Schema | Names and types of table columns, which must be the same as those of the imported data. Set the column name based on the imported data and select the column type. For details about the supported types, see :ref:`Table 4 `. | | | | | | Click **Add Schema** to add a new record. When creating a dataset, you must specify a schema. Once created, the schema cannot be modified. | | | | | | When data is imported from OBS, the schema of the CSV file in the file path is automatically obtained. If the schemas of multiple CSV files are inconsistent, an error is reported. | +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ .. _modelarts_23_0004__en-us_topic_0170886809_table1916832104917: .. table:: **Table 4** Migration data types +-----------+------------------------------------------------------------------------+---------------+---------------------------------------------+ | Type | Description | Storage Space | Range | +===========+========================================================================+===============+=============================================+ | String | String | - | - | +-----------+------------------------------------------------------------------------+---------------+---------------------------------------------+ | Short | Signed integer | 2 bytes | -32768 to 32767 | +-----------+------------------------------------------------------------------------+---------------+---------------------------------------------+ | Int | Signed integer | 4 bytes | –2147483648 to 2147483647 | +-----------+------------------------------------------------------------------------+---------------+---------------------------------------------+ | Long | Signed integer | 8 bytes | –9223372036854775808 to 9223372036854775807 | +-----------+------------------------------------------------------------------------+---------------+---------------------------------------------+ | Double | Double-precision floating point | 8 bytes | - | +-----------+------------------------------------------------------------------------+---------------+---------------------------------------------+ | Float | Single-precision floating point | 4 bytes | - | +-----------+------------------------------------------------------------------------+---------------+---------------------------------------------+ | Byte | Signed integer | 1 byte | -128 to 127 | +-----------+------------------------------------------------------------------------+---------------+---------------------------------------------+ | Date | Date type in the format of *yyyy-MM-dd*, for example, 2014-05-29 | - | - | +-----------+------------------------------------------------------------------------+---------------+---------------------------------------------+ | Timestamp | Timestamp that represents date and time. Format: *yyyy-MM-dd HH:mm:ss* | - | - | +-----------+------------------------------------------------------------------------+---------------+---------------------------------------------+ | Boolean | Boolean | 1 byte | TRUE or FALSE | +-----------+------------------------------------------------------------------------+---------------+---------------------------------------------+ .. _modelarts_23_0004__en-us_topic_0170886809_section1357212065510: Video ----- .. _modelarts_23_0004__en-us_topic_0170886809_fig973555618557: .. figure:: /_static/images/en-us_image_0000001157080907.png :alt: **Figure 6** Parameters of datasets of the video type **Figure 6** Parameters of datasets of the video type .. table:: **Table 5** Dataset parameters +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Parameter | Description | +===================================+============================================================================================================================================================================================+ | Input Dataset Path | Select the OBS path to the input dataset. | +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Output Dataset Path | Select the OBS path to the output dataset. | | | | | | .. note:: | | | | | | The output dataset path cannot be the same as the input dataset path or cannot be the subdirectory of the input dataset path. Select an empty directory as the **Output Dataset Path**. | +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Label Set | - **Label Name**: Enter a label name. The label name can contain only letters, digits, underscores (_), and hyphens (-). The name contains 1 to 32 characters. | | | | | | - **Add Label**: Click **Add Label** to add more labels. | | | | | | - Setting a label color: Select a color from the color palette or enter the hexadecimal color code to set the color. | +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ .. _modelarts_23_0004__en-us_topic_0170886809_section359415145517: Other (Free Format) ------------------- .. _modelarts_23_0004__en-us_topic_0170886809_fig1957792145712: .. figure:: /_static/images/en-us_image_0000001156920933.png :alt: **Figure 7** Parameters of datasets of the free format type **Figure 7** Parameters of datasets of the free format type .. table:: **Table 6** Dataset parameters +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Parameter | Description | +===================================+============================================================================================================================================================================================+ | Input Dataset Path | Select the OBS path to the input dataset. | +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Output Dataset Path | Select the OBS path to the output dataset. | | | | | | .. note:: | | | | | | The output dataset path cannot be the same as the input dataset path or cannot be the subdirectory of the input dataset path. Select an empty directory as the **Output Dataset Path**. | +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ .. |image1| image:: /_static/images/en-us_image_0000001156920935.png