Skip to content

Read Data from Files

Load structured data from files using FileSource, which supports CSV, JSON, JSONL, and Parquet formats with automatic format detection and type conversion.

QType YAML

steps:
  - id: read_data
    type: FileSource
    path: batch_inputs.csv
    outputs:
      - query
      - topic

Explanation

  • FileSource: Step that reads structured data from files using fsspec-compatible URIs
  • path: File path (relative to YAML file or absolute), supports local files and cloud storage (s3://, gs://, etc.)
  • outputs: Column names from the file to extract as variables (must match actual column names)
  • Format detection: Automatically determined by file extension (.csv, .json, .jsonl, .parquet)
  • Type conversion: Automatically converts data to match variable types (primitives, domain types, custom types)
  • Streaming: Emits one FlowMessage per row, enabling downstream steps to process data in parallel

Automatic Type Conversion

FileSource automatically converts data from files to match your variable types:

  • Primitive types (int, float, bool, text): Direct conversion from file data
  • Domain types (ChatMessage, SearchResult, etc.): Validated from dict/object columns
  • Custom types: Your defined types are validated and instantiated from dict/object columns

Format Recommendations:

  • CSV: Best for simple primitive types (strings, numbers, booleans)
  • JSON/JSONL: Recommended for nested objects, custom types, and domain types
  • Parquet: Best for large datasets with mixed types and efficient storage

Example with Custom Types (JSON format):

[
  {"person": {"name": "Alice", "age": 30}, "score": 95},
  {"person": {"name": "Bob", "age": 25}, "score": 87}
]

JSON preserves nested objects, making it ideal for complex types. CSV stores everything as strings, requiring nested objects to be serialized as JSON strings within the CSV.

Complete Example

id: read_file_example
description: Read data from a CSV file

models:
  - type: Model
    id: nova
    provider: aws-bedrock
    model_id: amazon.nova-lite-v1:0

flows:
  - type: Flow
    id: process_file_data
    description: Read and process data from a CSV file

    variables:
      - id: query
        type: text
      - id: topic
        type: text
      - id: prompt
        type: text
      - id: answer
        type: text

    inputs: []

    outputs:
      - query
      - topic
      - answer

    steps:
      - id: read_data
        type: FileSource
        path:
          uri: examples/data_processing/batch_inputs.csv
        outputs:
          - query
          - topic

      - id: create_prompt
        type: PromptTemplate
        template: |
          Topic: {topic}
          Question: {query}

          Provide a concise answer:
        inputs:
          - query
          - topic
        outputs:
          - prompt

      - id: generate_answer
        type: LLMInference
        model: nova
        inputs:
          - prompt
        outputs:
          - answer

See Also