Read Data from Files¶
Load structured data from files using FileSource, which supports CSV, JSON, JSONL, and Parquet formats with automatic format detection and type conversion.
QType YAML¶
Explanation¶
- FileSource: Step that reads structured data from files using fsspec-compatible URIs
- path: File path (relative to YAML file or absolute), supports local files and cloud storage (s3://, gs://, etc.)
- outputs: Column names from the file to extract as variables (must match actual column names)
- Format detection: Automatically determined by file extension (.csv, .json, .jsonl, .parquet)
- Type conversion: Automatically converts data to match variable types (primitives, domain types, custom types)
- Streaming: Emits one FlowMessage per row, enabling downstream steps to process data in parallel
Automatic Type Conversion¶
FileSource automatically converts data from files to match your variable types:
- Primitive types (
int,float,bool,text): Direct conversion from file data - Domain types (
ChatMessage,SearchResult, etc.): Validated from dict/object columns - Custom types: Your defined types are validated and instantiated from dict/object columns
Format Recommendations:
- CSV: Best for simple primitive types (strings, numbers, booleans)
- JSON/JSONL: Recommended for nested objects, custom types, and domain types
- Parquet: Best for large datasets with mixed types and efficient storage
Example with Custom Types (JSON format):
[
{"person": {"name": "Alice", "age": 30}, "score": 95},
{"person": {"name": "Bob", "age": 25}, "score": 87}
]
JSON preserves nested objects, making it ideal for complex types. CSV stores everything as strings, requiring nested objects to be serialized as JSON strings within the CSV.
Complete Example¶
id: read_file_example
description: Read data from a CSV file
models:
- type: Model
id: nova
provider: aws-bedrock
model_id: amazon.nova-lite-v1:0
flows:
- type: Flow
id: process_file_data
description: Read and process data from a CSV file
variables:
- id: query
type: text
- id: topic
type: text
- id: prompt
type: text
- id: answer
type: text
inputs: []
outputs:
- query
- topic
- answer
steps:
- id: read_data
type: FileSource
path:
uri: examples/data_processing/batch_inputs.csv
outputs:
- query
- topic
- id: create_prompt
type: PromptTemplate
template: |
Topic: {topic}
Question: {query}
Provide a concise answer:
inputs:
- query
- topic
outputs:
- prompt
- id: generate_answer
type: LLMInference
model: nova
inputs:
- prompt
outputs:
- answer