# CREATE SOURCE `CREATE SOURCE` connects Materialize to an external data source. A [source](/concepts/sources/) describes an external system you want Materialize to read data from, and provides details about how to decode and interpret that data. ## Syntax summary **PostgreSQL (New):** To create a source from an external PostgreSQL: ```mzsql CREATE SOURCE [IF NOT EXISTS] [IN CLUSTER ] FROM POSTGRES CONNECTION (PUBLICATION '') ; ``` For details, see [CREATE SOURCE: PostgreSQL (New Syntax)](/sql/create-source/postgres-v2/). **PostgreSQL (Legacy):** ```mzsql CREATE SOURCE [IF NOT EXISTS] [IN CLUSTER ] FROM POSTGRES CONNECTION ( PUBLICATION '' [, TEXT COLUMNS ( [, ...] ) ] [, EXCLUDE COLUMNS ( [, ...] ) ] ) [, ...] ) | FOR TABLES ( [AS ] [, ...] )> [EXPOSE PROGRESS AS ] [WITH (RETAIN HISTORY FOR )] ``` For details, see [CREATE SOURCE: PostgreSQL (Legacy)](/sql/create-source/postgres/). **MySQL:** ```mzsql CREATE SOURCE [IF NOT EXISTS] [IN CLUSTER ] FROM MYSQL CONNECTION [ ( [TEXT COLUMNS ( [, ...] ) ] [, EXCLUDE COLUMNS ( [, ...] ) ] ) ] [, ...] ) | FOR TABLES ( [AS ] [, ...] )> [EXPOSE PROGRESS AS ] [WITH (RETAIN HISTORY FOR )] ``` For details, see [CREATE SOURCE: MySQL](/sql/create-source/mysql/). **SQL Server (New):** ```mzsql CREATE SOURCE [IF NOT EXISTS] [IN CLUSTER ] FROM SQL SERVER CONNECTION ``` For details, see [CREATE SOURCE: SQL Server (New Syntax)](/sql/create-source/sql-server-v2/). **SQL Server (Legacy):** ```mzsql CREATE SOURCE [IF NOT EXISTS] [IN CLUSTER ] FROM SQL SERVER CONNECTION [ ( EXCLUDE COLUMNS ( [, ...]) ) ] [ ( TEXT COLUMNS ( [, ...]) ) ] [AS ] [, ...] )> [WITH (RETAIN HISTORY FOR )] ``` For details, see [CREATE SOURCE: SQL Server(Legacy)](/sql/create-source/sql-server/). **Kafka/Redpanda:** **Format Avro:** ```mzsql CREATE SOURCE [IF NOT EXISTS] [IN CLUSTER ] FROM KAFKA CONNECTION ( TOPIC '' [, GROUP ID PREFIX ''] [, START OFFSET ( [, ...] ) ] [, START TIMESTAMP ] ) FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY CONNECTION [KEY STRATEGY ] [VALUE STRATEGY ] [INCLUDE KEY [AS ] | PARTITION [AS ] | OFFSET [AS ] | TIMESTAMP [AS ] | HEADERS [AS ] | HEADER '' AS [BYTES] [, ...] ] [ENVELOPE NONE | DEBEZIUM | UPSERT [ ( VALUE DECODING ERRORS = INLINE [AS ] ) ] ] [EXPOSE PROGRESS AS ] [WITH (RETAIN HISTORY FOR )] ``` **Format JSON:** ```mzsql CREATE SOURCE [IF NOT EXISTS] [IN CLUSTER ] FROM KAFKA CONNECTION ( TOPIC '' [, GROUP ID PREFIX ''] [, START OFFSET ( [, ...] ) ] [, START TIMESTAMP ] ) FORMAT JSON [INCLUDE PARTITION [AS ] | OFFSET [AS ] | TIMESTAMP [AS ] | HEADERS [AS ] | HEADER '' AS [BYTES] [, ...] ] [ENVELOPE NONE] [EXPOSE PROGRESS AS ] [WITH (RETAIN HISTORY FOR )] ``` **Format TEXT/BYTES:** ```mzsql CREATE SOURCE [IF NOT EXISTS] [IN CLUSTER ] FROM KAFKA CONNECTION ( TOPIC '' [, GROUP ID PREFIX ''] [, START OFFSET ( [, ...] ) ] [, START TIMESTAMP ] ) FORMAT TEXT | BYTES [INCLUDE PARTITION [AS ] | OFFSET [AS ] | TIMESTAMP [AS ] | HEADERS [AS ] | HEADER '' AS [BYTES] [, ...] ] [ENVELOPE NONE] [EXPOSE PROGRESS AS ] [WITH (RETAIN HISTORY FOR )] ``` **Format CSV:** ```mzsql CREATE SOURCE [IF NOT EXISTS] ( [, ...] ) [IN CLUSTER ] FROM KAFKA CONNECTION ( TOPIC '' [, GROUP ID PREFIX ''] [, START OFFSET ( [, ...] ) ] [, START TIMESTAMP ] ) FORMAT CSV WITH COLUMNS | WITH HEADER [ ( [, ...] ) ] [INCLUDE PARTITION [AS ] | OFFSET [AS ] | TIMESTAMP [AS ] | HEADERS [AS ] | HEADER '' AS [BYTES] [, ...] ] [ENVELOPE NONE] [EXPOSE PROGRESS AS ] [WITH (RETAIN HISTORY FOR )] ``` **Format Protobuf:** ```mzsql CREATE SOURCE [IF NOT EXISTS] [IN CLUSTER ] FROM KAFKA CONNECTION ( TOPIC '' [, GROUP ID PREFIX ''] [, START OFFSET ( [, ...] ) ] [, START TIMESTAMP ] ) FORMAT PROTOBUF USING CONFLUENT SCHEMA REGISTRY CONNECTION | FORMAT PROTOBUF MESSAGE '' USING SCHEMA '' [INCLUDE KEY [AS ] | PARTITION [AS ] | OFFSET [AS ] | TIMESTAMP [AS ] | HEADERS [AS ] | HEADER '' AS [BYTES] [, ...] ] [ENVELOPE NONE | UPSERT [ ( VALUE DECODING ERRORS = INLINE [AS ] ) ] ] [EXPOSE PROGRESS AS ] [WITH (RETAIN HISTORY FOR )] ``` **KEY FORMAT VALUE FORMAT:** ```mzsql CREATE SOURCE [IF NOT EXISTS] [IN CLUSTER ] FROM KAFKA CONNECTION ( TOPIC '' [, GROUP ID PREFIX ''] [, START OFFSET ( [, ...] ) ] [, START TIMESTAMP ] ) KEY FORMAT VALUE FORMAT -- and can be: -- AVRO USING CONFLUENT SCHEMA REGISTRY CONNECTION -- [KEY STRATEGY ] -- [VALUE STRATEGY ] -- | CSV WITH COLUMNS DELIMITED BY -- | JSON | TEXT | BYTES -- | PROTOBUF USING CONFLUENT SCHEMA REGISTRY CONNECTION -- | PROTOBUF MESSAGE '' USING SCHEMA '' [INCLUDE KEY [AS ] | PARTITION [AS ] | OFFSET [AS ] | TIMESTAMP [AS ] | HEADERS [AS ] | HEADER '' AS [BYTES] [, ...] ] [ENVELOPE NONE | DEBEZIUM | UPSERT [(VALUE DECODING ERRORS = INLINE [AS name])] ] [EXPOSE PROGRESS AS ] [WITH (RETAIN HISTORY FOR )] ``` For details, see [CREATE SOURCE: Kafka/Redpanda](/sql/create-source/kafka/). **Webhook:** ```mzsql CREATE SOURCE [IF NOT EXISTS] [IN CLUSTER ] FROM WEBHOOK BODY FORMAT [INCLUDE HEADER AS [BYTES] | INCLUDE HEADERS [ ( [NOT] [, [NOT] ... ] ) ] ][...] [CHECK ( [WITH ( > [AS ] [BYTES] [, ...])] ) ] ``` For details, see [CREATE SOURCE: Webhook](/sql/create-source/webhook/). ## Privileges The privileges required to execute `CREATE SOURCE` are: - `CREATE` privileges on the containing schema. - `CREATE` privileges on the containing cluster if the source is created in an existing cluster. - `CREATECLUSTER` privileges on the system if the source is not created in an existing cluster. - `USAGE` privileges on all connections and secrets used in the source definition. - `USAGE` privileges on the schemas that all connections and secrets in the statement are contained in. ## Available guides The following guides step you through setting up sources:

## Best practices ### Separate cluster(s) for sources In production, if possible, use a dedicated cluster for [sources](/concepts/sources/); i.e., avoid putting sources on the same cluster that hosts compute objects, sinks, and/or serves queries.

In addition, for upsert sources:

Consider separating upsert sources from your other sources. Upsert sources have higher resource requirements (since, for upsert sources, Materialize maintains each key and associated last value for the key as well as to perform deduplication). As such, if possible, use a separate source cluster for upsert sources.
Consider using a larger cluster size during snapshotting for upsert sources. Once the snapshotting operation is complete, you can downsize the cluster to align with the steady-state ingestion.

### Sizing a source Some sources are low traffic and require relatively few resources to handle data ingestion, while others are high traffic and require hefty resource allocations. The cluster in which you place a source determines the amount of CPU, memory, and disk available to the source. It's a good idea to size up the cluster hosting a source when: * You want to **increase throughput**. Larger sources will typically ingest data faster, as there is more CPU available to read and decode data from the upstream external system. * You are using the [upsert envelope](/sql/create-source/kafka/#upsert-envelope) or [Debezium envelope](/sql/create-source/kafka/#debezium-envelope), and your source contains **many unique keys**. These envelopes maintain state proportional to the number of unique keys in the upstream external system. Larger sizes can store more unique keys. Sources share the resource allocation of their cluster with all other objects in the cluster. Colocating multiple sources onto the same cluster can be more resource efficient when you have many low-traffic sources that occasionally need some burst capacity. ## Related pages - [Sources](/concepts/sources/) - [`SHOW SOURCES`](/sql/show-sources/) - [`SHOW COLUMNS`](/sql/show-columns/) - [`SHOW CREATE SOURCE`](/sql/show-create-source/) --- ## Appendix: Load generator [`CREATE SOURCE`](/sql/create-source/) connects Materialize to an external system you want to read data from, and provides details about how to decode and interpret that data. Load generator sources produce synthetic data for use in demos and performance tests. ## Syntax ```mzsql CREATE SOURCE [IF NOT EXISTS] [IN CLUSTER ] FROM LOAD GENERATOR [ ( [TICK INTERVAL ] [, AS OF ] [, UP TO ] [, SCALE FACTOR ] [, MAX CARDINALITY ] [, KEYS ] [, SNAPSHOT ROUNDS ] [, TRANSACTIONAL SNAPSHOT ] [, VALUE SIZE ] [, SEED ] [, PARTITIONS ] [, BATCH SIZE ] ) ] FOR ALL TABLES [EXPOSE PROGRESS AS ] [WITH (RETAIN HISTORY FOR )] ``` | Syntax element | Description | | --- | --- | | `` | The name for the source. | | **IF NOT EXISTS** | Optional. If specified, do not throw an error if a source with the same name already exists. Instead, issue a notice and skip the source creation. | | **IN CLUSTER** `` | Optional. The [cluster](/sql/create-cluster) to maintain this source. | | **FROM LOAD GENERATOR** `` | The type of load generator to use. Valid generator types: \| Generator \| Description \| \|-----------\|-------------\| \| `AUCTION` \| Use the [auction](#auction) load generator. \| \| `MARKETING` \| Use the [marketing](#marketing) load generator. \| \| `TPCH` \| Use the [tpch](#tpch) load generator. \| \| `KEY VALUE` \| Use the key-value load generator. \| | | **TICK INTERVAL** `` | Optional. The interval at which the next datum should be emitted. Defaults to one second. | | **AS OF** `` | Optional. {{< warn-if-unreleased-inline "v0.101" >}} The tick at which to start producing data. Defaults to 0. | | **UP TO** `` | Optional. {{< warn-if-unreleased-inline "v0.101" >}} The tick before which to stop producing data. Defaults to infinite. | | **SCALE FACTOR** `` | Optional. The scale factor for the `TPCH` generator. Defaults to `0.01` (~ 10MB). | | **MAX CARDINALITY** `` | Optional. The maximum cardinality for the generator. | | **KEYS** `` | Optional. The number of keys for the generator. | | **SNAPSHOT ROUNDS** `` | Optional. The number of snapshot rounds for the generator. | | **TRANSACTIONAL SNAPSHOT** `` | Optional. Whether to use transactional snapshots. | | **VALUE SIZE** `` | Optional. The size of values for the generator. | | **SEED** `` | Optional. The seed for random number generation. | | **PARTITIONS** `` | Optional. The number of partitions for the generator. | | **BATCH SIZE** `` | Optional. The batch size for the generator. | | **FOR ALL TABLES** | Creates subsources for all tables in the load generator. | | **EXPOSE PROGRESS AS** `` | Optional. The name of the progress subsource for the source. If this is not specified, the subsource will be named `_progress`. For more information, see [Monitoring source progress](#monitoring-source-progress). | | **WITH** (`` [, ...]) | Optional. The following ``s are supported: \| Option \| Description \| \|--------\|-------------\| \| `RETAIN HISTORY FOR ` \| ***Private preview.** This option has known performance or stability issues and is under active development.* Duration for which Materialize retains historical data, which is useful to implement [durable subscriptions](/transform-data/patterns/durable-subscriptions/#history-retention-period). Accepts positive [interval](/sql/types/interval/) values (e.g. `'1hr'`). Default: `1s`. \| | ## Description Materialize has several built-in load generators, which provide a quick way to get up and running with no external dependencies before plugging in your own data sources. If you would like to see an additional load generator, please submit a [feature request]. ### Auction The auction load generator simulates an auction house, where users are bidding on an ongoing series of auctions. The auction source will be automatically demuxed into multiple subsources when the `CREATE SOURCE` command is executed. This will create the following subsources: * `organizations` describes the organizations known to the auction house. Field | Type | Description ------|------------|------------ id | [`bigint`] | A unique identifier for the organization. name | [`text`] | The organization's name. * `users` describes the users that belong to each organization. Field | Type | Description ----------|------------|------------ `id` | [`bigint`] | A unique identifier for the user. `org_id` | [`bigint`] | The identifier of the organization to which the user belongs. References `organizations.id`. `name` | [`text`] | The user's name. * `accounts` describes the account associated with each organization. Field | Type | Description ----------|------------|------------ `id` | [`bigint`] | A unique identifier for the account. `org_id` | [`bigint`] | The identifier of the organization to which the account belongs. References `organizations.id`. `balance` | [`bigint`] | The balance of the account in dollars. * `auctions` describes all past and ongoing auctions. Field | Type | Description -----------|------------------------------|------------ `id` | [`bigint`] | A unique identifier for the auction. `seller` | [`bigint`] | The identifier of the user selling the item. References `users.id`. `item` | [`text`] | The name of the item being sold. `end_time` | [`timestamp with time zone`] | The time at which the auction closes. * `bids` describes the bids placed in each auction. Field | Type | Description -------------|------------------------------|------------ `id` | [`bigint`] | A unique identifier for the bid. `buyer` | [`bigint`] | The identifier vof the user placing the bid. References `users.id`. `auction_id` | [`bigint`] | The identifier of the auction in which the bid is placed. References `auctions.id`. `amount` | [`bigint`] | The bid amount in dollars. `bid_time` | [`timestamp with time zone`] | The time at which the bid was placed. The organizations, users, and accounts are fixed at the time the source is created. Each tick interval, either a new auction is started, or a new bid is placed in the currently ongoing auction. ### Marketing The marketing load generator simulates a marketing organization that is using a machine learning model to send coupons to potential leads. The marketing source will be automatically demuxed into multiple subsources when the `CREATE SOURCE` command is executed. This will create the following subsources: * `customers` describes the customers that the marketing team may target. Field | Type | Description ----------|------------|------------ `id` | [`bigint`] | A unique identifier for the customer. `email` | [`text`] | The customer's email. `income` | [`bigint`] | The customer's income in pennies. * `impressions` describes online ads that have been seen by a customer. Field | Type | Description ------------------|------------------------------|------------ `id` | [`bigint`] | A unique identifier for the impression. `customer_id` | [`bigint`] | The identifier of the customer that saw the ad. References `customers.id`. `impression_time` | [`timestamp with time zone`] | The time at which the ad was seen. * `clicks` describes clicks of ads. Field | Type | Description ------------------|------------------------------|------------ `impression_id` | [`bigint`] | The identifier of the impression that was clicked. References `impressions.id`. `click_time` | [`timestamp with time zone`] | The time at which the impression was clicked. * `leads` describes a potential lead for a purchase. Field | Type | Description --------------------|------------------------------|------------ `id` | [`bigint`] | A unique identifier for the lead. `customer_id` | [`bigint`] | The identifier of the customer we'd like to convert. References `customers.id`. `created_at` | [`timestamp with time zone`] | The time at which the lead was created. `converted_at` | [`timestamp with time zone`] | The time at which the lead was converted. `conversion_amount` | [`bigint`] | The amount the lead converted for in pennies. * `coupons` describes coupons given to leads. Field | Type | Description --------------------|------------------------------|------------ `id` | [`bigint`] | A unique identifier for the coupon. `lead_id` | [`bigint`] | The identifier of the lead we're attempting to convert. References `leads.id`. `created_at` | [`timestamp with time zone`] | The time at which the coupon was created. `amount` | [`bigint`] | The amount the coupon is for in pennies. * `conversion_predictions` describes the predictions made by a highly sophisticated machine learning model. Field | Type | Description --------------------|------------------------------|------------ `lead_id` | [`bigint`] | The identifier of the lead we're attempting to convert. References `leads.id`. `experiment_bucket`| [`text`] | Whether the lead is a control or experiment. `created_at` | [`timestamp with time zone`] | The time at which the prediction was made. `score` | [`numeric`] | The predicted likelihood the lead will convert. ### TPCH The TPCH load generator implements the [TPC-H benchmark specification](https://www.tpc.org/tpch/default5.asp). The TPCH source must be used with `FOR ALL TABLES`, which will create the standard TPCH relations. If `TICK INTERVAL` is specified, after the initial data load, an order and its lineitems will be changed at this interval. If not specified, the dataset will not change over time. ### Monitoring source progress By default, load generator sources expose progress metadata as a subsource that you can use to monitor source **ingestion progress**. The name of the progress subsource can be specified when creating a source using the `EXPOSE PROGRESS AS` clause; otherwise, it will be named `_progress`. The following metadata is available for each source as a progress subsource: Field | Type | Meaning ---------------|-------------|-------- `offset` | [`uint8`] | The minimum offset for which updates to this sources are still undetermined. And can be queried using: ```mzsql SELECT "offset" FROM _progress; ``` As long as the offset continues increasing, Materialize is generating data. For more details on monitoring source ingestion progress and debugging related issues, see [Troubleshooting](/ops/troubleshooting/). ## Examples ### Creating an auction load generator To create a load generator source that simulates an auction house and emits new data every second: ```mzsql CREATE SOURCE auction_house FROM LOAD GENERATOR AUCTION (TICK INTERVAL '1s') FOR ALL TABLES; ``` To display the created subsources: ```mzsql SHOW SOURCES; ``` ```nofmt name | type ------------------------+---------------- accounts | subsource auction_house | load-generator auction_house_progress | progress auctions | subsource bids | subsource organizations | subsource users | subsource ``` To examine the simulated bids: ```mzsql SELECT * from bids; ``` ```nofmt id | buyer | auction_id | amount | bid_time ----+-------+------------+--------+---------------------------- 10 | 3844 | 1 | 59 | 2022-09-16 23:24:07.332+00 11 | 1861 | 1 | 40 | 2022-09-16 23:24:08.332+00 12 | 3338 | 1 | 97 | 2022-09-16 23:24:09.332+00 ``` ### Creating a marketing load generator To create a load generator source that simulates an online marketing campaign: ```mzsql CREATE SOURCE marketing FROM LOAD GENERATOR MARKETING FOR ALL TABLES; ``` To display the created subsources: ```mzsql SHOW SOURCES; ``` ```nofmt name | type ------------------------+--------------- clicks | subsource conversion_predictions | subsource coupons | subsource customers | subsource impressions | subsource leads | subsource marketing | load-generator marketing_progress | progress ``` To find all impressions and clicks associated with a campaign over the last 30 days: ```mzsql WITH click_rollup AS ( SELECT impression_id AS id, count(*) AS clicks FROM clicks WHERE click_time - INTERVAL '30' DAY <= mz_now() GROUP BY impression_id ), impression_rollup AS ( SELECT id, campaign_id, count(*) AS impressions FROM impressions WHERE impression_time - INTERVAL '30' DAY <= mz_now() GROUP BY id, campaign_id ) SELECT campaign_id, sum(impressions) AS impressions, sum(clicks) AS clicks FROM impression_rollup LEFT JOIN click_rollup USING(id) GROUP BY campaign_id; ``` ```nofmt campaign_id | impressions | clicks -------------+-------------+-------- 0 | 350 | 33 1 | 325 | 28 2 | 319 | 24 3 | 315 | 38 4 | 305 | 28 5 | 354 | 31 6 | 346 | 25 7 | 337 | 36 8 | 329 | 38 9 | 305 | 24 10 | 345 | 27 11 | 323 | 30 12 | 320 | 29 13 | 331 | 27 14 | 310 | 22 15 | 324 | 28 16 | 315 | 32 17 | 329 | 36 18 | 329 | 28 ``` ### Creating a TPCH load generator To create the load generator source and its associated subsources: ```mzsql CREATE SOURCE tpch FROM LOAD GENERATOR TPCH (SCALE FACTOR 1) FOR ALL TABLES; ``` To display the created subsources: ```mzsql SHOW SOURCES; ``` ```nofmt name | type ---------------+--------------- tpch | load-generator tpch_progress | progress supplier | subsource region | subsource partsupp | subsource part | subsource orders | subsource nation | subsource lineitem | subsource customer | subsource ``` To run the Pricing Summary Report Query (Q1), which reports the amount of billed, shipped, and returned items: ```mzsql SELECT l_returnflag, l_linestatus, sum(l_quantity) AS sum_qty, sum(l_extendedprice) AS sum_base_price, sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge, avg(l_quantity) AS avg_qty, avg(l_extendedprice) AS avg_price, avg(l_discount) AS avg_disc, count(*) AS count_order FROM lineitem WHERE l_shipdate <= date '1998-12-01' - interval '90' day GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus; ``` ```nofmt l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order --------------+--------------+----------+----------------+-----------------+-------------------+--------------------+--------------------+---------------------+------------- A | F | 37772997 | 56604341792 | 54338346989.17 | 57053313118.2657 | 25.490380624798817 | 38198.351517998075 | 0.04003729114831228 | 1481853 N | F | 986796 | 1477585066 | 1418531782.89 | 1489171757.0798 | 25.463731840115603 | 38128.27564317601 | 0.04007431682708436 | 38753 N | O | 74281600 | 111337230039 | 106883023012.04 | 112227399730.9018 | 25.49430183051871 | 38212.221432873834 | 0.03999775539657235 | 2913655 R | F | 37770949 | 56610551077 | 54347734573.7 | 57066196254.4557 | 25.496431466814634 | 38213.68205054471 | 0.03997848687172654 | 1481421 ``` ## Related pages - [`CREATE SOURCE`](../) [`bigint`]: /sql/types/bigint [`numeric`]: /sql/types/numeric [`text`]: /sql/types/text [`bytea`]: /sql/types/bytea [`interval`]: /sql/types/interval [`uint8`]: /sql/types/uint/#uint8-info [`timestamp with time zone`]: /sql/types/timestamp [feature request]: https://github.com/MaterializeInc/materialize/discussions/new?category=feature-requests --- ## CREATE SOURCE: Kafka/Redpanda [`CREATE SOURCE`](/sql/create-source/) connects Materialize to an external system you want to read data from, and provides details about how to decode and interpret that data. To connect to a Kafka/Redpanda broker (and optionally a schema registry), you first need to [create a connection](#prerequisite-creating-a-connection) that specifies access and authentication parameters. Once created, a connection is **reusable** across multiple `CREATE SOURCE` and `CREATE SINK` statements. > **Note:** The same syntax, supported formats and features can be used to connect to a > [Redpanda](/integrations/redpanda/) broker. ## Syntax **Format Avro:** ### Format Avro Materialize can decode Avro messages by integrating with a schema registry to retrieve a schema, and automatically determine the columns and data types to use in the source. ```mzsql CREATE SOURCE [IF NOT EXISTS] [IN CLUSTER ] FROM KAFKA CONNECTION ( TOPIC '' [, GROUP ID PREFIX ''] [, START OFFSET ( [, ...] ) ] [, START TIMESTAMP ] ) FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY CONNECTION [KEY STRATEGY ] [VALUE STRATEGY ] [INCLUDE KEY [AS ] | PARTITION [AS ] | OFFSET [AS ] | TIMESTAMP [AS ] | HEADERS [AS ] | HEADER '' AS [BYTES] [, ...] ] [ENVELOPE NONE | DEBEZIUM | UPSERT [ ( VALUE DECODING ERRORS = INLINE [AS ] ) ] ] [EXPOSE PROGRESS AS ] [WITH (RETAIN HISTORY FOR )] ``` | Syntax element | Description | | --- | --- | | `` | The name for the source. | | **IF NOT EXISTS** | Optional. If specified, do not throw an error if a source with the same name already exists. Instead, issue a notice and skip the source creation. | | **IN CLUSTER** `` | Optional. The [cluster](/sql/create-cluster) to maintain this source. | | `` | The name of the Kafka connection to use in the source. For details on creating connections, check the [`CREATE CONNECTION`](/sql/create-connection) documentation page. | | `''` | The Kafka topic you want to subscribe to. | | **GROUP ID PREFIX** `` | Optional. The prefix of the consumer group ID to use. See [Monitoring consumer lag](#monitoring-consumer-lag).
Default: `materialize-{REGION-ID}-{CONNECTION-ID}-{SOURCE_ID}` | | **START OFFSET** (`` [, ...]) | Optional. Read partitions from the specified offset. You cannot update the offsets once a source has been created; you will need to recreate the source. Offset values must be zero or positive integers. See [Setting start offsets](#setting-start-offsets) for details. | | **START TIMESTAMP** `` | Optional. Use the specified value to set `START OFFSET` based on the Kafka timestamp. Negative values will be interpreted as relative to the current system time in milliseconds (e.g. `-1000` means 1000 ms ago). See [Time-based offsets](#time-based-offsets) for details. | | `` | The Confluent Schema Registry connection to use in the source. | | **KEY STRATEGY** `` | Optional. Define how an Avro reader schema will be chosen for the message key. \| Strategy \| Description \| \|--------\|-------------\| \| **LATEST** \| (Default) Use the latest writer schema from the schema registry as the reader schema. \| \| **ID** \| Use a specific schema from the registry. \| \| **INLINE** \| Use the inline schema. \| | | **VALUE STRATEGY** `` | Optional. Define how an Avro reader schema will be chosen for the message value. \| Strategy \| Description \| \|--------\|-------------\| \| **LATEST** \| (Default) Use the latest writer schema from the schema registry as the reader schema. \| \| **ID** \| Use a specific schema from the registry. \| \| **INLINE** \| Use the inline schema. \| | | **INCLUDE** `` | Optional. If specified, include the additional information as column(s) in the table. The following ``s are supported: \| Option \| Description \| \|--------\|-------------\| \| **KEY [AS \]** \| Include a column containing the Kafka message key. If the key is encoded using a format that includes schemas, the column will take its name from the schema. For unnamed formats (e.g. `TEXT`), the column will be named `key`. The column can be renamed with the optional **AS** *name* statement. \| **PARTITION [AS \]** \| Include a `partition` column containing the Kafka message partition. The column can be renamed with the optional **AS** *name* clause. \| **OFFSET [AS \]** \| Include an `offset` column containing the Kafka message offset. The column can be renamed with the optional **AS** *name* clause. \| **TIMESTAMP [AS \]** \| Include a `timestamp` column containing the Kafka message timestamp. The column can be renamed with the optional **AS** *name* clause.

Note that the timestamp of a Kafka message depends on how the topic and its producers are configured. See the [Confluent documentation](https://docs.confluent.io/3.0.0/streams/concepts.html?#time) for details. \| **HEADERS [AS \]** \| Include a `headers` column containing the Kafka message headers as a list of records of type `(key text, value bytea)`. The column can be renamed with the optional **AS** *name* clause. \| **HEADER \ AS \ [**BYTES**]** \| Include a *name* column containing the Kafka message header *key* parsed as a UTF-8 string. To expose the header value as `bytea`, use the `BYTES` option. | | **ENVELOPE** `` | Optional. Specifies how Materialize interprets incoming records. Valid envelope types: \| Envelope \| Description \| \|----------\|-------------\| \| `NONE` \| Append-only envelope (default). Each message is inserted as a new row. \| \| `DEBEZIUM` \| Decode Kafka messages produced by [Debezium](https://debezium.io/). \| \| `UPSERT [ ( VALUE DECODING ERRORS = INLINE [AS ] ) ]` \| Use the standard key-value convention to support inserts, updates, and deletes. Required to consume [log compacted topics](https://docs.confluent.io/platform/current/kafka/design.html#log-compaction). \| | | **EXPOSE PROGRESS AS** `` | Optional. The name of the progress collection for the source. If this is not specified, the progress collection will be named `_progress`. See [Monitoring source progress](#monitoring-source-progress) for details. | | **WITH** (`` [, ...]) | Optional. The following ``s are supported: \| Option \| Description \| \|--------\|-------------\| \| `RETAIN HISTORY FOR ` \| ***Private preview.** This option has known performance or stability issues and is under active development.* Duration for which Materialize retains historical data, which is useful to implement [durable subscriptions](/transform-data/patterns/durable-subscriptions/#history-retention-period). Accepts positive [interval](/sql/types/interval/) values (e.g. `'1hr'`). Default: `1s`. \| | #### Schema versioning The _latest_ schema is retrieved using the [`TopicNameStrategy`](https://docs.confluent.io/current/schema-registry/serdes-develop/index.html) strategy at the time the `CREATE SOURCE` statement is issued. #### Schema evolution As long as the writer schema changes in a [compatible way](https://avro.apache.org/docs/++version++/specification/#schema-resolution), Materialize will continue using the original reader schema definition by mapping values from the new to the old schema version. To use the new version of the writer schema in Materialize, you need to **drop and recreate** the source. #### Name collision To avoid [case-sensitivity](/sql/identifiers/#case-sensitivity) conflicts with Materialize identifiers, we recommend double-quoting all field names when working with Avro-formatted sources. #### Supported types Materialize supports all [Avro types](https://avro.apache.org/docs/++version++/specification/), _except for_ recursive types and union types in arrays. **Format JSON:** ### Format JSON Materialize can decode JSON messages into a single column named `data` with type `jsonb`. Refer to the [`jsonb` type](/sql/types/jsonb) documentation for the supported operations on this type. ```mzsql CREATE SOURCE [IF NOT EXISTS] [IN CLUSTER ] FROM KAFKA CONNECTION ( TOPIC '' [, GROUP ID PREFIX ''] [, START OFFSET ( [, ...] ) ] [, START TIMESTAMP ] ) FORMAT JSON [INCLUDE PARTITION [AS ] | OFFSET [AS ] | TIMESTAMP [AS ] | HEADERS [AS ] | HEADER '' AS [BYTES] [, ...] ] [ENVELOPE NONE] [EXPOSE PROGRESS AS ] [WITH (RETAIN HISTORY FOR )] ``` | Syntax element | Description | | --- | --- | | `` | The name for the source. | | **IF NOT EXISTS** | Optional. If specified, do not throw an error if a source with the same name already exists. Instead, issue a notice and skip the source creation. | | **IN CLUSTER** `` | Optional. The [cluster](/sql/create-cluster) to maintain this source. | | **CONNECTION** `` | The name of the Kafka connection to use in the source. For details on creating connections, check the [`CREATE CONNECTION`](/sql/create-connection) documentation page. | | **TOPIC** `''` | **Required.** The Kafka topic you want to subscribe to. | | **GROUP ID PREFIX** `` | Optional. The prefix of the consumer group ID to use. See [Monitoring consumer lag](#monitoring-consumer-lag).
Default: `materialize-{REGION-ID}-{CONNECTION-ID}-{SOURCE_ID}` | | **START OFFSET** (`` [, ...]) | Optional. Read partitions from the specified offset. You cannot update the offsets once a source has been created; you will need to recreate the source. Offset values must be zero or positive integers. See [Setting start offsets](#setting-start-offsets) for details. | | **START TIMESTAMP** `` | Optional. Use the specified value to set `START OFFSET` based on the Kafka timestamp. Negative values will be interpreted as relative to the current system time in milliseconds (e.g. `-1000` means 1000 ms ago). See [Time-based offsets](#time-based-offsets) for details. | | **FORMAT JSON** | Decode JSON-formatted messages. JSON-formatted messages are ingested as a JSON blob. We recommend creating a parsing view on top of your Kafka source that maps the individual fields to columns with the required data types. | | **INCLUDE** `` | Optional. If specified, include the additional information as column(s) in the table. The following ``s are supported: \| Option \| Description \| \|--------\|-------------\| \| `PARTITION [AS ]` \| Expose the Kafka partition as a column. See [Partition, offset, timestamp](#partition-offset-timestamp) for details. \| \| `OFFSET [AS ]` \| Expose the Kafka offset as a column. See [Partition, offset, timestamp](#partition-offset-timestamp) for details. \| \| `TIMESTAMP [AS ]` \| Expose the Kafka timestamp as a column. See [Partition, offset, timestamp](#partition-offset-timestamp) for details. \| \| `HEADERS [AS ]` \| Expose all message headers as a column with type `record(key: text, value: bytea?) list`. See [Headers](#headers) for details. \| \| `HEADER '' AS [BYTES]` \| Expose a specific message header as a column. The `bytea` value is automatically parsed into a UTF-8 string unless `BYTES` is specified. See [Headers](#headers) for details. \| | | **ENVELOPE** `` | Optional. Specifies how Materialize interprets incoming records. Valid envelope types: \| Envelope \| Description \| \|----------\|-------------\| \| `NONE` \| Append-only envelope (default). Each message is inserted as a new row. See [Append-only envelope](/sql/create-source/kafka/#append-only-envelope) for details. \| | | **EXPOSE PROGRESS AS** `` | Optional. The name of the progress collection for the source. If this is not specified, the progress collection will be named `_progress`. See [Monitoring source progress](#monitoring-source-progress) for details. | | **WITH** (`` [, ...]) | Optional. The following ``s are supported: \| Option \| Description \| \|--------\|-------------\| \| `RETAIN HISTORY FOR ` \| ***Private preview.** This option has known performance or stability issues and is under active development.* Duration for which Materialize retains historical data, which is useful to implement [durable subscriptions](/transform-data/patterns/durable-subscriptions/#history-retention-period). Accepts positive [interval](/sql/types/interval/) values (e.g. `'1hr'`). Default: `1s`. \| | If your JSON messages have a consistent shape, we recommend creating a parsing [view](/concepts/views) that maps the individual fields to columns with the required data types: ```mzsql -- extract jsonb into typed columns CREATE VIEW my_typed_source AS SELECT (data->>'field1')::boolean AS field_1, (data->>'field2')::int AS field_2, (data->>'field3')::float AS field_3 FROM my_jsonb_source; ``` To avoid doing this task manually, you can use [this **JSON parsing widget**](/sql/types/jsonb/#parsing). #### Schema registry integration Retrieving schemas from a schema registry is not supported yet for JSON-formatted sources. This means that Materialize cannot decode messages serialized using the [JSON Schema](https://docs.confluent.io/platform/current/schema-registry/serdes-develop/serdes-json.html#json-schema-serializer-and-deserializer) serialization format (`JSON_SR`). **Format TEXT/BYTES:** ### Format Text/Bytes Materialize can: - Parse **new-line delimited** data as plain text. Data is assumed to be **valid unicode** (UTF-8), and discarded if it cannot be converted to UTF-8. Text-formatted sources have a single column, by default named `text`. For details on casting, check the [`text`](/sql/types/text/) documentation. - Read raw bytes without applying any formatting or decoding. Raw byte-formatted sources have a single column, by default named `data`. For details on encodings and casting, check the [`bytea`](/sql/types/bytea/) documentation. ```mzsql CREATE SOURCE [IF NOT EXISTS] [IN CLUSTER ] FROM KAFKA CONNECTION ( TOPIC '' [, GROUP ID PREFIX ''] [, START OFFSET ( [, ...] ) ] [, START TIMESTAMP ] ) FORMAT TEXT | BYTES [INCLUDE PARTITION [AS ] | OFFSET [AS ] | TIMESTAMP [AS ] | HEADERS [AS ] | HEADER '' AS [BYTES] [, ...] ] [ENVELOPE NONE] [EXPOSE PROGRESS AS ] [WITH (RETAIN HISTORY FOR )] ``` | Syntax element | Description | | --- | --- | | `` | The name for the source. | | **IF NOT EXISTS** | Optional. If specified, do not throw an error if a source with the same name already exists. Instead, issue a notice and skip the source creation. | | **IN CLUSTER** `` | Optional. The [cluster](/sql/create-cluster) to maintain this source. | | **CONNECTION** `` | The name of the Kafka connection to use in the source. For details on creating connections, check the [`CREATE CONNECTION`](/sql/create-connection) documentation page. | | **TOPIC** `''` | **Required.** The Kafka topic you want to subscribe to. | | **GROUP ID PREFIX** `` | Optional. The prefix of the consumer group ID to use. See [Monitoring consumer lag](#monitoring-consumer-lag).
Default: `materialize-{REGION-ID}-{CONNECTION-ID}-{SOURCE_ID}` | | **START OFFSET** (`` [, ...]) | Optional. Read partitions from the specified offset. You cannot update the offsets once a source has been created; you will need to recreate the source. Offset values must be zero or positive integers. See [Setting start offsets](#setting-start-offsets) for details. | | **START TIMESTAMP** `` | Optional. Use the specified value to set `START OFFSET` based on the Kafka timestamp. Negative values will be interpreted as relative to the current system time in milliseconds (e.g. `-1000` means 1000 ms ago). See [Time-based offsets](#time-based-offsets) for details. | | **FORMAT TEXT\|BYTES** | - If `TEXT`, decode new-line delimited data as plain text. Data is assumed to be valid unicode (UTF-8), and discarded if it cannot be converted to UTF-8. Text-formatted sources have a single column, by default named `text`. - If `BYTES`, read raw bytes without applying any formatting or decoding. Raw byte-formatted sources have a single column, by default named `data`. | | **INCLUDE** `` | Optional. If specified, include the additional information as column(s) in the table. The following ``s are supported: \| Option \| Description \| \|--------\|-------------\| \| `PARTITION [AS ]` \| Expose the Kafka partition as a column. See [Partition, offset, timestamp](#partition-offset-timestamp) for details. \| \| `OFFSET [AS ]` \| Expose the Kafka offset as a column. See [Partition, offset, timestamp](#partition-offset-timestamp) for details. \| \| `TIMESTAMP [AS ]` \| Expose the Kafka timestamp as a column. See [Partition, offset, timestamp](#partition-offset-timestamp) for details. \| \| `HEADERS [AS ]` \| Expose all message headers as a column with type `record(key: text, value: bytea?) list`. See [Headers](#headers) for details. \| \| `HEADER '' AS [BYTES]` \| Expose a specific message header as a column. The `bytea` value is automatically parsed into a UTF-8 string unless `BYTES` is specified. See [Headers](#headers) for details. \| | | **ENVELOPE** `` | Optional. Specifies how Materialize interprets incoming records. Valid envelope types: \| Envelope \| Description \| \|----------\|-------------\| \| `NONE` \| Append-only envelope (default). Each message is inserted as a new row. See [Append-only envelope](/sql/create-source/kafka/#append-only-envelope) for details. \| | | `EXPOSE PROGRESS AS ` | Optional. The name of the progress collection for the source. If this is not specified, the progress collection will be named `_progress`. See [Monitoring source progress](#monitoring-source-progress) for details. | | `WITH ( [, ...])` | Optional. The following ``s are supported: \| Option \| Description \| \|--------\|-------------\| \| `RETAIN HISTORY FOR ` \| ***Private preview.** This option has known performance or stability issues and is under active development.* Duration for which Materialize retains historical data, which is useful to implement [durable subscriptions](/transform-data/patterns/durable-subscriptions/#history-retention-period). Accepts positive [interval](/sql/types/interval/) values (e.g. `'1hr'`). Default: `1s`. \| | **Format CSV:** ### Format CSV Materialize can parse CSV-formatted data. The data in CSV sources is read as [`text`](/sql/types/text). ```mzsql CREATE SOURCE [IF NOT EXISTS] ( [, ...] ) [IN CLUSTER ] FROM KAFKA CONNECTION ( TOPIC '' [, GROUP ID PREFIX ''] [, START OFFSET ( [, ...] ) ] [, START TIMESTAMP ] ) FORMAT CSV WITH COLUMNS | WITH HEADER [ ( [, ...] ) ] [INCLUDE PARTITION [AS ] | OFFSET [AS ] | TIMESTAMP [AS ] | HEADERS [AS ] | HEADER '' AS [BYTES] [, ...] ] [ENVELOPE NONE] [EXPOSE PROGRESS AS ] [WITH (RETAIN HISTORY FOR )] ``` | Syntax element | Description | | --- | --- | | ` ( [, ...] )` | The name for the source and the column names. Column names are required for CSV-formatted sources. | | **IF NOT EXISTS** | Optional. If specified, do not throw an error if a source with the same name already exists. Instead, issue a notice and skip the source creation. | | **IN CLUSTER** `` | Optional. The [cluster](/sql/create-cluster) to maintain this source. | | **CONNECTION** `` | The name of the Kafka connection to use in the source. For details on creating connections, check the [`CREATE CONNECTION`](/sql/create-connection) documentation page. | | **TOPIC** `''` | **Required.** The Kafka topic you want to subscribe to. | | **GROUP ID PREFIX** `` | Optional. The prefix of the consumer group ID to use. See [Monitoring consumer lag](#monitoring-consumer-lag).
Default: `materialize-{REGION-ID}-{CONNECTION-ID}-{SOURCE_ID}` | | **START OFFSET** (`` [, ...]) | Optional. Read partitions from the specified offset. You cannot update the offsets once a source has been created; you will need to recreate the source. Offset values must be zero or positive integers. See [Setting start offsets](#setting-start-offsets) for details. | | **START TIMESTAMP** `` | Optional. Use the specified value to set `START OFFSET` based on the Kafka timestamp. Negative values will be interpreted as relative to the current system time in milliseconds (e.g. `-1000` means 1000 ms ago). See [Time-based offsets](#time-based-offsets) for details. | | **FORMAT CSV WITH** `` | CSV format options: \| Option \| Description \| \|--------\|-------------\| \| `WITH COLUMNS` \| Treat the source data as if it has `` columns. By default, columns are named `column1`, `column2`...`columnN`, but you can override these names by specifying column names in the source definition. \| \| `WITH HEADER [ ( [, ...] ) ]` \| Materialize determines the number of columns and the name of each column using the header row. The header is not ingested as data. Optionally, you can provide a list of column names to validate against the header or override the source column names. \| Any row that does not match the number of columns determined by the format is ignored, and Materialize logs an error. | | **INCLUDE** `` | Optional. If specified, include the additional information as column(s) in the table. The following ``s are supported: \| Option \| Description \| \|--------\|-------------\| \| `PARTITION [AS ]` \| Expose the Kafka partition as a column. See [Partition, offset, timestamp](#partition-offset-timestamp) for details. \| \| `OFFSET [AS ]` \| Expose the Kafka offset as a column. See [Partition, offset, timestamp](#partition-offset-timestamp) for details. \| \| `TIMESTAMP [AS ]` \| Expose the Kafka timestamp as a column. See [Partition, offset, timestamp](#partition-offset-timestamp) for details. \| \| `HEADERS [AS ]` \| Expose all message headers as a column with type `record(key: text, value: bytea?) list`. See [Headers](#headers) for details. \| \| `HEADER '' AS [BYTES]` \| Expose a specific message header as a column. The `bytea` value is automatically parsed into a UTF-8 string unless `BYTES` is specified. See [Headers](#headers) for details. \| | | **ENVELOPE** `` | Optional. Specifies how Materialize interprets incoming records. CSV format only supports `NONE`: \| Envelope \| Description \| \|----------\|-------------\| \| `NONE` \| Append-only envelope (default). Each message is inserted as a new row. See [Append-only envelope](/sql/create-source/kafka/#append-only-envelope) for details. \| | | **EXPOSE PROGRESS AS** `` | Optional. The name of the progress collection for the source. If this is not specified, the progress collection will be named `_progress`. See [Monitoring source progress](#monitoring-source-progress) for details. | | **WITH** (`` [, ...]) | Optional. The following ``s are supported: \| Option \| Description \| \|--------\|-------------\| \| `RETAIN HISTORY FOR ` \| ***Private preview.** This option has known performance or stability issues and is under active development.* Duration for which Materialize retains historical data, which is useful to implement [durable subscriptions](/transform-data/patterns/durable-subscriptions/#history-retention-period). Accepts positive [interval](/sql/types/interval/) values (e.g. `'1hr'`). Default: `1s`. \| | **Format Protobuf:** ### Format Protobuf Materialize can decode Protobuf messages by integrating with a schema registry or parsing an inline schema to retrieve a `.proto` schema definition. It can then automatically define the columns and data types to use in the source. ```mzsql CREATE SOURCE [IF NOT EXISTS] [IN CLUSTER ] FROM KAFKA CONNECTION ( TOPIC '' [, GROUP ID PREFIX ''] [, START OFFSET ( [, ...] ) ] [, START TIMESTAMP ] ) FORMAT PROTOBUF USING CONFLUENT SCHEMA REGISTRY CONNECTION | FORMAT PROTOBUF MESSAGE '' USING SCHEMA '' [INCLUDE KEY [AS ] | PARTITION [AS ] | OFFSET [AS ] | TIMESTAMP [AS ] | HEADERS [AS ] | HEADER '' AS [BYTES] [, ...] ] [ENVELOPE NONE | UPSERT [ ( VALUE DECODING ERRORS = INLINE [AS ] ) ] ] [EXPOSE PROGRESS AS ] [WITH (RETAIN HISTORY FOR )] ``` | Syntax element | Description | | --- | --- | | `` | The name for the source. | | **IF NOT EXISTS** | Optional. If specified, do not throw an error if a source with the same name already exists. Instead, issue a notice and skip the source creation. | | **IN CLUSTER** `` | Optional. The [cluster](/sql/create-cluster) to maintain this source. | | `` | The name of the Kafka connection to use in the source. For details on creating connections, check the [`CREATE CONNECTION`](/sql/create-connection) documentation page. | | `''` | The Kafka topic you want to subscribe to. | | **GROUP ID PREFIX** `` | Optional. The prefix of the consumer group ID to use. See [Monitoring consumer lag](#monitoring-consumer-lag).
Default: `materialize-{REGION-ID}-{CONNECTION-ID}-{SOURCE_ID}` | | **START OFFSET** (`` [, ...]) | Optional. Read partitions from the specified offset. You cannot update the offsets once a source has been created; you will need to recreate the source. Offset values must be zero or positive integers. See [Setting start offsets](#setting-start-offsets) for details. | | **START TIMESTAMP** `

MySQL Configuration	Value	Notes
`log_bin`	`ON`
`binlog_format`	`ROW`	Deprecated as of MySQL 8.0.34. Newer versions of MySQL default to row-based logging.
`binlog_row_image`	`FULL`
`gtid_mode`	`ON`
`enforce_gtid_consistency`	`ON`
`replica_preserve_commit_order`	`ON`	Only required when connecting Materialize to a read-replica.