The Pub/Sub to ClickHouse template is a streaming pipeline that reads JSON-encoded messages from a Pub/Sub subscription and writes them into a ClickHouse table. Messages that fail to parse or fail to map to the target schema are routed to a dead-letter destination: a ClickHouse table, a Pub/Sub topic, or both.Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-mintlify-bd738d80.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Pipeline requirements
- The source Pub/Sub subscription must exist.
- Messages published to the subscription must be valid JSON.
- The target ClickHouse table must exist, and its column names must match the field names in the JSON payload.
- The ClickHouse host must be accessible from the Dataflow worker machines.
- At least one dead-letter destination (
clickHouseDeadLetterTableordeadLetterTopic) must be provided. If both are provided, failed messages are routed to both destinations simultaneously. - When
clickHouseDeadLetterTableis set, the dead-letter table must already exist in ClickHouse with the schema shown in Dead-letter handling. - When
deadLetterTopicis set, the Pub/Sub topic must already exist.
Template parameters
| Parameter Name | Parameter Description | Required | Notes |
|---|---|---|---|
inputSubscription | The Pub/Sub subscription to read messages from. Example: projects/<PROJECT_ID>/subscriptions/<SUBSCRIPTION_NAME>. | ✅ | Messages must be JSON-encoded. |
clickHouseUrl | The ClickHouse endpoint URL. Use https:// for SSL connections (ClickHouse Cloud) or http:// for non-SSL connections. Example: https://<HOST>:8443 or http://<HOST>:8123. | ✅ | For ClickHouse Cloud, use the HTTPS endpoint on port 8443. |
clickHouseDatabase | The name of the ClickHouse database where the target table resides. Example: default. | ✅ | |
clickHouseTable | The name of the ClickHouse table to write data into. | ✅ | The table must exist before running the pipeline. |
clickHouseUsername | The username to authenticate with ClickHouse. | ✅ | |
clickHousePassword | The password to authenticate with ClickHouse. | ✅ | |
clickHouseDeadLetterTable | The ClickHouse table to write failed messages into. Example: my_table_dead_letter. | At least one of clickHouseDeadLetterTable or deadLetterTopic must be provided. The table must exist with the dead-letter schema shown in Dead-letter handling. | |
deadLetterTopic | The Pub/Sub topic to publish failed messages to. Example: projects/<PROJECT_ID>/topics/<TOPIC_NAME>. | At least one of clickHouseDeadLetterTable or deadLetterTopic must be provided. Failed payloads are published to the topic with errorMessage and failedAt set as message attributes. | |
windowSeconds | Duration in seconds for time-based batching windows. | See Batching and windowing for the interaction with batchRowCount. If neither is set, combined mode uses defaults of 30s and 1000 rows. | |
batchRowCount | Number of rows to accumulate before flushing to ClickHouse. | See Batching and windowing for the interaction with windowSeconds. | |
maxInsertBlockSize | Maximum number of rows per INSERT statement sent to ClickHouse. Defaults to 1,000,000. | A ClickHouseIO option. | |
maxRetries | Maximum number of retry attempts for failed ClickHouse inserts. Defaults to 5. | A ClickHouseIO option. | |
insertDeduplicate | Whether to enable deduplication for INSERT queries in replicated ClickHouse tables. Defaults to true. | A ClickHouseIO option. | |
insertQuorum | For INSERT queries in replicated tables, wait for the specified number of replicas to acknowledge the write and linearize the data addition. 0 disables quorum writes. | A ClickHouseIO option. Disabled in default server settings. | |
insertDistributedSync | If enabled, INSERT queries into distributed tables wait until data is sent to all nodes in the cluster. Defaults to true. | A ClickHouseIO option. |
Default values for all
ClickHouseIO parameters can be found in ClickHouseIO Apache Beam Connector.Message format and schema mapping
Pub/Sub messages must be JSON objects whose top-level field names exactly match the column names of the target ClickHouse table. To map incoming messages onto the target table, the pipeline performs the following at startup:- Fetches the schema of the target ClickHouse table.
- Builds a Beam
Rowschema from that ClickHouse schema. - For each incoming Pub/Sub message, parses the JSON payload and assembles a row by reading the fields named in the ClickHouse schema.
Type conversion
JSON values are coerced into the corresponding ClickHouse column type:| ClickHouse Type | Notes |
|---|---|
Float32 | Parsed via Float.valueOf. |
Float64 | Parsed via Double.valueOf. |
Date | Parsed as an ISO-8601 date string. |
DateTime | Parsed as an ISO-8601 datetime string (e.g. 2026-01-15T12:34:56Z). |
Array(T) | JSON array; each element is converted to the element type T. Empty or missing arrays produce an empty array. |
Integer types (Int8/Int16/Int32/Int64, UInt8/UInt16/UInt32/UInt64) | Parsed from the JSON number or its string representation. |
String | Used as-is for textual fields; non-textual JSON nodes are serialized to their JSON string form. |
Batching and windowing
Because the pipeline is streaming, incoming rows are accumulated into windows before being flushed to ClickHouse. The windowing strategy is selected from the parameters you provide:windowSeconds | batchRowCount | Behavior |
|---|---|---|
| set | unset | Time-based fixed windows of windowSeconds. |
| unset | set | Global window with a count trigger; fires every batchRowCount rows. |
| both set | both set | Global window with a combined trigger; fires on whichever condition is met first (time or row count). |
| neither set | neither set | Combined mode with defaults: 30 seconds or 1000 rows, whichever comes first. |
INSERT batches.
Dead-letter handling
Messages that fail JSON parsing, schema mapping, or type coercion are routed to the configured dead-letter destination(s). At least one ofclickHouseDeadLetterTable or deadLetterTopic must be provided; if both are set, failed messages are sent to both.
ClickHouse dead-letter table
WhenclickHouseDeadLetterTable is set, the dead-letter table must already exist with this fixed schema:
| Column | Type | Description |
|---|---|---|
raw_message | String | The original Pub/Sub message payload as UTF-8 text. |
error_message | String | The exception message describing why the row failed. |
stack_trace | String | The full Java stack trace captured at failure time. |
failed_at | DateTime | The processing-time timestamp at which the row failed. |
Adapt the engine and
ORDER BY clause for your deployment — use ReplicatedMergeTree for replicated tables, add ON CLUSTER for distributed setups, and adjust partitioning or TTL as needed.Pub/Sub dead-letter topic
WhendeadLetterTopic is set, each failed message is republished to the topic with:
- Payload: the original message bytes.
- Attribute
errorMessage: the exception message captured at failure time. - Attribute
failedAt: the processing-time timestamp at which the row failed.
Running the template
The Pub/Sub to ClickHouse template is available from the Google Cloud Console.Be sure to review this document, and specifically the above sections, to fully understand the template’s configuration requirements and prerequisites.
-
Press the
CREATE JOB FROM TEMPLATEbutton. - Once the template form is open, enter a job name and select the desired region.
-
In the
Dataflow Templateinput, typeClickHouseorPub/Sub, and select thePub/Sub to ClickHousetemplate. -
Once selected, the form expands. Fill in:
- The Pub/Sub input subscription, in the form
projects/<PROJECT_ID>/subscriptions/<SUBSCRIPTION_NAME>. - The ClickHouse endpoint URL — for ClickHouse Cloud use
https://<HOST>:8443. - The ClickHouse database, target table, username and password.
- At least one dead-letter destination: a ClickHouse table or a Pub/Sub topic (or both).
- The Pub/Sub input subscription, in the form
-
Optionally customize batching (
windowSeconds,batchRowCount) andClickHouseIOtuning parameters, as detailed in the Template parameters section.
Monitor the job
Navigate to the Dataflow Jobs tab in your Google Cloud Console to monitor the status of the job. You’ll find the job details, including progress and any errors: The template also emits the following custom metrics under thePubSubToClickHouse namespace, viewable from the Dataflow job page:
| Metric | Type | Description |
|---|---|---|
messages-received | Counter | Total Pub/Sub messages received by the parsing step. |
rows-parsed-ok | Counter | Messages successfully converted to a row and routed to the main output. |
rows-parse-failed | Counter | Messages that failed parsing or schema mapping and were routed to dead-letter. |
message-payload-bytes | Distribution | Distribution of incoming Pub/Sub message payload sizes, in bytes. |
Troubleshooting
Memory limit (total) exceeded error (code 241)
This error occurs when ClickHouse runs out of memory while processing large batches of data. To resolve this issue:- Increase the instance resources: Upgrade your ClickHouse server to a larger instance with more memory to handle the data processing load.
- Decrease the batch size: Reduce
batchRowCount(and/ormaxInsertBlockSize) in your Dataflow job configuration to send smaller chunks of data to ClickHouse, reducing memory consumption per batch.
All messages are going to the dead-letter destination
The most common causes are:- The JSON field names do not match the ClickHouse column names exactly (matching is case-sensitive).
- A column type cannot be coerced from the JSON value (for example, a non-ISO-8601 string in a
DateTimecolumn). - The target table schema has changed since the pipeline started — the schema is fetched once at startup. Restart the job after applying schema changes.
error_message and stack_trace columns of the ClickHouse dead-letter table (or the errorMessage attribute on Pub/Sub dead-letter messages) to identify the root cause.
Pipeline starts but no rows arrive in ClickHouse
- Confirm the subscription is receiving messages — check the
messages-receivedmetric on the Dataflow job page. - In time-based mode (
windowSecondsonly), rows are flushed only at window boundaries. LowerwindowSecondsto verify flushes are occurring. - Verify network reachability between Dataflow workers and the ClickHouse endpoint (firewall, VPC peering, or private service connect).
Template source code
The template’s source code is available in:GoogleCloudPlatform/DataflowTemplates— the upstream Google Cloud Platform repository.ClickHouse/DataflowTemplates— ClickHouse’s fork.