I Use This!
High Activity

News

Analyzed about 1 hour ago. based on code collected about 8 hours ago.
Posted over 4 years ago by Bhuvanesh R
We are living in the DataLake world. Now almost every oraganization wants their reporting in Near Real Time. Kafka is of the best streaming platform for realtime reporting. Based on the Kafka connector, RedHat designed the Debezium which is an ... [More] OpenSource product and high recommended for real time CDC from transnational databases. I referred many blogs to setup this cluster. But I found just basic installation steps. So I setup this cluster for AWS with Production grade and publishing this blog. A shot intro: Debezium is a set of distributed services to capture changes in your databases so that your applications can see those changes and respond to them. Debezium records all row-level changes within each database table in a change event stream, and applications simply read these streams to see the change events in the same order in which they occurred. Basic Tech Terms: Kafka Broker: Brokers are the core for the kafka streaming, they’ll keep your messages and giving it to the consumers. Zookeeper: It’ll maintain the cluster status and node status. It’ll help to make the Kafka’s availability. Producers: The component who will send the messages(data) to the Broker. Consumers: The component who will get the messages from the Queue for further analytics. Confluent: Confluent is having their own steaming platform which basically using Apache Kafka under the hood. But it has more features. Here Debezium is our data producer and S3sink is our consumer. For this setup, Im going to stream the MySQL data changes to S3 with customized format. AWS Architecture: Kafka and Zookeepers are installed on the same EC2. We we’ll deploy 3 node confluent Kafka cluster. Each node will be in a different availability zone. 172.31.47.152 - Zone A 172.31.38.158 - Zone B 172.31.46.207 - Zone C For Producer(debezium) and Consumer(S3sink) will be hosted on the same Ec2. We’ll 3 nodes for this. 172.31.47.12 - Zone A 172.31.38.183 - Zone B 172.31.46.136 - Zone C Instance Type: Kafka nodes are generally needs Memory and Network Optimized. You can choose either Persistent and ephemeral storage. I prefer Persistent SSD Disks for Kafka storage. So add n GB size disk to your Kafka broker nodes. For Normal work loads its better to go with R5 instance Family. Mount the Volume in /kafkadata location. Security Group: Use a new Security group which allows the below ports. Installation: Install the Java and Kafka on all the Broker nodes. -- Install OpenJDK apt-get -y update sudo apt-get -y install default-jre -- Install Confluent Kafka platform wget -qO - https://packages.confluent.io/deb/5.3/archive.key | sudo apt-key add - sudo add-apt-repository "deb [arch=amd64] https://packages.confluent.io/deb/5.3 stable main" sudo apt-get update && sudo apt-get install confluent-platform-2.12Configuration: We need to configure Zookeeper and Kafaka properties, Edit the /etc/kafka/zookeeper.properties on all the kafka nodes -- On Node 1 dataDir=/var/lib/zookeeper clientPort=2181 maxClientCnxns=0 server.1=0.0.0.0:2888:3888 server.2=172.31.38.158:2888:3888 server.3=172.31.46.207:2888:3888 autopurge.snapRetainCount=3 autopurge.purgeInterval=24 initLimit=5 syncLimit=2 -- On Node 2 dataDir=/var/lib/zookeeper clientPort=2181 maxClientCnxns=0 server.1=172.31.47.152:2888:3888 server.2=0.0.0.0:2888:3888 server.3=172.31.46.207:2888:3888 autopurge.snapRetainCount=3 autopurge.purgeInterval=24 initLimit=5 syncLimit=2 -- On Node 3 dataDir=/var/lib/zookeeper clientPort=2181 maxClientCnxns=0 server.1=172.31.47.152:2888:3888 server.2=172.31.38.158:2888:3888 server.3=0.0.0.0:2888:3888 autopurge.snapRetainCount=3 autopurge.purgeInterval=24 initLimit=5 syncLimit=2We need to assign a unique ID for all the Zookeeper nodes. -- On Node 1 echo "1" > /var/lib/zookeeper/myid --On Node 2 echo "2" > /var/lib/zookeeper/myid --On Node 3 echo "3" > /var/lib/zookeeper/myidNow we need to configure Kafka broker. So edit the /etc/kafka/server.properties on all the kafka nodes. --On Node 1 broker.id.generation.enable=true delete.topic.enable=true listeners=PLAINTEXT://:9092 zookeeper.connect=172.31.47.152:2181,172.31.38.158:2181,172.31.46.207:2181 log.dirs=/kafkadata/kafka log.retention.hours=168 num.partitions=1 --On Node 2 broker.id.generation.enable=true delete.topic.enable=true listeners=PLAINTEXT://:9092 log.dirs=/kafkadata/kafka zookeeper.connect=172.31.47.152:2181,172.31.38.158:2181,172.31.46.207:2181 log.retention.hours=168 num.partitions=1 -- On Node 3 broker.id.generation.enable=true delete.topic.enable=true listeners=PLAINTEXT://:9092 log.dirs=/kafkadata/kafka zookeeper.connect=172.31.47.152:2181,172.31.38.158:2181,172.31.46.207:2181 num.partitions=1 log.retention.hours=168The next step is optimizing the Java JVM Heap size, In many places kafka will go down due to the less heap size. So Im allocating 50% of the Memory to Heap. But make sure more Heap size also bad. Please refer some documentation to set this value for very heavy systems. vi /usr/bin/kafka-server-start export KAFKA_HEAP_OPTS="-Xmx2G -Xms2G"The another major problem in the kafka system is the open file descriptors. So we need to allow the kafka to open at least up to 100000 files. vi /etc/pam.d/common-session session required pam_limits.so vi /etc/security/limits.conf * soft nofile 10000 * hard nofile 100000 cp-kafka soft nofile 10000 cp-kafka hard nofile 100000Here the cp-kafka is the default user for the kafka process. Create Kafka data dir: mkdir -p /kafkadata/kafka chown -R cp-kafka:confluent /kafkadata/kafka chmode 710 /kafkadata/kafkaStart the Kafka cluster: sudo systemctl start confluent-zookeeper sudo systemctl start confluent-kafka sudo systemctl start confluent-schema-registryMake sure the Kafka has to automatically starts after the Ec2 restart. sudo systemctl enable confluent-zookeeper sudo systemctl enable confluent-kafka sudo systemctl enable confluent-schema-registryNow our kafka cluster is ready. To check the list of system topics run the following command. kafka-topics --list --zookeeper localhost:2181 __confluent.support.metricsSetup Debezium: Install the confluent connector and debezium MySQL connector on all the producer nodes. apt-get update sudo apt-get install default-jre wget -qO - https://packages.confluent.io/deb/5.3/archive.key | sudo apt-key add - sudo add-apt-repository "deb [arch=amd64] https://packages.confluent.io/deb/5.3 stable main" sudo apt-get update && sudo apt-get install confluent-hub-client confluent-common confluent-kafka-connect-s3 confluent-kafka-2.12Configuration: Edit the /etc/kafka/connect-distributed.properties on all the producer nodes to make our producer will run on a distributed manner. -- On all the connector nodes bootstrap.servers=172.31.47.152:9092,172.31.38.158:9092,172.31.46.207:9092 group.id=debezium-cluster plugin.path=/usr/share/java,/usr/share/confluent-hub-componentsInstall Debezium MySQL Connector: confluent-hub install debezium/debezium-connector-mysql:latestit’ll ask for making some changes just select Y for everything. Run the distributed connector as a service: vi /lib/systemd/system/confluent-connect-distributed.service [Unit] Description=Apache Kafka - connect-distributed Documentation=http://docs.confluent.io/ After=network.target [Service] Type=simple User=cp-kafka Group=confluent ExecStart=/usr/bin/connect-distributed /etc/kafka/connect-distributed.properties TimeoutStopSec=180 Restart=no [Install] WantedBy=multi-user.targetStart the Service: systemctl enable confluent-connect-distributed systemctl start confluent-connect-distributedConfigure Debezium MySQL Connector: Create a mysql.json file which contains the MySQL information and other formatting options. { "name": "mysql-connector-db01", "config": { "name": "mysql-connector-db01", "connector.class": "io.debezium.connector.mysql.MySqlConnector", "database.server.id": "1", "tasks.max": "3", "database.history.kafka.bootstrap.servers": "172.31.47.152:9092,172.31.38.158:9092,172.31.46.207:9092", "database.history.kafka.topic": "schema-changes.mysql", "database.server.name": "mysql-db01", "database.hostname": "172.31.84.129", "database.port": "3306", "database.user": "bhuvi", "database.password": "my_stong_password", "database.whitelist": "proddb,test", "internal.key.converter.schemas.enable": "false", "key.converter.schemas.enable": "false", "internal.key.converter": "org.apache.kafka.connect.json.JsonConverter", "internal.value.converter.schemas.enable": "false", "value.converter.schemas.enable": "false", "internal.value.converter": "org.apache.kafka.connect.json.JsonConverter", "value.converter": "org.apache.kafka.connect.json.JsonConverter", "key.converter": "org.apache.kafka.connect.json.JsonConverter", "transforms": "unwrap", "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState" "transforms.unwrap.add.source.fields": "ts_ms", } } “database.history.kafka.bootstrap.servers” - Kafka Servers IP. “database.whitelist” - List of databases to get the CDC. key.converter and value.converter and transforms parameters - By default Debezium output will have more detailed information. But I don’t want all of those information. Im only interested in to get the new row and the timestamp when its inserted. If you don’t want to customize anythings then just remove everything after the database.whitelist Register the MySQL Connector: curl -X POST -H "Accept: application/json" -H "Content-Type: application/json" http://localhost:8083/connectors -d @mysql.jsonCheck the status: curl GET localhost:8083/connectors/mysql-connector-db01/status { "name": "mysql-connector-db01", "connector": { "state": "RUNNING", "worker_id": "172.31.94.191:8083" }, "tasks": [ { "id": 0, "state": "RUNNING", "worker_id": "172.31.94.191:8083" } ], "type": "source" }Test the MySQL Consumer: Now insert something into any tables in proddb or test (because we have whilelisted only these databaes to capture the CDC. use test; create table rohi (id int), fn varchar(10), ln varchar(10), phone int ); insert into rohi values (2, 'rohit', 'ayare','87611');We can get these values from the Kafker brokers. Open any one the kafka node and run the below command. I prefer confluent cli for this. By default it’ll not be available, so download manually. curl -L https://cnfl.io/cli | sh -s -- -b /usr/bin/Listen the below topic: mysql-db01.test.rohi This is the combination of servername.databasename.tablename servername(you mentioned this in as a server name in mysql json file). confluent local consume mysql-db01.test.rohi ---- The local commands are intended for a single-node development environment only, NOT for production usage. https://docs.confluent.io/current/cli/index.html ----- {"id":1,"fn":"rohit","ln":"ayare","phone":87611,"__ts_ms":1576757407000}Setup S3 Sink connector in All Producer Nodes: I want to send this data to S3 bucket. So you must have an EC2 IAM role which has access to the target S3 bucket. Or install awscli and configure access and secret key(but its not recommended) Create s3.json file. { "name": "s3-sink-db01", "config": { "connector.class": "io.confluent.connect.s3.S3SinkConnector", "storage.class": "io.confluent.connect.s3.storage.S3Storage", "s3.bucket.name": "bhuvi-datalake", "name": "s3-sink-db01", "tasks.max": "3", "s3.region": "us-east-1", "s3.part.size": "5242880", "s3.compression.type": "gzip", "timezone": "UTC", "locale": "en", "flush.size": "10000", "rotate.interval.ms": "3600000", "topics.regex": "mysql-db01.(.*)", "internal.key.converter.schemas.enable": "false", "key.converter.schemas.enable": "false", "internal.key.converter": "org.apache.kafka.connect.json.JsonConverter", "format.class": "io.confluent.connect.s3.format.json.JsonFormat", "internal.value.converter.schemas.enable": "false", "value.converter.schemas.enable": "false", "internal.value.converter": "org.apache.kafka.connect.json.JsonConverter", "value.converter": "org.apache.kafka.connect.json.JsonConverter", "key.converter": "org.apache.kafka.connect.json.JsonConverter", "partitioner.class": "io.confluent.connect.storage.partitioner.HourlyPartitioner", "path.format": "YYYY/MM/dd/HH", "partition.duration.ms": "3600000", "rotate.schedule.interval.ms": "3600000" } } "topics.regex": "mysql-db01" - It’ll send the data only from the topics which has mysql-db01 as prefix. In our case all the MySQL databases related topics will start with this prefix. "flush.size" - The data will uploaded to S3 only after these many number of records stored. Or after "rotate.schedule.interval.ms" this duration. Register this S3 sink connector: curl -X POST -H "Accept: application/json" -H "Content-Type: application/json" http://localhost:8083/connectors -d @s3Check the Status: curl GET localhost:8083/connectors/s3-sink-db01/status { "name": "s3-sink-db01", "connector": { "state": "RUNNING", "worker_id": "172.31.94.191:8083" }, "tasks": [ { "id": 0, "state": "RUNNING", "worker_id": "172.31.94.191:8083" }, { "id": 1, "state": "RUNNING", "worker_id": "172.31.94.191:8083" }, { "id": 2, "state": "RUNNING", "worker_id": "172.31.94.191:8083" } ], "type": "sink" }Test the S3 sync: Insert the 10000 rows into the rohi table. Then check the S3 bucket. It’ll save the data in JSON format with GZIP compression. Also in a HOUR wise partitions. More Tuning: Replication Factor is the other main parameter to the data durability. Use internal IP addresses as much as you can. By default debezium uses 1 Partition per topic. You can configure this based on your work load. But more partitions more through put needed. References: Setup Kafka in production by confluent How to choose number of partition Open file descriptors for Kafka Kafka best practices in AWS Debezium documentation Customize debezium output with SMT [Less]
Posted over 4 years ago by Bhuvanesh R
We are living in the DataLake world. Now almost every organizations wants their reporting in Near Real Time. Kafka is of the best streaming platform for realtime reporting. Based on the Kafka connector, RedHat designed the Debezium which is an ... [More] OpenSource product and high recommended for real time CDC from transnational databases. I referred many blogs to setup this cluster. But I found just basic installation steps. So I setup this cluster for AWS with Production grade and publishing this blog. A shot intro: Debezium is a set of distributed services to capture changes in your databases so that your applications can see those changes and respond to them. Debezium records all row-level changes within each database table in a change event stream, and applications simply read these streams to see the change events in the same order in which they occurred. Basic Tech Terms: Kafka Broker: Brokers are the core for the kafka streaming, they’ll keep your messages and giving it to the consumers. Zookeeper: It’ll maintain the cluster status and node status. It’ll help to make the Kafka’s availability. Producers: The component who will send the messages(data) to the Broker. Consumers: The component who will get the messages from the Queue for further analytics. Confluent: Confluent is having their own steaming platform which basically using Apache Kafka under the hood. But it has more features. Here Debezium is our data producer and S3sink is our consumer. For this setup, Im going to stream the MySQL data changes to S3 with customized format. AWS Architecture: Kafka and Zookeepers are installed on the same EC2. We we’ll deploy 3 node confluent Kafka cluster. Each node will be in a different availability zone. 172.31.47.152 - Zone A 172.31.38.158 - Zone B 172.31.46.207 - Zone C For Producer(debezium) and Consumer(S3sink) will be hosted on the same Ec2. We’ll 3 nodes for this. 172.31.47.12 - Zone A 172.31.38.183 - Zone B 172.31.46.136 - Zone C Instance Type: Kafka nodes are generally needs Memory and Network Optimized. You can choose either Persistent and ephemeral storage. I prefer Persistent SSD Disks for Kafka storage. So add n GB size disk to your Kafka broker nodes. For Normal work loads its better to go with R5 instance Family. Mount the Volume in /kafkadata location. Security Group: Use a new Security group which allows the below ports. Installation: Install the Java and Kafka on all the Broker nodes. -- Install OpenJDK apt-get -y update sudo apt-get -y install default-jre -- Install Confluent Kafka platform wget -qO - https://packages.confluent.io/deb/5.3/archive.key | sudo apt-key add - sudo add-apt-repository "deb [arch=amd64] https://packages.confluent.io/deb/5.3 stable main" sudo apt-get update && sudo apt-get install confluent-platform-2.12Configuration: We need to configure Zookeeper and Kafaka properties, Edit the /etc/kafka/zookeeper.properties on all the kafka nodes -- On Node 1 dataDir=/var/lib/zookeeper clientPort=2181 maxClientCnxns=0 server.1=0.0.0.0:2888:3888 server.2=172.31.38.158:2888:3888 server.3=172.31.46.207:2888:3888 autopurge.snapRetainCount=3 autopurge.purgeInterval=24 initLimit=5 syncLimit=2 -- On Node 2 dataDir=/var/lib/zookeeper clientPort=2181 maxClientCnxns=0 server.1=172.31.47.152:2888:3888 server.2=0.0.0.0:2888:3888 server.3=172.31.46.207:2888:3888 autopurge.snapRetainCount=3 autopurge.purgeInterval=24 initLimit=5 syncLimit=2 -- On Node 3 dataDir=/var/lib/zookeeper clientPort=2181 maxClientCnxns=0 server.1=172.31.47.152:2888:3888 server.2=172.31.38.158:2888:3888 server.3=0.0.0.0:2888:3888 autopurge.snapRetainCount=3 autopurge.purgeInterval=24 initLimit=5 syncLimit=2We need to assign a unique ID for all the Zookeeper nodes. -- On Node 1 echo "1" > /var/lib/zookeeper/myid --On Node 2 echo "2" > /var/lib/zookeeper/myid --On Node 3 echo "3" > /var/lib/zookeeper/myidNow we need to configure Kafka broker. So edit the /etc/kafka/server.properties on all the kafka nodes. --On Node 1 broker.id.generation.enable=true delete.topic.enable=true listeners=PLAINTEXT://:9092 zookeeper.connect=172.31.47.152:2181,172.31.38.158:2181,172.31.46.207:2181 log.dirs=/kafkadata/kafka log.retention.hours=168 num.partitions=1 --On Node 2 broker.id.generation.enable=true delete.topic.enable=true listeners=PLAINTEXT://:9092 log.dirs=/kafkadata/kafka zookeeper.connect=172.31.47.152:2181,172.31.38.158:2181,172.31.46.207:2181 log.retention.hours=168 num.partitions=1 -- On Node 3 broker.id.generation.enable=true delete.topic.enable=true listeners=PLAINTEXT://:9092 log.dirs=/kafkadata/kafka zookeeper.connect=172.31.47.152:2181,172.31.38.158:2181,172.31.46.207:2181 num.partitions=1 log.retention.hours=168The next step is optimizing the Java JVM Heap size, In many places kafka will go down due to the less heap size. So Im allocating 50% of the Memory to Heap. But make sure more Heap size also bad. Please refer some documentation to set this value for very heavy systems. vi /usr/bin/kafka-server-start export KAFKA_HEAP_OPTS="-Xmx2G -Xms2G"The another major problem in the kafka system is the open file descriptors. So we need to allow the kafka to open at least up to 100000 files. vi /etc/pam.d/common-session session required pam_limits.so vi /etc/security/limits.conf * soft nofile 10000 * hard nofile 100000 cp-kafka soft nofile 10000 cp-kafka hard nofile 100000Here the cp-kafka is the default user for the kafka process. Create Kafka data dir: mkdir -p /kafkadata/kafka chown -R cp-kafka:confluent /kafkadata/kafka chmode 710 /kafkadata/kafkaStart the Kafka cluster: sudo systemctl start confluent-zookeeper sudo systemctl start confluent-kafka sudo systemctl start confluent-schema-registryMake sure the Kafka has to automatically starts after the Ec2 restart. sudo systemctl enable confluent-zookeeper sudo systemctl enable confluent-kafka sudo systemctl enable confluent-schema-registryNow our kafka cluster is ready. To check the list of system topics run the following command. kafka-topics --list --zookeeper localhost:2181 __confluent.support.metricsSetup Debezium: Install the confluent connector and debezium MySQL connector on all the producer nodes. apt-get update sudo apt-get install default-jre wget -qO - https://packages.confluent.io/deb/5.3/archive.key | sudo apt-key add - sudo add-apt-repository "deb [arch=amd64] https://packages.confluent.io/deb/5.3 stable main" sudo apt-get update && sudo apt-get install confluent-hub-client confluent-common confluent-kafka-connect-s3 confluent-kafka-2.12Configuration: Edit the /etc/kafka/connect-distributed.properties on all the producer nodes to make our producer will run on a distributed manner. -- On all the connector nodes bootstrap.servers=172.31.47.152:9092,172.31.38.158:9092,172.31.46.207:9092 group.id=debezium-cluster plugin.path=/usr/share/java,/usr/share/confluent-hub-componentsInstall Debezium MySQL Connector: confluent-hub install debezium/debezium-connector-mysql:latestit’ll ask for making some changes just select Y for everything. Run the distributed connector as a service: vi /lib/systemd/system/confluent-connect-distributed.service [Unit] Description=Apache Kafka - connect-distributed Documentation=http://docs.confluent.io/ After=network.target [Service] Type=simple User=cp-kafka Group=confluent ExecStart=/usr/bin/connect-distributed /etc/kafka/connect-distributed.properties TimeoutStopSec=180 Restart=no [Install] WantedBy=multi-user.targetStart the Service: systemctl enable confluent-connect-distributed systemctl start confluent-connect-distributedConfigure Debezium MySQL Connector: Create a mysql.json file which contains the MySQL information and other formatting options. { "name": "mysql-connector-db01", "config": { "name": "mysql-connector-db01", "connector.class": "io.debezium.connector.mysql.MySqlConnector", "database.server.id": "1", "tasks.max": "3", "database.history.kafka.bootstrap.servers": "172.31.47.152:9092,172.31.38.158:9092,172.31.46.207:9092", "database.history.kafka.topic": "schema-changes.mysql", "database.server.name": "mysql-db01", "database.hostname": "172.31.84.129", "database.port": "3306", "database.user": "bhuvi", "database.password": "my_stong_password", "database.whitelist": "proddb,test", "internal.key.converter.schemas.enable": "false", "key.converter.schemas.enable": "false", "internal.key.converter": "org.apache.kafka.connect.json.JsonConverter", "internal.value.converter.schemas.enable": "false", "value.converter.schemas.enable": "false", "internal.value.converter": "org.apache.kafka.connect.json.JsonConverter", "value.converter": "org.apache.kafka.connect.json.JsonConverter", "key.converter": "org.apache.kafka.connect.json.JsonConverter", "transforms": "unwrap", "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState" "transforms.unwrap.add.source.fields": "ts_ms", } } “database.history.kafka.bootstrap.servers” - Kafka Servers IP. “database.whitelist” - List of databases to get the CDC. key.converter and value.converter and transforms parameters - By default Debezium output will have more detailed information. But I don’t want all of those information. Im only interested in to get the new row and the timestamp when its inserted. If you don’t want to customize anythings then just remove everything after the database.whitelist Register the MySQL Connector: curl -X POST -H "Accept: application/json" -H "Content-Type: application/json" http://localhost:8083/connectors -d @mysql.jsonCheck the status: curl GET localhost:8083/connectors/mysql-connector-db01/status { "name": "mysql-connector-db01", "connector": { "state": "RUNNING", "worker_id": "172.31.94.191:8083" }, "tasks": [ { "id": 0, "state": "RUNNING", "worker_id": "172.31.94.191:8083" } ], "type": "source" }Test the MySQL Consumer: Now insert something into any tables in proddb or test (because we have whilelisted only these databaes to capture the CDC. use test; create table rohi (id int), fn varchar(10), ln varchar(10), phone int ); insert into rohi values (2, 'rohit', 'ayare','87611');We can get these values from the Kafker brokers. Open any one the kafka node and run the below command. I prefer confluent cli for this. By default it’ll not be available, so download manually. curl -L https://cnfl.io/cli | sh -s -- -b /usr/bin/Listen the below topic: mysql-db01.test.rohi This is the combination of servername.databasename.tablename servername(you mentioned this in as a server name in mysql json file). confluent local consume mysql-db01.test.rohi ---- The local commands are intended for a single-node development environment only, NOT for production usage. https://docs.confluent.io/current/cli/index.html ----- {"id":1,"fn":"rohit","ln":"ayare","phone":87611,"__ts_ms":1576757407000}Setup S3 Sink connector in All Producer Nodes: I want to send this data to S3 bucket. So you must have an EC2 IAM role which has access to the target S3 bucket. Or install awscli and configure access and secret key(but its not recommended) Install S3 Connector: confluent-hub install confluentinc/kafka-connect-s3:latestCreate s3.json file. { "name": "s3-sink-db01", "config": { "connector.class": "io.confluent.connect.s3.S3SinkConnector", "storage.class": "io.confluent.connect.s3.storage.S3Storage", "s3.bucket.name": "bhuvi-datalake", "name": "s3-sink-db01", "tasks.max": "3", "s3.region": "us-east-1", "s3.part.size": "5242880", "s3.compression.type": "gzip", "timezone": "UTC", "locale": "en", "flush.size": "10000", "rotate.interval.ms": "3600000", "topics.regex": "mysql-db01.(.*)", "internal.key.converter.schemas.enable": "false", "key.converter.schemas.enable": "false", "internal.key.converter": "org.apache.kafka.connect.json.JsonConverter", "format.class": "io.confluent.connect.s3.format.json.JsonFormat", "internal.value.converter.schemas.enable": "false", "value.converter.schemas.enable": "false", "internal.value.converter": "org.apache.kafka.connect.json.JsonConverter", "value.converter": "org.apache.kafka.connect.json.JsonConverter", "key.converter": "org.apache.kafka.connect.json.JsonConverter", "partitioner.class": "io.confluent.connect.storage.partitioner.HourlyPartitioner", "path.format": "YYYY/MM/dd/HH", "partition.duration.ms": "3600000", "rotate.schedule.interval.ms": "3600000" } } "topics.regex": "mysql-db01" - It’ll send the data only from the topics which has mysql-db01 as prefix. In our case all the MySQL databases related topics will start with this prefix. "flush.size" - The data will uploaded to S3 only after these many number of records stored. Or after "rotate.schedule.interval.ms" this duration. Register this S3 sink connector: curl -X POST -H "Accept: application/json" -H "Content-Type: application/json" http://localhost:8083/connectors -d @s3Check the Status: curl GET localhost:8083/connectors/s3-sink-db01/status { "name": "s3-sink-db01", "connector": { "state": "RUNNING", "worker_id": "172.31.94.191:8083" }, "tasks": [ { "id": 0, "state": "RUNNING", "worker_id": "172.31.94.191:8083" }, { "id": 1, "state": "RUNNING", "worker_id": "172.31.94.191:8083" }, { "id": 2, "state": "RUNNING", "worker_id": "172.31.94.191:8083" } ], "type": "sink" }Test the S3 sync: Insert the 10000 rows into the rohi table. Then check the S3 bucket. It’ll save the data in JSON format with GZIP compression. Also in a HOUR wise partitions. Monitoring: Refer this post to setup monitoring for MySQL Connector. More Tuning: Replication Factor is the other main parameter to the data durability. Use internal IP addresses as much as you can. By default debezium uses 1 Partition per topic. You can configure this based on your work load. But more partitions more through put needed. References: Setup Kafka in production by confluent How to choose number of partition Open file descriptors for Kafka Kafka best practices in AWS Debezium documentation Customize debezium output with SMT [Less]
Posted over 4 years ago by Bhuvanesh R
We are living in the DataLake world. Now almost every organizations wants their reporting in Near Real Time. Kafka is of the best streaming platform for realtime reporting. Based on the Kafka connector, RedHat designed the Debezium which is an ... [More] OpenSource product and high recommended for real time CDC from transnational databases. I referred many blogs to setup this cluster. But I found just basic installation steps. So I setup this cluster for AWS with Production grade and publishing this blog. A shot intro: Debezium is a set of distributed services to capture changes in your databases so that your applications can see those changes and respond to them. Debezium records all row-level changes within each database table in a change event stream, and applications simply read these streams to see the change events in the same order in which they occurred. Basic Tech Terms: Kafka Broker: Brokers are the core for the kafka streaming, they’ll keep your messages and giving it to the consumers. Zookeeper: It’ll maintain the cluster status and node status. It’ll help to make the Kafka’s availability. Producers: The component who will send the messages(data) to the Broker. Consumers: The component who will get the messages from the Queue for further analytics. Confluent: Confluent is having their own steaming platform which basically using Apache Kafka under the hood. But it has more features. Here Debezium is our data producer and S3sink is our consumer. For this setup, Im going to stream the MySQL data changes to S3 with customized format. AWS Architecture: Kafka and Zookeepers are installed on the same EC2. We we’ll deploy 3 node confluent Kafka cluster. Each node will be in a different availability zone. 172.31.47.152 - Zone A 172.31.38.158 - Zone B 172.31.46.207 - Zone C For Producer(debezium) and Consumer(S3sink) will be hosted on the same Ec2. We’ll 3 nodes for this. 172.31.47.12 - Zone A 172.31.38.183 - Zone B 172.31.46.136 - Zone C Instance Type: Kafka nodes are generally needs Memory and Network Optimized. You can choose either Persistent and ephemeral storage. I prefer Persistent SSD Disks for Kafka storage. So add n GB size disk to your Kafka broker nodes. For Normal work loads its better to go with R5 instance Family. Mount the Volume in /kafkadata location. Security Group: Use a new Security group which allows the below ports. Installation: Install the Java and Kafka on all the Broker nodes. -- Install OpenJDK apt-get -y update sudo apt-get -y install default-jre -- Install Confluent Kafka platform wget -qO - https://packages.confluent.io/deb/5.3/archive.key | sudo apt-key add - sudo add-apt-repository "deb [arch=amd64] https://packages.confluent.io/deb/5.3 stable main" sudo apt-get update && sudo apt-get install confluent-platform-2.12Configuration: We need to configure Zookeeper and Kafaka properties, Edit the /etc/kafka/zookeeper.properties on all the kafka nodes -- On Node 1 dataDir=/var/lib/zookeeper clientPort=2181 maxClientCnxns=0 server.1=0.0.0.0:2888:3888 server.2=172.31.38.158:2888:3888 server.3=172.31.46.207:2888:3888 autopurge.snapRetainCount=3 autopurge.purgeInterval=24 initLimit=5 syncLimit=2 -- On Node 2 dataDir=/var/lib/zookeeper clientPort=2181 maxClientCnxns=0 server.1=172.31.47.152:2888:3888 server.2=0.0.0.0:2888:3888 server.3=172.31.46.207:2888:3888 autopurge.snapRetainCount=3 autopurge.purgeInterval=24 initLimit=5 syncLimit=2 -- On Node 3 dataDir=/var/lib/zookeeper clientPort=2181 maxClientCnxns=0 server.1=172.31.47.152:2888:3888 server.2=172.31.38.158:2888:3888 server.3=0.0.0.0:2888:3888 autopurge.snapRetainCount=3 autopurge.purgeInterval=24 initLimit=5 syncLimit=2We need to assign a unique ID for all the Zookeeper nodes. -- On Node 1 echo "1" > /var/lib/zookeeper/myid --On Node 2 echo "2" > /var/lib/zookeeper/myid --On Node 3 echo "3" > /var/lib/zookeeper/myidNow we need to configure Kafka broker. So edit the /etc/kafka/server.properties on all the kafka nodes. --On Node 1 broker.id.generation.enable=true delete.topic.enable=true listeners=PLAINTEXT://:9092 zookeeper.connect=172.31.47.152:2181,172.31.38.158:2181,172.31.46.207:2181 log.dirs=/kafkadata/kafka log.retention.hours=168 num.partitions=1 --On Node 2 broker.id.generation.enable=true delete.topic.enable=true listeners=PLAINTEXT://:9092 log.dirs=/kafkadata/kafka zookeeper.connect=172.31.47.152:2181,172.31.38.158:2181,172.31.46.207:2181 log.retention.hours=168 num.partitions=1 -- On Node 3 broker.id.generation.enable=true delete.topic.enable=true listeners=PLAINTEXT://:9092 log.dirs=/kafkadata/kafka zookeeper.connect=172.31.47.152:2181,172.31.38.158:2181,172.31.46.207:2181 num.partitions=1 log.retention.hours=168The next step is optimizing the Java JVM Heap size, In many places kafka will go down due to the less heap size. So Im allocating 50% of the Memory to Heap. But make sure more Heap size also bad. Please refer some documentation to set this value for very heavy systems. vi /usr/bin/kafka-server-start export KAFKA_HEAP_OPTS="-Xmx2G -Xms2G"The another major problem in the kafka system is the open file descriptors. So we need to allow the kafka to open at least up to 100000 files. vi /etc/pam.d/common-session session required pam_limits.so vi /etc/security/limits.conf * soft nofile 10000 * hard nofile 100000 cp-kafka soft nofile 10000 cp-kafka hard nofile 100000Here the cp-kafka is the default user for the kafka process. Create Kafka data dir: mkdir -p /kafkadata/kafka chown -R cp-kafka:confluent /kafkadata/kafka chmode 710 /kafkadata/kafkaStart the Kafka cluster: sudo systemctl start confluent-zookeeper sudo systemctl start confluent-kafka sudo systemctl start confluent-schema-registryMake sure the Kafka has to automatically starts after the Ec2 restart. sudo systemctl enable confluent-zookeeper sudo systemctl enable confluent-kafka sudo systemctl enable confluent-schema-registryNow our kafka cluster is ready. To check the list of system topics run the following command. kafka-topics --list --zookeeper localhost:2181 __confluent.support.metricsSetup Debezium: Install the confluent connector and debezium MySQL connector on all the producer nodes. apt-get update sudo apt-get install default-jre wget -qO - https://packages.confluent.io/deb/5.3/archive.key | sudo apt-key add - sudo add-apt-repository "deb [arch=amd64] https://packages.confluent.io/deb/5.3 stable main" sudo apt-get update && sudo apt-get install confluent-hub-client confluent-common confluent-kafka-connect-s3 confluent-kafka-2.12Configuration: Edit the /etc/kafka/connect-distributed.properties on all the producer nodes to make our producer will run on a distributed manner. -- On all the connector nodes bootstrap.servers=172.31.47.152:9092,172.31.38.158:9092,172.31.46.207:9092 group.id=debezium-cluster plugin.path=/usr/share/java,/usr/share/confluent-hub-componentsInstall Debezium MySQL Connector: confluent-hub install debezium/debezium-connector-mysql:latestit’ll ask for making some changes just select Y for everything. Run the distributed connector as a service: vi /lib/systemd/system/confluent-connect-distributed.service [Unit] Description=Apache Kafka - connect-distributed Documentation=http://docs.confluent.io/ After=network.target [Service] Type=simple User=cp-kafka Group=confluent ExecStart=/usr/bin/connect-distributed /etc/kafka/connect-distributed.properties TimeoutStopSec=180 Restart=no [Install] WantedBy=multi-user.targetStart the Service: systemctl enable confluent-connect-distributed systemctl start confluent-connect-distributedConfigure Debezium MySQL Connector: Create a mysql.json file which contains the MySQL information and other formatting options. { "name": "mysql-connector-db01", "config": { "name": "mysql-connector-db01", "connector.class": "io.debezium.connector.mysql.MySqlConnector", "database.server.id": "1", "tasks.max": "3", "database.history.kafka.bootstrap.servers": "172.31.47.152:9092,172.31.38.158:9092,172.31.46.207:9092", "database.history.kafka.topic": "schema-changes.mysql", "database.server.name": "mysql-db01", "database.hostname": "172.31.84.129", "database.port": "3306", "database.user": "bhuvi", "database.password": "my_stong_password", "database.whitelist": "proddb,test", "internal.key.converter.schemas.enable": "false", "key.converter.schemas.enable": "false", "internal.key.converter": "org.apache.kafka.connect.json.JsonConverter", "internal.value.converter.schemas.enable": "false", "value.converter.schemas.enable": "false", "internal.value.converter": "org.apache.kafka.connect.json.JsonConverter", "value.converter": "org.apache.kafka.connect.json.JsonConverter", "key.converter": "org.apache.kafka.connect.json.JsonConverter", "transforms": "unwrap", "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState", "transforms.unwrap.add.source.fields": "ts_ms", "tombstones.on.delete": false } } “database.history.kafka.bootstrap.servers” - Kafka Servers IP. “database.whitelist” - List of databases to get the CDC. key.converter and value.converter and transforms parameters - By default Debezium output will have more detailed information. But I don’t want all of those information. Im only interested in to get the new row and the timestamp when its inserted. If you don’t want to customize anythings then just remove everything after the database.whitelist Register the MySQL Connector: curl -X POST -H "Accept: application/json" -H "Content-Type: application/json" http://localhost:8083/connectors -d @mysql.jsonCheck the status: curl GET localhost:8083/connectors/mysql-connector-db01/status { "name": "mysql-connector-db01", "connector": { "state": "RUNNING", "worker_id": "172.31.94.191:8083" }, "tasks": [ { "id": 0, "state": "RUNNING", "worker_id": "172.31.94.191:8083" } ], "type": "source" }Test the MySQL Consumer: Now insert something into any tables in proddb or test (because we have whilelisted only these databaes to capture the CDC. use test; create table rohi (id int, fn varchar(10), ln varchar(10), phone int ); insert into rohi values (2, 'rohit', 'ayare','87611');We can get these values from the Kafker brokers. Open any one the kafka node and run the below command. I prefer confluent cli for this. By default it’ll not be available, so download manually. curl -L https://cnfl.io/cli | sh -s -- -b /usr/bin/Listen the below topic: mysql-db01.test.rohi This is the combination of servername.databasename.tablename servername(you mentioned this in as a server name in mysql json file). confluent local consume mysql-db01.test.rohi ---- The local commands are intended for a single-node development environment only, NOT for production usage. https://docs.confluent.io/current/cli/index.html ----- {"id":1,"fn":"rohit","ln":"ayare","phone":87611,"__ts_ms":1576757407000}Setup S3 Sink connector in All Producer Nodes: I want to send this data to S3 bucket. So you must have an EC2 IAM role which has access to the target S3 bucket. Or install awscli and configure access and secret key(but its not recommended) Install S3 Connector: confluent-hub install confluentinc/kafka-connect-s3:latestCreate s3.json file. { "name": "s3-sink-db01", "config": { "connector.class": "io.confluent.connect.s3.S3SinkConnector", "storage.class": "io.confluent.connect.s3.storage.S3Storage", "s3.bucket.name": "bhuvi-datalake", "name": "s3-sink-db01", "tasks.max": "3", "s3.region": "us-east-1", "s3.part.size": "5242880", "s3.compression.type": "gzip", "timezone": "UTC", "locale": "en", "flush.size": "10000", "rotate.interval.ms": "3600000", "topics.regex": "mysql-db01.(.*)", "internal.key.converter.schemas.enable": "false", "key.converter.schemas.enable": "false", "internal.key.converter": "org.apache.kafka.connect.json.JsonConverter", "format.class": "io.confluent.connect.s3.format.json.JsonFormat", "internal.value.converter.schemas.enable": "false", "value.converter.schemas.enable": "false", "internal.value.converter": "org.apache.kafka.connect.json.JsonConverter", "value.converter": "org.apache.kafka.connect.json.JsonConverter", "key.converter": "org.apache.kafka.connect.json.JsonConverter", "partitioner.class": "io.confluent.connect.storage.partitioner.HourlyPartitioner", "path.format": "YYYY/MM/dd/HH", "partition.duration.ms": "3600000", "rotate.schedule.interval.ms": "3600000" } } "topics.regex": "mysql-db01" - It’ll send the data only from the topics which has mysql-db01 as prefix. In our case all the MySQL databases related topics will start with this prefix. "flush.size" - The data will uploaded to S3 only after these many number of records stored. Or after "rotate.schedule.interval.ms" this duration. Register this S3 sink connector: curl -X POST -H "Accept: application/json" -H "Content-Type: application/json" http://localhost:8083/connectors -d @s3Check the Status: curl GET localhost:8083/connectors/s3-sink-db01/status { "name": "s3-sink-db01", "connector": { "state": "RUNNING", "worker_id": "172.31.94.191:8083" }, "tasks": [ { "id": 0, "state": "RUNNING", "worker_id": "172.31.94.191:8083" }, { "id": 1, "state": "RUNNING", "worker_id": "172.31.94.191:8083" }, { "id": 2, "state": "RUNNING", "worker_id": "172.31.94.191:8083" } ], "type": "sink" }Test the S3 sync: Insert the 10000 rows into the rohi table. Then check the S3 bucket. It’ll save the data in JSON format with GZIP compression. Also in a HOUR wise partitions. Monitoring: Refer this post to setup monitoring for MySQL Connector. More Tuning: Replication Factor is the other main parameter to the data durability. Use internal IP addresses as much as you can. By default debezium uses 1 Partition per topic. You can configure this based on your work load. But more partitions more through put needed. References: Setup Kafka in production by confluent How to choose number of partition Open file descriptors for Kafka Kafka best practices in AWS Debezium documentation Customize debezium output with SMT Debezium Series blogs: Build Production Grade Debezium Cluster With Confluent Kafka Monitor Debezium MySQL Connector With Prometheus And Grafana Debezium MySQL Snapshot From Read Replica With GTID Debezium MySQL Snapshot From Read Replica And Resume From Master Debezium MySQL Snapshot For AWS RDS Aurora From Backup Snaphot RealTime CDC From MySQL Using AWS MSK With Debezium [Less]
Posted over 4 years ago by Sveta Smirnova
First I want to thank everyone who attended my December 5, 2019 webinar “Introduction to MySQL Query Tuning for DevOps“. Recording and slides are available on the webinar page. Here are answers to the questions from participants which I was not able ... [More] to provide during the webinar. Q: How to find stored execution plans and optimizer metadata stored in mysql data dictionary (i.e. PS, IS, sys schema)? A: The Optimizer creates the query execution plan each time when MySQL Server executes the query. These plans are never stored. However, some information, used by the optimizer, to create the execution plan, is stored and available. It includes. Index statistics. You can find details using the SHOW INDEX  command:mysql> show index from employees; +-----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | +-----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | employees | 0 | PRIMARY | 1 | emp_no | A | 299556 | NULL | NULL | | BTREE | | | | employees | 1 | first_name | 1 | first_name | A | 1196 | NULL | NULL | | BTREE | | | | employees | 1 | first_name | 2 | last_name | A | 280646 | NULL | NULL | | BTREE | | | +-----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ 3 rows in set (0,00 sec) Or by querying information_schema.statistics  table:mysql> select * from information_schema.statistics where TABLE_SCHEMA='employees' and table_name='employees'; +---------------+--------------+------------+------------+--------------+------------+--------------+-------------+-----------+-------------+----------+--------+----------+------------+---------+---------------+ | TABLE_CATALOG | TABLE_SCHEMA | TABLE_NAME | NON_UNIQUE | INDEX_SCHEMA | INDEX_NAME | SEQ_IN_INDEX | COLUMN_NAME | COLLATION | CARDINALITY | SUB_PART | PACKED | NULLABLE | INDEX_TYPE | COMMENT | INDEX_COMMENT | +---------------+--------------+------------+------------+--------------+------------+--------------+-------------+-----------+-------------+----------+--------+----------+------------+---------+---------------+ | def | employees | employees | 0 | employees | PRIMARY | 1 | emp_no | A | 299556 | NULL | NULL | | BTREE | | | | def | employees | employees | 1 | employees | first_name | 1 | first_name | A | 1196 | NULL | NULL | | BTREE | | | | def | employees | employees | 1 | employees | first_name | 2 | last_name | A | 280646 | NULL | NULL | | BTREE | | | +---------------+--------------+------------+------------+--------------+------------+--------------+-------------+-----------+-------------+----------+--------+----------+------------+---------+---------------+ 3 rows in set (0,00 sec) For InnoDB tables, you can additionally query mysql.innodb_index_stats  table which stores physical data which the engine passes to the Optimizer:mysql> select * from mysql.innodb_index_stats where database_name='employees' and table_name='employees'; +---------------+------------+------------+---------------------+--------------+------------+-------------+-----------------------------------+ | database_name | table_name | index_name | last_update | stat_name | stat_value | sample_size | stat_description | +---------------+------------+------------+---------------------+--------------+------------+-------------+-----------------------------------+ | employees | employees | PRIMARY | 2019-12-12 18:22:40 | n_diff_pfx01 | 299556 | 20 | emp_no | | employees | employees | PRIMARY | 2019-12-12 18:22:40 | n_leaf_pages | 886 | NULL | Number of leaf pages in the index | | employees | employees | PRIMARY | 2019-12-12 18:22:40 | size | 929 | NULL | Number of pages in the index | | employees | employees | first_name | 2019-12-12 21:49:02 | n_diff_pfx01 | 1196 | 20 | first_name | | employees | employees | first_name | 2019-12-12 21:49:02 | n_diff_pfx02 | 280646 | 20 | first_name,last_name | | employees | employees | first_name | 2019-12-12 21:49:02 | n_diff_pfx03 | 298471 | 20 | first_name,last_name,emp_no | | employees | employees | first_name | 2019-12-12 21:49:02 | n_leaf_pages | 460 | NULL | Number of leaf pages in the index | | employees | employees | first_name | 2019-12-12 21:49:02 | size | 545 | NULL | Number of pages in the index | +---------------+------------+------------+---------------------+--------------+------------+-------------+-----------------------------------+ 8 rows in set (0,01 sec) Difference between SHOW INDEX  output and mysql.innodb_index_stats  table content is that information in SHOW INDEX  is absolutely virtual and calculated on each access while data in the mysql.innodb_index_stats  table physically stored and updated only when InnoDB storage engine updates statistics. Since version 8.0: Optimizer statistics aka Histograms. You can find details by querying information_schema.column_statistics  table:mysql> select HISTOGRAM from information_schema.column_statistics -> where table_name=’example’\G *************************** 1. row *************************** HISTOGRAM: {"buckets": [[1, 0.6], [2, 0.8], [3, 1.0]], "data-type": "int", "null-values": 0.0, "collation-id": 8, "last-updated": "2018-11-07 09:07:19.791470", "sampling-rate": 1.0, "histogram-type": "singleton", "number-of-buckets-specified": 3} 1 row in set (0.00 sec) Optimizer trace: which actions Optimizer performed to resolve last optimizer_trace_limit queries. This data stored in memory, disabled by default and available in the information_schema.optimizer_trace  table:mysql> set optimizer_trace=1; Query OK, 0 rows affected (0,00 sec) mysql> select count(*) from employees where first_name like 'A%'; +----------+ | count(*) | +----------+ | 22039 | +----------+ 1 row in set (0,16 sec) mysql> select * from information_schema.optimizer_trace\G *************************** 1. row *************************** QUERY: select count(*) from employees where first_name like 'A%' TRACE: { "steps": [ { "join_preparation": { "select#": 1, "steps": [ { "expanded_query": "/* select#1 */ select count(0) AS `count(*)` from `employees` where (`employees`.`first_name` like 'A%')" } ] } }, { "join_optimization": { "select#": 1, "steps": [ { ...   Q: Is it possible to list all possible plans available to optimizer for an individual query? how? A: Since query execution plans created each time when the query is executed it is not possible to list them all.   Q: How to clear an existing plan for a particular query or all optimizer metadata from the data dictionary? A: Since query execution plans are not stored anywhere, there is no need to clear them.   Q: Hey there, I have been working with mysql for a long time, now I want to make a system that will have complex queries with a combination of group by columns, I want them to get completed in a couple of seconds and use the lowest ram, your advise may be helpful. A: This is a complicated question. I recommend you to learn how MySQL can use indexes for GROUP BY . Start from “GROUP BY Optimization” chapter of the MySQL User Reference Manual. [Less]
Posted over 4 years ago by The Pythian Group
When you have multiple database servers working together as nodes in a cluster, it’s beneficial to understand how data consistency is established. In this post, we’re going to take a look at the various cluster consistency levels that you can choose ... [More] from within InnoDB Group Replication and see how they impact data flow and failure recovery. Let’s start with a quick refresher on what data consistency is. In a nutshell, consistency is just a way of thinking about how data is duplicated across multiple nodes in a cluster. If you write a row of data to one node, that data is not considered to be consistent within the cluster until it has been written to all of the other participating nodes. Without consistency, it’s possible to write data to one node in the cluster but not see it immediately when reading from another node in the cluster. In some cases, the aforementioned scenario, better known as eventual consistency, is acceptable but sometimes you need to ensure the data you wrote is available on all your nodes before any subsequent reads occur. For that, we need to look at the system server variables group_replication_consistency, which allows you to control consistency levels globally or per transaction. Let’s have a look at each option and see what they mean for consistency and your workload. Eventual The value of ‘EVENTUAL’ implies that you are okay with having eventual consistency in your cluster. This means that as data gets written on one node, it’s possible that immediate reads on other nodes may not see that data. This is the default value for group_replication_consistency and this is going to be the cluster consistency level that most of us are familiar with if you’ve been working with Galera or earlier versions of InnoDB Group Replication. When you get an ‘OK’ return status on your transaction, you are only getting confirmation that the write you’re performing doesn’t conflict with anything pending on the other participating nodes and that the other participating nodes have the change logged so that it can eventually be replayed. This is part of why we consider Galera to be ‘virtually synchronous’ replication. The advantage of using this cluster consistency level is speed. You lose overhead of consistency verification as part of your transaction which allows you to commit data changes much faster. But what if we want to introduce a bit more consistency? Before We may want to consider using the ‘BEFORE’ option for group_replication_consistency. When set to this value, any transaction or request will wait for the node to complete any pending transactions in its queue before allowing the request to occur. This ensures that that request is looking at the most up-to-date version of the data, which in turn assures consistency for the transaction or request. The advantage here is obviously the fact that we get greater consistency, but the disadvantage is greater overhead and potential delays. Depending on the number of data changes pending in the node’s queue, you could notice delays in getting the results for your request. Keep in mind there is no data checking on any of the other nodes in order to ensure consistency like you might see in a similar database shared-nothing clustering products like Cassandra. It simply checks the log and assumes that if the queue is up to date with the time the request was submitted, that data is consistent. However, the fact that it has to wait for transactions to process prior to the execution of the request will still add overhead. If you have a write-heavy workload and want to ensure that your reads are consistent, you can consider working with this option. After Another option for consideration is ‘AFTER’. When set to this value, any transaction will wait to complete until the associated changes are applied to all the other participating nodes in the cluster. This takes things one step further than the basic conflict checking that occurs with eventual consistency. The advantage, once again, is that you have greater assured cluster consistency, but it comes at the cost of performance as you’ll have to wait for all participating nodes to write the request before getting an ‘OK’. If you have a node with a large queue, this can cause further delays. Another limitation to consider is that if you use this option, there will be an impact on other concurrently running transactions as they have to be committed in the order in which they were received. For example, if you have one transaction executing with the ‘AFTER’ option enabled, any other transaction that commits after will have to wait, even if the other transactions are using the ‘EVENTUAL’ consistency option. If you have a read-heavy workload and want to ensure that your writes are consistent, you can consider working with this option. Before and After The option ‘before_and_after’ is simply a combination of the logic found in options ‘BEFORE’ and ‘AFTER’. Before any request occurs, it will make sure the node is up to date with transactions waiting in its queue. Once the transaction is complete it will verify that the write is done on all nodes prior to giving an ‘OK’. This will offer the highest level of consistency and may sometimes be applicable when you are not operating in single-primary mode, but it obviously comes with the highest overhead cost. If you are using multi-primary, have a read-heavy workload, and want to ensure that your writes with read dependencies are consistent, you can consider working with this option. Before on Primary Failover If you’ve checked the documentation for this option, you will notice that there is a fifth option, called ‘BEFORE_ON_PRIMARY_FAILOVER’. This does have an impact on cluster level consistency but mainly for when a failover occurs when running in single-primary mode. Single primary mode in InnoDB Group Replication means that you have one node in your cluster designated as the primary. All write requests should be directed to this node, whereas read requests can be evenly distributed over the remaining nodes in the cluster. Should the single primary node fail, a promotion will need to occur and the value of ‘BEFORE_ON_PRIMARY_FAILOVER’ denotes how soon traffic should be directed to the promoted node. If you use the option ‘EVENTUAL’, the newly promoted node will start taking write traffic immediately after the promotion is complete. If you use ‘BEFORE_ON_PRIMARY_FAILOVER’, then it will wait for all pending transactions in its queue to complete before accepting any new write traffic. The big trade-off here is availability vs consistency. You’ll have to determine what is best for your use case. Recommended Variable Configuration Given that the value of this variable impacts cluster level recovery and data flow, I would be inclined to set the global value of this variable to either ‘EVENTUAL’ or ‘BEFORE_ON_PRIMARY_FAILOVER’ so you have a default configuration on how write requests are handled during a promotion. Beyond that, I would consider setting the variable to other values like ‘BEFORE’, ‘AFTER’, or ‘BEFORE_AND_AFTER’  at the session level in accordance with the consistency requirements of my specific transaction. Conclusion InnoDB Group Replication takes us a step forward when it comes to providing options for cluster level data consistency in MySQL-based virtually synchronous clustering solutions, but we have to be aware of the cost of overhead that comes with them. Ensure that you run adequate performance tests before increasing your cluster consistency level! [Less]
Posted over 4 years ago by The Pythian Group
When you have multiple database servers working together as nodes in a cluster, it’s beneficial to understand how data consistency is established. In this post, we’re going to take a look at the various cluster consistency levels that you can choose ... [More] from within InnoDB Group Replication and see how they impact data flow and failure recovery. Let’s start with a quick refresher on what data consistency is. In a nutshell, consistency is just a way of thinking about how data is duplicated across multiple nodes in a cluster. If you write a row of data to one node, that data is not considered to be consistent within the cluster until it has been written to all of the other participating nodes. Without consistency, it’s possible to write data to one node in the cluster but not see it immediately when reading from another node in the cluster. In some cases, the aforementioned scenario, better known as eventual consistency, is acceptable but sometimes you need to ensure the data you wrote is available on all your nodes before any subsequent reads occur. For that, we need to look at the system server variables group_replication_consistency, which allows you to control consistency levels globally or per transaction. Let’s have a look at each option and see what they mean for consistency and your workload. Eventual The value of ‘EVENTUAL’ implies that you are okay with having eventual consistency in your cluster. This means that as data gets written on one node, it’s possible that immediate reads on other nodes may not see that data. This is the default value for group_replication_consistency and this is going to be the cluster consistency level that most of us are familiar with if you’ve been working with Galera or earlier versions of InnoDB Group Replication. When you get an ‘OK’ return status on your transaction, you are only getting confirmation that the write you’re performing doesn’t conflict with anything pending on the other participating nodes and that the other participating nodes have the change logged so that it can eventually be replayed. This is part of why we consider Galera to be ‘virtually synchronous’ replication. The advantage of using this cluster consistency level is speed. You lose overhead of consistency verification as part of your transaction which allows you to commit data changes much faster. But what if we want to introduce a bit more consistency? Before We may want to consider using the ‘BEFORE’ option for group_replication_consistency. When set to this value, any transaction or request will wait for the node to complete any pending transactions in its queue before allowing the request to occur. This ensures that that request is looking at the most up-to-date version of the data, which in turn assures consistency for the transaction or request. The advantage here is obviously the fact that we get greater consistency, but the disadvantage is greater overhead and potential delays. Depending on the number of data changes pending in the node’s queue, you could notice delays in getting the results for your request. Keep in mind there is no data checking on any of the other nodes in order to ensure consistency like you might see in a similar database shared-nothing clustering products like Cassandra. It simply checks the log and assumes that if the queue is up to date with the time the request was submitted, that data is consistent. However, the fact that it has to wait for transactions to process prior to the execution of the request will still add overhead. If you have a write-heavy workload and want to ensure that your reads are consistent, you can consider working with this option. After Another option for consideration is ‘AFTER’. When set to this value, any transaction will wait to complete until the associated changes are applied to all the other participating nodes in the cluster. This takes things one step further than the basic conflict checking that occurs with eventual consistency. The advantage, once again, is that you have greater assured cluster consistency, but it comes at the cost of performance as you’ll have to wait for all participating nodes to write the request before getting an ‘OK’. If you have a node with a large queue, this can cause further delays. Another limitation to consider is that if you use this option, there will be an impact on other concurrently running transactions as they have to be committed in the order in which they were received. For example, if you have one transaction executing with the ‘AFTER’ option enabled, any other transaction that commits after will have to wait, even if the other transactions are using the ‘EVENTUAL’ consistency option. If you have a read-heavy workload and want to ensure that your writes are consistent, you can consider working with this option. Before and After The option ‘before_and_after’ is simply a combination of the logic found in options ‘BEFORE’ and ‘AFTER’. Before any request occurs, it will make sure the node is up to date with transactions waiting in its queue. Once the transaction is complete it will verify that the write is done on all nodes prior to giving an ‘OK’. This will offer the highest level of consistency and may sometimes be applicable when you are not operating in single-primary mode, but it obviously comes with the highest overhead cost. If you are using multi-primary, have a read-heavy workload, and want to ensure that your writes with read dependencies are consistent, you can consider working with this option. Before on Primary Failover If you’ve checked the documentation for this option, you will notice that there is a fifth option, called ‘BEFORE_ON_PRIMARY_FAILOVER’. This does have an impact on cluster level consistency but mainly for when a failover occurs when running in single-primary mode. Single primary mode in InnoDB Group Replication means that you have one node in your cluster designated as the primary. All write requests should be directed to this node, whereas read requests can be evenly distributed over the remaining nodes in the cluster. Should the single primary node fail, a promotion will need to occur and the value of ‘BEFORE_ON_PRIMARY_FAILOVER’ denotes how soon traffic should be directed to the promoted node. If you use the option ‘EVENTUAL’, the newly promoted node will start taking write traffic immediately after the promotion is complete. If you use ‘BEFORE_ON_PRIMARY_FAILOVER’, then it will wait for all pending transactions in its queue to complete before accepting any new read or write traffic. The big trade-off here is availability vs consistency. You’ll have to determine what is best for your use case. Recommended Variable Configuration Given that the value of this variable impacts cluster level recovery and data flow, I would be inclined to set the global value of this variable to either ‘EVENTUAL’ or ‘BEFORE_ON_PRIMARY_FAILOVER’ so you have a default configuration on how write requests are handled during a promotion. Beyond that, I would consider setting the variable to other values like ‘BEFORE’, ‘AFTER’, or ‘BEFORE_AND_AFTER’  at the session level in accordance with the consistency requirements of my specific transaction. Conclusion InnoDB Group Replication takes us a step forward when it comes to providing options for cluster level data consistency in MySQL-based virtually synchronous clustering solutions, but we have to be aware of the cost of overhead that comes with them. Ensure that you run adequate performance tests before increasing your cluster consistency level! Thank you to Nuno Carvalho for clarifying points regarding BEFORE_ON_PRIMARY_FAILOVER. You can see his comment below for full details. [Less]
Posted over 4 years ago by Severalnines
Slow queries, inefficient queries, or long running queries are problems that regularly plague DBA's. They are always ubiquitous, yet are an inevitable part of life for anyone responsible for managing a database.  Poor database design can affect ... [More] the efficiency of the query and its performance. Lack of knowledge or improper use of function calls, stored procedures, or routines can also cause database performance degradation and can even harm the entire MySQL database cluster.  For a master-slave replication, a very common cause of these issues are tables which lack primary or secondary indexes. This causes slave lag which can last for a very long time (in a worse case scenario). In this two-part series blog, we'll give you a refresher course on how to tackle the maximizing of your database queries in MySQL to driver better efficiency and performance. Always Add a Unique Index To Your Table Tables that do not have primary or unique keys typically create huge problems when data gets bigger. When this happens a simple data modification can stall the database. Lack of proper indices and an UPDATE or DELETE statement has been applied to the particular table, a full table scan will be chosen as the query plan by MySQL. That can cause high disk I/O for reads and writes and degrades the performance of your database. See an example below: root[test]> show create table sbtest2\G *************************** 1. row *************************** Table: sbtest2 Create Table: CREATE TABLE `sbtest2` ( `id` int(10) unsigned NOT NULL, `k` int(10) unsigned NOT NULL DEFAULT '0', `c` char(120) NOT NULL DEFAULT '', `pad` char(60) NOT NULL DEFAULT '' ) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec) root[test]> explain extended update sbtest2 set k=52, pad="xx234xh1jdkHdj234" where id=57; +----+-------------+---------+------------+------+---------------+------+---------+------+---------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+---------+------------+------+---------------+------+---------+------+---------+----------+-------------+ | 1 | UPDATE | sbtest2 | NULL | ALL | NULL | NULL | NULL | NULL | 1923216 | 100.00 | Using where | +----+-------------+---------+------------+------+---------------+------+---------+------+---------+----------+-------------+ 1 row in set, 1 warning (0.06 sec) Whereas a table with primary key has a very good query plan, root[test]> show create table sbtest3\G *************************** 1. row *************************** Table: sbtest3 Create Table: CREATE TABLE `sbtest3` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `k` int(10) unsigned NOT NULL DEFAULT '0', `c` char(120) NOT NULL DEFAULT '', `pad` char(60) NOT NULL DEFAULT '', PRIMARY KEY (`id`), KEY `k` (`k`) ) ENGINE=InnoDB AUTO_INCREMENT=2097121 DEFAULT CHARSET=latin1 1 row in set (0.00 sec) root[test]> explain extended update sbtest3 set k=52, pad="xx234xh1jdkHdj234" where id=57; +----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+ | 1 | UPDATE | sbtest3 | NULL | range | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | Using where | +----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+ 1 row in set, 1 warning (0.00 sec) Primary or unique keys provides vital component for a table structure because this is very important especially when performing maintenance on a table. For example, using tools from the Percona Toolkit (such as pt-online-schema-change or pt-table-sync) recommends that you must have unique keys. Keep in mind that the PRIMARY KEY is already a unique key and a primary key cannot hold NULL values but unique key. Assigning a NULL value to a Primary Key can cause an error like, ERROR 1171 (42000): All parts of a PRIMARY KEY must be NOT NULL; if you need NULL in a key, use UNIQUE instead For slave nodes, it is also common that in certain occasions, the primary/unique key is not present on the table which therefore are discrepancy of the table structure. You can use mysqldiff to achieve this or you can mysqldump --no-data … params and and run a diff to compare its table structure and check if there's any discrepancy.  Scan Tables With Duplicate Indexes, Then Dropped It Duplicate indices can also cause performance degradation, especially when the table contains a huge number of records. MySQL has to perform multiple attempts to optimize the query and performs more query plans to check. It includes scanning large index distribution or statistics and that adds performance overhead as it can cause memory contention or high I/O memory utilization. Degradation for queries when duplicate indices are observed on a table also attributes on saturating the buffer pool. This can also affect the performance of MySQL when the checkpointing flushes the transaction logs into the disk. This is due to the processing and storing of an unwanted index (which is in fact a waste of space in the particular tablespace of that table). Take note that duplicate indices are also stored in the tablespace which also has to be stored in the buffer pool.  Take a look at the table below which contains multiple duplicate keys: root[test]#> show create table sbtest3\G *************************** 1. row *************************** Table: sbtest3 Create Table: CREATE TABLE `sbtest3` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `k` int(10) unsigned NOT NULL DEFAULT '0', `c` char(120) NOT NULL DEFAULT '', `pad` char(60) NOT NULL DEFAULT '', PRIMARY KEY (`id`), KEY `k` (`k`,`pad`,`c`), KEY `kcp2` (`id`,`k`,`c`,`pad`), KEY `kcp` (`k`,`c`,`pad`), KEY `pck` (`pad`,`c`,`id`,`k`) ) ENGINE=InnoDB AUTO_INCREMENT=2048561 DEFAULT CHARSET=latin1 1 row in set (0.00 sec) and has a size of 2.3GiB root[test]#> \! du -hs /var/lib/mysql/test/sbtest3.ibd 2.3G /var/lib/mysql/test/sbtest3.ibd Let's drop the duplicate indices and rebuild the table with a no-op alter, root[test]#> drop index kcp2 on sbtest3; drop index kcp on sbtest3 drop index pck on sbtest3; Query OK, 0 rows affected (0.01 sec) Records: 0 Duplicates: 0 Warnings: 0 Query OK, 0 rows affected (0.01 sec) Records: 0 Duplicates: 0 Warnings: 0 Query OK, 0 rows affected (0.01 sec) Records: 0 Duplicates: 0 Warnings: 0 root[test]#> alter table sbtest3 engine=innodb; Query OK, 0 rows affected (28.23 sec) Records: 0 Duplicates: 0 Warnings: 0 root[test]#> \! du -hs /var/lib/mysql/test/sbtest3.ibd 945M /var/lib/mysql/test/sbtest3.ibd It has been able to save up to ~59% of the old size of the table space which is really huge. To determine duplicate indexes, you can use pt-duplicate-checker to handle the job for you.  Tune Up your Buffer Pool For this section I’m referring only to the InnoDB storage engine.  Buffer pool is an important component within the InnoDB kernel space. This is where InnoDB caches table and index data when accessed. It speeds up processing because frequently used data are being stored in the memory efficiently using BTREE. For instance, If you have multiple tables consisting of >= 100GiB and are accessed heavily, then we suggest that you delegate a fast volatile memory starting from a size of 128GiB and start assigning the buffer pool with 80% of the physical memory. The 80% has to be monitored efficiently. You can use SHOW ENGINE INNODB STATUS \G or you can leverage monitoring software such as ClusterControl which offers a fine-grained monitoring which includes buffer pool and its relevant health metrics. Also set the innodb_buffer_pool_instances variable accordingly. You might set this larger than 8 (default if innodb_buffer_pool_size >= 1GiB), such as 16, 24, 32, or 64 or higher if necessary.   When monitoring the buffer pool, you need to check global status variable Innodb_buffer_pool_pages_free which provides you thoughts if there's a need to adjust the buffer pool, or maybe consider if there are also unwanted or duplicate indexes that consumes the buffer. The SHOW ENGINE INNODB STATUS \G also offers a more detailed aspect of the buffer pool information including its individual buffer pool based on the number of innodb_buffer_pool_instances you have set. Use FULLTEXT Indexes (But Only If Applicable) Using queries like, SELECT bookid, page, context FROM books WHERE context like '%for dummies%'; wherein context is a string-type (char, varchar, text) column, is an example of a super bad query! Pulling large content of records with a filter that has to be greedy ends up with a full table scan, and that is just crazy. Consider using FULLTEXT index. A FULLTEXT indexes have an inverted index design. Inverted indexes store a list of words, and for each word, a list of documents that the word appears in. To support proximity search, position information for each word is also stored, as a byte offset. In order to use FULLTEXT for searching or filtering data, you need to use the combination of MATCH() ...AGAINST syntax and not like the query above. Of course, you need to specify the field to be your FULLTEXT index field.  To create a FULLTEXT index, just specify with FULLTEXT as your index. See the example below: root[minime]#> CREATE FULLTEXT INDEX aboutme_fts ON users_info(aboutme); Query OK, 0 rows affected, 1 warning (0.49 sec) Records: 0 Duplicates: 0 Warnings: 1 root[jbmrcd_date]#> show warnings; +---------+------+--------------------------------------------------+ | Level | Code | Message | +---------+------+--------------------------------------------------+ | Warning | 124 | InnoDB rebuilding table to add column FTS_DOC_ID | +---------+------+--------------------------------------------------+ 1 row in set (0.00 sec) Although using FULLTEXT indexes can offer benefits when searching words within a very large context inside a column, it also creates issues when used incorrectly.  When doing a FULLTEXT search for a large table that is constantly accessed (where a number of client requests are searching for different,  unique keywords) it could be very CPU intensive.  There are certain occasions as well that FULLTEXT is not applicable. See this external blog post. Although I haven't tried this with 8.0, I don't see any changes relevant to this. We suggest that do not use FULLTEXT for searching a big data environment, especially for high-traffic tables. Otherwise, try to leverage other technologies such as Apache Lucene, Apache Solr, tsearch2, or Sphinx. Avoid Using NULL in Columns Columns that contain null values are totally fine in MySQL. But if you are using columns with null values into an index, it can affect query performance as the optimizer cannot provide the right query plan due to poor index distribution. However, there are certain ways to optimize queries that involves null values but of course, if this suits the requirements. Please check the documentation of MySQL about Null Optimization. You may also check this external post which is helpful as well. Design Your Schema Topology and Tables Structure Efficiently To some extent, normalizing your database tables from 1NF (First Normal Form) to 3NF (Third Normal Form) provides you some benefit for query efficiency because normalized tables tend to avoid redundant records. A proper planning and design for your tables is very important because this is how you retrieved or pull data and in every one of these actions has a cost. With normalized tables, the goal of the database is to ensure that every non-key column in every table is directly dependent on the key; the whole key and nothing but the key. If this goal is reached, it pays of the benefits in the form of reduced redundancies, fewer anomalies and improved efficiencies. While normalizing your tables has many benefits, it doesn't mean you need to normalize all your tables in this way. You can implement a design for your database using Star Schema. Designing your tables using Star Schema has the benefit of simpler queries (avoid complex cross joins), easy to retrieve data for reporting, offers performance gains because there's no need to use unions or complex joins, or fast aggregations. A Star Schema is simple to implement, but you need to carefully plan because it can create big problems and disadvantages when your table gets bigger and requires maintenance. Star Schema (and its underlying tables) are prone to data integrity issues, so you may have a high probability that bunch of your data is redundant. If you think this table has to be constant (structure and design) and is designed to utilize query efficiency, then it's an ideal case for this approach. Mixing your database designs (as long as you are able to determine and identify what kind of data has to be pulled on your tables) is very important since you can benefit with more efficient queries and as well as help the DBA with backups, maintenance, and recovery. Get Rid of Constant and Old Data We recently wrote some Best Practices for Archiving Your Database in the Cloud. It covers about how you can take advantage of data archiving before it goes to the cloud. So how does getting rid of old data or archiving your constant and old data help query efficiency? As stated in my previous blog, there are benefits for larger tables that are constantly modified and inserted with new data, the tablespace can grow quickly. MySQL and InnoDB performs efficiently when records or data are contiguous to each other and has significance to its next row in the table. Meaning, if you have no old records that are no longer need to be used, then the optimizer does not need to include that in the statistics offering much more efficient result. Make sense, right? And also, query efficiency is not only on the application side, it has also need to consider its efficiency when performing a backup and when on maintenance or failover. For example, if you have a bad and long query that can affect your maintenance period or a failover, that can be a problem. Enable Query Logging As Needed Always set your MySQL's slow query log in accordance to your custom needs. If you are using Percona Server, you can take advantage of their extended slow query logging. It allows you to customarily define certain variables. You can filter types of queries in combination such as full_scan, full_join, tmp_table, etc. You can also dictate the rate of slow query logging through variable log_slow_rate_type, and many others. The importance of enabling query logging in MySQL (such as slow query) is beneficial for inspecting your queries so that you can optimize or tune your MySQL by adjusting certain variables that suits to your requirements. To enable slow query log, ensure that these variables are setup: long_query_time - assign the right value for how long the queries can take. If the queries take more than 10 seconds (default), it will fall down to the slow query log file you assigned. slow_query_log - to enable it, set it to 1. slow_query_log_file - this is the destination path for your slow query log file. The slow query log is very helpful for query analysis and diagnosing bad queries that cause stalls, slave delays, long running queries, memory or CPU intensive, or even cause the server to crash. If you use pt-query-digest or pt-index-usage, use the slow query log file as your source target for reporting these queries alike. Conclusion We have discussed some ways you can use to maximize database query efficiency in this blog. In this next part we'll discuss even more factors which can help you maximize performance. Stay tuned!   Tags:  database performance query management query tuning query latency MySQL MariaDB galera cluster [Less]
Posted over 4 years ago by Frederic Descamps
The schedule of the preFOSDEM Day is now available ! We had a lot of proposals to deal with. Also this is a MySQL event where we, the MySQL Team has the possibility to show to you, our Community, all what we have working on to improve MySQL ... [More] but also new stuff. We also invite some of our friends from the MySQL Community to talk about their experience. I think we did a good selection and propose you new content. We are extremely happy to have Saverio Miroddi from TicketSolve talking about MySQL 8.0, Uber talking about InnoDB Cluster, and Facebook about Binlog. As you can see, we will have 2 rooms, where one will be dedicated mostly to SQL and Optimizer topics but also tutorials. You can also see that this year we also want to put some spot lights to MySQL NDB Cluster. You will see what is it, how is it used and what’s new. We will also have 2 community speakers sharing their MySQL NDB knowledge: Giuseppe Maxia, will show you how to get familiar with NDB without having to deploy it on complicated architecture and Marco Tusa from Percona will also show you how to used it with ProxySQL. I’ve also heard that there will be new stuff…. Don’t forget to register if you want to join this event that will be held in Brussels, January 30 and 31: https://mysqldays2020.eventbrite.com Day 1 – Thursday, January 30 Madera Azzar From To Title Speaker Company Title Speaker Company 9:30 10:00 MySQL Community Welcome Lenka Kasporova,David Stokes,Frédéric Descamps Oralce 10:00 10:30 Simplifying MySQL Geir Hoydalsvik Oracle 10:30 11:05 MySQL Replication Kenny Gryp Oracle 11:05 11:25 Coffee Break 11:25 11:55 MySQL Clone: A better way to migrate databases Georgi Kodinov Oracle Indexing JSON Arrays in MySQL Dag Wanvik Oracle 12:00 12:30 MySQL Group Replication:Best Practices for Handling Network Glitches Pedro Gomes Oracle Everything you always wanted to knowabout datetime types but didn’t have timeto ask – How to avoid the most commonpitfalls of date and time types in MySQL Martin Hansson Oracle 12:35 13:30 Lunch Break 13:30 14:00 Friends let real friends use MySQL 8.0 Saverio Miroddi TicketSolve CHECK Constraints in MySQL 8.0 Dmitry Lenev Oracle 14:05 14:35 MySQL Connectors Kenny Gryp Oracle Table value constructors in MySQL 8.0 Catalin Beleaga Oracle 14:40 15:10 MySQL InnoDB ClusterMaking Provisioning and Troubleshooting as easy as pie Miguel Araújo Oracle Checking & Hardening MySQL 8.0 Security via SQL and without OS Access Georgi Kodinov Oracle 15:15 15:40 Coffee Break 15:40 16:10 Best practices to upgrade to MySQL 8.0 Frédéric Descamps Oracle MySQL 8.0 Document Store Tutorial David Stokes Oracle 16:15 16:45 New Redo Log in MySQL 8.0 InnoDB:Improvements for high concurrency Pawel Olchawa Oracle 16:50 17:20 Hash Join in MySQL 8.0 Erik Frøseth Oracle 17:25 17:55 MySQL 8.0 EXPLAIN ANALYZE Norvald H. Ryeng Oracle Day 2 – Friday, January 31 Madera Azzar From To Title Speaker Company Title Speaker Company 9:30 10:00 MySQL Community Welcome Lenka Kasporova,David Stokes,Frédéric Descamps Oralce 10:00 10:30 Extreme Performance with MySQL Analytics Service Nipun Agarwal Oracle 10:30 11:05 MySQL Replication Performance in the Cloud Vitor Oliveira Oracle MySQL Server Usability Guidelines Morgan Tocker & Geir Høydalsvik PlanetScale / Oracle 11:05 11:25 Coffee Break 11:25 11:55 MySQL NDB 8.0 101 Bernd Ocklin Oracle MySQL Database ArchitecturesTutorial – part I Miguel AraújoLuis Soares Oracle 12:00 12:30 MySQL NDB 8.0 clusters in your laptop with DBdeployer Giuseppe Maxia DBdeployer 12:35 13:30 Lunch Break 13:30 14:00 SQL with MySQL NDB 8.0 faster than your NoSQL allow Bernd Ocklin Oracle MySQL Database ArchitecturesTutorial – part II Miguel AraújoLuis Soares Oracle 14:05 14:35 Boosting MySQL NDB Cluster & MySQL InnoDB Clusterwith ProxySQL V2 Marco Tusa Percona 14:40 15:10 Machine Learning for automating MySQL service Nipun Agarwal Oracle 15:15 15:40 Coffee Break 15:40 16:10 Binlog and Engine Consistency Under Reduced Durability Yoshinori Matsunobu Facebook MySQL NDB 8.0 ClusterTutorial Frazer Clement Oracle 16:15 16:45 Benchmarks -vs- Benchmarks Dimitri Kravtchuk Oracle 16:50 17:20 Onboarding to MySQL Group Replication at Uber Giedrius JaraminasHenrik Korku Uber 17:25 17:55 Vitess: the sharding solution for MySQL 8.0 Liz Van Dijk PlanetScale Please note that the current schedule is subject to change. [Less]
Posted over 4 years ago by Frederic Descamps
The schedule of the preFOSDEM Day is now available ! We had a lot of proposals to deal with. Also this is a MySQL event where we, the MySQL Team has the possibility to show to you, our Community, all what we have working on to improve MySQL ... [More] but also new stuff. We also invite some of our friends from the MySQL Community to talk about their experience. I think we did a good selection and propose you new content. We are extremely happy to have Saverio Miroddi from TicketSolve talking about MySQL 8.0, Uber talking about InnoDB Cluster, and Facebook about Binlog. As you can see, we will have 2 rooms, where one will be dedicated mostly to SQL and Optimizer topics but also tutorials. You can also see that this year we also want to put some spot lights to MySQL NDB Cluster. You will see what is it, how is it used and what’s new. We will also have 2 community speakers sharing their MySQL NDB knowledge: Giuseppe Maxia, will show you how to get familiar with NDB without having to deploy it on complicated architecture and Marco Tusa from Percona will also show you how to used it with ProxySQL. I’ve also heard that there will be new stuff…. Don’t forget to register if you want to join this event that will be held in Brussels, January 30 and 31: https://mysqldays2020.eventbrite.com Day 1 – Thursday, January 30 Madera Azzar From To Title Speaker Company Title Speaker Company 9:30 10:00 MySQL Community Welcome Lenka Kasporova,David Stokes,Frédéric Descamps Oralce 10:00 10:30 Simplifying MySQL Geir Hoydalsvik Oracle 10:30 11:05 MySQL Replication Kenny Gryp Oracle 11:05 11:25 Coffee Break 11:25 11:55 MySQL Clone: A better way to migrate databases Georgi Kodinov Oracle Indexing JSON Arrays in MySQL Dag Wanvik Oracle 12:00 12:30 MySQL Group Replication:Best Practices for Handling Network Glitches Pedro Gomes Oracle Everything you always wanted to knowabout datetime types but didn’t have timeto ask – How to avoid the most commonpitfalls of date and time types in MySQL Martin Hansson Oracle 12:35 13:30 Lunch Break 13:30 14:00 Friends let real friends use MySQL 8.0 Saverio Miroddi TicketSolve CHECK Constraints in MySQL 8.0 Dmitry Lenev Oracle 14:05 14:35 MySQL Connectors Kenny Gryp Oracle Table value constructors in MySQL 8.0 Catalin Beleaga Oracle 14:40 15:10 MySQL InnoDB ClusterMaking Provisioning and Troubleshooting as easy as pie Miguel Araújo Oracle MySQL 8.0 Security Georgi Kodinovg Oracle 15:15 15:40 Coffee Break 15:40 16:10 Best practices to upgrade to MySQL 8.0 Frédéric Descamps Oracle MySQL 8.0 Document Store Tutorial David Stokes Oracle 16:15 16:45 New Redo Log in MySQL 8.0 InnoDB:Improvements for high concurrency Pawel Olchawa Oracle 16:50 17:20 Hash Join in MySQL 8.0 Erik Frøseth Oracle 17:25 17:55 MySQL 8.0 EXPLAIN ANALYZE Norvald H. Ryeng Oracle Day 2 – Friday, January 31 Madera Azzar From To Title Speaker Company Title Speaker Company 9:30 10:00 MySQL Community Welcome Lenka Kasporova,David Stokes,Frédéric Descamps Oralce 10:00 10:30 Extreme Performance with MySQL Analytics Service Nipun Agarwal Oracle 10:30 11:05 MySQL Replication Performance in the Cloud Vitor Oliveira Oracle 11:05 11:25 Coffee Break 11:25 11:55 MySQL NDB 8.0 101 Bernd Ocklin Oracle MySQL Database ArchitecturesTutorial – part I Miguel AraújoLuis Soares Oracle 12:00 12:30 MySQL NDB 8.0 clusters in your laptop with DBdeployer Giuseppe Maxia DBdeployer 12:35 13:30 Lunch Break 13:30 14:00 SQL with MySQL NDB 8.0 faster than your NoSQL allow Bernd Ocklin Oracle MySQL Database ArchitecturesTutorial – part II Miguel AraújoLuis Soares Oracle 14:05 14:35 Boosting MySQL NDB Cluster & MySQL InnoDB Clusterwith ProxySQL V2 Marco Tusa Percona 14:40 15:10 Machine Learning for automating MySQL service Nipun Agarwal Oracle 15:15 15:40 Coffee Break 15:40 16:10 Binlog and Engine Consistency Under Reduced Durability Yoshinori Matsunobu Facebook MySQL NDB 8.0 ClusterTutorial Frazer Clement Oracle 16:15 16:45 Benchmarks -vs- Benchmarks Dimitri Kravtchuk Oracle 16:50 17:20 Onboarding to MySQL Group Replication at Uber Giedrius JaraminasHenrik Korku Uber 17:25 17:55 Vitess: the sharding solution for MySQL 8.0 Liz Van Dijk PlanetScale Please note that the current schedule is subject to change. [Less]
Posted over 4 years ago by Frederic Descamps
The schedule of the preFOSDEM Day is now available ! We had a lot of proposals to deal with. Also this is a MySQL event where we, the MySQL Team has the possibility to show to you, our Community, all what we have working on to improve MySQL ... [More] but also new stuff. We also invite some of our friends from the MySQL Community to talk about their experience. I think we did a good selection and propose you new content. We are extremely happy to have Saverio Miroddi from TicketSolve talking about MySQL 8.0, Uber talking about InnoDB Cluster, and Facebook about Binlog. As you can see, we will have 2 rooms, where one will be dedicated mostly to SQL and Optimizer topics but also tutorials. You can also see that this year we also want to put some spot lights to MySQL NDB Cluster. You will see what is it, how is it used and what’s new. We will also have 2 community speakers sharing their MySQL NDB knowledge: Giuseppe Maxia, will show you how to get familiar with NDB without having to deploy it on complicated architecture and Marco Tusa from Percona will also show you how to used it with ProxySQL. I’ve also heard that there will be new stuff…. Don’t forget to register if you want to join this event that will be held in Brussels, January 30 and 31: https://mysqldays2020.eventbrite.com Day 1 – Thursday, January 30 Madera Azzar From To Title Speaker Company Title Speaker Company 9:30 10:00 MySQL Community Welcome Lenka Kasporova,David Stokes,Frédéric Descamps Oralce 10:00 10:30 Simplifying MySQL Geir Hoydalsvik Oracle 10:30 11:05 MySQL Replication Kenny Gryp Oracle 11:05 11:25 Coffee Break 11:25 11:55 MySQL Clone: A better way to migrate databases Georgi Kodinov Oracle Indexing JSON Arrays in MySQL Dag Wanvik Oracle 12:00 12:30 MySQL Group Replication:Best Practices for Handling Network Glitches Pedro Gomes Oracle Everything you always wanted to knowabout datetime types but didn’t have timeto ask – How to avoid the most commonpitfalls of date and time types in MySQL Martin Hansson Oracle 12:35 13:30 Lunch Break 13:30 14:00 Friends let real friends use MySQL 8.0 Saverio Miroddi TicketSolve CHECK Constraints in MySQL 8.0 Dmitry Lenev Oracle 14:05 14:35 MySQL Connectors Kenny Gryp Oracle Table value constructors in MySQL 8.0 Catalin Beleaga Oracle 14:40 15:10 MySQL InnoDB ClusterMaking Provisioning and Troubleshooting as easy as pie Miguel Araújo Oracle MySQL 8.0 Security Georgi Kodinov Oracle 15:15 15:40 Coffee Break 15:40 16:10 Best practices to upgrade to MySQL 8.0 Frédéric Descamps Oracle MySQL 8.0 Document Store Tutorial David Stokes Oracle 16:15 16:45 New Redo Log in MySQL 8.0 InnoDB:Improvements for high concurrency Pawel Olchawa Oracle 16:50 17:20 Hash Join in MySQL 8.0 Erik Frøseth Oracle 17:25 17:55 MySQL 8.0 EXPLAIN ANALYZE Norvald H. Ryeng Oracle Day 2 – Friday, January 31 Madera Azzar From To Title Speaker Company Title Speaker Company 9:30 10:00 MySQL Community Welcome Lenka Kasporova,David Stokes,Frédéric Descamps Oralce 10:00 10:30 Extreme Performance with MySQL Analytics Service Nipun Agarwal Oracle 10:30 11:05 MySQL Replication Performance in the Cloud Vitor Oliveira Oracle 11:05 11:25 Coffee Break 11:25 11:55 MySQL NDB 8.0 101 Bernd Ocklin Oracle MySQL Database ArchitecturesTutorial – part I Miguel AraújoLuis Soares Oracle 12:00 12:30 MySQL NDB 8.0 clusters in your laptop with DBdeployer Giuseppe Maxia DBdeployer 12:35 13:30 Lunch Break 13:30 14:00 SQL with MySQL NDB 8.0 faster than your NoSQL allow Bernd Ocklin Oracle MySQL Database ArchitecturesTutorial – part II Miguel AraújoLuis Soares Oracle 14:05 14:35 Boosting MySQL NDB Cluster & MySQL InnoDB Clusterwith ProxySQL V2 Marco Tusa Percona 14:40 15:10 Machine Learning for automating MySQL service Nipun Agarwal Oracle 15:15 15:40 Coffee Break 15:40 16:10 Binlog and Engine Consistency Under Reduced Durability Yoshinori Matsunobu Facebook MySQL NDB 8.0 ClusterTutorial Frazer Clement Oracle 16:15 16:45 Benchmarks -vs- Benchmarks Dimitri Kravtchuk Oracle 16:50 17:20 Onboarding to MySQL Group Replication at Uber Giedrius JaraminasHenrik Korku Uber 17:25 17:55 Vitess: the sharding solution for MySQL 8.0 Liz Van Dijk PlanetScale Please note that the current schedule is subject to change. [Less]