@Khaled0Ebrahim, I completely agree with you that the diagram strongly suggests that you need one token to release one packet regardless its size. I assume it is because the purpose of the diagram was not to express the relationship between a single token and the amount of data sent, so the author has simplified it too much, not anticipating the possible impact.
But to respond your OP - you have misunderstood multiple aspects. @Chechito has already tried to explain that, so I'll try to use other words.
Put both the bucket and the packets aside at first. There is a constant flow of tokens, and there is a byte stream of data in the queue. If you prefer, you can think of it as if each token was permitting a certain number of bits or bytes to be taken from the queue and sent, but in fact the question of "how much data does a token represent exactly" is irrelevant because there are actually no discrete tokens, it is a continuous flow.
Without the bucket, the data are taken from the queue with exactly the same rate (in bits per second units) as the one at which the tokens arrive (also in bits per second units). If there are no data in the queue, the tokens for that queue keep nevertheless arriving, but they are wasted. So when further data arrive to the queue, they cannot be released from it faster than new tokens arrive. So if the average rate of data arrival is lower than the token rate, but the data arrive in bursts larger than the queue capacity, some of the data get dropped.
The "bucket" (
) is a reservoir of tokens, whose only purpose is to allow the data to be taken from the queue at an unlimited rate for a certain amount of time - i.e. to allow bursting not only above the constant flow rate but even above the normal burst rate. The idea is that while there are no data in the queue, the tokens are collected in the bucket rather than being wasted immediately, so once some data arrive to the queue, they can use the tokens accumulated in the bucket. So using the tokens accumulated in the bucket, the data are taken from the queue at an unlimited rate until the bucket becomes empty; once there are no more tokens in the bucket, the data sending rate falls back to the normal token arrival rate.
When there are no data for an extended period of time, the bucket becomes full of tokens and further tokens don't fit into it and get wasted as they arrive.
And now we finally get to the
bucket-size parameter. It controls the amount of tokens that can be accumulated. Here it becomes complicated to imagine, as the physical dimension of this value is actually time - it says for how long, in seconds, the tokens can be accumulated while no data arrive to the queue. So for a queue with a 10 Mbit/s data rate, a bucket of 0.1 second "size" will hold tokens worth 1 Mbit of data; a bucket of 10 second "size" will hold tokens worth 100 Mbits of data.