What happens when is shown in the packet flow diagram.
Destination NAT happens after prerouting, and therefore before forward/input and postrouting. Destination NAT undoes source NAT. Therefore, marking based on real address for traffic to the client has to happen in forward, or later. Source NAT happens after postrouting, so you can mark whenever you want. That is, if you mark by IP at all. If everyone gets the same mark anyway I find it easier to mark by interface.
Global-in processing (queues that have that as a parent) happens after prerouting, global-out happens after postrouting, interface HTB (interfaces as parents) happens after that.
It's mainly a matter of preference and what you find easier to understand as there are several valid approaches. Personally on simple routers like yours I like to mark upload traffic in prerouting based on in-interface=LAN and use PCQ attached to global-in, and mark download traffic in postrouting based on out-interface=LAN and use PCQ attached to global-out. That works fine. Other people prefer to use interfaces.
One thing to keep in mind is that interface HTB queues only see packets leaving the interface, and never packets entering the specified interface (which is why you shouldn't use the Internet facing interface for upload). See the manual: http://wiki.mikrotik.com/wiki/Manual:Queue#Queues
In RouterOS, these hierarchical structures can be attached at 4 different places:
global-in: represents all the input interfaces in general (INGRESS queue). Queues attached to global-in apply to traffic that is received by the router before the packet filtering
global-out: represents all the output interfaces in general (EGRESS queue).
global-total: represents all input and output interfaces together (in other words it is aggregation of global-in and global-out). Used in case when customers have single limit for both, upload and download.
<interface name>: - represents one particular outgoing interface. Only traffic that is designated to go out via this interface will pass this HTB queue.
The PCQ limits are simply the number of packets any dynamic sub-queue can hold, the total-limit is the number of packets held for ALL dynamic sub-queues. It should be equal to the pcq-limit multiplied by the maximum number of simultaneous sub-queues you are expecting. If you have lots of RAM just assume that to be the total number of clients if you are using src-address and dst-address PCQ classifiers.
Hope that helps. Really, it's all in the packet flow diagram (which is easily the best and most valuable document in the wiki manual).
Specific answers require specific questions. When in doubt, post the output of "/ip address print detail", "/ip route print detail", "/interface print detail", "/ip firewall export", and an accurate network diagram.